All Gens Quantitative Analysis of Viability Rankings, Metagame Changes, and Camps of Thought

vapicuno · Apr 30, 2019

TLDR FAQ for voters

Q: This is a pain. Do I really need to rank every Pokemon precisely?
A: No, the beauty of the method is that you don't need to precisely rank things you don't know! Just rank approximately. If you can't properly rank C-tier mons, but know where the cutoff is from D-tier mons, and everyone else kind of agrees with you where the cutoff is, the tiers will be formed correctly regardless of how you ranked it.

Q: The tiers formed by this method don't make any sense! How do we relate it to S/A+/A/A-?
A: Strictly speaking, the methodology may create letter-labeled tiers that don't correspond to conventional notions, as the labels are determined by relative separations between ranking clusters. However, in practice, we come to some compromise with the VR coordinators and alter the labels to fit what people are used to.

Q: Do the final ranks actually matter?
A: The methodology spits out a ranked order of tiers, within which the mons' ranks are unimportant (see first question). However, since so much information can be obtained, the ranks are used to calculate the improvements and drops from previous VRs as well.

(do not read with Smogon Dark theme; you won't be able to see the graphs; click here to change your theme)
Check the 2nd post for major updates!

Hello everyone,

I'm here to share a mathematical-visual tool to the wider community for evaluating Viability Ranking (VR) tiers, especially contributors who do VR outside my main gen. I believe that an objective method founded in mathematics but visually convincing to a lay audience should appeal to a wide spectrum of people.

Recently, there has been some discussion on how to separate the ADV OU (my main) VR tiers. Questions such as "Should Zapdos belong in A+ or A", or "Is it even meaningful to separate the B tier into B-/B/B+?" were asked, and I believe I have come up with a tool helpful for answering these questions. I also hope that this post can elucidate the process to those who participated on that thread.

Key Assumptions

This methodology is premised on two key assumptions:

A1. Every mon in a tier should be mostly indistinguishable from any other in the same tier.
A2. Every mon in a tier should be convincingly distinguishable from every mon in another tier.

And mathematically, for reliability we require that

B1. Enough players contribute their VR, or outliers removed, so that the central limit theorem holds and ranking statistics can be treated as normal distributions. The standard deviation can thus be meaningful statistic.

Together, these premises require that
1. Within a tier, each mon has a ranking with standard deviation that stays within the mean (average) ranking,
2. At the transition from a higher to lower tier, the mean rankings will go from a overestimate to underestimate.

Methodology through an Example

Let's take a look at an instructive example. Imagine a world of eight pokemons where are there two tiers: the legendary trio and eeveelutions. Contributors to the VR gave the following ranking statistics (final integer rank, mean +/- deviation).

1, 1.9 +/- 1.0 Articuno
2, 2.0 +/- 1.0 Zapdos
3, 2.1 +/- 1.0 Moltres
4, 5.8 +/- 2.0 Vaporeon
5, 5.9 +/- 2.0 Flareon
6, 6.0 +/- 2.0 Jolteon
7, 6.1 +/- 2.0 Umbreon
8, 6.2 +/- 2.0 Espeon

We can do a scatter plot of these results.

In this figure, I have plotted on the y axis the mean ranking +/- a standard deviation as error bars. On the x axis, I have plotted the final integer ranking decided by the mean rank. Notice that the legendary trio and the eeveelutions, which are the tiers, essentially form a flat line, and the deviations of each mon overlap the others mean rankings in the same group (satisfying A1), but do not overlap with those in the other group (satisfying A2). In my imaginary sample, contributors thought the legendary trio members were on the whole better than the eeveelutions, but could not decide convincingly among either group who came ahead. It is only by a small margin that Articuno and Vaporeon came ahead. This is exactly the definition of a tier at the start of the article, and tier boundaries can be defined this way.

How else could we have noticed this? On the same plot, I have drawn the diagonal line y=x, representing where points would lie if everyone posted the same VR. Ironically, this is a boring world where there are no tiers. A way to understand this is that tiers are born out of players' natural disagreements in rankings. The substitution of mons by different people in the same ranking position is what generates the idea that two mons are roughly the same quality. We can exploit this to define a tier too.

Points that lie below the y=x line are ranking overestimates: i.e. people clearly think of Moltres as in the same league as Articuno, but it is through the vagaries of a slight 0.2 disadvantage that it was relegated from roughly #2 on average to #3. It's like losing a really close semifinals with the champ and getting bronzes (sorry Hclat ). Similarly, points that lie above are underestimates. Vaporeon isn't Moltres standard by a longshot, but enjoys the luxury of being #4 instead of roughly #6 on average because someone needs to get the consolation prize anyway. Thus, the bottom of a tier will have its rank overestimated/underrated, and the top of a tier will be rank-underestimated/overrated (sorry this is a bit confusing, a rank overestimate is a numerically higher rank, which is a worse rank). A tier shift can be seen a jump from below to above the y=x line. This method is good for one's understanding but is less reliable than just looking at the means and deviations, because outliers can severely skew the points away from the diagonal line, causing all the points to be on one side like those of rank >30 in the ADV OU VR below.

The method of determining tiers now is simple:
Step 1. Plot the graph
Step 2. Find flat portions (I'll call this a tier-line) within the standard deviations of the mons that also exclude other mons. This defines a tier. To find the boundaries especially in more ambiguous cases, identify two consecutive flat-ish portions, and try to find the midpoint.
Step 3. This is not necessary but think of it as a real smoking gun: Find positions where there is a transition from below to above the diagonal line. This would define a tier shift when there are no outliers. (2) and (3) should almost coincide, and should exactly coincide if all contributors' votes did not spill to other tiers.

Real Data and Dealing with Ambiguity: ADV OU Viability Ranking 2019

Finally, let's take a look at some real data. This is the same plot generated from outlier-removed data provided by McMeghan of the ADV OU VR in https://www.smogon.com/forums/threads/adv-viability-ranking-ou.3503019/post-8115920

To quote verbatim from my post over there,

At the bottom left corner is Tyranitar, which everyone agrees is #1 and has zero standard deviation, clearly belongs in the S tier.

The next three mons #2-4 are Gengar, Metagross and Swampert, which are really close up to error bars. Zapdos, at #5, is clearly behind #2-4, and is in roughly the same league as Skarmory and Blissey (#6-7). From #8-13, Celebi, Suicune, Jirachi, Snorlax, Salamence, Dugtrio clearly form a tier of their own. This concludes what many of us might be inclined to call the A tier.

#14 Starmie, on the border of the magenta line, is in a strange league of its own, not up to the standards of Salamence/Dugtrio, but clearly more preferred than Magneton. It may be considered B+. Here, the standard deviations start not to cover entire tiers, and the spread is more uniform. From #15-20, you have Magneton, Claydol, Aerodactyl, Jolteon, Heracross, Moltres, which are slightly distinguished from #21-26, Milotic, Flygon, Cloyster, Forretress, Porygon2, Gyarados. #27 is Venusaur, which sits on the border of the black line separating what you may like to call the B and C tier (Hariyama onwards).

Rewritten for simplicity: The VR tiers from this nominally stands at
S: Tyranitar
A+: Gengar, Metagross, Swampert
A: Zapdos, Skarmory, Blissey
A-: Celebi, Suicune, Jirachi, Snorlax, Salamence, Dugtrio
Borderline A/B: Starmie
B1: Magneton, Claydol, Aerodactyl, Jolteon, Heracross, Moltres
B2: Milotic, Flygon, Cloyster, Forretress, Porygon2, Gyarados

To me, these results appear reasonable. For example, Metagross and Swampert are imo up a notch in versatility compared to Zapdos. Skarmory and Blissey come together. The "B" mons Magneton-Moltres all occupy positions on some notable archetypes (Magdol, Aero spikes, Jolt spikes, Heracross phys spam, Molt TSS that can either come with Flygon or Forre), while Milotic to Gyarados have less of a clear presence on teams, with the exception of Porygon2 archetypes (CMspam) that appear to be becoming less relevant imo.

The first four tiers, which I will conventionally call S/A+/A/A-, have rather distinct flat tier-lines and boundaries demarcated by the green/red/blue/magenta lines. Now things get tricky. At that point, the B tier-line from magenta to black become less flat and the distinguishability assumptions A1 and A2 cannot both hold true. We have now come to the question of how one should subdivide lower tiers, where the tier distinguishability assumptions cannot simultaneously hold. In this scenario, there is no way to please everyone. On the one hand, sticking with a single tier will not satisfy those who believe that the start of the tier is clearly better than those at the end. On the other hand, subdividing would not satisfy those who believe that there isn't a distinct difference especially between mons sandwiched between the two tiers. I have to make a compromise and relax an assumption.

1. Relax A1: Do not divide the tier into subtiers. Acknowledge that there will be variation within the tier.
2. Relax A2: Divide the tier into sections of the size of the error bars. Within each subtier the mons are now more equal, but it's not so easy to see the differences across subtiers. In my original post, I just subdivided the tier anyway with the yellow line, because the error bars were roughly half the size of the tier. This ended up making reasonable sense, and would eventually lead to the idea of subdivisions (next section).

The tiers eventually looked reasonable, and there are some surprises, such as the controversial Starmie not belonging in any tier.

Here's another example from the ORAS OU VR 2019 in the diagram below. I have extracted the segment where mons 57-60 and 70-74 look pretty flat, but it's not so clear where the boundary is. I drew two lines that roughly fit the flat regions, and drew a diagonal cutting the points when they start to deviate (60 and 70). The diagonal appears to cause a separation at about 66/67. The diagram on the right shows a clearer schematic of what I mean.

Naming Tiers: Subdivisions

In the previous section, we explored some instances, especially in the ORAS OU VR 2019 and ADV OU VR 2020 where both distinguishability assumptions could not simultaneously hold. Here's a good example from the ADV OU VR 2020:

The tier that Celebi to Jirachi exists in is both distinct from other tiers, and indistinguishable within. However, the same cannot be said of the tier containing Moltres to Jolteon, because their deviations are so huge that they bleed into other tiers. In my opinion, this needs to be distinguished, because there is always someone among the data points who is going to look at the final tiering and say "but I think Forretress is better than Moltres, why is it in a higher tier? It should at least be in the same tier".

In such instances, traditional Smogon tiers using pluses and minuses like A+/A/A- cannot do these Pokemon justice because the tiering structure is going to be tripartite regardless of how people actually think of the relative closeness. A- is always going to sound closer to A than to B+. Thus, for this iteration of the ADV OU VR, I implemented numerical subranks in the following manner:

:celebi:

to

- B tier

to

- C tier

and I furthermore subdivided the C tier into
:starmie:

to

- C1

to

- C2

to

- C3

so we can truly say that the boundary between B and C is definitely distinct, but there is some fuzziness between subdivisions C2 and C3.

This VR methodology could well work without inventing a new VR tier system, in the old +/- scheme, and admittedly the former might be easier for newcomers to understand, but that would not accurately capture the sentiments behind contributors to the VR.

Interpretation

The fact that this method yields much more than just some semblance of sanity is actually pretty amazing. If you think about it, McMeghan had not asked anyone for opinions on their tier limits. Only the rankings were obtained, yet the tiers naturally formed from player variations. This means,

1. It is a method of tiering that minimizes human bias.
2. It was surprisingly robust to the different metrics people used to evaluate mons, be it versatility vs efficacy, or at the level of operation (individual, core, archetype). A criticism of VRs are that mons fulfill niches on teams and are extremely subjective. These results tell us that we can measure subjectivity objectively, dividing the metagame into tiers of contained subjectivity, but with a common understanding of relative placements across tiers. Borrowing some ADV OU context, in more concrete terms, even though Zapdos and Snorlax do different things, there's a consensus that Zapdos is better at what it does/can do than Snorlax at what it does/can do. It should trigger some thought in people unfamiliar to the tier, or lead a semi-experienced player to think of fundamental flaws of Snorlax in its main role eg. a special wall that can't deal with WoW Gengar/Moltres, or gets overloaded by Sand, which is why spikeless offense tends to be electric weak etc.

Implementation (including Voter Selection)

Update Oct 1, 2019:

Since the advent of this VR methodology, we have learnt several practical lessons on implementing the VRs. Thanks to Altina and Earthworm's resourcefulness, we have learnt that the VR data can be easily collected via a ranked poll on a survey provider such as Surveymonkey. This data can then be processed in a spreadsheet that I have attached. Note that in implementing his poll,

~~The order of mons should be randomized or placed in alphabetical order to prevent bias from the last VR~~. Although I had suggested the above, I now think that using the previous VR as a standard allows people to think in terms of changes to the metagame more concretely and is more convenient, and I do not oppose doing so.
For tiers with large lists of mons, if contributors find it too difficult to rank everything, it should be communicated to them that it is only necessary to rank to the precision of their level of confidence. For example, it does not matter too much if a contributor cannot decide if #80 > #81 and swaps their places because it is within the standard deviation anyway. However, if a voter is very sure that #1 is better than #2, they must be precisely ranked. Similarly if for whatever reason the voter thinks #81 is definitely better than #82, that must be captured. Importantly, for regions of low confidence, although mons close by may not need to be precisely ranked, the voter should have higher certainty in ranking groups of them. In other words, although #80 and #81 may be hard to distinguish, the voter should make sure that #60-#70 should be on average better than #71-80. Following these guidelines should ensure that voters minimize their effort on ranking 100 mons but still capture the essence of what they feel are tiers.
A question on additions and removals can be posted, but I think these modifications should be made by the VR coordinator after the quantiative analysis has been performed.

The data can be entered into the spreadsheet either as an ordered list of mons in sheet "Before_Pivot", or as an integer list in sheet "Raw". To quantify each voter's influence on the VR, the weights are entered in "Raw". These weights could be a function of tournament results, ladder, voter's reputation etc. The case of equal influence would correspond to an entry of 1 for all cells in the Weights row. Note that these are proportional weights -- someone who has twice the weight of another would contribute to the average rank as if they were two of the other voter. Raw VR results can then be obtained by duplicating the sheet, sorting by Weighted Average, and filling in the "Ranks" column with 1,2,3... A sample sheet, "Raw_Sorted", is provided for reference. This sample sheet does not carry formulas over, so please delete it in actual use. The graph is displayed on the right.

Frequently, there are outliers in the data that should be removed so as to satisfy assumption B1 with normally distributed statistics. I recommend duplicating the "Raw" sheet, and removing outliers through whatever method one deems fit. The important point is not to let any user entry dominate the statistics. For example, in the GSC OU VR, one user with nonzero weight had really unorthodox rankings, so I removed outliers by taking away the highest and lowest ranking of each mon only if nobody else ranked gave that mon the same rank. A sample sheet, "Outlier_Removed", is provided. To get the final VR results, one can duplicate this sheet, sort by Weighted Average, and fill in the "Ranks" column
from 1,2,3... just as before. Again, a sample sheet, "Outlier_Removed_Sorted", is provided for reference and should be deleted in actual use.

Caveats

1. Tiering, especially tier shift, becomes more ambiguous as you get to the lower ranks, because of larger variance, and less reliable because some participants do not rank low tier mons. In the ADV OU tier, the techniques made sense till the black line (start of C rank), after which all points fell below the y=x line.
2. Cleaning your data and removing outliers may or may not be important. It didn't turn out to matter too much for ADV OU.

the use of the central limit theorem in B1 requires that these outliers not skew the normal distribution. Outlier removal could be something like remove all points some fraction of the median away from the median. The quality of data can be checked by bootstrapping (forming a distribution of the averaged ranking of a particular mon by sampling with replacement).

3. Enough contributors are needed to generate the required statistics. The ADV OU 2019 VR had 17.
4. Again, this has only been tested in the ADV OU 2019 VR. It would be far fetched to say it works for every tier, but I'm hoping it gets somewhere.

Add-on 4/30:
Speaking to tjdaas and Altina, some objections have been raised about the ability of using this
O1. In tiers that are too centralized
O2. In tiers are in flux
O3. For lower ranks.

Addressing O1:
Tjdaas informed me that most RBY players would provide a ranking like
1. Tauros
2. Chansey/Snorlax
3. Chansey/Snorlax
4. Exeggutor, sometimes Starmie
5. Starmie
6. Alakazam

In this case, the method would create tiers something like
Tier 1: Tauros
Tier 2: Chansey, Snorlax
Tier 3: Exeggutor
Tier 4: Starmie
Tier 5: Alakazam

As mentioned earlier, tiers form from player variations, so when there are no variations, a single mon occupies a tier, indicating that this mon is in a different league from the ones above it and yet also below it. This accurately captures the essence of the two key distinguishability assumptions, albeit forming tiers with a small number of mons. Remember that a corollary of our assumptions is that if everyone ranked a list the same way, there is no need for tiers. A tier is useful only because it says something like Articuno = Zapdos = Moltres > Vaporeon = Jolteon = Flareon. Without approximate equalities defining tiers, it doesn't make sense in this article's definition of a tier to say Tauros is in the same tier as Snorlax, but I'd rather always keep my Tauros alive in an RBY game rather than my Snorlax (just a guess, I'm not a RBY player).

Addressing O2:
Tiers in flux will be likely be subject to a high standard deviation, especially if the contributors are not up-to-date. Part of this is a data selection issue -- deciding who is a relevant contributor and who isn't. If that is accounted for though, then any remaining ambiguous tiers that don't satisfy both distinguishability criterion (such as the B or C tier in ADV OU) is an indication that sub-tiering to that level may not be useful. This to me is not a flaw of the methodology, but a result indicating no matter how one tries to divide the tier, there are a significant number of contributors that will disagree (the same people who cause the error bars to bleed from one tier to the other). It's not possible to make everyone happy in such a tier, even if any other tiering method was used.

Addressing O3:
Lower ranks suffer from two problems. First, as mentioned in O2 and from our experience in ADV OU, at least one of the distinguishability criterion is likely not satisfied, so refer to O2 for this response. Second, contributors just don't rank some mons in lower tiers because they're just not important to them. A practical but contentious way of resolving the latter includes requiring contributors to rank from a prescribed list. An impractical but less contentious way requires contributors to rank at least X number of mons with the intention to cut down a percentage of them down to a common list. A biased but practical and non-contentious way would be to just give every unranked mon a fake low rank ceiling in place -- but then there isn't really a point of separating the low tiers because they are so heavily biased by responses.

In all, to me, the most significant impediment is the selection of contributors. Contributors have to be competent and up-to-date with the tier, and there must be a sufficient number of them for the data to look clean. Perhaps that is where the greatest subjectivity lies.

History

The chronology, as well as interesting features and results of this VR methodology as compared to previous methods are captured here.

ADV OU (April 2019): 16 voters, ~50 mons. Original Post. The tiers obtained were rather intuitive -- S: the centralizing mon, A+: Premier Spinblocker+Glues, A: Premier Offensive Utility + Defensive Core, A-: Mons used for matchup advantage, B+: Starmie, B: Mons used in specific archetypes. Notably, the single-mon tier of Starmie was formed, presumably emphasizing Starmie's highly matchup-dependent nature.
RBY OU (May 2019): 15 voters, ~50 mons. Features three single-mon tiers at the top, emphasizing the hierarchical nature of RBY OU.
GSC OU (July 2019): 28 voters, ~50 mons. Again we see two single-mon tiers at the top, and a single-mon Tentacruel tier at #24. Compared to the previous GSC OU VR, this update yielded more, smaller tiers (7 vs 5 tiers) up till #19. It also features a weak but distinct cutoff from C to C-, which demonstrates the possibility of tiering even lower tiers.
ORAS OU (Sep 2019): 17 voters, ~100 mons. The first time the method has been used in a tier with up to 100 active mons. There were initial concerns about the labor going into performing the rankings, but to the point of ranking only to the degree of uncertainty was communicated to voters to ease their job. On a separate point, this VR deviates from the conventional tripartite +/- scheme of tier subdivisions, Instead, the B tier was found to be subdivided into 4 segments, for example. This is because the distinctiveness of a tier boundary determines if said tier is a subtier or full tier. Altina confirmed as an ORAS player that if +/- divisions were made instead, some subtiers would awkwardly not fit in, as they were closer in viability to the adjacent main tier. Numeric subranks of this method help to differentiate the larger and smaller gaps between mons.
GSC Ubers (Sep 2019): 14 voters, ~30 mons. The first time weights were tagged to voters to determine their influence in the rankings. This takes into account different levels of experience in a small population playing a niche tier.
ADV OU (April 2020): 22 voters, ~50 mons. This VR, like ORAS OU, deviates from the tripartite +/- scheme of tier subdivisions in favor of numerical ranks.
RBY OU (April 2020): 24 voters, ~ 50 mons. Hierarchical clustering on each tier was used to determine the presence of camps of players within RBY OU, and some trends observed can be explained by someone with metagame knowledge.

Closing Remarks

I feel that this has some potential, and really hope I can reach out to the other tiers doing Viability Rankings through this post, without having to double post across forums. Finally, this is an analysis that can be easily implemented in Excel (see attached .zip file), and I'd be happy to help with any questions about doing this analysis. I'm a physicist by training, not a statistician, so I might have made some mistakes. Please feel free to point out improvements or just plain wrong things I said, as well as your opinion on such an endeavor, useful considerations going forward, perhaps even pointing the right people to this post if you think it's useful. Thanks!

vapicuno · Apr 30, 2019

In 2020, inspired by Jorgen's post which was essentially an upgrade of my method, I decided to turn this into an almost-fully automated process through a Jupyter notebook (python script) attached at the end of this post. The intention is to eventually pass this as a user-friendly tool for tier leaders to use without requiring my technical expertise.

With this comes the following additional features.

1. Identification of metagame shifts between updates, ordered by statistical significance.
2. Automated tier classification with hierarchical clustering on the Kendall ranking correlation coefficient
3. Better tier visualization with the dissimilarity matrix
4. Identification of camps of thought - groups of divided opinions, again using hierarchical clustering
5. Analysis of camps of thought by ranking comparison and dissimilarity matrices using the Kendall distance

I will be using the RBY OU 2021 VR as an illustrative example.

Prelude: Statistical Significance and z-score

At many points in this post, I will emphasize statistical significance over the absolute change or difference in rank. For those who are unfamiliar with this concept, statistical significance is about asking "what is the likelihood that this change did not happen due to chance?". This concept is important because, for example, a shift in the mean ranking of 3 when the data has a spread of 5 ranks is more likely to have occurred due to pure chance than a smaller shift in the mean ranking of 2, but with an even smaller spread of 1.

In a nutshell, the z-score when comparing two rankings is the difference divided by the total uncertainty in that ranking difference. This uncertainty is estimated by the standard deviation (spread) of each of the two rankings under comparison divided by the square root of the number of voters in the said ranking, then added in quadrature. The further away from zero the z-score, the more statistically significant the change. As a guide, I would probably ignore anything that changed below one standard deviation. There are some technical points here, like how I should be using the t-score instead of the z-score, and how this is a parameterization of a nonparametric quantity, but I will put that aside for now.

Metagame Shifts

The key differences in the metagame shifts are captured by two plots comparing the last and current update. The first shows the absolute ranking change and the second shows the z-score change ordered by z-score, which I consider to be more important as it shows the most significant shifts.

Automated Tier Classification and Visualization

We use the Kendall rank correlation coefficient matrix as the metric to be used for hierarchical clustering in order to determine Pokemon similarity, and hence tiers. Here's what it means in layman terms.

For every pair of Pokemon, we want a number that acts as a distance between them; we want this distance to be zero when people heavily disagree on their relative placement as this implies they are of similar worth, and to be huge when people all agree that it is clear one of them is better than the other. For every pair of Pokemon, the Kendall rank correlation sums over all pairs of people their ranking correlation. If a pair of people both ranked Pokemon A better than B or B better than A, add 1 to the correlation statistic (distance). If a pair of people rank the Pokemon in the opposition to each other, add -1. Divide by the total number of pairs afterwards. This creates a situation where if 50% of people rank A higher than B, then the distance will be 0, whereas if A is clearly better than B or vice versa, the distance will be 1. If you draw up a table of this distance for all pairs of Pokemon, you get the dissimilarity matrix. This method is actually quite tolerant to outliers unlike the linear plot method because it doesn't care about the numerical value of the rank; it only asks "do you rank A higher than B or not?".

This dissimilarity matrix is fed into a hierarchical clustering algorithm (Ward's method) which yields the following dendrogram. Think of this as a tree where the further higher through the branches you need to go to connect two Pokemon, the further they are apart. Setting a threshold (like 5) in the following graph cuts off those connecting branches, leaving colored clusters which we conventionally call tiers.

The dissimilarity matrix is plotted to make sure that the tiers obtained from hierarchical clustering make sense, and can be modified if they don't. Well-defined tiers should appear as dark squares in the matrix. I've found that Ward's linkage does a good job at splitting ill-defined tiers into subtiers determined by their spread as well.

Identifying and Analyzing Camps of Thought

To me, this is the really interesting part of data processing. This time, we apply the same principles of a similar concept, the Kendall tau distance, but on pairs of voters' rankings, not pairs of Pokemon. A similar dendrogram can be made, The following example shows a split of opinions into 3 camps,

Relative statistics can be made of each camp,

and z-scores can be calculated as well, arranged either by z score or by ranking

The z-score chart is probably the most important graph, but the dissimilarity matrices may reveal some hidden information that cannot be seen in the z-score chart.

The first way in which information can be hidden is in specific pairs. In the following example, the Nails camp in S-B2 has a 100% inversion of Jolteon with Zapdos, indicating that everyone in this camp thinks of Jolteon as the superior Electric-type. It's not immediately obvious just looking at the z-score plot.

To read it, note that when you see a strongly red square, then the camp being analyzed frequently ranks the corresponding Pokemon on the Y axis more favorably than that on the X axis. Another way to see this is by looking across the diagonal line. A strongly red square should be accompanied by a strongly blue square reflected across the line, and we can say the camp prefers the Pokemon on the Y axis of the red square more than that of the blue square. For example, in the data below, the Nails camp prefers Jolteon to Zapdos.

The second way in which more information can be provided by dissimilarity matrices is in general trends. In the following example, the Troller camp in S-C2 mostly has similar opinions about which tiers Pokemon belong in, but have a lot of inversions within the tiers themselves, with Alakazam > Exeggutor, Cloyster/Jynx/Jolteon > Zapdos/Rhydon, and Moltres/Dragonite/Golem/Porygon > Lapras/Victreebel/Persian

To read it, note that when you see a strongly red square, then the camp being analyzed frequently ranks the corresponding Pokemon on the Y axis more favorably than that on the X axis. Another way to see this is by looking across the diagonal line. A strongly red square should be accompanied by a strongly blue square reflected across the line, and we can say the camp prefers the Pokemon on the Y axis of the red square more than that of the blue square. For example, in the data below, the Troller camp prefers Alakazam to Exeggutor.

Using the Script
The scripts are coded in a Jupyter (Python) notebook. There are a number of scientific packages that need to be installed, so I recommend just downloading Anaconda here. It is a Python distribution with all the packages that are necessary to run the code. After installation, run "Anaconda Prompt" and enter the command jupyter notebook to gain access to the notebook. From there, open up the script in the attached .zip file, and make sure the .csv files containing the data are located in the same directory. Follow the instructions in the script by reading the comments. You can run the code blocks by pressing Shift+Enter.

Closing comments
I hope this automation will allow VR coordinators to do their own VR analyses and discover the richness behind varied opinions on their metagames. Although I am happy to help analyze different metagames, it helps a lot when the person doing the analysis also plays the same metagame. I also hope that each metagame that employs this VR methodology can be independently self-sustaining, so that it may live beyond my time on the site.

2022 Edit (see new ADV OU zip file)
Minor improvements:
1. [Cell 10] Previously, if a mon is unranked by a voter, it is assumed that the voter places this Pokemon last on their list. However, sometimes, voters don't vote because they have no knowledge of the Pokemon. The 2022 edit offers two options: last and average of other voters' placement. This is entirely dependent on the question of whether people don't rank something because they don't respect it or that they don't know enough. I recommend dealing with this by laying out clear requirements before the VRs are submitted.
2. [Cell 13] Occasionally, there are pairs of mons that are ranked in an opposite manner from the tiers that are derived by the dendrogram/dissimilarity matrix. For example, Porygon2 is ranked higher than Regice, but is grouped by the algorithm in D2 instead of D1. This contradiction needs to be addressed manually, and manifests as strange single-mon tiers in the dissimilarity matrix. This 2022 edit has a code that identifies contradictions and instructions to address them manually.
3. [Cell 21] Highlighted regions in camps to indicate which tier (range of mons) the analysis is being performed in.

Ortheore · May 3, 2019

Hey this is a really cool idea, and I think that it's something we should seriously look at integrating

One thing I'll note is that this makes VR updates a really formal process. In theory, just having an individual or team leading the thread means that updates are relatively expedient. In practice I'm not sure this is the case, since they seem to seldom be that responsive. The key thing that sticks out for me is the selection of contributors- this adds a step to the decision-making process that simply doesn't exist currently. The best way of handling it would be to make the criteria for selection as passive as possible. It seems to me that there would a few different ways for players to prove themselves worthy of contributing.

The biggest of them is forum tournaments. This would require a schedule of all tournaments featuring the appropriate tier- most subforums already list all major forum tournaments for their respective tiers, though I don't know how many other tournaments there are that would count, but aren't listed in the aforementioned calendars. Ideally this would also include other sites as well, but then there are practical issues in terms of identity- people often register for different forums under different names, which would be a pain in terms of logistics. There are ways around this, such as people simply including that info in their profiles, but idk, it would require some testing. There are more issues as well- you might want to count certain tournaments differently- you might decide that making the 3rd round of a tournament (idk what it would actually be) is enough to qualify, but outright winning might grant contributor status for a longer period than just making 3rd round. And also there might not be enough major tournaments in the year, idk. PP's tournament routine is pretty valuable for this kind of thing, and is what I really had in mind when I mentioned non-smogon sites, but yeah, like I said it's not something that can just be thrown straight in

Then there's PS room tournaments. Idk, I've never been into live tournaments so I'm not going to comment too much. I guess the standard of play is considerably lower than in forum tours, while idk how it compares with ladder, I assume it's better though.

Ladder is a tricky one. In principle it's good for proving competence, even when you consider that many tour players prefer to avoid it. The issue is decay. From what I understand it's relatively easy to maintain a decent ranking while barely playing, even in ladders where "decent" is well above the level where decay is significant. Inactive ladders are a bit of a trainwreck as well- you might only have a couple dozen people at ratings above the decay threshold, with a shitload of accounts sitting at the threshold (1500 for RBY). I actually had a look and there's 27 people above 1500 elo on RBY ladder, with 117 accounts sitting at exactly 1500. On top of that, 1545 gets you a top 10 spot. And then there's the question of whether you want to use Elo at all. I personally prefer to look at GXE when I give a shit about my performance on ladder. In spite of all this, I think ladders are still a piece in the puzzle, as there are some elite players who might want to contribute that simply prefer ladder (in RBY, Raish and Kaz are two players that spring to mind as being elite and being primarily ladder players. Well, Raish was elite, I don't think they've played in some time)

O3 is an interesting topic for me. I'm not sure how you'd best handle the situation if pokemon are unranked by some but not all players. Maybe there's some statistical solution to this, but off the top of my head it would seem like a major issue. I like the idea of a prescribed list, since it is easy to maintain democratically alongside the rankings themselves- at the same time that players are giving rankings, they can also nominate pokemon to be added. You can nominate a pokemon to be removed, but this is also something that could theoretically be flagged by prior rankings (e.g. if something is unanimously last, consider whether or not to rank it at all). To me this could also function similarly to the inconsistent unranked issue. For instance if 27/30 contributors include Raticate in their RBY rankings, it doesn't get included in that set of rankings, but it's required to be voted on for the subsequent set.

Also yes, I believe Raticate should be ranked in RBY, but I won't go into it here.

vapicuno · May 3, 2019

Addressing a common misconception: Implicit vs Explicit tier cutoffs

I'd just like to clear up a common misconception here (not specifically addressed to you, but from multiple opinions gathered). This is a system for segmenting ranked mons into tiers implicitly. Conventionally, people argue about whether mons should be in A- or B+. This says, let's not argue about that, don't even give me your preferred cutoffs. Just rank the mons without any pre-conceived notion of tiering, and statistics of the community will derive the cutoffs implicitly. It's a completely different way of deciding tiers that are still sensible and revealing in their own way.

---

Hello Ortheore! Thank you for your helpful comments, I appreciate it. I acknowledge the point about a more formal process, and I agree that the selection of players is also going to be the biggest question.

Addressing your point about the formality of the process, I feel that this method of VR can be done either as an aid, or in conjunction with the current method of a healthy debate with final VRs determined by a tiering council. There is definitely value in debating about VRs before an objective vote, but on the contrary, I also see it as something that raises topics for discussion. Assuming we address the selection process, the statistical analysis of VRs can be performed with the tiering council using it as data for the final VRs. On the side of the council, it can identify preferences that have been overlooked or misjudged, be it due to activity differences of potential contributors, how determined they are to make a point, unfamiliarity with certain mons and arguments being put forth due to playstyle differences, or just unavoidable biases. On the side of users, it makes the VR process more transparent, as they will get to see objective ranks and rankings modifications decided by the council. The council can be encouraged to explain differences in their VRs with those suggested by the data.

Perhaps I should share what happened with the ADV OU tier. I hope I don't misrepresent anybody in this process. After releasing my statistical analysis, I've only gotten positive reactions on the ADV discord channel (maybe it doesn't say much because people may be acknowledging my hard work rather than the results), but we had a debate on where Starmie was located, as it was borderline A/B. The statistical evidence suggested having Starmie in its own tier of B+. I then learned there were players who indeed thought of Starmie as convincingly better than B and worse than A, yet found it strange to place a mon in its own borderline tier. It is counter-intuitive, but that isn't a good reason to reject such a suggestion. In fact, I think this is better than the alternative of having a tiering council choose an unrepresentative tier, leaving most contributors dissatisfied. Data helped us to find this surprising results.

On the selection of players, I'm honestly not too sure. I guess it can be pretty informal for small, tight-knit communities where most members have been around for a long time and have a way to talk to and know of each other (ADV OU), but I agree you'd need a formal selection for larger groups. Given that competent players have their preferences, I'd be inclined to say you can allow for three different modes of qualification (forum tour/room tour/ladder) to relax the requirements. Some degree of word of mouth is even possible. Another contentious issue is the difficulty of discerning if a player's metagame knowledge is current. People can do well in tours without having built teams for the gen or even using outdated teams, yet there are many good ladder players who can never prove their currency. It is very soft, which brings up the political aspect that we don't want to offend someone who thinks he's sufficiently current. Perhaps an extreme example to think about would be SPL players who don't touch a gen until a season is around the corner -- how would you evaluate their competency and relevance? Being too selective makes this elitist, yet being too lax means dealing with misinformed opinions. Regardless, I don't think this ambiguity is a reason not to try, because the process of debate is also prone to biases. Finally, as this is a relatively simple process, you could explore categorizing players by tour/ladder or by some competency/currency metric to create separate graphs and cutoffs. I think data can only help; it's deciding what data to use that there is discretion, and rightfully so.
Edit: Weighted contributions based on performance such as what GSC NU does is also possible.

I actually do not think O3 is a major issue, contrary to your opinion, but let me speak objectively first. With the current algorithm, if some players do not rank mons, I assume that they assign a rank that is equal to the average of all contributing players. This is an underestimate of course -- it's more likely that they're not even worth mentioning to the contributor (resulting in a very bad rank), rather than that the contributor forgot about it. An alternative would be to place the unranked mons last for these players. Why I think this is not as important has nothing to do with my personal preference of ranking low tier mons -- it's that the main objective of this methodology is arriving at tier cutoffs from rankings, not rankings themselves. If players do not rank some mons, then there is a huge disagreement in the placement of those mons, so the standard deviation will probably be so large that it wouldn't change the tier. Remember that a tier shift is marked by a convincing change in the mean rank beyond the standard deviation. It is based on technical rather than personal grounds that I suspect O3 wouldn't be too big of a problem, but where it comes to rankings themselves, I agree with you that it could be helpful to have a system.

Thank you for reading this wall of text!

vapicuno · Oct 1, 2019

As this method has been adopted in a number of places and we've learnt a few lessons along the way, I've updated the OP with an Implementation section to help anyone who wants to do this in a foolproof way, and also a History section detailing the specific changes that we've seen from adopting this VR method. A new Excel spreadsheet that does the VR calculations with voter weights included has also been attached.

vapicuno · Apr 15, 2020

A number of people have been asking me questions about how the numerical subranks mean for this quantitative VR methodology, as applied to the ADV OU VR 2020 updated. I have written a section "Naming Tiers: Subdivisions" to convey more properly what new information this naming scheme gives us.

vapicuno · May 26, 2021

I have updated the 2nd post in the OP to reflect the new analyses, insight, and ease of use that automation that has gone into improving the VR process. I have also included instructions on getting the script running. Hopefully, VR coordinators can perform updates easily with a more hands-off approach on my side in future. Looking at you fellas Plague von Karma Earthworm

All Gens Quantitative Analysis of Viability Rankings, Metagame Changes, and Camps of Thought

vapicuno

你的价值比自己想象中的所有还要低。我却早已解脱，享受幸福

Attachments

vapicuno

你的价值比自己想象中的所有还要低。我却早已解脱，享受幸福

Attachments

Ortheore

vapicuno

你的价值比自己想象中的所有还要低。我却早已解脱，享受幸福

vapicuno

你的价值比自己想象中的所有还要低。我却早已解脱，享受幸福

vapicuno

你的价值比自己想象中的所有还要低。我却早已解脱，享受幸福

vapicuno

你的价值比自己想象中的所有还要低。我却早已解脱，享受幸福