Hello everyone,
I'm here to share a mathematicalvisual tool to the wider community for evaluating Viability Ranking (VR) tiers, especially contributors who do VR outside my main gen. I believe that an objective method founded in mathematics but visually convincing to a lay audience should appeal to a wide spectrum of people.
Recently, there has been some discussion on how to separate the ADV OU (my main) VR tiers. Questions such as "Should Zapdos belong in A+ or A", or "Is it even meaningful to separate the B tier into B/B/B+?" were asked, and I believe I have come up with a tool helpful for answering these questions. I also hope that this post can elucidate the process to those who participated on that thread.
Key Assumptions
This methodology is premised on two key assumptions:
A1. Every mon in a tier should be mostly indistinguishable from any other in the same tier.
A2. Every mon in a tier should be convincingly distinguishable from every mon in another tier.
And mathematically, for reliability we require that
B1. Enough players contribute their VR, or outliers removed, so that the central limit theorem holds and ranking statistics can be treated as normal distributions. The standard deviation can thus be meaningful statistic.
Together, these premises require that
1. Within a tier, each mon has a ranking with standard deviation that stays within the mean (average) ranking,
2. At the transition from a higher to lower tier, the mean rankings will go from a overestimate to underestimate.
Methodology through an Example
Let's take a look at an instructive example. Imagine a world of eight pokemons where are there two tiers: the legendary trio and eeveelutions. Contributors to the VR gave the following ranking statistics (final integer rank, mean +/ deviation).
1, 1.9 +/ 1.0 Articuno
2, 2.0 +/ 1.0 Zapdos
3, 2.1 +/ 1.0 Moltres
4, 5.8 +/ 2.0 Vaporeon
5, 5.9 +/ 2.0 Flareon
6, 6.0 +/ 2.0 Jolteon
7, 6.1 +/ 2.0 Umbreon
8, 6.2 +/ 2.0 Espeon
We can do a scatter plot of these results.
In this figure, I have plotted on the y axis the mean ranking +/ a standard deviation as error bars. On the x axis, I have plotted the final integer ranking decided by the mean rank. Notice that the legendary trio and the eeveelutions, which are the tiers, essentially form a flat line, and the deviations of each mon overlap the others mean rankings in the same group, but not in the other group. In my imaginary sample, contributors thought the legendary trio members were on the whole better than the eeveelutions, but could not decide convincingly among either group who came ahead. It is only by a small margin that Articuno and Vaporeon came ahead. This is exactly the definition of a tier at the start of the article.
How else could we have noticed this? On the same plot, I have drawn the diagonal line y=x, representing where points would lie if everyone posted the same VR. Ironically, this is a boring world where there are no tiers. A way to understand this is that tiers are born out of players' natural disagreements in rankings. The substitution of mons by different people in the same ranking position is what generates the idea that two mons are roughly the same quality. We can exploit this to define a tier too.
Points that lie below the y=x line are ranking overestimates: i.e. people clearly think of Moltres as in the same league as Articuno, but it is through the vagaries of a slight 0.2 disadvantage that it was relegated from roughly #2 on average to #3. It's like losing a really close semifinals with the champ and getting bronzes (sorry Hclat ). Similarly, points that lie above are underestimates. Vaporeon isn't Moltres standard by a longshot, but enjoys the luxury of being #4 instead of roughly #6 on average because someone needs to get the consolation prize anyway. Thus, the bottom of a tier will have its rank overestimated/underrated, and the top of a tier will be rankunderestimated/overrated (sorry this is a bit confusing, a rank overestimate is a numerically higher rank, which is a worse rank). A tier shift will thus be a jump from below to above the y=x line.
The method of determining tiers now is simple:
Step 1. Plot the graph
Step 2. Find flat portions (I'll call this a tierline) within the standard deviations of the mons. This defines a tier.
Step 3. Find positions where there is a transition from below to above the diagonal line. This defines a tier shift. (2) and (3) should almost coincide, and should exactly coincide if all contributors' votes did not spill to other tiers.
Note that Steps 2 and 3 may give different results. Step 2 is generally more reliable because Step 3 only works reliably if not too many people vote mons out of the tier, but Step 3 is a real smoking gun.
Real Data and Dealing with Ambiguity: ADV OU Viability Ranking 2019
Finally, let's take a look at some real data. This is the same plot generated from outlierremoved data provided by McMeghan of the ADV OU VR in https://www.smogon.com/forums/threads/advviabilityrankingou.3503019/post8115920
To quote verbatim from my post over there,
1. Relax A1: Do not divide the tier into subtiers. Acknowledge that there will be variation within the tier.
2. Relax A2: Divide the tier into sections of the size of the error bars. Within each subtier the mons are now more equal, but it's not so easy to see the differences across subtiers. In my original post, I just subdivided the tier anyway with the yellow line, because the error bars were roughly half the size of the tier. This ended up making reasonable sense.
The tiers eventually looked reasonable, and there are some surprises, such as the controversial Starmie not belonging in any tier.
Interpretation
The fact that this method yields much more than just some semblance of sanity is actually pretty amazing. If you think about it, McMeghan had not asked anyone for opinions on their tier limits. Only the rankings were obtained, yet the tiers naturally formed from player variations. This means,
1. It is a method of tiering that minimizes human bias.
2. It was surprisingly robust to the different metrics people used to evaluate mons, be it versatility vs efficacy, or at the level of operation (individual, core, archetype). A criticism of VRs are that mons fulfill niches on teams and are extremely subjective. These results tell us that we can measure subjectivity objectively, dividing the metagame into tiers of contained subjectivity, but with a common understanding of relative placements across tiers. Borrowing some ADV OU context, in more concrete terms, even though Zapdos and Snorlax do different things, there's a consensus that Zapdos is better at what it does/can do than Snorlax at what it does/can do. It should trigger some thought in people unfamiliar to the tier, or lead a semiexperienced player to think of fundamental flaws of Snorlax in its main role eg. a special wall that can't deal with WoW Gengar/Moltres, or gets overloaded by Sand, which is why spikeless offense tends to be electric weak etc.
Caveats
1. Tiering, especially tier shift, becomes more ambiguous as you get to the lower ranks, because of larger variance, and less reliable because some participants do not rank low tier mons. In the ADV OU tier, the techniques made sense till the black line (start of C rank), after which all points fell below the y=x line.
2. Cleaning your data and removing outliers may or may not be important. It didn't turn out to matter too much for ADV OU.
3. Enough contributors are needed to generate the required statistics. The ADV OU 2019 VR had 17.
4. Again, this has only been tested in the ADV OU 2019 VR. It would be far fetched to say it works for every tier, but I'm hoping it gets somewhere.
Addon 4/30:
Speaking to tjdaas and Altina, some objections have been raised about the ability of using this
O1. In tiers that are too centralized
O2. In tiers are in flux
O3. For lower ranks.
Addressing O1:
Tjdaas informed me that most RBY players would provide a ranking like
1. Tauros
2. Chansey/Snorlax
3. Chansey/Snorlax
4. Exeggutor, sometimes Starmie
5. Starmie
6. Alakazam
In this case, the method would create tiers something like
Tier 1: Tauros
Tier 2: Chansey, Snorlax
Tier 3: Exeggutor
Tier 4: Starmie
Tier 5: Alakazam
As mentioned earlier, tiers form from player variations, so when there are no variations, a single mon occupies a tier, indicating that this mon is in a different league from the ones above it and yet also below it. This accurately captures the essence of the two key distinguishability assumptions, albeit forming tiers with a small number of mons. Remember that a corollary of our assumptions is that if everyone ranked a list the same way, there is no need for tiers. A tier is useful only because it says something like Articuno = Zapdos = Moltres > Vaporeon = Jolteon = Flareon. Without approximate equalities defining tiers, it doesn't make sense in this article's definition of a tier to say Tauros is in the same tier as Snorlax, but I'd rather always keep my Tauros alive in an RBY game rather than my Snorlax (just a guess, I'm not a RBY player).
Addressing O2:
Tiers in flux will be likely be subject to a high standard deviation, especially if the contributors are not uptodate. Part of this is a data selection issue  deciding who is a relevant contributor and who isn't. If that is accounted for though, then any remaining ambiguous tiers that don't satisfy both distinguishability criterion (such as the B or C tier in ADV OU) is an indication that subtiering to that level may not be useful. This to me is not a flaw of the methodology, but a result indicating no matter how one tries to divide the tier, there are a significant number of contributors that will disagree (the same people who cause the error bars to bleed from one tier to the other). It's not possible to make everyone happy in such a tier, even if any other tiering method was used.
Addressing O3:
Lower ranks suffer from two problems. First, as mentioned in O2 and from our experience in ADV OU, at least one of the distinguishability criterion is likely not satisfied, so refer to O2 for this response. Second, contributors just don't rank some mons in lower tiers because they're just not important to them. A practical but contentious way of resolving the latter includes requiring contributors to rank from a prescribed list. An impractical but less contentious way requires contributors to rank at least X number of mons with the intention to cut down a percentage of them down to a common list. A biased but practical and noncontentious way would be to just give every unranked mon a fake low rank ceiling in place  but then there isn't really a point of separating the low tiers because they are so heavily biased by responses.
In all, to me, the most significant impediment is the selection of contributors. Contributors have to be competent and uptodate with the tier, and there must be a sufficient number of them for the data to look clean. Perhaps that is where the greatest subjectivity lies.
Closing Remarks
I feel that this has some potential, and really hope I can reach out to the other tiers doing Viability Rankings through this post, without having to double post across forums. Finally, this is an analysis that can be easily implemented in Excel (see attached .zip file), and I'd be happy to help with any questions about doing this analysis. I'm a physicist by training, not a statistician, so I might have made some mistakes. Please feel free to point out improvements or just plain wrong things I said, as well as your opinion on such an endeavor, useful considerations going forward, perhaps even pointing the right people to this post if you think it's useful. Thanks!
I'm here to share a mathematicalvisual tool to the wider community for evaluating Viability Ranking (VR) tiers, especially contributors who do VR outside my main gen. I believe that an objective method founded in mathematics but visually convincing to a lay audience should appeal to a wide spectrum of people.
Recently, there has been some discussion on how to separate the ADV OU (my main) VR tiers. Questions such as "Should Zapdos belong in A+ or A", or "Is it even meaningful to separate the B tier into B/B/B+?" were asked, and I believe I have come up with a tool helpful for answering these questions. I also hope that this post can elucidate the process to those who participated on that thread.
Key Assumptions
This methodology is premised on two key assumptions:
A1. Every mon in a tier should be mostly indistinguishable from any other in the same tier.
A2. Every mon in a tier should be convincingly distinguishable from every mon in another tier.
And mathematically, for reliability we require that
B1. Enough players contribute their VR, or outliers removed, so that the central limit theorem holds and ranking statistics can be treated as normal distributions. The standard deviation can thus be meaningful statistic.
Together, these premises require that
1. Within a tier, each mon has a ranking with standard deviation that stays within the mean (average) ranking,
2. At the transition from a higher to lower tier, the mean rankings will go from a overestimate to underestimate.
Methodology through an Example
Let's take a look at an instructive example. Imagine a world of eight pokemons where are there two tiers: the legendary trio and eeveelutions. Contributors to the VR gave the following ranking statistics (final integer rank, mean +/ deviation).
1, 1.9 +/ 1.0 Articuno
2, 2.0 +/ 1.0 Zapdos
3, 2.1 +/ 1.0 Moltres
4, 5.8 +/ 2.0 Vaporeon
5, 5.9 +/ 2.0 Flareon
6, 6.0 +/ 2.0 Jolteon
7, 6.1 +/ 2.0 Umbreon
8, 6.2 +/ 2.0 Espeon
We can do a scatter plot of these results.
In this figure, I have plotted on the y axis the mean ranking +/ a standard deviation as error bars. On the x axis, I have plotted the final integer ranking decided by the mean rank. Notice that the legendary trio and the eeveelutions, which are the tiers, essentially form a flat line, and the deviations of each mon overlap the others mean rankings in the same group, but not in the other group. In my imaginary sample, contributors thought the legendary trio members were on the whole better than the eeveelutions, but could not decide convincingly among either group who came ahead. It is only by a small margin that Articuno and Vaporeon came ahead. This is exactly the definition of a tier at the start of the article.
How else could we have noticed this? On the same plot, I have drawn the diagonal line y=x, representing where points would lie if everyone posted the same VR. Ironically, this is a boring world where there are no tiers. A way to understand this is that tiers are born out of players' natural disagreements in rankings. The substitution of mons by different people in the same ranking position is what generates the idea that two mons are roughly the same quality. We can exploit this to define a tier too.
Points that lie below the y=x line are ranking overestimates: i.e. people clearly think of Moltres as in the same league as Articuno, but it is through the vagaries of a slight 0.2 disadvantage that it was relegated from roughly #2 on average to #3. It's like losing a really close semifinals with the champ and getting bronzes (sorry Hclat ). Similarly, points that lie above are underestimates. Vaporeon isn't Moltres standard by a longshot, but enjoys the luxury of being #4 instead of roughly #6 on average because someone needs to get the consolation prize anyway. Thus, the bottom of a tier will have its rank overestimated/underrated, and the top of a tier will be rankunderestimated/overrated (sorry this is a bit confusing, a rank overestimate is a numerically higher rank, which is a worse rank). A tier shift will thus be a jump from below to above the y=x line.
The method of determining tiers now is simple:
Step 1. Plot the graph
Step 2. Find flat portions (I'll call this a tierline) within the standard deviations of the mons. This defines a tier.
Step 3. Find positions where there is a transition from below to above the diagonal line. This defines a tier shift. (2) and (3) should almost coincide, and should exactly coincide if all contributors' votes did not spill to other tiers.
Note that Steps 2 and 3 may give different results. Step 2 is generally more reliable because Step 3 only works reliably if not too many people vote mons out of the tier, but Step 3 is a real smoking gun.
Real Data and Dealing with Ambiguity: ADV OU Viability Ranking 2019
Finally, let's take a look at some real data. This is the same plot generated from outlierremoved data provided by McMeghan of the ADV OU VR in https://www.smogon.com/forums/threads/advviabilityrankingou.3503019/post8115920
To quote verbatim from my post over there,
The first four tiers, which I will conventionally call S/A+/A/A, have rather distinct flat tierlines and boundaries demarcated by the green/red/blue/magenta lines. Now things get tricky. At that point, the B tierline from magenta to black become less flat and the distinguishability assumptions A1 and A2 cannot both hold true. We have now come to the question of how one should subdivide lower tiers, where things are more ambiguous. There are two ways to interpret the data.At the bottom left corner is Tyranitar, which everyone agrees is #1 and has zero standard deviation, clearly belongs in the S tier.
The next three mons #24 are Gengar, Metagross and Swampert, which are really close up to error bars. Zapdos, at #5, is clearly behind #24, and is in roughly the same league as Skarmory and Blissey (#67). From #813, Celebi, Suicune, Jirachi, Snorlax, Salamence, Dugtrio clearly form a tier of their own. This concludes what many of us might be inclined to call the A tier.
#14 Starmie, on the border of the magenta line, is in a strange league of its own, not up to the standards of Salamence/Dugtrio, but clearly more preferred than Magneton. It may be considered B+. Here, the standard deviations start not to cover entire tiers, and the spread is more uniform. From #1520, you have Magneton, Claydol, Aerodactyl, Jolteon, Heracross, Moltres, which are slightly distinguished from #2126, Milotic, Flygon, Cloyster, Forretress, Porygon2, Gyarados. #27 is Venusaur, which sits on the border of the black line separating what you may like to call the B and C tier (Hariyama onwards).
Rewritten for simplicity: The VR tiers from this nominally stands at
S: Tyranitar
A+: Gengar, Metagross, Swampert
A: Zapdos, Skarmory, Blissey
A: Celebi, Suicune, Jirachi, Snorlax, Salamence, Dugtrio
Borderline A/B: Starmie
B1: Magneton, Claydol, Aerodactyl, Jolteon, Heracross, Moltres
B2: Milotic, Flygon, Cloyster, Forretress, Porygon2, Gyarados
To me, these results appear reasonable. For example, Metagross and Swampert are imo up a notch in versatility compared to Zapdos. Skarmory and Blissey come together. The "B" mons MagnetonMoltres all occupy positions on some notable archetypes (Magdol, Aero spikes, Jolt spikes, Heracross phys spam, Molt TSS that can either come with Flygon or Forre), while Milotic to Gyarados have less of a clear presence on teams, with the exception of Porygon2 archetypes (CMspam) that appear to be becoming less relevant imo.
1. Relax A1: Do not divide the tier into subtiers. Acknowledge that there will be variation within the tier.
2. Relax A2: Divide the tier into sections of the size of the error bars. Within each subtier the mons are now more equal, but it's not so easy to see the differences across subtiers. In my original post, I just subdivided the tier anyway with the yellow line, because the error bars were roughly half the size of the tier. This ended up making reasonable sense.
The tiers eventually looked reasonable, and there are some surprises, such as the controversial Starmie not belonging in any tier.
Interpretation
The fact that this method yields much more than just some semblance of sanity is actually pretty amazing. If you think about it, McMeghan had not asked anyone for opinions on their tier limits. Only the rankings were obtained, yet the tiers naturally formed from player variations. This means,
1. It is a method of tiering that minimizes human bias.
2. It was surprisingly robust to the different metrics people used to evaluate mons, be it versatility vs efficacy, or at the level of operation (individual, core, archetype). A criticism of VRs are that mons fulfill niches on teams and are extremely subjective. These results tell us that we can measure subjectivity objectively, dividing the metagame into tiers of contained subjectivity, but with a common understanding of relative placements across tiers. Borrowing some ADV OU context, in more concrete terms, even though Zapdos and Snorlax do different things, there's a consensus that Zapdos is better at what it does/can do than Snorlax at what it does/can do. It should trigger some thought in people unfamiliar to the tier, or lead a semiexperienced player to think of fundamental flaws of Snorlax in its main role eg. a special wall that can't deal with WoW Gengar/Moltres, or gets overloaded by Sand, which is why spikeless offense tends to be electric weak etc.
Caveats
1. Tiering, especially tier shift, becomes more ambiguous as you get to the lower ranks, because of larger variance, and less reliable because some participants do not rank low tier mons. In the ADV OU tier, the techniques made sense till the black line (start of C rank), after which all points fell below the y=x line.
2. Cleaning your data and removing outliers may or may not be important. It didn't turn out to matter too much for ADV OU.
the use of the central limit theorem in B1 requires that these outliers not skew the normal distribution. Outlier removal could be something like remove all points some fraction of the median away from the median. The quality of data can be checked by bootstrapping (forming a distribution of the averaged ranking of a particular mon by sampling with replacement).
4. Again, this has only been tested in the ADV OU 2019 VR. It would be far fetched to say it works for every tier, but I'm hoping it gets somewhere.
Addon 4/30:
Speaking to tjdaas and Altina, some objections have been raised about the ability of using this
O1. In tiers that are too centralized
O2. In tiers are in flux
O3. For lower ranks.
Addressing O1:
Tjdaas informed me that most RBY players would provide a ranking like
1. Tauros
2. Chansey/Snorlax
3. Chansey/Snorlax
4. Exeggutor, sometimes Starmie
5. Starmie
6. Alakazam
In this case, the method would create tiers something like
Tier 1: Tauros
Tier 2: Chansey, Snorlax
Tier 3: Exeggutor
Tier 4: Starmie
Tier 5: Alakazam
As mentioned earlier, tiers form from player variations, so when there are no variations, a single mon occupies a tier, indicating that this mon is in a different league from the ones above it and yet also below it. This accurately captures the essence of the two key distinguishability assumptions, albeit forming tiers with a small number of mons. Remember that a corollary of our assumptions is that if everyone ranked a list the same way, there is no need for tiers. A tier is useful only because it says something like Articuno = Zapdos = Moltres > Vaporeon = Jolteon = Flareon. Without approximate equalities defining tiers, it doesn't make sense in this article's definition of a tier to say Tauros is in the same tier as Snorlax, but I'd rather always keep my Tauros alive in an RBY game rather than my Snorlax (just a guess, I'm not a RBY player).
Addressing O2:
Tiers in flux will be likely be subject to a high standard deviation, especially if the contributors are not uptodate. Part of this is a data selection issue  deciding who is a relevant contributor and who isn't. If that is accounted for though, then any remaining ambiguous tiers that don't satisfy both distinguishability criterion (such as the B or C tier in ADV OU) is an indication that subtiering to that level may not be useful. This to me is not a flaw of the methodology, but a result indicating no matter how one tries to divide the tier, there are a significant number of contributors that will disagree (the same people who cause the error bars to bleed from one tier to the other). It's not possible to make everyone happy in such a tier, even if any other tiering method was used.
Addressing O3:
Lower ranks suffer from two problems. First, as mentioned in O2 and from our experience in ADV OU, at least one of the distinguishability criterion is likely not satisfied, so refer to O2 for this response. Second, contributors just don't rank some mons in lower tiers because they're just not important to them. A practical but contentious way of resolving the latter includes requiring contributors to rank from a prescribed list. An impractical but less contentious way requires contributors to rank at least X number of mons with the intention to cut down a percentage of them down to a common list. A biased but practical and noncontentious way would be to just give every unranked mon a fake low rank ceiling in place  but then there isn't really a point of separating the low tiers because they are so heavily biased by responses.
In all, to me, the most significant impediment is the selection of contributors. Contributors have to be competent and uptodate with the tier, and there must be a sufficient number of them for the data to look clean. Perhaps that is where the greatest subjectivity lies.
Closing Remarks
I feel that this has some potential, and really hope I can reach out to the other tiers doing Viability Rankings through this post, without having to double post across forums. Finally, this is an analysis that can be easily implemented in Excel (see attached .zip file), and I'd be happy to help with any questions about doing this analysis. I'm a physicist by training, not a statistician, so I might have made some mistakes. Please feel free to point out improvements or just plain wrong things I said, as well as your opinion on such an endeavor, useful considerations going forward, perhaps even pointing the right people to this post if you think it's useful. Thanks!
Attachments

56.2 KB Views: 17
Last edited: