All Gens Objective Visual-Mathematical Method of Constructing Viability Ranking Tiers (Test Data: ADV OU 2019)

Hello everyone,

I'm here to share a mathematical-visual tool to the wider community for evaluating Viability Ranking (VR) tiers, especially contributors who do VR outside my main gen. I believe that an objective method founded in mathematics but visually convincing to a lay audience should appeal to a wide spectrum of people.

Recently, there has been some discussion on how to separate the ADV OU (my main) VR tiers. Questions such as "Should Zapdos belong in A+ or A", or "Is it even meaningful to separate the B tier into B-/B/B+?" were asked, and I believe I have come up with a tool helpful for answering these questions. I also hope that this post can elucidate the process to those who participated on that thread.

Key Assumptions

This methodology is premised on two key assumptions:

A1. Every mon in a tier should be mostly indistinguishable from any other in the same tier.
A2. Every mon in a tier should be convincingly distinguishable from every mon in another tier.

And mathematically, for reliability we require that

B1. Enough players contribute their VR, or outliers removed, so that the central limit theorem holds and ranking statistics can be treated as normal distributions. The standard deviation can thus be meaningful statistic.

Together, these premises require that
1. Within a tier, each mon has a ranking with standard deviation that stays within the mean (average) ranking,
2. At the transition from a higher to lower tier, the mean rankings will go from a overestimate to underestimate.

Methodology through an Example

Let's take a look at an instructive example. Imagine a world of eight pokemons where are there two tiers: the legendary trio and eeveelutions. Contributors to the VR gave the following ranking statistics (final integer rank, mean +/- deviation).

1, 1.9 +/- 1.0 Articuno
2, 2.0 +/- 1.0 Zapdos
3, 2.1 +/- 1.0 Moltres
4, 5.8 +/- 2.0 Vaporeon
5, 5.9 +/- 2.0 Flareon
6, 6.0 +/- 2.0 Jolteon
7, 6.1 +/- 2.0 Umbreon
8, 6.2 +/- 2.0 Espeon

We can do a scatter plot of these results.
173199


In this figure, I have plotted on the y axis the mean ranking +/- a standard deviation as error bars. On the x axis, I have plotted the final integer ranking decided by the mean rank. Notice that the legendary trio and the eeveelutions, which are the tiers, essentially form a flat line, and the deviations of each mon overlap the others mean rankings in the same group, but not in the other group. In my imaginary sample, contributors thought the legendary trio members were on the whole better than the eeveelutions, but could not decide convincingly among either group who came ahead. It is only by a small margin that Articuno and Vaporeon came ahead. This is exactly the definition of a tier at the start of the article.

How else could we have noticed this? On the same plot, I have drawn the diagonal line y=x, representing where points would lie if everyone posted the same VR. Ironically, this is a boring world where there are no tiers. A way to understand this is that tiers are born out of players' natural disagreements in rankings. The substitution of mons by different people in the same ranking position is what generates the idea that two mons are roughly the same quality. We can exploit this to define a tier too.

Points that lie below the y=x line are ranking overestimates: i.e. people clearly think of Moltres as in the same league as Articuno, but it is through the vagaries of a slight 0.2 disadvantage that it was relegated from roughly #2 on average to #3. It's like losing a really close semifinals with the champ and getting bronzes (sorry Hclat ). Similarly, points that lie above are underestimates. Vaporeon isn't Moltres standard by a longshot, but enjoys the luxury of being #4 instead of roughly #6 on average because someone needs to get the consolation prize anyway. Thus, the bottom of a tier will have its rank overestimated/underrated, and the top of a tier will be rank-underestimated/overrated (sorry this is a bit confusing, a rank overestimate is a numerically higher rank, which is a worse rank). A tier shift will thus be a jump from below to above the y=x line.

The method of determining tiers now is simple:
Step 1. Plot the graph
Step 2. Find flat portions (I'll call this a tier-line) within the standard deviations of the mons. This defines a tier.
Step 3. Find positions where there is a transition from below to above the diagonal line. This defines a tier shift. (2) and (3) should almost coincide, and should exactly coincide if all contributors' votes did not spill to other tiers.

Note that Steps 2 and 3 may give different results. Step 2 is generally more reliable because Step 3 only works reliably if not too many people vote mons out of the tier, but Step 3 is a real smoking gun.

Real Data and Dealing with Ambiguity: ADV OU Viability Ranking 2019

Finally, let's take a look at some real data. This is the same plot generated from outlier-removed data provided by McMeghan of the ADV OU VR in https://www.smogon.com/forums/threads/adv-viability-ranking-ou.3503019/post-8115920

173201


To quote verbatim from my post over there,

At the bottom left corner is Tyranitar, which everyone agrees is #1 and has zero standard deviation, clearly belongs in the S tier.

The next three mons #2-4 are Gengar, Metagross and Swampert, which are really close up to error bars. Zapdos, at #5, is clearly behind #2-4, and is in roughly the same league as Skarmory and Blissey (#6-7). From #8-13, Celebi, Suicune, Jirachi, Snorlax, Salamence, Dugtrio clearly form a tier of their own. This concludes what many of us might be inclined to call the A tier.

#14 Starmie, on the border of the magenta line, is in a strange league of its own, not up to the standards of Salamence/Dugtrio, but clearly more preferred than Magneton. It may be considered B+. Here, the standard deviations start not to cover entire tiers, and the spread is more uniform. From #15-20, you have Magneton, Claydol, Aerodactyl, Jolteon, Heracross, Moltres, which are slightly distinguished from #21-26, Milotic, Flygon, Cloyster, Forretress, Porygon2, Gyarados. #27 is Venusaur, which sits on the border of the black line separating what you may like to call the B and C tier (Hariyama onwards).

Rewritten for simplicity: The VR tiers from this nominally stands at
S: Tyranitar
A+: Gengar, Metagross, Swampert
A: Zapdos, Skarmory, Blissey
A-: Celebi, Suicune, Jirachi, Snorlax, Salamence, Dugtrio
Borderline A/B: Starmie
B1: Magneton, Claydol, Aerodactyl, Jolteon, Heracross, Moltres
B2: Milotic, Flygon, Cloyster, Forretress, Porygon2, Gyarados

To me, these results appear reasonable. For example, Metagross and Swampert are imo up a notch in versatility compared to Zapdos. Skarmory and Blissey come together. The "B" mons Magneton-Moltres all occupy positions on some notable archetypes (Magdol, Aero spikes, Jolt spikes, Heracross phys spam, Molt TSS that can either come with Flygon or Forre), while Milotic to Gyarados have less of a clear presence on teams, with the exception of Porygon2 archetypes (CMspam) that appear to be becoming less relevant imo.
The first four tiers, which I will conventionally call S/A+/A/A-, have rather distinct flat tier-lines and boundaries demarcated by the green/red/blue/magenta lines. Now things get tricky. At that point, the B tier-line from magenta to black become less flat and the distinguishability assumptions A1 and A2 cannot both hold true. We have now come to the question of how one should subdivide lower tiers, where things are more ambiguous. There are two ways to interpret the data.

1. Relax A1: Do not divide the tier into subtiers. Acknowledge that there will be variation within the tier.
2. Relax A2: Divide the tier into sections of the size of the error bars. Within each subtier the mons are now more equal, but it's not so easy to see the differences across subtiers. In my original post, I just subdivided the tier anyway with the yellow line, because the error bars were roughly half the size of the tier. This ended up making reasonable sense.

The tiers eventually looked reasonable, and there are some surprises, such as the controversial Starmie not belonging in any tier.

Interpretation

The fact that this method yields much more than just some semblance of sanity is actually pretty amazing. If you think about it, McMeghan had not asked anyone for opinions on their tier limits. Only the rankings were obtained, yet the tiers naturally formed from player variations. This means,

1. It is a method of tiering that minimizes human bias.
2. It was surprisingly robust to the different metrics people used to evaluate mons, be it versatility vs efficacy, or at the level of operation (individual, core, archetype). A criticism of VRs are that mons fulfill niches on teams and are extremely subjective. These results tell us that we can measure subjectivity objectively, dividing the metagame into tiers of contained subjectivity, but with a common understanding of relative placements across tiers. Borrowing some ADV OU context, in more concrete terms, even though Zapdos and Snorlax do different things, there's a consensus that Zapdos is better at what it does/can do than Snorlax at what it does/can do. It should trigger some thought in people unfamiliar to the tier, or lead a semi-experienced player to think of fundamental flaws of Snorlax in its main role eg. a special wall that can't deal with WoW Gengar/Moltres, or gets overloaded by Sand, which is why spikeless offense tends to be electric weak etc.

Caveats

1. Tiering, especially tier shift, becomes more ambiguous as you get to the lower ranks, because of larger variance, and less reliable because some participants do not rank low tier mons. In the ADV OU tier, the techniques made sense till the black line (start of C rank), after which all points fell below the y=x line.
2. Cleaning your data and removing outliers may or may not be important. It didn't turn out to matter too much for ADV OU.
the use of the central limit theorem in B1 requires that these outliers not skew the normal distribution. Outlier removal could be something like remove all points some fraction of the median away from the median. The quality of data can be checked by bootstrapping (forming a distribution of the averaged ranking of a particular mon by sampling with replacement).
3. Enough contributors are needed to generate the required statistics. The ADV OU 2019 VR had 17.
4. Again, this has only been tested in the ADV OU 2019 VR. It would be far fetched to say it works for every tier, but I'm hoping it gets somewhere.

Add-on 4/30:
Speaking to tjdaas and Altina, some objections have been raised about the ability of using this
O1. In tiers that are too centralized
O2. In tiers are in flux
O3. For lower ranks.

Addressing O1:
Tjdaas informed me that most RBY players would provide a ranking like
1. Tauros
2. Chansey/Snorlax
3. Chansey/Snorlax
4. Exeggutor, sometimes Starmie
5. Starmie
6. Alakazam

In this case, the method would create tiers something like
Tier 1: Tauros
Tier 2: Chansey, Snorlax
Tier 3: Exeggutor
Tier 4: Starmie
Tier 5: Alakazam

As mentioned earlier, tiers form from player variations, so when there are no variations, a single mon occupies a tier, indicating that this mon is in a different league from the ones above it and yet also below it. This accurately captures the essence of the two key distinguishability assumptions, albeit forming tiers with a small number of mons. Remember that a corollary of our assumptions is that if everyone ranked a list the same way, there is no need for tiers. A tier is useful only because it says something like Articuno = Zapdos = Moltres > Vaporeon = Jolteon = Flareon. Without approximate equalities defining tiers, it doesn't make sense in this article's definition of a tier to say Tauros is in the same tier as Snorlax, but I'd rather always keep my Tauros alive in an RBY game rather than my Snorlax (just a guess, I'm not a RBY player).

Addressing O2:
Tiers in flux will be likely be subject to a high standard deviation, especially if the contributors are not up-to-date. Part of this is a data selection issue -- deciding who is a relevant contributor and who isn't. If that is accounted for though, then any remaining ambiguous tiers that don't satisfy both distinguishability criterion (such as the B or C tier in ADV OU) is an indication that sub-tiering to that level may not be useful. This to me is not a flaw of the methodology, but a result indicating no matter how one tries to divide the tier, there are a significant number of contributors that will disagree (the same people who cause the error bars to bleed from one tier to the other). It's not possible to make everyone happy in such a tier, even if any other tiering method was used.

Addressing O3:
Lower ranks suffer from two problems. First, as mentioned in O2 and from our experience in ADV OU, at least one of the distinguishability criterion is likely not satisfied, so refer to O2 for this response. Second, contributors just don't rank some mons in lower tiers because they're just not important to them. A practical but contentious way of resolving the latter includes requiring contributors to rank from a prescribed list. An impractical but less contentious way requires contributors to rank at least X number of mons with the intention to cut down a percentage of them down to a common list. A biased but practical and non-contentious way would be to just give every unranked mon a fake low rank ceiling in place -- but then there isn't really a point of separating the low tiers because they are so heavily biased by responses.

In all, to me, the most significant impediment is the selection of contributors. Contributors have to be competent and up-to-date with the tier, and there must be a sufficient number of them for the data to look clean. Perhaps that is where the greatest subjectivity lies.

Closing Remarks

I feel that this has some potential, and really hope I can reach out to the other tiers doing Viability Rankings through this post, without having to double post across forums. Finally, this is an analysis that can be easily implemented in Excel (see attached .zip file), and I'd be happy to help with any questions about doing this analysis. I'm a physicist by training, not a statistician, so I might have made some mistakes. Please feel free to point out improvements or just plain wrong things I said, as well as your opinion on such an endeavor, useful considerations going forward, perhaps even pointing the right people to this post if you think it's useful. Thanks!
 

Attachments

Last edited:
Hey this is a really cool idea, and I think that it's something we should seriously look at integrating

One thing I'll note is that this makes VR updates a really formal process. In theory, just having an individual or team leading the thread means that updates are relatively expedient. In practice I'm not sure this is the case, since they seem to seldom be that responsive. The key thing that sticks out for me is the selection of contributors- this adds a step to the decision-making process that simply doesn't exist currently. The best way of handling it would be to make the criteria for selection as passive as possible. It seems to me that there would a few different ways for players to prove themselves worthy of contributing.

The biggest of them is forum tournaments. This would require a schedule of all tournaments featuring the appropriate tier- most subforums already list all major forum tournaments for their respective tiers, though I don't know how many other tournaments there are that would count, but aren't listed in the aforementioned calendars. Ideally this would also include other sites as well, but then there are practical issues in terms of identity- people often register for different forums under different names, which would be a pain in terms of logistics. There are ways around this, such as people simply including that info in their profiles, but idk, it would require some testing. There are more issues as well- you might want to count certain tournaments differently- you might decide that making the 3rd round of a tournament (idk what it would actually be) is enough to qualify, but outright winning might grant contributor status for a longer period than just making 3rd round. And also there might not be enough major tournaments in the year, idk. PP's tournament routine is pretty valuable for this kind of thing, and is what I really had in mind when I mentioned non-smogon sites, but yeah, like I said it's not something that can just be thrown straight in

Then there's PS room tournaments. Idk, I've never been into live tournaments so I'm not going to comment too much. I guess the standard of play is considerably lower than in forum tours, while idk how it compares with ladder, I assume it's better though.

Ladder is a tricky one. In principle it's good for proving competence, even when you consider that many tour players prefer to avoid it. The issue is decay. From what I understand it's relatively easy to maintain a decent ranking while barely playing, even in ladders where "decent" is well above the level where decay is significant. Inactive ladders are a bit of a trainwreck as well- you might only have a couple dozen people at ratings above the decay threshold, with a shitload of accounts sitting at the threshold (1500 for RBY). I actually had a look and there's 27 people above 1500 elo on RBY ladder, with 117 accounts sitting at exactly 1500. On top of that, 1545 gets you a top 10 spot. And then there's the question of whether you want to use Elo at all. I personally prefer to look at GXE when I give a shit about my performance on ladder. In spite of all this, I think ladders are still a piece in the puzzle, as there are some elite players who might want to contribute that simply prefer ladder (in RBY, Raish and Kaz are two players that spring to mind as being elite and being primarily ladder players. Well, Raish was elite, I don't think they've played in some time)

O3 is an interesting topic for me. I'm not sure how you'd best handle the situation if pokemon are unranked by some but not all players. Maybe there's some statistical solution to this, but off the top of my head it would seem like a major issue. I like the idea of a prescribed list, since it is easy to maintain democratically alongside the rankings themselves- at the same time that players are giving rankings, they can also nominate pokemon to be added. You can nominate a pokemon to be removed, but this is also something that could theoretically be flagged by prior rankings (e.g. if something is unanimously last, consider whether or not to rank it at all). To me this could also function similarly to the inconsistent unranked issue. For instance if 27/30 contributors include Raticate in their RBY rankings, it doesn't get included in that set of rankings, but it's required to be voted on for the subsequent set.

Also yes, I believe Raticate should be ranked in RBY, but I won't go into it here.
 
Addressing a common misconception: Implicit vs Explicit tier cutoffs

I'd just like to clear up a common misconception here (not specifically addressed to you, but from multiple opinions gathered). This is a system for segmenting ranked mons into tiers implicitly. Conventionally, people argue about whether mons should be in A- or B+. This says, let's not argue about that, don't even give me your preferred cutoffs. Just rank the mons without any pre-conceived notion of tiering, and statistics of the community will derive the cutoffs implicitly. It's a completely different way of deciding tiers that are still sensible and revealing in their own way.

---

Hello Ortheore! Thank you for your helpful comments, I appreciate it. I acknowledge the point about a more formal process, and I agree that the selection of players is also going to be the biggest question.

Addressing your point about the formality of the process, I feel that this method of VR can be done either as an aid, or in conjunction with the current method of a healthy debate with final VRs determined by a tiering council. There is definitely value in debating about VRs before an objective vote, but on the contrary, I also see it as something that raises topics for discussion. Assuming we address the selection process, the statistical analysis of VRs can be performed with the tiering council using it as data for the final VRs. On the side of the council, it can identify preferences that have been overlooked or misjudged, be it due to activity differences of potential contributors, how determined they are to make a point, unfamiliarity with certain mons and arguments being put forth due to playstyle differences, or just unavoidable biases. On the side of users, it makes the VR process more transparent, as they will get to see objective ranks and rankings modifications decided by the council. The council can be encouraged to explain differences in their VRs with those suggested by the data.

Perhaps I should share what happened with the ADV OU tier. I hope I don't misrepresent anybody in this process. After releasing my statistical analysis, I've only gotten positive reactions on the ADV discord channel (maybe it doesn't say much because people may be acknowledging my hard work rather than the results), but we had a debate on where Starmie was located, as it was borderline A/B. The statistical evidence suggested having Starmie in its own tier of B+. I then learned there were players who indeed thought of Starmie as convincingly better than B and worse than A, yet found it strange to place a mon in its own borderline tier. It is counter-intuitive, but that isn't a good reason to reject such a suggestion. In fact, I think this is better than the alternative of having a tiering council choose an unrepresentative tier, leaving most contributors dissatisfied. Data helped us to find this surprising results.

On the selection of players, I'm honestly not too sure. I guess it can be pretty informal for small, tight-knit communities where most members have been around for a long time and have a way to talk to and know of each other (ADV OU), but I agree you'd need a formal selection for larger groups. Given that competent players have their preferences, I'd be inclined to say you can allow for three different modes of qualification (forum tour/room tour/ladder) to relax the requirements. Some degree of word of mouth is even possible. Another contentious issue is the difficulty of discerning if a player's metagame knowledge is current. People can do well in tours without having built teams for the gen or even using outdated teams, yet there are many good ladder players who can never prove their currency. It is very soft, which brings up the political aspect that we don't want to offend someone who thinks he's sufficiently current. Perhaps an extreme example to think about would be SPL players who don't touch a gen until a season is around the corner -- how would you evaluate their competency and relevance? Being too selective makes this elitist, yet being too lax means dealing with misinformed opinions. Regardless, I don't think this ambiguity is a reason not to try, because the process of debate is also prone to biases. Finally, as this is a relatively simple process, you could explore categorizing players by tour/ladder or by some competency/currency metric to create separate graphs and cutoffs. I think data can only help; it's deciding what data to use that there is discretion, and rightfully so.
Edit: Weighted contributions based on performance such as what GSC NU does is also possible.


I actually do not think O3 is a major issue, contrary to your opinion, but let me speak objectively first. With the current algorithm, if some players do not rank mons, I assume that they assign a rank that is equal to the average of all contributing players. This is an underestimate of course -- it's more likely that they're not even worth mentioning to the contributor (resulting in a very bad rank), rather than that the contributor forgot about it. An alternative would be to place the unranked mons last for these players. Why I think this is not as important has nothing to do with my personal preference of ranking low tier mons -- it's that the main objective of this methodology is arriving at tier cutoffs from rankings, not rankings themselves. If players do not rank some mons, then there is a huge disagreement in the placement of those mons, so the standard deviation will probably be so large that it wouldn't change the tier. Remember that a tier shift is marked by a convincing change in the mean rank beyond the standard deviation. It is based on technical rather than personal grounds that I suspect O3 wouldn't be too big of a problem, but where it comes to rankings themselves, I agree with you that it could be helpful to have a system.

Thank you for reading this wall of text!
 
Last edited:

Users Who Are Viewing This Thread (Users: 1, Guests: 0)

Top