Tier list underusage cutoff

capefeather · Feb 9, 2011

(nb: I make frequent references to the Smogon's tiering article. Just mentioning this on the off-chance that you're reading this and somehow you're not aware of that article.)

Hey, all. Since Super is apparently trying to get usage stats to work this weekend, I thought that I should bring up something that I've been wanting to bring up for a long time. Coyotte's thread was similar, but it kind of trivialized the issue of how to make the cutoff and it was never really resolved. This post intends to change that. I was originally going to have this be a series of threads where the underusage cutoff formula was constructed step by step, but I eventually thought it best to lay everything out in this thread and then maybe have polls for each step or something.

I'm bringing this up because it seems to me as if we're heading toward accepting a simple x% cutoff because that's what Coyotte is using. Besides my personal disagreement with this method, I feel that we'd be taking steps backward and nullifying the efforts of people who have discussed this in the past, like X-Act, obi, DougJustDoug and Cathy, if we didn't take a serious look at what has been used in the past, what may be used in the immediate future, and what we as a community really want out of the cutoff. I also understand that this may not seem to be the best time to talk about this, but I don't see us talking about much else right now, and I'm hoping to get all this decided before we get the hard numbers.

I also intend to give and explain formulas for each aspect of the cutoff, at least eventually, because I think that sometimes the explaining part was lacking in past discussions. Even if the calculations and explanations are "trivial" or "too technical", it's nice to have some kind of explanation around so that the constant in the tier philosophy page right now doesn't look so arbitrary. It also makes it easier on those of us who are looking for explanations for our running usage stats in the manner in which we are running them. I also wanted to give examples using maybe the August 2010 Shoddy stats as a sample, but to do that for every single formula that's going to be discussed seems not worth it at all. I may give examples if the options are shrunk down later on, though.

Step 0: A look at existing proposed cutoffs

From what I've seen, three main cutoffs have been proposed, and two have been put into use:

1. Coyotte's cutoff: x% cutoff

2. 4th gen cutoff: A Pokémon is OU if its probability of appearing in a random selection of T teams is at least x.) (T = 20 and x = 0.5 in actual usage)

3. Collective cutoff: OU is the collection of the most used Pokémon who together make up x of all teams.

I also saw something about comparing each Pokémon's usage with the #1 most used Pokémon's usage during my "research" into past discussions.

I'll talk about each of these in later steps, but I wanted to highlight here the current displayed formula for the 4th gen cutoff:

X-Act said:
C = S x (1 - (0.5)^(1 / T)) / 6

S is the sum of all of the usages, being used presumably due to error in the data collection causing S to be a bit different from 1. I'm going to ignore S for simplicity's sake.

I don't quite understand why the cutoff is divided by 6. The fact that there are six Pokémon on a team shouldn't factor into the formula at all, due to Species Clause. However, the rest checks out:

Let u be the usage of a Pokémon. Then the probability that it won't appear in a random selection of T teams is (1 - u)^T. We want this to equal 1 - x when u = C, so C = 1 - (1 - x)^(1/T).

When T = 20 and x = 0.5, we get 3.41%, which if I'm not mistaken is near what we used last generation, anyway. There's also a prediction factor involved, but I'll talk about that in a later step.

Step 1: The purpose of tiers: individual vs collective merit

The way I see it, there are two primary motivations to make UU (and NU and such). The first is to make a "low-tier" game where Pokémon who are used at a certain frequency are banned; this ties into individual merit. The second is to create an accurate threat list that accounts for everything that a typical team might have; this ties into collective merit. The first two cutoffs address the former, while the collective cutoff addresses the latter.

Both individual and collective merit based cutoffs have their advantages and disadvantages. By setting a hard percentage cutoff, one essentially ensures that only "not uncommon" Pokémon are in OU, but a threat list composed of the OUs has inconsistent accuracy and relevance. On the other hand, by setting a collective cutoff, one is able to control how accurate the OU threat list is, at the cost of possibly letting "rare" Pokémon into the OU tier.

There are two ways that I can think of to calculate a collective cutoff. The first is the cutoff 3. listed above.

"OU is the smallest collection of the most used Pokémon who together make up at least x of all teams."

Equivalently, UU is the largest collection of the least used Pokémon from which none are in less than x of all teams. That means that (if N is the total number of Pokémon) #N doesn't appear AND #N-1 doesn't appear AND etc.:

The goal is to find the lowest n for which P(n) >= x. Unfortunately, there's no "formula" for this cutoff simply because of the nature of how it's calculated, but with a few estimates it shouldn't be too bad after a few iterations of doing this.

EDIT: I think that I made the rookie mistake of assuming that I'd set this up so that correlation didn't matter. It does. Sorry about that.

The other method is a slight modification:

"OU is the smallest collection of the most used Pokémon from which at least one is present in at least x of all teams."

Then the formula becomes P(n) = product(i=1,n-1)(u_i). The advantage of this becomes clear when we look at the likely values of x. With the first formula, we may be looking at something like 50%, while with this formula, that number could be more like 95%. To say that the OU threat list is 95% accurate is pretty appealing. Nonetheless, this may be a less accurate way to interpret what we want out of a threat list in the first place, since we seek to deal with teams, not individual Pokémon.

Step 2: Sample size

With sample size, we're essentially looking at the phrase, "random selection of T teams." For Coyotte's cutoff, the sample size T is the entire population, while for X-Act's cutoff, T is obviously 20. Now, I personally don't find the use of the entire population very useful at all, since in the end (I'm presuming) we're trying to appeal to an individual's experiences of what is common/rare and what is not common/rare. On the other hand, if we use the collective cutoff in Step 1, the OU definition might get just a little bit complicated:

"OU is the smallest collection of the most used Pokémon for which the probability that they fully make up at least t out of a random selection of T teams is at least x."

I hope you'll forgive me for not deriving a formula for this :P

One should note that, regardless of which cutoff we use, we're actually deciding two things here: the sample size T and the ratio t/T. Technically, there's no reason not to have picked "2 in 40 teams" or "3 in 60 teams" in X-Act's cutoff other than sheer simplicity. It's something to keep in mind if anyone wants something like "3 in 10 teams". In cases like this, we'll have to call upon the binomial distribution:

$334f6d225a50d1e4777b8e7915215577.png$

One would have to solve for 1 - p to get the cutoff C in this case. "Sheer simplicity" is looking like a very valid argument right now...

Personally, I think that, given we stick with the 4th gen cutoff, we should get a stat showing the average number of battles that a user participates in in a day, and determine T by that. t should be 1 because I don't think anyone wants to invert that huge formula I copy-pasted from Wikipedia.

Step 3: Predictive tiering and weighted stats

I consider this an extremely important step. A tier list arising from present usage stats is, in fact, obsolete right out of the box. But here's where things get vague as far as research into the past discussions of this matter goes.

First, the weighted stats. Apparently, X-Act had originally used ratios related to the Golden Ratio instead of the 20-3-1 ratio used today, but I'm guessing that he went with 20-3-1 for simplicity's sake. The point of the weighted stats is to use the usage stats from every "checkpoint" (that is, every time the stats are drawn) in such a way that more recent checkpoints have more impact than less recent ones do. There isn't much that I could find on this stuff, but for the most part this seems uncontroversial.

The real puzzling matter is stat prediction. There was discussion on it, but I'm not sure that it ever got implemented. Either that or it was scrapped. I wanted to bring this back up because I wanted to see if people could come up with a way to predict stats that would help to make the tier list more relevant to the present. Ideally, I'd want to use polynomial fitting, but in practice that would probably take way too much effort.

---
Well, my hope for this post was to demystify (at least a little bit) the usage cutoff methods that have been used, as well as a few that were discussed but never got off the ground. Even if most people end up not understanding much of it, I thought that it would be very important to have a compendium of sorts so that people don't have to dig through PR, IS, even Stark Mountain to understand the tier list that we hold so dearly. I also wanted to lay everything out so that people wouldn't feel as disadvantaged when trying to support one method or propose an entirely different method. Finally, even if people just want the status quo of X-Act's cutoff or Coyotte's cutoff or whatever, I'd feel a lot better knowing that you guys have a better idea of what exactly sticking to the status quo would entail.

"wtf" is an entirely appropriate reaction, too; I wouldn't blame you

coyotte508 · Feb 9, 2011

I want to address a few points :)

First about "coyotte's cutoff". It's a decision based on how usage stats are looking, and the % could change depending on how the usage stats look - maybe being based on a natural cutoff if it seems appropriate for the month (ie finding out if there's a significant enough gap in usage at one point and placing the cutoff there if it makes sense).

With sample size, we're essentially looking at the phrase, "random selection of T teams." For Coyotte's cutoff, the sample size T is the entire population, while for X-Act's cutoff, T is obviously 20. Now, I personally don't find the use of the entire population very useful at all, since in the end (I'm presuming) we're trying to appeal to an individual's experiences of what is common/rare and what is not common/rare. On the other hand, if we use the collective cutoff in Step 1, the OU definition might get just a little bit complicated:

I don't really understand. Using the entire population is kind of like having absolute probabilities. For example getting a 4% cutoff is the same as saying removing pokemon that are in less than 1/25 of the teams. X-Act's 3.41% derived from 1/20 of the battles is more like saying pokemon that have 50% of chances to appear in 1/20 of the battles. It's pretty much the same idea, and both methods just end up at setting a definite %. Also if you're ever going to make tiers of course you'll need to consider the entire population. If you decide to not consider the entire population and remove the part of the population that uses Toxicroak, you're going to alter the tiers. If you're going to randomly remove a part of the population, why do so? Why limit yourself to stats on 300 000 battles if 1 000 000 battles happened?

The real puzzling matter is stat prediction. There was discussion on it, but I'm not sure that it ever got implemented.

Actually, the ratio 20-3-1 was the best way X-Act found to predict stats, it's the prediction function. If you want to predict stats better knowing the trend on previous months, then you probably need to introduce more complicated mathematical factors such as if the usage is rising/decreasing, etc. and one of the best way would be something like training a neuronal network. But it could give off results too, because no one can predict the future with certainty.

<bit about error analysis removed seeing your edit :)>

With all what i pointed out from the post, i feel like the collective satts is really a great idea, better than the systems already existing. It also has the ability to skip some spots in the usagestat list which a straight cutoff could never do. It's not overly complicated to implement too. And i'm glad someone at last decides to think on the matter.

capefeather · Feb 9, 2011

Well, the whole point of this is to have a better handle on saying what we mean. The raw usage stat is almost the expected value, and that gives no control over how "accurate" we can make the claim that you'll meet a Pokémon at least once every T battles. If a Pokémon is rated at 5% usage and I have a 20-battle session, the probability that I'll run into at least one is about 64%. It seems to me that we ought in principle to be controlling the 64% rather than the 5%, assuming that we even stick with a hard cutoff.

Matchmaking aside, whether I meet a Toxicroak or not is random. I have no control over whether or not I meet a Toxicroak, or how many I meet. So I as an individual can't really alter the tiers by not meeting Toxicroak. I think that there's a misunderstanding here because I was just comparing the 4th gen cutoff to the x% cutoff.

If the 20-3-1 ratio IS the end result of the prediction function, then that's good to know. Again, I would have liked this to have been explained somewhere not obscure, but oh, well.

Chou Toshio · Feb 10, 2011

capefeather said:
Step 1: The purpose of tiers: individual vs collective merit

The way I see it, there are two primary motivations to make UU (and NU and such). The first is to make a "low-tier" game where Pokémon who are used at a certain frequency are banned; this ties into individual merit. The second is to create an accurate threat list that accounts for everything that a typical team might have; this ties into collective merit. The first two cutoffs address the former, while the collective cutoff addresses the latter.

I'm not too much of a math guy, so I'll leave that be-- but I just wanted to say that by your definitions, I think individual merit should take precedence if possible. To me, UU's identity as a tier of infrequently used Pokemon, is more important than trying to use the OU list as a threat list. The best players will make their own judgements about the threat level of Pokemon regardless of what is on or off the OU list.

Besides, if providing a "threat list" was the real goal, than last gen's OU list would have failed-- because pokemon like Weavile, Electivire, and Umbreon are frankly, not very threatening at all. That doesn't mean I'd want to see them in UU matches. Let's face it, the OU list has a lot more practical use for, and a whole lot more influence on the play of UU rather than OU.

coyotte508 · Feb 10, 2011

If you take the last post from X-Act's thread:

I thought of the following. We have lots of data to play with now. Why not just find the prediction function that produces the least sum of deviations artificially, i.e. via a short computer program? And that's what I did. I made a computer program to find me the best prediction function that generates the nearest value to the actual one using just the three previous months. And the function produced was the following:

Artificial Prediction Function (APF): u_0 = (86 x u_1 + 13 x u_2 + 4 x u_3) / 103

That resembles strangley the current weigthing, just divide everything by 4.

Also from the introduction to tiers page:

From previous ShoddyBattle statistics, a small computer program was written to generate an equation that would fit best the predicted probability P with the actual probability for each Pokemon. This equation is the following:
P = (20×P_3 + 3×P_2 + P_1) ÷ 24

So that's how the prediction function was determined. Though I think the metagame's not settled down enough after the introduction of 5th gen for it to be accuracte right now. And this was 4th gen too.

Also i think that if the collective stats system were implemented, rather than predetermining a set x% from before running it, it'd be better to run tests with different x% and take the one that seems better. And maybe do that for next month if a new x% becomes the best.

capefeather · Feb 19, 2011

I'm guessing that Super hasn't resolved the Smogon server's usage stat problems, but in case he has, I'd like to ask for more input on this. It seems to me that people are scared off by the math or whatever (I've gotten PMs to that effect), but that was just to explain how these random-looking formulas that no one seems to be able or willing to explain work. So far, we have examples of both branches of discussion possible here: Chou Toshio brought up a sort of philosophical preference, while coyotte helped to clarify some things about past discussions. I'm really hoping that something is decided here BEFORE stats come out, to get at the heart of the matter of what UU is or should be.

I'm starting to think that the 4th gen cutoff is fine, that maybe it needs to be raised. I don't think that consistent threat lists are all that useful other than maybe on RMTs, and a lot of the criticism of existing lists seem to be along the lines of "this Pokémon isn't seen enough to be OU". The prevailing problem with a hard cutoff has always been its effects on Pokémon who toe the line, but unless we make a sort of "cutoff region" that a Pokémon has to cross completely to change tiers, I'm not sure that it's worth the trouble to try to "fix" this.

I just find it a bit aggravating to see various complaints about existing tier cutoffs from time to time and then see exactly TWO people actually take this opportunity to make a public statement of their opinions. There are actually a lot of things to talk about. Do you agree with my interpretations of the cutoffs? Is there a different cutoff altogether to propose? Did I miss something important from past discussions? There's not much to talk about on my end if no dialogue is made.

If people who aren't in PR and who have PMed me give permission to post our conversations, I can start with that. (Actually, there's only one of that.)

jrrrrrrr · Feb 19, 2011

Since you're asking for opinions...

capefeather said:
The prevailing problem with a hard cutoff has always been its effects on Pokémon who toe the line, but unless we make a sort of "cutoff region" that a Pokémon has to cross completely to change tiers, I'm not sure that it's worth the trouble to try to "fix" this.

I like this idea, a hard arbitrary cutoff point for being OU at first, then a couple of months above that point to be considered OU again sounds like it would slow down the flickering issue that was brought up in the other thread, and it would keep things more stable overall. Instead of just one arbitrary cutoff point (other than the initial one, of course), make it bridge the gap to really become OU instead of just having one fluke month and flickering.

coyotte508 · Feb 19, 2011

Well then i'll try to explain the collective usage stats thing without using too much math ^^

It relies a lot on teammates. As in the end the pokemons will only be counted in the statistics for their participation in teams that are pure OU, the pokemons that are often seen as OU partners will have more chances to be OU even if they have a lower usage than say, charizard, that is used by noobs but in conjunction with extremely low used pokemons.

If I were to take an example, with that following batch of stats:

# 65 -

Hihidaruma (2.94 %)
# 66 -

Charizard (2.93 %)
# 67 -

Ninjask (2.92 %)
# 68 -

Milotic (2.91 %)
# 69 -

Chansey (2.91 %)
# 70 -

Empoleon (2.85 %)

Chansey and Empoleon would be more likely to be OU than say Charizard or Ninjask as they're often paired up with full OU teams while Charizard and Ninjask are used with other low usage pokemon, like gallade umbreon etc.

Of course this won't always produce perfect results but it should be able to skip some "noobish" pokemons. But at the same time I can't tell how different the results would be from a regular cutoff. Also this method will be extremely new, as opposed to having a pokemon percentage cutoff (like "4%!") it will be a full OU teams cutoff that take full teams into account, that's why I think that we can't decide on a limit before seeing sample results of the algorithm.

Also, even though a lot of math was used in the OP when implementing it math will not be used as much, it will be a matter of the usage stats program trying different combinations of OU pokemons and choosing the one combination that fills enough usage with the lowest amount of pokemon possible. Several optimizations / assumptions will make the program find the solution somewhat efficiently.

Edit: Now that I listed the pros there are a few things not to overlook:

* This system assumes that standard teams are composed of 6 OU pokemons. This is not really a problem here, as this is how teams are made in standard play.
* Limiting ourselves instead to battles between 1000+ rated players may or may not be more effective, but it'd really flush out a lot of pokemons that aren't fit to OU. This is much simpler than the collective usage stats solution and could be used instead for the same results if there is a lack of programming availability.

JabbaTheGriffin · Feb 20, 2011

not going to lie haven't read anything in this topic but i'm going to throw my cim cents in anyway

keep the same formula we used in gen 4 but replace t = 20 with t = 10

it effectively will probably change OU from around 45 mons to 30, which i feel is a better mark (last gen you ended up with shit like ninjask and vire being OU when clearly they weren't)

alternatively we could use weighted stats...but i'd still like only ~30 ou mons

MapleSandwich · Feb 20, 2011

JabbaTheGriffin said:
not going to lie haven't read anything in this topic but i'm going to throw my cim cents in anyway

keep the same formula we used in gen 4 but replace t = 20 with t = 10

it effectively will probably change OU from around 45 mons to 30, which i feel is a better mark (last gen you ended up with shit like ninjask and vire being OU when clearly they weren't)

alternatively we could use weighted stats...but i'd still like only ~30 ou mons

I can understand your feelings regarding stupid shit being OU, but at the same time there's a lot more things that are OU viable this gen (especially in Dream World, just look at Sheer Force Feraligatr being #1 in usage). Basically anything we do for the first OU cutoff is going to be arbitrary because we don't have the stats for this metgame, so we don't know where OU is at this point.

I think it'd be easier to start with more liberal than less liberal, and then find a point where we can agree is clearly OU. It's easier to say (as an example, this in no way reflects my opinion) "Infernape is clearly not OU" after we've set the mark below its usage, than to say "Is Infernape OU?" after we've set the mark arbitrarily high above its usage.

That said I really don't see the difference between X-Act's and coyotte's methods other than more numbers. Both give you a set, percentage cutoff number. Collective cutoff is the only thing that actually seems set to determine, based on collective usage of all Pokemon, whether or not Infernape is actually OU. However, it also discourages people from using Pokemon with new strategies in place from UU, because then their stats won't be taken as heavily into account, due to their usage of UU Pokemon.

I think as long as we make it clear that collective cutoff in no way mitigates the impact that an individual using an original strategy has on the metagame (we're going to notice if you get to no. 1 on the ladder using Emboar), it is the best way to go.

eric the espeon · Feb 20, 2011

I'd be hesitant to decide to go with collective stats until we have a good idea of what the results look like. Some Pokemon will naturally be used with Pokemon which have a low usage, but may still be fairly effective. Ninetales is the strongest example I can see right now. Almost all the top sun abusers have fairly low usage which would drag Ninetales down. Perhaps it's position in the late 30s would keep it in OU, but you can see the kind of side effects that going with collective stats could have, driving more effective and common Pokemon down simply because they work better with lower usage Pokemon while keeping less effective/common Pokemon which happen to be used by players who load their teams with standards. As for the region to pass through to cause tier change, yes. Lets have that and update more frequently. More up to date tiers without the instability.

Tier list underusage cutoff

capefeather

toot

coyotte508

capefeather

toot

Chou Toshio

Over9000

coyotte508

capefeather

toot

jrrrrrrr

wubwubwub

coyotte508

JabbaTheGriffin

Stormblessed

MapleSandwich

eric the espeon

maybe I just misunderstood