TL;DR -- Here's what's wrong
Dear Smogon community,
I've spent the past 1.5 years researching the usage stats, and Antar's weighted stat system has some serious problems that don't seem to be well-known. There's 2 main issues: The usage calculation method and the cutoff justification calculation.
I have a suggested fix, but it would require changing the usage calculation scripts (but not the data needed for them).
There's also a very weak fix that is bad but easy to implement
Realistically, we'd need an event like Ambipom to the Top (except via bots) to trigger any serious action, especially considering Zarel's inactivity, but it's still important to understand how to view the current usage stats critically -- the percentages alone are not enough to assess a pokemon's viability or frequency within a metagame. If you have any questions, please feel free to ask here or message me on discord (same username).
If you would like to know more, I wrote a paper on the subject and also made a video to help explain the problems. These are just supplements to the post itself though.
Video
Paper
https://drive.google.com/file/d/1GEtjBAUg_9PgdHA52cBX2BJjkZDtEMdB/view?usp=drive_link
Also thanks to pre for helping me understand some of the usage stat mechanisms as I did research.
- The tiering cutoff definition allows for a pokemon to be both truly OU and not designated OU because the percentage cutoff is based off unweighted usage stats, but percentages themselves are calculated using weighted stats (i.e. the cutoff calculation method is outdated by a decade).
- The weight system overvalues new players and enables poorly performing new players to attain weights that can have noticeable impacts at a large scale (at least 0.05 weight)
- The system has no mechanisms to stop masses of bad players with mediocre weights from making a garbage pokemon appear equal to a good pokemon in a tier. I propose changing the weight formula and bucketing weights into tiers (similar to histogram bins) to fix these problems.
Dear Smogon community,
I've spent the past 1.5 years researching the usage stats, and Antar's weighted stat system has some serious problems that don't seem to be well-known. There's 2 main issues: The usage calculation method and the cutoff justification calculation.
There's two parts here -- the weight calculation formula and the method for summing weights. Remember that new players start with a glicko rating of
Recall from antar's FAQ, that weights are calculated using the normal distribution cumulative density function with glicko weights. The problem here is that the function strongly overvalues high player deviation when calculating weights. This can be seen by comparing the following two ratings: P1 has
Yet the usage stats think p1 is much better just because of its higher variability.
The second problem comes from directly summing weights to determine usage. With enough 0.05 weights, which are attainable by players who lose more than they win (50% winrate is usually enough to maintain R = 1500), bad players can easily overwhelm good players. There's no cap on the amount of weight that bad players can contribute bc all weights are directly summed. The system incorrectly assumes their weights will be too low to matter.
When I looked at uu player ratings, there was a ratio of at least 100 : 1 for players with weights below 0.1 to those above. Also one can spam bots with new accounts to play themselves and abuse this, using residential proxies at the cost of a few $ to bypass any IP restrictions.
R = 1500, RD = 130
. Glicko ratings are also not Elo. They are displayed to the right of them on a player's ratings.Recall from antar's FAQ, that weights are calculated using the normal distribution cumulative density function with glicko weights. The problem here is that the function strongly overvalues high player deviation when calculating weights. This can be seen by comparing the following two ratings: P1 has
R = 1480, RD = 100
and P2 has R = 1580, RD = 25
. P1 has a weight of 0.0668 and P2's is 0.0228. If we compared GXE values though, we see p2 is likely much better. P2's GXE is 60.5 and P1's is 47.4.Yet the usage stats think p1 is much better just because of its higher variability.
The second problem comes from directly summing weights to determine usage. With enough 0.05 weights, which are attainable by players who lose more than they win (50% winrate is usually enough to maintain R = 1500), bad players can easily overwhelm good players. There's no cap on the amount of weight that bad players can contribute bc all weights are directly summed. The system incorrectly assumes their weights will be too low to matter.
When I looked at uu player ratings, there was a ratio of at least 100 : 1 for players with weights below 0.1 to those above. Also one can spam bots with new accounts to play themselves and abuse this, using residential proxies at the cost of a few $ to bypass any IP restrictions.
All of the cutoffs past gen 4 are calculated incorrectly. For gen 9, the dex mentions
However, 4.52% weighted usage doesn't represent a 4.52% encounter rate. That 4.52% could be made up of 1000 bad pokemon or 1 good pokemon because a ton of low weights will stack up to be equal to a high weight. Thus, we have no idea what 4.52% usage actually means.
4.52% is calculated by assuming stats represent encounter rates and are therefore unweighted. We can see this by using the following formula to calculate just above 50%
Where the term in the parens represents the probability of not seeing a pokemon, and the exponent says that we don't see a pokemon 15 times in a row. More details on the old formula are here.
This leads to pokemon like gen 8's regieleki that can be both truly OU and not designated OU.
Also even if the low weights were ignored somehow, it's important to remember that weights from above average players (>0.5) can vary widely. Some are very close to 1 and others are close to 0.5, but according to Antar, both should matter (hence why the 1630 cutoff is used). This means that no matter how you put it, weighted stats cannot be used to justify cutoffs.
A Pokémon is truly OU if a typical competitive player is more than 50% likely to encounter that Pokémon at least once in a given day of playing (15 battles).
However, 4.52% weighted usage doesn't represent a 4.52% encounter rate. That 4.52% could be made up of 1000 bad pokemon or 1 good pokemon because a ton of low weights will stack up to be equal to a high weight. Thus, we have no idea what 4.52% usage actually means.
4.52% is calculated by assuming stats represent encounter rates and are therefore unweighted. We can see this by using the following formula to calculate just above 50%
Code:
1 - (1 - 0.0452)^15 = 0.5003
This leads to pokemon like gen 8's regieleki that can be both truly OU and not designated OU.
Also even if the low weights were ignored somehow, it's important to remember that weights from above average players (>0.5) can vary widely. Some are very close to 1 and others are close to 0.5, but according to Antar, both should matter (hence why the 1630 cutoff is used). This means that no matter how you put it, weighted stats cannot be used to justify cutoffs.
I have a suggested fix, but it would require changing the usage calculation scripts (but not the data needed for them).
First, change the weighing formula to use the same formula that GXE uses. GXE uses a reference player with
Set buckets for weights based on ranges. You could have a category for weights from 0 - 0.1, 0.1 - 0.5, and 0.5 - 1. Each bucket calculates its own usage stats the same way as they are done now. Then force buckets to have a certain contribution percentage (e.g. force 0 - 0.1 to have 5% contribution so low weight spam no longer works). Then calculate overall usage per pokemon by taking the expected value of its buckets.
The great thing about buckets is that you can go back to using unweighted entries within buckets since the importance is already enforced by what bucket each entry falls into. This would allow stats to be based on encounter rates, while also limiting the contribution of bad players. It means people can actually understand stats, and the stats will be robust against abuse.
R = 1500, RD = 130
, so you could change it to be R = 1630, RD = 0
for a cutoff of 1630.Set buckets for weights based on ranges. You could have a category for weights from 0 - 0.1, 0.1 - 0.5, and 0.5 - 1. Each bucket calculates its own usage stats the same way as they are done now. Then force buckets to have a certain contribution percentage (e.g. force 0 - 0.1 to have 5% contribution so low weight spam no longer works). Then calculate overall usage per pokemon by taking the expected value of its buckets.
The great thing about buckets is that you can go back to using unweighted entries within buckets since the importance is already enforced by what bucket each entry falls into. This would allow stats to be based on encounter rates, while also limiting the contribution of bad players. It means people can actually understand stats, and the stats will be robust against abuse.
There's also a very weak fix that is bad but easy to implement
Set the minimum deviation restriction for players to have weight to be lower. Currently it's 100, which takes about 5 battles to reach. This would help a bit, but it doesn't solve the root issue and it also cannot logically be below 60. It takes about 30 battles to reach a deviation of 60, which is also noticeably the min # of battles required to meet suspect test requirements. Putting it below 60 would be saying that a person who qualifies for suspect reqs is not worthy of contributing to the usage stats, which makes no sense.
If the min required is 60 though, a 50% winrate player can still achieve a weight of 0.015, which only needs about 100 bad players per every good player to be even, so it does a crappy job.
If the min required is 60 though, a 50% winrate player can still achieve a weight of 0.015, which only needs about 100 bad players per every good player to be even, so it does a crappy job.
Realistically, we'd need an event like Ambipom to the Top (except via bots) to trigger any serious action, especially considering Zarel's inactivity, but it's still important to understand how to view the current usage stats critically -- the percentages alone are not enough to assess a pokemon's viability or frequency within a metagame. If you have any questions, please feel free to ask here or message me on discord (same username).
If you would like to know more, I wrote a paper on the subject and also made a video to help explain the problems. These are just supplements to the post itself though.
Video
Paper
https://drive.google.com/file/d/1GEtjBAUg_9PgdHA52cBX2BJjkZDtEMdB/view?usp=drive_link
Also thanks to pre for helping me understand some of the usage stat mechanisms as I did research.
Last edited: