The OU List.

imperfectluck

Banned deucer.
So, thanks to X-Act's calculations, we have a new, updated list of 49 Pokemon, for January. But, is 49 Pokemon really necessary for the OU threshold that we've been using?

Electivire is classified as OU just like Zapdos is, but in my mind Zapdos clearly outclasses Electivire for team building in almost every way. Our original way of calculating the OU list is a somewhat arbitrary method of measuring usages, while reducing the size of the OU list is also somewhat arbitrary, I would like to propose a new size of about 30 or so. While this may be a controversial statement to make, I believe that the OU list should reflect the best of the best, and not stragglers into OU like Ninjask and Dusknoir and so on that really don't belong up there in the best players' teams and should properly be classed as BL, Pokemon that can perform somewhat well in OU. I'd like to hear other people's thoughts on this.
 
My only question would be where would the pokemon who are no longer OU go? Would they be included in the UU test or would they be put into BL (probably the former but I figured I should ask)?

Other than that it just seems to be a preference thing; do you consider pokemon "decent" (edit- aka pokemon in the, say, PorygonZ/Ninjask range, or what some people call "Low OU") in the standard metagame to be OU or do you not?
 
I think that right now is perfect, X-Act has a great formula for OU that works everytime and OU is based on usage purely, not how well something performs. If Beedrill was used on every team for the next 3 months then it would be OU, of course that is just not going to happen.

From the way you say it, you seem to think that pokemon that aren't that strong such as Donphan or Electivire shouldn't be OU.
 

JabbaTheGriffin

Stormblessed
is a Top Tutor Alumnusis a Senior Staff Member Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis a Smogon Media Contributor Alumnusis a Battle Simulator Moderator Alumnus
I'm not really sure how OU is calculated and I'm kind of too lazy to find out how. But just looking at Doug's statistics, if I'm reading them correctly, we have Pokemon that are used on less than 1 in 25 teams in OU. That seems ridiculous. I mean it may not be, i'm not really well versed in the statistics of it all, but I don't see how a Pokemon that is on 1 in 25 teams can be considered "overused"
 
Do you remember when Colin first gave us statistics he had two different statistics: "rated" and "unrated"?

I think that if Doug could get us the "rated" statistics(which are affected by the people using them as well as the amount they are used), we would probably find out that a lot of players that keep Ninjask, Electivire, and Dusknoir OU are mostly newcomers that don't get too high on the ladder, but there are enough of them to the point where they could actually skew the "unrated" results.
 

X-Act

np: Biffy Clyro - Shock Shock
is a Site Content Manager Alumnusis a Programmer Alumnusis a Smogon Discord Contributor Alumnusis a Top Researcher Alumnusis a Top CAP Contributor Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis a Smogon Media Contributor Alumnusis an Administrator Alumnus
So, thanks to X-Act's calculations, we have a new, updated list of 49 Pokemon, for January. But, is 49 Pokemon really necessary for the OU threshold that we've been using?

Electivire is classified as OU just like Zapdos is, but in my mind Zapdos clearly outclasses Electivire for team building in almost every way. Our original way of calculating the OU list is a somewhat arbitrary method of measuring usages, while reducing the size of the OU list is also somewhat arbitrary, I would like to propose a new size of about 30 or so. While this may be a controversial statement to make, I believe that the OU list should reflect the best of the best, and not stragglers into OU like Ninjask and Dusknoir and so on that really don't belong up there in the best players' teams and should properly be classed as BL, Pokemon that can perform somewhat well in OU. I'd like to hear other people's thoughts on this.
Okay. I hope I'll answer you with how I answer the quoted people below.

I'm not really sure how OU is calculated and I'm kind of too lazy to find out how. But just looking at Doug's statistics, if I'm reading them correctly, we have Pokemon that are used on less than 1 in 25 teams in OU. That seems ridiculous. I mean it may not be, i'm not really well versed in the statistics of it all, but I don't see how a Pokemon that is on 1 in 25 teams can be considered "overused"
Blame it on the other policy reviewers.

If you might lower your laziness a bit and actually look at how I made the OU calculations (no hard feelings btw), I asked the policy reviewers clearly as to where to draw the line. I drew the line at T = 20 only because the majority wanted to. T = 20 means that Pokemon that are listed at least once in every 20 teams make it to OU. If you look at that thread, I mentioned explicitly that I don't agree with this number!

Also I'd argue about your point that the OU list has Pokemon that are used in less than 1 in 25 teams, as OU is made explicitly to contain those Pokemon that are used in at least one in 20 teams (meaning, they are more probable than not of appearing in at least 1 in 20 teams).

Do you remember when Colin first gave us statistics he had two different statistics: "rated" and "unrated"?

I think that if Doug could get us the "rated" statistics(which are affected by the people using them as well as the amount they are used), we would probably find out that a lot of players that keep Ninjask, Electivire, and Dusknoir OU are mostly newcomers that don't get too high on the ladder, but there are enough of them to the point where they could actually skew the "unrated" results.
This. I'm sure that if we use rated statistics, the OU list would be better. In fact, I was going to post something about rated statistics myself, but you beat me to it.

A very simple way of making rated statistics is to only count the Pokemon of the winning team, instead of those of both teams. Another slightly more complex way is to skew the winning Pokemon's usages by the rating of the opponent.

The problem is that Doug is reluctant to release 'rated' statistics, probably because, understandably, he has better things to do.
 

Jumpman16

np: Michael Jackson - "Mon in the Mirror" (DW mix)
is a Site Content Manager Alumnusis a Top Team Rater Alumnusis a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Researcher Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis an Administrator Alumnus
To add to that, I'll quote an important part of one of your September posts:

To be honest, the number of Pokemon in OU is only important so that we know from where to start BL/UU. There's nothing important about OU other than that. The important tiers are Uber and BL (and possibly a ban tier for NU).
Given both this and the fact that the New UU movement is already in full swing, I think the timing of this thread is a little off. If the "#21-50 pokemon" (pokemon that may not be "OU" by definition) aren't included in the New UU test, so be it. The only reason this matters as far as I'm concerned is that UU may be considering some 15-30 less pokemon than it should be, but I think that besides the fact that it's too late, we're making great strides to finally fix UU, and that 50 may be a fine number anyway. If we arrive at a better OU list in the next few weeks or months, great. We can then throw the "#21-50 pokemon" into UU and see if they're really UU like we're doing now with UU. I don't think now's the time to worry about it though.
 

imperfectluck

Banned deucer.
T=10 looks about the right number of Pokemon I'd like to aim for for Pokemon that actually belong in OU.

Given both this and the fact that the New UU movement is already in full swing, I think the timing of this thread is a little off. If the "#21-50 pokemon" (pokemon that may not be "OU" by definition) aren't included in the New UU test, so be it. The only reason this matters as far as I'm concerned is that UU may be considering some 15-30 less pokemon than it should be, but I think that besides the fact that it's too late, we're making great strides to finally fix UU, and that 50 may be a fine number anyway. If we arrive at a better OU list in the next few weeks or months, great. We can then throw the "#21-50 pokemon" into UU and see if they're really UU like we're doing now with UU. I don't think now's the time to worry about it though.
I'm not particularly concerned about seeing the "lower end OUs" thrown into the New UU movement immediately, they can be sorted into BL and tested later, or be tested while the New UU movement is still fresh, I'd just like to see an OU list that represents what I feel should be an accurate definition of what is exactly "OU."
 
Apologies in advance if I'm somewhat hijacking this thread with my wall of text, but the 'rated statistics' discussion I find especially important.

I think we need to make a distinction based on usage vs viability, since they do not necessarily go hand in hand. The OU tier is based strictly on overall usage, and should remain that way. However, unweighted statistics cannot tell us everything we need to know about how healthy the standard metagame is. I really believe we make a mistake when we consider a rise in the number of pokemon which are used on one in twenty teams to mean that the metagame is by definition 'better'. We need to really weigh the statistics in order to determine that, as competitive games are balanced around the pinnacle of play - not the average of all players.

For instance, if Electivire is used by 0 players in the top hundred, can we reasonably say it contributes to diversity? Or maybe something less extreme - if, when considering the same hundred players, there are more instances of Porygon2 (which is UU now) than there are Electivire, what does that tell us?


I think we need to ask ourselves: What questions do we want the statistics to answer?


Question 1: Which pokemon are Overused?

This is mostly where we draw the line based on the statistics and why.

So, where should the cutoff line be placed? How much do we take into account estimations of usage in the past metagames OU was based on in the first place? Is T=20 a valid choice? If so, is it okay if the top pokemon in OU is used over 7 times more often than the bottom one? Should OU pokemon cover over 75% of the used pokemon in standard as they currently do? Even T=10 covers well over 50%. This is where someone who is much more versed in statistics than I am needs to take over. These aren't rhetorical questions - I'm really not sure if the numbers are acceptable or not.

Question 2: How many pokemon are competitively viable?

Here, we have removed the distinction of overall use and can focus strictly on true diversity. This is where weighted statistics are necessary, or perhaps only statistics from the top 100 battlers or so are necessary. If we have a picture of what the leaderboard looks like as far as usage, we can get a good idea of which pokemon are actually viable in high level play.

For example: Porygon-Z is OU, but can I use it on a team and hit top 10-20 on the leaderboard? It has serious trouble switching in on anything due to typing and weak defensive stats, and it's speed lets it down for sweeping purposes. It can run a scarf set, but then becomes much easier to wall. It's incredibly powerful in some circumstances, but is it viable when you're dealing with the top teams? Finding out how many times P-Z is actually featured in top-level games can help us answer questions like these, and in a way that nothing short of asking IPL and other top battlers to go make a team with X pokemon and try to hit the top can.

ST5 has a lot of potential to help out in that aspect. Will the best teams be utterly standard, or will they feature pokemon we thought weren't as viable?

The main concern with weighing things is that the better players often react to the metagame faster, so it becomes tough to gauge just exactly what it means when 75% of the top few hundred players are using both Scizor and Heatran (made up statistic, but not really stretching). Will those two pokemon be replaced on these players' teams in a month or two with threats which are covered less at that time? Is this a predictable trend which is healthy for the metagame? If it turns out the top teams rarely ever contain pokemon outside of a group of 20, for a period of 6 months, what would that say about how many pokemon are truly viable? We could theoretically have a 50 pokemon OU with only half of them being usable once you hit leaderboard level.

Question 3: What is suspect?

This one is obviously more difficult to assess. However, I feel it tells us exactly how weighted statistics can be paired with detailed ones in order to help determine suspects. Specifically, DJD's recent post on predictability opens up use of the detailed statistics to see just how unpredictable the most used pokemon are.

We've already seen how long it can take for us to realize pokemon are suspect, and, as much as it doesn't seem like it should matter, overall usage is a real factor in determining how long something sticks around before officially becoming suspect. This happened for quite some time with Wobbuffet. Many people believed that it wasn't worth considering Wobb uber as long as his usage remained lower than many other pokemon. Of course, that stance completely disregarded the many other factors at hand: unwillingness to use a taboo pokemon ('boring' as well); loyalty toward certain websites' tiering; and most importantly, disparity of use between the good and bad players (not to mention correct use as well). If everyone had seen the ratio of top players using Wobb to the ratio of those who weren't, how much faster would it have been made suspect?

Obviously, Wobbuffet serves as a striking example of when predictability does not actually make a pokemon any less viable. Of course, it doesn't have much of a movepool though. Garchomp is a better case for that, as detailed statistics would have allowed us to see exactly how predictable his moveset was becoming over the months. At that point it would have been easy to see that Garchomp's viability was increasing even though he was becoming more predictable. That, combined with the fact that 'there was no end in sight', would make Garchomp an easy suspect.

An interesting subset of this method is that of lead pokemon. We can see that Aerodactyl usage has been rising substantially, and also that he is without a doubt a shining exemplar of predictability (seriously). However, scarf Jirachi has jumped by a huge margin in order to counter Aerodactyl's rise (it doesn't hurt that it stops the number 1 and 2 leads either). Since Aerodactyl can't do anything to counter this other than scarfing itself, which defeats it's entire purpose as a lead to begin with, we can consider it standard metagame behavior that his usage as a lead will likely stop increasing as we continue to see more Jirachis leading. In this way, the lead metagame is the microevolution to the overall metagame's macroevolution.


-----


I wanted to address the specific OU-BL-UU issues separately from weighted statistics, so there wouldn't be any confusion.

Porygon2 is a great example for this, because it is certainly reasonable that it will be considered to be balanced for play in UU, but at the same time it handles so many top tier OU threats that it could also possibly breech through to OU based on usage (it was within less than 500 of Donphan in December, placing 54th). Not too long ago, this exact same thing happened as Aerodactyl filled a niche in OU as a fast taunt/SR.

Having a pokemon switching between OU and UU begs the question 'What is BL?', which is pretty damn important right now. Is the definition the same as Uber, or does it need more specific restraints? If BL are simply pokemon which 'break' UU, then UU itself needs a clear definition. Does UU need to be distinct from OU? Does it need to be 'fun', or does competition reign as the most important aspect?

If Aerodactyl usage drops off, but remains steady at a position which is slightly below the cutoff line for OU, will it be tiered back in UU? This creates a potential situation wherein Aero would be used quite a bit in both tiers, and likely for quite the same function (some of the previously BL pokemon are having this happen as we speak). I'm not sure this is a problem, but it certainly is 'weird', and would alienate a lot of the traditional UU crowd. If closeness to OU is a concern, I could definitely see how BL could become a 'limbo' tier with a top and bottom set along the lines of: All pokemon which are too powerful for UU, or have a usage ratio of between T=20 and T=30. Previously, it seems that BL has acted that part without ever being defined as such.
 

JabbaTheGriffin

Stormblessed
is a Top Tutor Alumnusis a Senior Staff Member Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis a Smogon Media Contributor Alumnusis a Battle Simulator Moderator Alumnus
I don't pay much attention to the statistics threads and deciding on how OU is made and all that, but I guess in this case I should have been there before to argue for T=10 or 11
 
I'm not really sure how OU is calculated and I'm kind of too lazy to find out how. But just looking at Doug's statistics, if I'm reading them correctly, we have Pokemon that are used on less than 1 in 25 teams in OU. That seems ridiculous. I mean it may not be, i'm not really well versed in the statistics of it all, but I don't see how a Pokemon that is on 1 in 25 teams can be considered "overused"
This post actually makes some sense to me. I mean, if the OU list is supposed to contain Pokemon that are used at least once in 20 teams, then I would think that every Pokemon on the list is used on at least 5% of the teams, according to Doug's statistics. Is it wrong for me to think that?

I looked into this further by doing a little "experiment" of sorts:
  • I compiled Doug's Standard usage statistics over the past three months for each Pokemon on the current OU list on a worksheet, including the total number of battles for each month.
  • Then, I added X-Act's weights (1:3:20 for October, November, and December respectively) to the usage statistics for each Pokemon and the total number of battles for each. This gave me a weighted-average usage for each Pokemon and a weighted-average total number of battles.
  • Finally, I used Doug's method of calculating % usage on the data. I took the weighted-average usage for each Pokemon and divided that by 2 times the weighted-average total number of battles to get a weighted-average percentage usage for each Pokemon.
If an OU Pokemon is used at least once in 20 teams, then I would be looking for a weighted-average usage of at least 5% for a Pokemon to be considered OU.

The number of OU Pokemon with this method: 40.

Really? That's quite a drop-off from the actual list, but it happens to equal the number of Pokemon that were used at least 5% of the time in December, so I guess it makes sense.

Here are the nine current OU Pokemon that fail to make this cut:

Alakazam
Donphan
Dragonite
Dugtrio
Empoleon
Ninjask
Porygon-Z
Rhyperior
Yanmega

To get the current number of OU Pokemon with this method, you would have to assign a "T value" of 33! O_O

Interestingly enough, here is how big this list would be based on the last three months if you used different T values:
Code:
T         # OU Pokemon
10             17
11             19
12             22
13             23
14             27
15             27
16             31
17             33
18             34
19             36
20             40
21             40
22             41
23             42
24             42
25             44
All that said, I would much rather see "rated" or "weighted" usage statistics. Whether I want weights applied to every Pokemon based on the ratings of the battlers using them or just winning battlers' usages count is something I haven't made up my mind on yet.
 

X-Act

np: Biffy Clyro - Shock Shock
is a Site Content Manager Alumnusis a Programmer Alumnusis a Smogon Discord Contributor Alumnusis a Top Researcher Alumnusis a Top CAP Contributor Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis a Smogon Media Contributor Alumnusis an Administrator Alumnus
When I say "used once in 20 teams", I'm actually simplifying the wording, but that's not _exactly_ what I want to say. So I would like to apologize for any confusion and explain exactly what T = 20 means. Feel free to skip this part if you don't want to know.

---------------------------------------
Okay, let me start with an example. Suppose I have a Pokemon that has probability p of being in a team, and I choose two teams at random. (This can also be understood that the Pokemon is in (100p)% of teams.) What is the probability of that Pokemon appearing in at least one of those two teams? The probability is equal to 1 minus the probability of that Pokemon not appearing in any of those two teams. And how do we calculate this?

The probability of that Pokemon not appearing in a team is (1 - p). Hence, the probability of that Pokemon not appearing in any of those two teams is (1 - p) x (1 - p), or (1 - p)^2.

Therefore, the probability of that Pokemon appearing in at least one of those two teams is 1 - (1 - p)^2.

In a similar way, if the number of teams to choose from was three, the probability of that Pokemon appearing in at least one of those three teams is 1 - (1 - p)^3. And hence, the probability of a Pokemon appearing in at least one of T teams is 1 - (1 - p)^T.

Now if this probability exceeds 0.5, it would mean that it is more likely than not that the Pokemon appears in at least one of T teams. Hence we have:

1 - (1 - p)^T > 0.5
1 - 0.5 > (1 - p)^T
0.5 > (1 - p)^T
(0.5)^(1/T) > 1 - p
p > 1 - (0.5)^(1/T)

and this is the criterion I used to find what is OU. Also, remember that p is equal to the usage of that Pokemon u multiplied by 6 divided by the sum of all usages U. So we have:

6u / U > 1 - (0.5)^(1/T)
u > U(1 - (0.5)^(1/T)) / 6

The only problem I had was to put a value for T. When I consulted with the policy reviewers, they wanted T = 20. 1 - (0.5)^(1/20) = 0.03406, so I ended up with:

u > U x 0.03406 / 6
u > U x 0.005677

which is equivalent to

u > U / 176.1407

There, now you know from where I got the mysterious number 176.1407.

Of course, any other value of T could have been chosen if so desired.

--------------------------------------------------------
With that out of the way, I now want to mention another thing. As I said in the original post in the OU thread, the number of frequently used Pokemon in Standard is really small. I defined 'frequently-used' by the formula above again except with T = 4, i.e. the Pokemon that are more likely than not to appear in at least one in 4 teams.

Now the problem is that the number of such 'frequently-used' Pokemon is usually only about 6 or 7. So do you people really want an OU list containing 6 or 7 Pokemon? I don't think so.

Hence, the OU list is really not the Pokemon that appear frequently in teams. As I said in that thread, the OU list is the Pokemon that do not appear rarely in teams. The policy reviewers thought that "a Pokemon appearing rarely" means that it is more likely than not that that Pokemon does not even appear in one in 20 teams. And I went with that.

On second thoughts, I might even start to agree with the policy reviewers about choosing T = 20, as Rhyperior, Empoleon, Alakazam truly do not appear rarely in teams. :)

Sorry for the long explanation, but I wanted to be as clear as possible in conveying my thought processes when I made the algorithm that determines OU.
 
For the weighted statistics, rather than ranking by usage, we rank by "points". Each usage scores the pokemon a number of points equal to its trainer's conservative estimated rating at the point when this list was compiled, with a minimum of one point per usage (so a person with a -300 rating using a pokemon still scores it one point).
I don't see a problem doing it this way, since it is accurate and to the point. I don't like the way that only a "winners" pokemon would count towards the score because the way shoddybattle works, or so I believe, you are usually pitted against people similar to your score, so that means that 1/2 the people would get cut out. It would just be best for it to count double if a battle with 2 high players use the same pokemon against each other.

On the "what happens to the pokemon not in OU anymore" issue, I think they should go to BL, so we at least have one lol. Looking at the following pokemon to go to BL from HotC's list:

Alakazam- hits too hard without Blissey
Donphan- with Grounded poisons being prevelent it is a great Rapid Spinner and would be at the top.
Dragonite- Altaria dies
Dugtrio- Trapinch dies, plus really the thing it kills in BL is ridiculous
Empoleon- great supporter and sweeper altogether, but the lack of pokemon able to take down the agility set really gives me the creeps
Ninjask- the only one really that should go to UU, with stall being prevelent and all
Porygon-Z- look at alakazam, except a bit worse since it gets Nasty Plot and Download
Rhyperior- I think rated statistics would show this being higher tbh
Yanmega- god it's hard enough to beat this thing without using Blissey/Zapdos/ Heatran(assuming no HP Ground), but without them it seems almost impossible.

of course then there's the issue of a cutoff point which I have no clue on. X-Act would probably find some magic math number thingy that I wouldn't dream of, as even though math is my best subject everything goes over my head ._.
 
I don't think we should worry about how effective low-OU Pokemon would end up being in "new UU," as that's kind of the same argument people had against the BL reset in the first place. So I wouldn't support just throwing half of OU into the BL tier, as that pretty much defeats the point of a BL tier in the first place.

Rather, I'm somewhat bothered by the idea of giving perfectly OU-viable Pokemon like Empoleon a chance to see additional play in another tier, when the UU tier exists for pretty much the exact opposite reason (to give Pokemon that aren't common in OU a chance to see play). It's pretty redundant, but on top of that, if we're going to have tiers of roughly 25 or 30 viable Pokemon each... that's a lot of tiers. It does us well to have as little overlap as possible between tiers, because nobody's going to have much interest in playing "NU 4" or whatever (and especially if a lot of "NU 4" can be seen and used in "NU 3" pretty decently).


If it's really a problem having relatively weak Pokemon lumped together with the truly common threats (though it doesn't really bother me), then we could always have little subdivisions of "High OU/Low OU" or some such. It'd have the same effect as just throwing the weaker Pokemon into BL, but without ruining the point of the entire "new UU" process. I do agree that weighted statistics are the way to go though.
 
The probability of more likely than not seeing a Pokemon at least once in 20 teams is ~3.406%. That's good to know.

Now to see if there's a less-arbitrary method of picking a number...

*gets to work
 

TAY

You and I Know
is a Top Team Rater Alumnusis a Senior Staff Member Alumnusis a Contributor Alumnusis a Smogon Media Contributor Alumnusis a Battle Simulator Moderator Alumnus
We already have an arbitrarily defined cutoff...and I don't really see a particularly pressing need to switch to a new arbitrarily defined cutoff. I doubt it will do more than confuse people, since everyone is obviously accustomed to the way it is now.
 

imperfectluck

Banned deucer.
From the way you say it, you seem to think that pokemon that aren't that strong such as Donphan or Electivire shouldn't be OU.
This pretty much sums up why I think there should be a new, if arbitrary, cutoff point for OU.
 
Articanus said:
list of pokemon with reasons
Sorry, but this is the kind of reasoning why we implemented a new UU test to begin with - because the old one is based on tradition and (oftentimes biased) theorymon. The only reason these Pokemon were not in the UU test was because they were deemed as "OU", but if we change the algorithm, they deserve every bit of a chance as, say, Raikou.
 

X-Act

np: Biffy Clyro - Shock Shock
is a Site Content Manager Alumnusis a Programmer Alumnusis a Smogon Discord Contributor Alumnusis a Top Researcher Alumnusis a Top CAP Contributor Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis a Smogon Media Contributor Alumnusis an Administrator Alumnus
This pretty much sums up why I think there should be a new, if arbitrary, cutoff point for OU.
Why? OU isn't about the strength of a Pokemon. OU is about the usage of Pokemon.

And what makes another arbitrary cut-off point better than the current one anyway? Let's say we make it T=15. Another person can go and say "Wtf Alakazam shouldn't be in UU! What are all you guys smoking?"

I agree, however, on implementing some sort of weighted statistics that reward a Pokemon being used by a better player, and use those for the OU list computation.
 

Great Sage

Banned deucer.
I agree, however, on implementing some sort of weighted statistics that reward a Pokemon being used by a better player, and use those for the OU list computation.
I remember discussing this with you and a few other people once. IIRC, we agreed that a fair way would be to give Pokemon on a winning team usage credit equal to the player's conservative rating estimate, and to assign Pokemon on a losing team usage credit of 0 (basically, only Pokemon that win a battle count towards weighted usage).
 

DougJustDoug

Knows the great enthusiasms
is a Site Content Manageris a Top Artistis a Programmeris a Forum Moderatoris a Top CAP Contributoris a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Top Tiering Contributor Alumnusis an Administrator Alumnus
I have opposed using weighted numbers because I think the available metrics for determining "weighted usage" is far too arbitrary to be used effectively.

As I say that, I'm sure the first comment that comes to mind is: "What are you talking about Doug? Every player has a rating. That's a clean numeric representation of a player's strength. Add that every time a pokemon is used and you have a clean numeric representation of weighted usage!"

The presumption is that weighted statistics will tell us "What good players are using." That sounds so simple. But it isn't. Not really...

Here are several problems with using a players rating to calculate weighted usage statistics:

---------------------------

What rating do you use? The rating at the time the stats are collected, or the rating at the time the battle was held? Actually neither of those is really accurate. If you use ratings at the end of the month, it's potentially very different than the rating at the time of the battle. So we really can't use that. If you use the players rating at the time of the battle, you aren't really using the rating that is the basis of the Glicko2 system. The Glicko system is based on the idea that ratings are calculated for all battles conducted during a rating period. So, the rating that you possess at the end of a given battle, isn't actually the rating that will be recorded in your "real rating" which is the rating used to compute rating against other players. Your "real rating" is calculated every night at 11:30pm, against all battles conducted during 24 hours.

So, we would have to build new mechanisms for not only calculating ratings for every player, but also assigning ratings values to every battle conducted in a 24 hour period. Even if I was willing to do that work, it really isn't worth it for other reasons I will mention below.

---------------------------

At any given time, a players rating is not a true assessment of the player, it is the rating of the account being used by the player at that time. We have no way to determine who are the good players at any point in time. We can only identify the ID's of people who have chosen to ladder actively at that time. Let's look at few fictional users....

ElderChampion is a very knowledgeable and skilled battler, arguably one of the best in the history of the metagame. But he has not played in a certain length of time, and his CRE is somewhat low. That doesn't mean his rating is lower, it means his ratings deviation is higher -- which simply means we don't know whether he is still good or not. Maybe his skills have eroded, maybe not. But since we use CRE (that means CONSERVATIVE ratings estimate) -- it will be estimated on the low end. So, if this fantastic player with incredible knowledge and skill, uses a pokemon -- it will not be weighted very high. The CRE really is not a reliable indication of who is good or who is bad. It's simply a measurement of who is winning the current race.

Now look at CurrentChampionUndercover, which is the fictional alt of the current top-ranked player on the server. How should we regard this player's usage? In one scenario, the player may be trying out a new team, and wants to play a bit under an alt without affecting his main rating. Should we rate his pokemon lower? He is the best player on the server -- why would we NOT be interested in the pokemon he is using? If you are interested in what the best players are using CURRENTLY -- isn't this team the EXACT team that you want to weight highly? It's the current team of the current best player! But because of the alt system, these usages will be rated down with all the other noobs.

Maybe CurrentChampionUndercover is just screwing around and is playing with a gimmick team for fun. In this case, even if I could identify that the alt really is CurrentChampion (maybe possible by looking at IP addresses) -- I really DON'T care what the player is using, because in this case the best player on the server is intentionally NOT using his best pokemon. In that case I would prefer to completely ignore the player, because their usage is virtually meaningless for competitive weighting.

Then we have CurrentChampionStartingOver. Since rating volatility actually discourages players from keeping the same alias for a long period of time, the CurrentChampion decides to "start fresh" with a new alias. This is the exact same player, playing the exact same team, with the exact same strategy. But suddenly the weightings for that player are going to be suddenly rated down with the idiot noobs. How can that possibly be right? Well, since we have no idea that CurrentChampionStartingOver is really a great player -- then we can't rate his pokemon accurately.

So when it comes to weighting the pokemon being used by the various good players mentioned above -- there is no way to weight them accurately by simply looking at the numbers. The numbers are terribly arbitrary.

---------------------------

The usage system is already exposed to skew based on individual usage -- why would we ever want to increase that exposure? Let me explain...

Which pokemon is used more -- a single player using Ninjask over and over 100 times in a single day, or 100 different players using Dugtrio once in a day? I would certainly argue that Dugtrio being used by 100 different people is far more "Used" than one guy with a shitload of time on his hands, spamming the hell out of Ninjask. Right now, the usage stats really can't differentiate, so both cases are considered equal.

Well, now let's change it up and say that the single player is one of the best players on the server. And he is spamming his Stallrein in 100 consecutive battles. On the other hand, 100 different other users, all rated highly, but not as highly as the Stallrein user -- they all are using Salamence once to great effect, because it's such a true stud of a pokemon. Which pokemon will receive a higher overall weighted usage? Stallrein will. Because a lone highly rated player spammed it severely. So, even though I could make a very valid argument that Salamence was 100 TIMES more popular amongst good players that day -- the weighted usage stats would say that Stallrein was "used more", whatever that means. That is ridiculous, in my opinion.

If you wonder how far people can go to spam usage, let me give you this little tidbit -- the most active battler on the server played over 2000 matches in the month of December. Many good, active players average less than 200 battles per month. I sure hope the guy with 2000 battles is lowly rated, or using "the right pokemon" -- because that guy would have a dramatically heightened effect on weighted usage numbers.

With the current system, every usage is equal, so an active player -- potentially a player intentionally trying to game the tiers -- can only effect the stats by 1 unit each time they play a game. Yes, a spammer can still game the stats, but only one unit at a time. If we use weighted stats, we increase that players ability to manipulate the stats.

---------------------------

Since individual battles matter, we unfairly add weighting to pokemon that are part of offensive teams. Since offensive teams play faster, they complete more battles. As such, these pokemon get higher usage numbers. However, just like my previous point -- each usage currently only counts for 1 unit on the stats. If we add weightings to the mix, we are only enhancing the skew.

---------------------------

I also disagree with using only winning pokemon for weighted stats. This argument is frequently mentioned as a way to defend against people gaming the tiers, by spamming pokemon on losing teams. However, since players of like skill are intentionally matched against each other, you would be intentionally excluding good pokemon being used by good players, every time good players face each other. This likelihood of exclusion is enhanced during times of high server activity, since there is an increased likelihood of good players being matched during those times. So, the weightings would be enhanced for overseas players that ladder during times of inactivity. If something as arbitrary as the time of day has any bearing on our pokemon tiers, then something is severely wrong.​



I'm going to stop now... I'm tired of typing, and for those of you that have made it this far -- you're probably tired of reading it. If you don't understand my point by now, then you never will.

The heart of my argument is this -- we have no way of knowing what pokemon are being used by good players. Yes we have usage, and yes we have player ratings. But that doesn't mean we can mix the two in an accurate and useful way.

All we can reasonably do is report usage of pokemon in plain-vanilla terms. As in, "If you press Find Match on the Standard Ladder, here is the percentage chance of facing Pokemon X on your opponent's team."

Everyone loves to talk about how wonderful "weighted statistics" would be. Sure, I agree that "true" weighted statistics would be great. But, as yet, I have not seen any proposal that can be implemented with our currently available stats that can be anything other than a vague arbitrary guess, masquerading as "more informative stats".
 

X-Act

np: Biffy Clyro - Shock Shock
is a Site Content Manager Alumnusis a Programmer Alumnusis a Smogon Discord Contributor Alumnusis a Top Researcher Alumnusis a Top CAP Contributor Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis a Smogon Media Contributor Alumnusis an Administrator Alumnus
If you use the players rating at the time of the battle, you aren't really using the rating that is the basis of the Glicko2 system. The Glicko system is based on the idea that ratings are calculated for all battles conducted during a rating period. So, the rating that you possess at the end of a given battle, isn't actually the rating that will be recorded in your "real rating" which is the rating used to compute rating against other players. Your "real rating" is calculated every night at 11:30pm, against all battles conducted during 24 hours.
Actually I'm not sure if this is really what happens. At 11:30pm, the rating deviation gets updated, but the rating remains unchanged. It's true that the CRE gets updated as a result, but the CRE is just that: a conservative rating estimate. The real rating is not the CRE.

As far as I know, the rating period implemented in Shoddybattle is 1 battle... which means that the rating and deviation get updated in every battle. The update that happens at 11:30pm everyday is there just for those players who do not compete during that day (even though all players get updated and not just them). For the players who competed, they shouldn't see a vast change in their rating deviation (unless they have high volatility). And, again, I stress that only their deviation changes... their (real) rating stays the same.

Since the rating and deviation get updated in every battle, we DO know the real R and RD for that player as soon as the battle ends. The only problem I would have is whether to choose the rating of the player(s) before the battle or the rating after it.
 

DougJustDoug

Knows the great enthusiasms
is a Site Content Manageris a Top Artistis a Programmeris a Forum Moderatoris a Top CAP Contributoris a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Top Tiering Contributor Alumnusis an Administrator Alumnus
Actually I'm not sure if this is really what happens. At 11:30pm, the rating deviation gets updated, but the rating remains unchanged. It's true that the CRE gets updated as a result, but the CRE is just that: a conservative rating estimate. The real rating is not the CRE.

As far as I know, the rating period implemented in Shoddybattle is 1 battle... which means that the rating and deviation get updated in every battle. The update that happens at 11:30pm everyday is there just for those players who do not compete during that day (even though all players get updated and not just them). For the players who competed, they shouldn't see a vast change in their rating deviation (unless they have high volatility). And, again, I stress that only their deviation changes... their (real) rating stays the same.

Since the rating and deviation get updated in every battle, we DO know the real R and RD for that player as soon as the battle ends. The only problem I would have is whether to choose the rating of the player(s) before the battle or the rating after it.

I said this in my post (bolding added):

So, the rating that you possess at the end of a given battle, isn't actually the rating that will be recorded in your "real rating" which is the rating used to compute rating against other players. Your "real rating" is calculated every night at 11:30pm, against all battles conducted during 24 hours.
I realize I wasn't being very clear there. I'll try to explain better:

A player's rating, deviation, and volatility are updated every battle. That's true. However, that per-battle rating calculation is made by taking both players NIGHTLY rating into account -- not their most-current rating at the time of the battle. This indicates to me that a players rating at the end of a rating period, is perhaps "better" than the per-battle rating -- because that nightly rating is the rating used when players are rated in comparison to each other. At the conclusion of the rating period, the "current rating" is copied into the "nightly rating", and the cycle begins anew.

So, yes, players see their rating change after every battle. But, for measuring players against each other in terms of strength, only one rating is used per player for an entire day at a time. I don't know which rating is the most "correct". All of them are. None of them are. The point is that it's somewhat arbitrary to pick any one of them.



I also dispute that Glicko ratings are of an appropriate scale for weighting pokemon usage.

Player A is ranked #1 and has a CRE of 1700
Player B is ranked #100 and has a CRE of 1400

How much better is Player A than Player B?

By one measurement, player A is 100 times better -- using rank as the metric. I think we all agree that would be a horrible metric.

By CRE, Player A is ~20% better than Player B. Is that more correct? I don't really know.

If there are only 100 players on the ladder, then 20% separation seems awfully small. I would think the difference between the absolute best player and the absolute worst player is certainly more than a 20% difference in knowledge and skill. But, if there's 10,000 players on the ladder, then 20% seems awfully big. In that case, both players are in the top 1% of all players. It seems hard to believe that there is a 20% separation in skill at the very top. So where do you draw the line? I have no idea.

Yet people constantly point to using a players rating as a proper order of magnitude for representing how good they are for purposes of weighting pokemon usage. The Glicko rating system was not devised for that purpose. It's simply the most visible number that we have, so people automatically conclude that it must somehow be relevant. I don't think so.

Here's another way to do weighted usage:

Perhaps the entire ladder should be split into quartiles with the top quartile of players receiving 3 points per usage of each pokemon, the second quartile receiving 2 points per usage, and the bottom half receiving 1 point per usage.

Is that weighting system better? Who the hell knows. I spent about 6 seconds just now thinking it up in relation to the need for weighting pokemon usage. That's six seconds longer than Professor Glickman spent thinking about how his Glicko system should be used for weighting pokemon usage. So why does everyone assume that the Glicko ratings are appropriate for multiplying times usage for weighting purposes? It is not because it makes any sense, or has any correlation to answering the question "What are good players using currently?" It's simply a number that happens to exist in the metagame. That's one of the many reasons why I say:

Pokemon Usage * Glicko rating = ARBITRARY NUMBER
 
I'll try to address everything point by point:


Using only the winning pokemon

I've seen this mentioned quite a lot, and it's even worse than Doug painted it to be. The amount of games you win to achieve a higher rating in no way makes up for the addition of stats taken from every one of a low rated player's wins.


If going by rating what rating do you use?

What about separating only those statistics of players who are 'currently' over 1500 (this number is debatable) in rating? By 'currently', I mean for that day. Obviously this could create a problem with someone tanking their rating in a mass battling spree after reaching the rating, but I have a feeling that would be pretty easy to spot if we were looking.

We could weigh these numbers as well, especially if we feel that our cutoff point is a bit low. Or we could do something like:

-Players above 1500 are using
-Players above 1600 are using
-Players above 1700 are using

The best part about doing this, is that even if there is a large disparity in battle count between users, we can still gather useful stats. If solid players are using pokemon above 1600 or 1700 that we might not have expected them to use, that's telling. It's also notable if they're not using many OU pokemon ever, or if a couple pokemon seem to be on almost every team over 1700, but see 'normal' numbers at 1500-1600 or so.


Alternatively, we could just 'trust' that all users who have ever reached a certain rating (likely a higher one) are playing to win.

Changing pokemon and using other teams

The way I see it, is if a player can achieve the rating we set, they're a pretty good battler. More importantly, the pokemon they've used to get to that ranking can be quite telling. If they're using other pokemon at other times afterward, I don't see this being a problem. Not everyone tests teams on alternate names, but if we can say with confidence that a 'good' battler is still trying to win I don't see any problem in counting the pokemon used to do so, especially as they're staying above a good rating.


Overexposure - game speed and mass battles

You're correct when you say that there's simply no way to gather statistics which aren't marred by teams whose battles are either exceptionally fast or slow. And with a more restricted set of players, your concern about one of them heavily influencing the stats is certainly valid. However, the stats still give us a picture of 'what we will see' when battling at a certain rating, just as the current stats give us that same picture overall. The mass battlers will obviously be seen more often than the casual ones, and the offensive teams more than the defensive teams.

We're not getting 'perfect' stats to work with here, but nothing will, short of trying, say...

Find the X most common used pokemon each player used while over the rating we set. Using those X pokemon, determine percentage based use of each just within that set. Now, take the average of the players rating for the month, and weigh that (something heavier than just using their rating itself - the range between 1500 and 1700 is quite a difference). Multiply the percentage by the weighted average, and there you have it.

Example using terribly brainstormed numbers:

22% - Gengar
16% - Tyranitar
15% - Jirachi
15% - Flygon
12% - Blissey
12% - Heatran
6% - Infernape
2% - Cresselia

If this user finished with a 1500 average, give points equal to the percentage. 1400 would be ½, 1600 would be x2, 1700 would be x3. Obviously it would be uniform along those lines.

So, if IPL uses Gengar 22% of the time, it's worth 66 'usages'. If I use Gengar 22% of the time, it'll be worth 11 (my OU rating is terrible right now). It doesn't matter if I used it 500 times and he used it 50, and it shouldn't, right?

Now, the reason why I didn't mention this in the first place is because it seems like those statistics would be infinitely more difficult to gather and weigh. Logically, I don't see any issues, though feel free to pick that apart as well.

Saw Doug's latest post just as I was about to hit reply: I don't see any reason as to why we can't find some sort of 'magic number' which makes the Glicko actually tell us what we want it to. Though if Glicko is that arbitrary, maybe we need a new system? I assume the entire ladder rating system would have to be rewritten, which sounds like a nightmare unless there are actually 'better' systems under public domain.
 

Users Who Are Viewing This Thread (Users: 1, Guests: 0)

Top