The decision to base UU off of 1760 stats

Status
Not open for further replies.
#76
It took me a while to realize this, but the purpose of the standard candles is to identify a subset a of Pokemon Showdown users using Pokemon, movesets, abilities, and/or items possessing unequivocally no competitive merit. The candles are not meant to filter out, but the provide one with a sample of obviously uncompetitive players as one would use this information to determine what a rating of an uncompetitive player should be? The designated criteria of the standard candles are obviously theoretically useless in gameplay, and thus their usage should be correlated with a player's ineptitude (such ignorance of game mechanics and/or lack of knowledge of the Pokemon's movepool) or lack of competitive intent (such as trolling). Moreover, even competent players who use the standard candles would be handicapped severely and this would adversely affect their potential rating.

From the data set of the users who have a standard candle, one could know the average rating of non-competitive players and then set a cut-off that would practically exclude them. For instance, if the average rating of Gengar users having a physical attack (barring Focus Punch which has some situational merit) is 1450 GLICKO with RD 70, setting the cutoff at 1600 would exclude those players from influencing tiering (their weight would be less than .02).

Still, I believe it is better to base tiering on empirical statistics rather than on inexact theoretical considerations (theorymon isn't quantum electrodynamics where one can get verified quantitative predictions down to many sig figs!) about the efficacy of a given Pokemon and subjective preferences and considerations that are similar to viability rankings. However, one should have some theoretical understanding to know why certain Pokemon are more useful than others in a given metagame setting that can explain the usage trends. But regardless of any theoretical considers, the competent players will simply use what works; tautologically, if they did not, they wouldn't be deemed "competent" or "competitive".

----

I wonder if Antar decided on a criteria of standard candles and determine the strength of uncompetitive players.
I get that that's the idea behind it, but it doesn't mean similar concepts to the candles can't be used in other ways, to avoid the huge problems this system has.
 

Antar

is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
Official Data Miner
#77
Anyway guys, it turns out that the candle idea was a turd. I'll explain more later, but it has to do with what happens when you push the weighting system too far.
 

Antar

is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
Official Data Miner
#79
Basically, for all of these candles, as the baseline (not calling it cutoff anymore. You happy, WebBowser and @Excitement?) increased, the candle usage didn't even come close to tending towards zero. In fact, above a certain threshold, candle usage actually started going up. The reason for this is because once the baseline is above the vast, vast majority of players, R ends up mattering a lot less than RD. So if the baseline is, say, 1800, then a new alt (rating 1500±130) will have substantially greater weight than a player with rating 1700±25 (by a factor of over 300). And it doesn't matter if the new player has a weight that's 100x less than a player with rating of 1900±25--there are hundreds of thousands upon thousands more new alts than there are really good players.

In truth, what this demonstrates is the limits of our weighting system--it was never really designed to work at any level above 1500. I have some alternatives I may try. I'll make sure to keep everyone posted.
 
#80
Basically, for all of these candles, as the baseline (not calling it cutoff anymore. You happy, WebBowser and @Excitement?) increased, the candle usage didn't even come close to tending towards zero. In fact, above a certain threshold, candle usage actually started going up. The reason for this is because once the baseline is above the vast, vast majority of players, R ends up mattering a lot less than RD. So if the baseline is, say, 1800, then a new alt (rating 1500±130) will have substantially greater weight than a player with rating 1700±25 (by a factor of over 300). And it doesn't matter if the new player has a weight that's 100x less than a player with rating of 1900±25--there are hundreds of thousands upon thousands more new alts than there are really good players.

In truth, what this demonstrates is the limits of our weighting system--it was never really designed to work at any level above 1500. I have some alternatives I may try. I'll make sure to keep everyone posted.
Still, I cannot image how standard candle usage could increase after some R threshold, assuming by "threshold" you mean Glicko, not weight. Although it wasn't listed as an official candle, I cannot see how someone, even a skilled player with an otherwise well-constructed team, using a Gengar with physical attacks (besides Focus Punch) can get a rating above 1760.

Your explanation isn't even surprising at all if "threshold" means "weight". It seemed obvious to me that an increased "baseline" can be abused by someone for tiering purposes due to the properties of the normal distribution. Needless to say, if the threshold is so high ( baseline >> R) , then for someone lacking the skill or willingness to use competitive Pokemon to attain it, then they can increase the value of the integral by increasing the standard deviation so the tail would be larger; playing more would simply decrease their rating deviation and their weight. It is a perverse disincentive not to play many matches on a single alt.

Still, I did not think this would affect tiering unless their is a conscious and a mathematically informed effort to undermine it, since most newbies/trolls/uncompetitive players are likely unaware of the tiering process and rating system and they would simply just play matches on their alt while their RD shrinks and their rating languishes at 1500 or below unconcerned about their weight. But dedicated tier trollers could take advantage of this by making numerous alt with a high initial rating deviation and challenging newbies randomly winning one or two games fairly easily even being handicapped from using uncompetitive Pokemon, and then using a different alt. While the tier trollers absolute weight has decreased, their relative weight has increased since an increased baseline has diluted the influence of the the majority uncompetitive average players on the ladder and this would make the tier trollers' influence more concentrated.

And besides there are better things to do in life than abuse the tiering system on an unofficial Pokemon website. :)
 
Last edited:
#81
Wait... candle usage went... up? Could the mons/sets being used at low levels be even more random then we thought? Or perhaps candles are being purposely used by folks who know what they are doing just to show off how much better they are then the average player? (ala the PS OU forum's NU + Delibird challenge, where one must make a team consisting entirely of NU mons, one of which being Delibird, and try to ladder to a semi-respectable rating). Yeah, I'm pretty baffled here. Well good luck figuring all of this out. Also, yay slightly more accurate terminology. I know it probably seems really petty to you, but describing things well is actually really important from a PR standpoint, as it's really hard for folks to have an informed opinion of what's going on when terminology is inconsistent and/or inaccurate.

Also, you sound really tired. Get some sleep, it might help you think of new ideas (or at least help alleviate some stress. It works for me at least). I really hope this doesn't come off as patronizing, I'm just concerned is all.
 

Antar

is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
Official Data Miner
#82
No, WebBowser and Calm_Mind_Latias, the problem is not trolls or people mistakingly using wrong stuff at the top of the ladder. The problem is that the weighting system more strongly weights players with RDs over players whose R is closer to the baseline, and you have the maximum RD when you first start off. There is nothing odd or mysterious about this--it's just math. It's not even wrong math: a new alt, just starting out, has a nonzero chance of belonging to a player at the top of the ladder, but if the alt's played 100 games and is in the 1600-1700 range, it's extremely unlikely that that player has a "true rating" of 1800.
 
#83
Sorry,

I never said trolls were a problem; I just noted the possibility that tier trolls can abuse a weighting system by playing fewer matches in order to maximize their weight by focusing on the high RD from a new alt. It also gives a perverse incentive not to play many matches on a single alt and discounts moderately skilled players relative to a nascent random alt.

Still, I do not see how standard candle usage can increase after a given R (with the exception at around 1500 that some candle users win their first or two battle and quit). The math easily allows for increased weight relative to the average player who uses an alt for tens of matches among some standard candle users. As I said before, I understand the math behind it (it mostly revolves around the AUC of the normal distribution) and do not see it as an incomprehensible black box.
 
#84
No, WebBowser and Calm_Mind_Latias, the problem is not trolls or people mistakingly using wrong stuff at the top of the ladder. The problem is that the weighting system more strongly weights players with RDs over players whose R is closer to the baseline, and you have the maximum RD when you first start off. There is nothing odd or mysterious about this--it's just math. It's not even wrong math: a new alt, just starting out, has a nonzero chance of belonging to a player at the top of the ladder, but if the alt's played 100 games and is in the 1600-1700 range, it's extremely unlikely that that player has a "true rating" of 1800.
This is something that makes me even more convinced that the rating system as a whole is flawed. If at 1700 glicko with a very low deviation my contribution to usage stats can fall below a new player with high deviation, that is a major flaw. Players should be rewarded for playing many games at a high glicko, not punished by how their decreasing deviation reduces their contribution to weighting.
 
#85
No, WebBowser and Calm_Mind_Latias, the problem is not trolls or people mistakingly using wrong stuff at the top of the ladder. The problem is that the weighting system more strongly weights players with RDs over players whose R is closer to the baseline, and you have the maximum RD when you first start off. There is nothing odd or mysterious about this--it's just math. It's not even wrong math: a new alt, just starting out, has a nonzero chance of belonging to a player at the top of the ladder, but if the alt's played 100 games and is in the 1600-1700 range, it's extremely unlikely that that player has a "true rating" of 1800.
Could you make a cutoff deviation then, similar to how 100+ deviation players didn't appear on the ladder leaderboards?
 

Antar

is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
Official Data Miner
#87
vayu, Elo, no, but Glicko is designed to match a normal distribution with center 1500 and standard deviation 130, so the formula for percentile is:
Code:
pctl=1-0.5*(1+erf((R-1500)/sqrt(2*130*130))
Glicko ratings also have a deviation RD which complicates things, but treat that as an "uncertainty."
 
Status
Not open for further replies.