Screw Elo, A New Proposed Skill Rating System That Is Far More Accurate (Based on the GXE Formula)

#1
Greetings fellow smogon nerds,

Have you ever experienced the frustrations of laddering under our current ranking system? I reach out specifically to players who are ranked in the very top percentages of the ladder. Anyone who ladders competitively and who has made it to the top 10 or so of the ladder will really know what I'm talking about. I have been on pokemon showdown since late 2013, right when generation 6 made its debut, and have played a variety of metagames. As of now I simply focus on playing Anything Goes, which is by far my favorite metagame to play. I have consistently been a high ranked player on the Gen 7 AG Ladder, and have made it to number 1 only once, although I am quite often in the top 10. Anyone who plays Anything Goes knows that out of all of the ladders on the website besides Random Battle and occasionally OU, the AG ladder typically contains the highest ELO rated players. Have a look at this screenshot for instance:

Screenshot (26).png

As can be seen, 4 players are rated above 2000, while there is a big gap between the ranked number 4 player and the ranked number 5 player. In other ladders, such as the ubers ladder, top rated players are in the high 1800s, and for inactive ladders top rated players can be as low as in the 1200s-1300s. At this time, I was at my highest ever elo rating, and I felt very satisfied. I had been playing 6 games per day, which as far as I know is the minimum number of games needed to be played in order to prevent decay. The next day, however, in the very first game I played, I was matched up against a player who was rated in the low 1800s, and I lost due to making a terrible misplay, scratching a terrible 30 something points off of my rating. I then proceeded to win several games in a row in order to get my rating back up, but before I made it to 2065+ again, I faced another crippling loss, knocking me down even further than where I had been knocked the first time. It wasn't long before I started playing badly due to frustration, and I fell to the low 1900s. Each time I lost I would win several games in succession, each one earning me a mere 4-6 ladder points, only to see myself get knocked down by 30 points again from a single loss. After I had "tilted," I decided to take a several day break from laddering due to how frustrating it was for me. And trust me, I had been through this many times already. Tilting is a common scenario that high ladder players experience, and to prevent it they often have a policy such as "if I experience 2 losses in a row, then I will take a break from laddering until tomorrow to prevent tilting." After going through the tilting process countless times, it dawned on me one day, "is the pokemon showdown rating system really that accurate?" I raised this question not only out of my frustration but also due to the several, severe issues that I noticed with the way players on showdown are ranked.

Issue Number 1: Elo Ratings From One Ladder Cannot Be Accurately Compared to Those From Another

This is something that I hinted at earlier. Every ladder has a different peak elo score, and every ladder has a different degree of variance. As of right now, for instance, the number 1 rated player in Anything Goes has an elo of 2087, and the ranked 500 player has a rating of 1581, while in OU, the number 1 player is at 1979, while the number 500 player is at 1688, as can be seen in these screenshots:

Screenshot (34).png Screenshot (33).png Screenshot (32).png Screenshot (31).png
So the big question about these screenshots is, which player is better, the number 1 AG player or the number 1 OU player? What about the number 500 AG player and the number 500 OU player? Based on elo, this question is simply impossible to answer. A 1688 elo in OU is not equivalent to a 1688 elo in AG, because both ladders have different degrees of variance, and different population sizes. The AG ladder obviously has more variance than the OU ladder, despite the fact that I am pretty sure that the OU ladder contains a greater number of players. To top that off, elo, simply put, is not an accurate method of displaying a player's unknowable true rating, especially when it comes to the game of pokemon. This will be further emphasized in the next issue with it.

Issue Number 2: Good Players Can Decay and then Rob Higher Elo Rated Players of Points

This, I must say, is a horrendous issue that needs to be fixed. There are some players who are extremely skilled, and easily capable of topping the ladder, yet they do not ladder very often. As a result, they are rated much lower than they should be, due to the fact that they decay by going long periods without laddering. When they actually do ladder, any higher rated sucker who is unfortunate enough to get matched up with them risks getting beaten and losing more points than they should.

Issue Number 3: All High Rated Players Decay to the Same Rating

As far as I know, rating decay begins at 1500 and does not occur at ratings 1499 or below (although I've heard that in some ladders this value is different, such as 1399, but maybe this isn't true). Regardless of the "minimum rating," that elo decays to, I do not think that every player above this rating should eventually decay down to the same rating. Why on earth should a 2100 rated player decay to 1499 after not playing (in this case, for a very long time) for a while, and a player rated 1550 decay to that same rating (in a shorter amount of time, though)?

Issue Number 4: Pokemon is not only a game of skill, but also a game of luck

This is perhaps the biggest reason why the elo rating system is not suited for pokemon. As we all know, in pokemon, there exists a little battle mechanic that game freak decided to implement into the game called accuracy. This acts hand in hand with the secondary affect chances that some moves have, such as a 10% chance to burn paralyze or freeze. We all know the frustration of getting "haxed," when we miss a 90% accurate move 3 times in a row, or when we get smacked with a double or even triple crit (I witnessed a triple crit occur in a battle yesterday, not sure if this is a bug or not, but the chance of a crit occuring is supposed to be 1/16), or when the opponent freezes a pokemon with ice beam. And don't forget about the dreaded para or flinch hax! Due to the existence of these "hax" factors, we can sometimes lose battles on the ladder that we clearly should have won, and simply lost because of luck. People may argue that hax happens to everyone, and that it is just a part of the game, and that while it does result in upset losses for a player it also results in upset wins, which is true, but there is no reason why a 2050 rated player should lose 40 points to a 1700s player simply because he lost due to an extremely unlucky game. In other words, the pokemon showdown elo rating system simply does not offer an accurate representation of a player's or pokemon team's consistency.

So if Pokemon Showdown Does Not Use Elo to Rank Players, What Ranking System Can Possibly Replace It?

This is an interesting question to answer. In the past, pokemon showdown used ACRE, a ranking system that was horribly inaccurate when it came to estimating ratings that had a high rd (rating deviation, which is a fancy term for standard deviation), and was meant as a way of interpreting the glicko-1 rating. Eventually, showdown switched to elo, and made a few changes to the system afterward such as introducing a rating floor of 1000, and some other tweaks. Currently, the pokemon showdown ladder displays a player's elo, glicko-1, GXE, and on suspect ladders, COIL. Of all of these ratings, which one takes the cake as being the most accurate? That answer is, without any doubt whatsoever, GXE. If you don't believe me, take a look at X-Act's original post that introduced the concept of GXE, and how it is calculated, which can be found here: http://www.smogon.com/forums/thread...layers-overall-rating-than-shoddys-cre.51169/ . If you read everything that X-Act wrote, you will see that he calculated the exact true rating of 250 chosen players by examining the glicko and rd of each player, and using an equation devised by Mark Glickman, the inventor of the Glicko-1 and Glicko-2 rating systems, to calculate the probability of every single player beating every single other player. He then matched these true ratings up with his GXE formula, and the results were astoundingly accurate. The GXE formula ordered the players in the exact same order that their true ratings would have ordered them in, a degree of accuracy which is absolutely stunning.

So if GXE is More Accurate Than Glicko-1 and Elo, Why Don't We Use it to Rank Players?

The big reason why gxe is not used to rank players is that it is a percentage, rather than a solid whole number. People tend to prefer their rating as a whole number, rather than as a percentage referring to their estimated chance of winning a match against a random opponent. Another issue with GXE is that it is less accurate in determining a player's rating when the player has a high rating deviation, and so rating deviations above 100 result in a rating of "provisional," or 0. It also shares some of the problems with glicko in that after many battles are played, it is difficult to change. However, I have a simple solution to all of these issues.

My Rating System

In reality, the goal of a rating system is to get as close to an estimate of a player's true rating as possible. X-Act's GXE formula does an extremely good job of doing this, so I really think that we need to consider the work that he left behind, since he is not available on showdown and hasn't been online since 2012. I am honestly baffled at why the smogon staff chose to use elo as the primary ranking system, when GXE is far superior, and can have its issues fixed with a few tweaks.

The original GXE formula looked like this:

Given a player rating R and a rating deviation RD:
GLIXARE Rating = 0, if RD > 100
GLIXARE Rating = round(10000 / (1 + 10^(((1500 - R) * pi / sqrt(3 * ln(10)^2 * RD^2 + 2500 * (64 * pi^2 + 147 * ln(10)^2)))))) / 100, otherwise


I have tested this formula with the glicko and rds of current top ladder players, and I did not get the same resulting GXEs as what the ladder displays, so I am almost certain that this formula has been modified in some way. Regardless, the above formula is pretty darn accurate, and if it has been modified in some way for the pokemon showdown ladder, which I am sure it has, then I don't doubt that the new formula is just as accurate if not more accurate. I must say that despite the claim that GXE is not as accurate when glicko-RD is high, the pokemon showdown ladder has a maximum RD of 350 for glicko-1 ratings, and honestly, if you compare a gxe that is based on a glicko rating with an RD of 0 with a glicko rating with an RD of 350, the resulting GXEs are not tremendously different.

Example:
2051 glicko rating rd of 0 >>> GXE: 89.3
2051 glicko rating rd of 350 >>> GXE: 84.6

Yes there is a 4.7 point difference here, but it really isn't that big when you consider how far apart a deviation of 0 (which by the way, is virtually impossible) is from a deviation of 350 (the deviation of a player who has not played any games). These results were calculated using the original GXE formula in microsoft excel.

I feel that it would be very nice if a player's GXE were to be converted into a whole number, preferably a 4 digit number that falls between 1000 and 2100-2300 or so, similarly to the elo rating on pokemon showdown's ladder. X-Act provided a simple way of doing this, proposing that the GXE simply be multiplied by 20, therefore resulting in a whole number rating with a floor of 0 (for a person with a 0 gxe, which is virtually impossible), and a maximum of 2000 (for a person with a 100 gxe, which is also virtually impossible). The problem with a rating system like this is that as a player gets into the 1900s, winning battles results in extremely small point gains, and reaching 2000 or even reaching 1960 is nigh impossible (1960 would be the equivalent of a 98 gxe, which is extremely hard to obtain, unless you are that sweetlol2 guy on the ubers ladder who apparently has a 98.3 gxe).

And so I wondered, "what if I solve the gxe formula for glicko in terms of gxe, using an rd of 0?" And so I did. Here is how you would derive glicko (without an RD) from the original GXE formula:

Glicko = -((LN(100/GXE-1)/LN(10))-2.50901023943244)/0.00167267349295496


I then considered how the elo rating system relates to the glicko rating system, and I found this: http://www.glicko.net/ratings/report08.txt , which contains a formula for doing a rough conversion of the FIDE chess rating (a chess rating system which is virtually equivalent to elo) to the USCF (United States Chess Federation) chess rating (which is virtually equivalent to glicko). From this, it is possible to derive a rough conversion of the glicko rating system to the elo rating system, with a centered elo rating of 1250 (which I am sure is pretty close to the current mean elo rating on pokemon showdown's ladders, and yes this would mean that the starting rating would NOT be 1000).

The equation looks like this:

Elo = ROUND(((B78-720)/0.624),0) if Glicko < ~1969.2
Elo = ROUND(((Glicko+350)/1.1585),0) if Glicko > ~1969.2


From all of this, we can derive an equation for essentially converting GXE into Elo. However, the rating obtained from this equation is not equivalent to elo, but rather an estimate of what a player's TRUE Elo is, a value that is FAR more accurate in depicting a player's skill level than the crappy elo system that pokemon showdown uses. I'm not sure how the community will view this, but if I were to come up with a name for this rating I suppose the most basic name would be TRUE ELO, or TELO. This equation is listed here:

TELO = ROUND((((-((LN(100/GXE-1)/LN(10))-2.50901023943244)/0.00167267349295496)-720)/0.624),0) if GXE < 85.9
TELO = ROUND((((-((LN(100/GXE-1)/LN(10))-2.50901023943244)/0.00167267349295496)+350)/1.1585),0) if GXE is > or equal to 85.9


The Advantages of This New Rating System
  • It has built in decay, due to the fact that GXE decays as RD gets bigger
  • Not all players will decay to the same final value, since the maximum RD is 350
  • It is FAR more accurate in depicting a player's true skill than elo, due to the fact that it uses GXE
  • Good players will not lose a TON of points from losing to weaker players because of hax/misplays
  • Players with high ratings will not get robbed of points by good players who have a misrepresentative rating
  • The range of ratings will go from as low as 500 to as much as 2500 (for players who are INSANELY good), which is similar to the range of ratings for the current elo system.
  • Ratings will be able to be accurately compared across ladders
  • Displays GXE as an elo rating, therefore not being a percentage.
  • It will not be ridiculously hard for top rated players to gain points due to the fact that the rating takes into account the fact that GXE becomes harder to raise as it approaches 100.
  • Doesn't share the problem that ACRE had, in which some players can get ridiculously high ladder scores.
Issues With This Rating System
  • Shares a similar problem with glicko in that after many battles are played, the rating will not be as sensitive to change. HOWEVER, it will still be possible to regain the ability to gain points more quicky by taking a break, and letting the RD increase.
  • Players may be able to park themselves at the top of the ladder, HOWEVER their ratings would still decay due to the increase of RD.

Easy Fix to the Problems With TELO
  • If a "ladder reset" option were programmed into the game, players could easily reset their rating back to 1250 with an RD of 350, if their rating was at a standstill and the player did not feel like waiting for his/her RD to increase.
  • It would have to be required that a player could not perform a ladder reset unless they have played "x" number of games, or until their RD is at least as low as "x," to prevent players from immediately resetting their ladder ranking upon losing their first match or their first few matches.
  • If desired, it could be programmed that ONLY players with an RD of 350 have their rating temporarily removed from the ladder, until they decide to ladder again. This would prevent parking.

For reference I have provided a chart that converts gxe into TELO, to provide an idea of how the ratings would look:

https://docs.google.com/spreadsheets/d/1ITHZxJcczf4Hd9xZT0mPLYjCDi8tc7HfRS0t61Y7mB8/edit?usp=sharing

But wait, isn't this just ranking players by GXE?

Essentially, yes, but it displays GXE in a more convenient fashion. I do not feel that we should make any changes to the current GXE formula, except that no rating with an RD below 350 be displayed as "0" or "provisional," since the Glicko RD acts as a decay agent.
 
Last edited:
#2
my opinions:

I mean it feels weird to respond to such a long post with a relatively short one but I don't really see the serious need for this. Elo doesn't decide "real" things like suspect test qualifications, we have COIL for that, and for pretty much any purpose that isn't "placement at the near-top of the ladder" (like a means of comparison more consistent across tiers) by all means use the other ratings we show with /rank. If you think that GXE is a better means to rank, just find the GXEs we show with to each player when you look at the laddering screen and order them. I will cover one major problem I see in your proposal and some reasons why Elo is used for this cosmetic placement scheme, though.

Among the most important features of Elo is a pragmatic time-based decay. Im not familiar with the intricacies of glicko but the following seems to be true based on your language: Your solution to prevent alt-parking isn't very practical since it lets a player who left for three months suffer no decrease so long as he wins the next game (or few games?) upon his return. Aside from the system's interpretation of such a situation being quite illogical, it makes me wary as to whether your solution would really keep alt-parking as in check as it should be. I'd rather not finally conquer the ag ladder to have like Curve (love ya) who had disappeared from the rankings unretire and pop into existence 150 points ahead of me - because he won one game, as he is wont to do, since his return. If I'm right in my interpretation of your claims the "phantom peaker" would create a probably intolerable placement insecurity.

Another key advantage is the helpfulness/simplicity of Elo's premises to newcomers: you have the nice even 1000 for a baseline, and expand on that by beating players and gaining points to show you belong in a "higher crowd" competition-wise. This makes the placement of people higher on the ladder "click" in such cases: they clearly kept up the winning and showed that they "deserved" to face/reflect tougher opponents. Winning games to try and "sway" a gxe, even if it is in point form, is rather less fluid and more frustrating to ladderers aspiring to use their scores as a means to compete. Even after lots of ladder experience, trying to pump up gxe has still frustrated me more than once. Once you're familiar enough with Elo, you basically know what you stand to win/lose from a match if you know the opponents' Elo, and as I said before getting Elo points clicks. To the human mind even if not to the details of math, gxe is a fickle mistress.
On a larger scale, Elo has the tendency to create convenient "low" "mid" and "high" Elo zones useful for kinds of analyses and expectations gxe comparisons don't allow.

Overall there are more points on both sides to be argued, but I don't think doing so would serve enough purpose in this context due to the lack of need for change.
 
#3
It appears that my solution to parking is definitely somewhat lackluster. Perhaps it would have to be programmed that ANY TELO with an RD of 350 would automatically be reset. Either this or we simply would not implement an auto reset for inactive accounts.

I would like to dive a bit further into this:

Say a random player has a GXE of 95 (which would likely be high enough to peak or nearly peak the ladder), and they have an RD of 25, which is the lowest obtainable RD on the ladder. This would give them a TELO rating of 2257. Now let's say that they stop playing, and never play again, so their RD expands all the way until it reaches 350. Their new GXE would be 91.4-91.5, placing their floor TELO rating at roughly 2125-2129. This is roughly a 130 point decay, which is pretty big, but not nearly as big as the decay that is currently on pokemon showdown's ladders. Based on how I constructed the rating system, other TELO ratings would also likely have a similar maximum decay.

So yes, in this system, players could park themselves, which would be a bit of a problem. I'd say the best solution to this would be automatically resetting any players' rating that has a glicko-RD of 350, instead of a "temporary reset" as I had initially suggested. Another option would be to reset a player's rating after a certain amount of time had passed without the player doing any battles (say, 50 days or so), although I prefer the "reset when RD reaches 350" option, as it allows more time before a reset for players who have played a great number of games than for players who have played few games. The only issue with this option is that some players would have a tremendous amount of time before a reset, but in my opinion, it really isn't that big of a deal.

As for the claim that there is a lack of need for change, I am sure that the majority of smogon staff members would agree. However, I must say that I would find it hard to believe that elo doesn't decide "real" things, because yes, though what you said is true, that elo only determines the order in which the players on the ladder are ranked, and that COIL, which is derived from GXE, is what smogon uses for suspect test qualifications, I would have to say that "deciding the numerical order in which players are ranked," is by far the most "real" thing that a rating system can be used for. The primary purpose of the showdown ladder is to rank players according to skill, not to decide who gets to vote in suspect tests. That's what suspect ladders are used for. And in many ladders, there are no suspect tests, such as in AG, Ubers (I know that ubers suspect tests do occur but they are very rare), and all of the past generation ladders. My issue with the elo system is that its function is to rank players according to skill, as well as display a numerical value that estimates a given player's probability of winning a match (because that is what a player's "true" rating actually is), and in the case of pokemon, it fails to do so. In chess, however, in which there is no such thing as "hax," the elo system is great. But pokemon is not chess, and rather than being solely based on skill, it is also based on luck. Hence, elo is not the best ranking system for showdown. I made a bunch of points about the flaws of the elo sytem in my post which I feel are pretty big flaws.

Also, yes, there are flaws with my rating system (which kind of isn't mine, since it's based on GXE which was invented by X-Act). A starting rating of 1250 isn't as convenient as 1000, but there will still be well defined rating ranges of skill level (1200s players would be average, 1500s-1600s would be mid ladder, 2000s+ would be superb players and 2300s+ would be gods). I think getting more people to discuss the issues with my system and how they compare to the flaws of the current system would be beneficial.
 
#7
Wow, I was wondering how ladder can become something really competitive and how fix the quantity over quality problem. And, unless I misunderstood something due to the fact I'm not an english native, your system would work really well on a Pokemon ladder, and a player can see how fast his progress are by laddering, write his score, then reset and try to get a better one. I'm not really great at maths but that's the best system I ever seen for a relatively high variance game ladder.

I hope it will be implemented on PS
 
#8
The biggest problem with using GXE over ELO is the idea of consistency. Counter-intuitive? No.

First and foremost, the ladder is not and never will be a place to show of skill. It's mainly used for team testing. Secondly, you have to take into account playstyles. Stall for example has a lot more consistency than HO, meaning that a stall player has a much easier time getting a higher GXE than the latter. Does this mean that the former is "better" than the latter?

You need to understand that the "high" ladder is there for players to test out their teams with other players who play the ladder often, which usually means they have more meta experience. They are in no way better due to this standard.

Good math though!
 
Last edited:
#9
I love the effort you put into this Reffrey. Glad you did this because it's really annoying to leave the ladder and decay a lot more. I like your math as well; although who would really want to do math on forums? A for effort

(You may want to shorten it a a for people without the attention span)
 

Users Who Are Viewing This Thread (Users: 1, Guests: 0)