Screw Elo, A New Proposed Skill Rating System That Is Far More Accurate (Based on the GXE Formula)

#1
Greetings fellow smogon nerds,

Have you ever experienced the frustrations of laddering under our current ranking system? I reach out specifically to players who are ranked in the very top percentages of the ladder. Anyone who ladders competitively and who has made it to the top 10 or so of the ladder will really know what I'm talking about. I have been on pokemon showdown since late 2013, right when generation 6 made its debut, and have played a variety of metagames. As of now I simply focus on playing Anything Goes, which is by far my favorite metagame to play. I have consistently been a high ranked player on the Gen 7 AG Ladder, and have made it to number 1 only once, although I am quite often in the top 10. Anyone who plays Anything Goes knows that out of all of the ladders on the website besides Random Battle and occasionally OU, the AG ladder typically contains the highest ELO rated players. Have a look at this screenshot for instance:

Screenshot (26).png

As can be seen, 4 players are rated above 2000, while there is a big gap between the ranked number 4 player and the ranked number 5 player. In other ladders, such as the ubers ladder, top rated players are in the high 1800s, and for inactive ladders top rated players can be as low as in the 1200s-1300s. At this time, I was at my highest ever elo rating, and I felt very satisfied. I had been playing 6 games per day, which as far as I know is the minimum number of games needed to be played in order to prevent decay. The next day, however, in the very first game I played, I was matched up against a player who was rated in the low 1800s, and I lost due to making a terrible misplay, scratching a terrible 30 something points off of my rating. I then proceeded to win several games in a row in order to get my rating back up, but before I made it to 2065+ again, I faced another crippling loss, knocking me down even further than where I had been knocked the first time. It wasn't long before I started playing badly due to frustration, and I fell to the low 1900s. Each time I lost I would win several games in succession, each one earning me a mere 4-6 ladder points, only to see myself get knocked down by 30 points again from a single loss. After I had "tilted," I decided to take a several day break from laddering due to how frustrating it was for me. And trust me, I had been through this many times already. Tilting is a common scenario that high ladder players experience, and to prevent it they often have a policy such as "if I experience 2 losses in a row, then I will take a break from laddering until tomorrow to prevent tilting." After going through the tilting process countless times, it dawned on me one day, "is the pokemon showdown rating system really that accurate?" I raised this question not only out of my frustration but also due to the several, severe issues that I noticed with the way players on showdown are ranked.

Issue Number 1: Elo Ratings From One Ladder Cannot Be Accurately Compared to Those From Another

This is something that I hinted at earlier. Every ladder has a different peak elo score, and every ladder has a different degree of variance. As of right now, for instance, the number 1 rated player in Anything Goes has an elo of 2087, and the ranked 500 player has a rating of 1581, while in OU, the number 1 player is at 1979, while the number 500 player is at 1688, as can be seen in these screenshots:

Screenshot (34).png Screenshot (33).png Screenshot (32).png Screenshot (31).png
So the big question about these screenshots is, which player is better, the number 1 AG player or the number 1 OU player? What about the number 500 AG player and the number 500 OU player? Based on elo, this question is simply impossible to answer. A 1688 elo in OU is not equivalent to a 1688 elo in AG, because both ladders have different degrees of variance, and different population sizes. The AG ladder obviously has more variance than the OU ladder, despite the fact that I am pretty sure that the OU ladder contains a greater number of players. To top that off, elo, simply put, is not an accurate method of displaying a player's unknowable true rating, especially when it comes to the game of pokemon. This will be further emphasized in the next issue with it.

Issue Number 2: Good Players Can Decay and then Rob Higher Elo Rated Players of Points

This, I must say, is a horrendous issue that needs to be fixed. There are some players who are extremely skilled, and easily capable of topping the ladder, yet they do not ladder very often. As a result, they are rated much lower than they should be, due to the fact that they decay by going long periods without laddering. When they actually do ladder, any higher rated sucker who is unfortunate enough to get matched up with them risks getting beaten and losing more points than they should.

Issue Number 3: All High Rated Players Decay to the Same Rating

As far as I know, rating decay begins at 1500 and does not occur at ratings 1499 or below (although I've heard that in some ladders this value is different, such as 1399, but maybe this isn't true). Regardless of the "minimum rating," that elo decays to, I do not think that every player above this rating should eventually decay down to the same rating. Why on earth should a 2100 rated player decay to 1499 after not playing (in this case, for a very long time) for a while, and a player rated 1550 decay to that same rating (in a shorter amount of time, though)?

Issue Number 4: Pokemon is not only a game of skill, but also a game of luck

This is perhaps the biggest reason why the elo rating system is not suited for pokemon. As we all know, in pokemon, there exists a little battle mechanic that game freak decided to implement into the game called accuracy. This acts hand in hand with the secondary affect chances that some moves have, such as a 10% chance to burn paralyze or freeze. We all know the frustration of getting "haxed," when we miss a 90% accurate move 3 times in a row, or when we get smacked with a double or even triple crit (I witnessed a triple crit occur in a battle yesterday, not sure if this is a bug or not, but the chance of a crit occuring is supposed to be 1/16), or when the opponent freezes a pokemon with ice beam. And don't forget about the dreaded para or flinch hax! Due to the existence of these "hax" factors, we can sometimes lose battles on the ladder that we clearly should have won, and simply lost because of luck. People may argue that hax happens to everyone, and that it is just a part of the game, and that while it does result in upset losses for a player it also results in upset wins, which is true, but there is no reason why a 2050 rated player should lose 40 points to a 1700s player simply because he lost due to an extremely unlucky game. In other words, the pokemon showdown elo rating system simply does not offer an accurate representation of a player's or pokemon team's consistency.

So if Pokemon Showdown Does Not Use Elo to Rank Players, What Ranking System Can Possibly Replace It?

This is an interesting question to answer. In the past, pokemon showdown used ACRE, a ranking system that was horribly inaccurate when it came to estimating ratings that had a high rd (rating deviation, which is a fancy term for standard deviation), and was meant as a way of interpreting the glicko-1 rating. Eventually, showdown switched to elo, and made a few changes to the system afterward such as introducing a rating floor of 1000, and some other tweaks. Currently, the pokemon showdown ladder displays a player's elo, glicko-1, GXE, and on suspect ladders, COIL. Of all of these ratings, which one takes the cake as being the most accurate? That answer is, without any doubt whatsoever, GXE. If you don't believe me, take a look at X-Act's original post that introduced the concept of GXE, and how it is calculated, which can be found here: http://www.smogon.com/forums/thread...layers-overall-rating-than-shoddys-cre.51169/ . If you read everything that X-Act wrote, you will see that he calculated the exact true rating of 250 chosen players by examining the glicko and rd of each player, and using an equation devised by Mark Glickman, the inventor of the Glicko-1 and Glicko-2 rating systems, to calculate the probability of every single player beating every single other player. He then matched these true ratings up with his GXE formula, and the results were astoundingly accurate. The GXE formula ordered the players in the exact same order that their true ratings would have ordered them in, a degree of accuracy which is absolutely stunning.

So if GXE is More Accurate Than Glicko-1 and Elo, Why Don't We Use it to Rank Players?

The big reason why gxe is not used to rank players is that it is a percentage, rather than a solid whole number. People tend to prefer their rating as a whole number, rather than as a percentage referring to their estimated chance of winning a match against a random opponent. Another issue with GXE is that it is less accurate in determining a player's rating when the player has a high rating deviation, and so rating deviations above 100 result in a rating of "provisional," or 0. It also shares some of the problems with glicko in that after many battles are played, it is difficult to change. However, I have a simple solution to all of these issues.

My Rating System

In reality, the goal of a rating system is to get as close to an estimate of a player's true rating as possible. X-Act's GXE formula does an extremely good job of doing this, so I really think that we need to consider the work that he left behind, since he is not available on showdown and hasn't been online since 2012. I am honestly baffled at why the smogon staff chose to use elo as the primary ranking system, when GXE is far superior, and can have its issues fixed with a few tweaks.

The original GXE formula looked like this:

Given a player rating R and a rating deviation RD:
GLIXARE Rating = 0, if RD > 100
GLIXARE Rating = round(10000 / (1 + 10^(((1500 - R) * pi / sqrt(3 * ln(10)^2 * RD^2 + 2500 * (64 * pi^2 + 147 * ln(10)^2)))))) / 100, otherwise


I have tested this formula with the glicko and rds of current top ladder players, and I did not get the same resulting GXEs as what the ladder displays, so I am almost certain that this formula has been modified in some way. Regardless, the above formula is pretty darn accurate, and if it has been modified in some way for the pokemon showdown ladder, which I am sure it has, then I don't doubt that the new formula is just as accurate if not more accurate. I must say that despite the claim that GXE is not as accurate when glicko-RD is high, the pokemon showdown ladder has a maximum RD of 350 for glicko-1 ratings, and honestly, if you compare a gxe that is based on a glicko rating with an RD of 0 with a glicko rating with an RD of 350, the resulting GXEs are not tremendously different.

Example:
2051 glicko rating rd of 0 >>> GXE: 89.3
2051 glicko rating rd of 350 >>> GXE: 84.6

Yes there is a 4.7 point difference here, but it really isn't that big when you consider how far apart a deviation of 0 (which by the way, is virtually impossible) is from a deviation of 350 (the deviation of a player who has not played any games). These results were calculated using the original GXE formula in microsoft excel.

I feel that it would be very nice if a player's GXE were to be converted into a whole number, preferably a 4 digit number that falls between 1000 and 2100-2300 or so, similarly to the elo rating on pokemon showdown's ladder. X-Act provided a simple way of doing this, proposing that the GXE simply be multiplied by 20, therefore resulting in a whole number rating with a floor of 0 (for a person with a 0 gxe, which is virtually impossible), and a maximum of 2000 (for a person with a 100 gxe, which is also virtually impossible). The problem with a rating system like this is that as a player gets into the 1900s, winning battles results in extremely small point gains, and reaching 2000 or even reaching 1960 is nigh impossible (1960 would be the equivalent of a 98 gxe, which is extremely hard to obtain, unless you are that sweetlol2 guy on the ubers ladder who apparently has a 98.3 gxe).

And so I wondered, "what if I solve the gxe formula for glicko in terms of gxe, using an rd of 0?" And so I did. Here is how you would derive glicko (without an RD) from the original GXE formula:

Glicko = -((LN(100/GXE-1)/LN(10))-2.50901023943244)/0.00167267349295496


I then considered how the elo rating system relates to the glicko rating system, and I found this: http://www.glicko.net/ratings/report08.txt , which contains a formula for doing a rough conversion of the FIDE chess rating (a chess rating system which is virtually equivalent to elo) to the USCF (United States Chess Federation) chess rating (which is virtually equivalent to glicko). From this, it is possible to derive a rough conversion of the glicko rating system to the elo rating system, with a centered elo rating of 1250 (which I am sure is pretty close to the current mean elo rating on pokemon showdown's ladders, and yes this would mean that the starting rating would NOT be 1000).

The equation looks like this:

Elo = ROUND(((B78-720)/0.624),0) if Glicko < ~1969.2
Elo = ROUND(((Glicko+350)/1.1585),0) if Glicko > ~1969.2


From all of this, we can derive an equation for essentially converting GXE into Elo. However, the rating obtained from this equation is not equivalent to elo, but rather an estimate of what a player's TRUE Elo is, a value that is FAR more accurate in depicting a player's skill level than the crappy elo system that pokemon showdown uses. I'm not sure how the community will view this, but if I were to come up with a name for this rating I suppose the most basic name would be TRUE ELO, or TELO. This equation is listed here:

TELO = ROUND((((-((LN(100/GXE-1)/LN(10))-2.50901023943244)/0.00167267349295496)-720)/0.624),0) if GXE < 85.9
TELO = ROUND((((-((LN(100/GXE-1)/LN(10))-2.50901023943244)/0.00167267349295496)+350)/1.1585),0) if GXE is > or equal to 85.9


The Advantages of This New Rating System
  • It has built in decay, due to the fact that GXE decays as RD gets bigger
  • Not all players will decay to the same final value, since the maximum RD is 350
  • It is FAR more accurate in depicting a player's true skill than elo, due to the fact that it uses GXE
  • Good players will not lose a TON of points from losing to weaker players because of hax/misplays
  • Players with high ratings will not get robbed of points by good players who have a misrepresentative rating
  • The range of ratings will go from as low as 500 to as much as 2500 (for players who are INSANELY good), which is similar to the range of ratings for the current elo system.
  • Ratings will be able to be accurately compared across ladders
  • Displays GXE as an elo rating, therefore not being a percentage.
  • It will not be ridiculously hard for top rated players to gain points due to the fact that the rating takes into account the fact that GXE becomes harder to raise as it approaches 100.
  • Doesn't share the problem that ACRE had, in which some players can get ridiculously high ladder scores.
Issues With This Rating System
  • Shares a similar problem with glicko in that after many battles are played, the rating will not be as sensitive to change. HOWEVER, it will still be possible to regain the ability to gain points more quicky by taking a break, and letting the RD increase.
  • Players may be able to park themselves at the top of the ladder, HOWEVER their ratings would still decay due to the increase of RD.

Easy Fix to the Problems With TELO
  • If a "ladder reset" option were programmed into the game, players could easily reset their rating back to 1250 with an RD of 350, if their rating was at a standstill and the player did not feel like waiting for his/her RD to increase.
  • It would have to be required that a player could not perform a ladder reset unless they have played "x" number of games, or until their RD is at least as low as "x," to prevent players from immediately resetting their ladder ranking upon losing their first match or their first few matches.
  • If desired, it could be programmed that ONLY players with an RD of 350 have their rating temporarily removed from the ladder, until they decide to ladder again. This would prevent parking.

For reference I have provided a chart that converts gxe into TELO, to provide an idea of how the ratings would look:

https://docs.google.com/spreadsheets/d/1ITHZxJcczf4Hd9xZT0mPLYjCDi8tc7HfRS0t61Y7mB8/edit?usp=sharing

But wait, isn't this just ranking players by GXE?

Essentially, yes, but it displays GXE in a more convenient fashion. I do not feel that we should make any changes to the current GXE formula, except that no rating with an RD below 350 be displayed as "0" or "provisional," since the Glicko RD acts as a decay agent.
 
Last edited:

Tiksi

prepare for explosive flame
is a Battle Server Moderator
#2
my opinions:

I mean it feels weird to respond to such a long post with a relatively short one but I don't really see the serious need for this. Elo doesn't decide "real" things like suspect test qualifications, we have COIL for that, and for pretty much any purpose that isn't "placement at the near-top of the ladder" (like a means of comparison more consistent across tiers) by all means use the other ratings we show with /rank. If you think that GXE is a better means to rank, just find the GXEs we show with to each player when you look at the laddering screen and order them. I will cover one major problem I see in your proposal and some reasons why Elo is used for this cosmetic placement scheme, though.

Among the most important features of Elo is a pragmatic time-based decay. Im not familiar with the intricacies of glicko but the following seems to be true based on your language: Your solution to prevent alt-parking isn't very practical since it lets a player who left for three months suffer no decrease so long as he wins the next game (or few games?) upon his return. Aside from the system's interpretation of such a situation being quite illogical, it makes me wary as to whether your solution would really keep alt-parking as in check as it should be. I'd rather not finally conquer the ag ladder to have like Curve (love ya) who had disappeared from the rankings unretire and pop into existence 150 points ahead of me - because he won one game, as he is wont to do, since his return. If I'm right in my interpretation of your claims the "phantom peaker" would create a probably intolerable placement insecurity.

Another key advantage is the helpfulness/simplicity of Elo's premises to newcomers: you have the nice even 1000 for a baseline, and expand on that by beating players and gaining points to show you belong in a "higher crowd" competition-wise. This makes the placement of people higher on the ladder "click" in such cases: they clearly kept up the winning and showed that they "deserved" to face/reflect tougher opponents. Winning games to try and "sway" a gxe, even if it is in point form, is rather less fluid and more frustrating to ladderers aspiring to use their scores as a means to compete. Even after lots of ladder experience, trying to pump up gxe has still frustrated me more than once. Once you're familiar enough with Elo, you basically know what you stand to win/lose from a match if you know the opponents' Elo, and as I said before getting Elo points clicks. To the human mind even if not to the details of math, gxe is a fickle mistress.
On a larger scale, Elo has the tendency to create convenient "low" "mid" and "high" Elo zones useful for kinds of analyses and expectations gxe comparisons don't allow.

Overall there are more points on both sides to be argued, but I don't think doing so would serve enough purpose in this context due to the lack of need for change.
 
#3
It appears that my solution to parking is definitely somewhat lackluster. Perhaps it would have to be programmed that ANY TELO with an RD of 350 would automatically be reset. Either this or we simply would not implement an auto reset for inactive accounts.

I would like to dive a bit further into this:

Say a random player has a GXE of 95 (which would likely be high enough to peak or nearly peak the ladder), and they have an RD of 25, which is the lowest obtainable RD on the ladder. This would give them a TELO rating of 2257. Now let's say that they stop playing, and never play again, so their RD expands all the way until it reaches 350. Their new GXE would be 91.4-91.5, placing their floor TELO rating at roughly 2125-2129. This is roughly a 130 point decay, which is pretty big, but not nearly as big as the decay that is currently on pokemon showdown's ladders. Based on how I constructed the rating system, other TELO ratings would also likely have a similar maximum decay.

So yes, in this system, players could park themselves, which would be a bit of a problem. I'd say the best solution to this would be automatically resetting any players' rating that has a glicko-RD of 350, instead of a "temporary reset" as I had initially suggested. Another option would be to reset a player's rating after a certain amount of time had passed without the player doing any battles (say, 50 days or so), although I prefer the "reset when RD reaches 350" option, as it allows more time before a reset for players who have played a great number of games than for players who have played few games. The only issue with this option is that some players would have a tremendous amount of time before a reset, but in my opinion, it really isn't that big of a deal.

As for the claim that there is a lack of need for change, I am sure that the majority of smogon staff members would agree. However, I must say that I would find it hard to believe that elo doesn't decide "real" things, because yes, though what you said is true, that elo only determines the order in which the players on the ladder are ranked, and that COIL, which is derived from GXE, is what smogon uses for suspect test qualifications, I would have to say that "deciding the numerical order in which players are ranked," is by far the most "real" thing that a rating system can be used for. The primary purpose of the showdown ladder is to rank players according to skill, not to decide who gets to vote in suspect tests. That's what suspect ladders are used for. And in many ladders, there are no suspect tests, such as in AG, Ubers (I know that ubers suspect tests do occur but they are very rare), and all of the past generation ladders. My issue with the elo system is that its function is to rank players according to skill, as well as display a numerical value that estimates a given player's probability of winning a match (because that is what a player's "true" rating actually is), and in the case of pokemon, it fails to do so. In chess, however, in which there is no such thing as "hax," the elo system is great. But pokemon is not chess, and rather than being solely based on skill, it is also based on luck. Hence, elo is not the best ranking system for showdown. I made a bunch of points about the flaws of the elo sytem in my post which I feel are pretty big flaws.

Also, yes, there are flaws with my rating system (which kind of isn't mine, since it's based on GXE which was invented by X-Act). A starting rating of 1250 isn't as convenient as 1000, but there will still be well defined rating ranges of skill level (1200s players would be average, 1500s-1600s would be mid ladder, 2000s+ would be superb players and 2300s+ would be gods). I think getting more people to discuss the issues with my system and how they compare to the flaws of the current system would be beneficial.
 
#7
Wow, I was wondering how ladder can become something really competitive and how fix the quantity over quality problem. And, unless I misunderstood something due to the fact I'm not an english native, your system would work really well on a Pokemon ladder, and a player can see how fast his progress are by laddering, write his score, then reset and try to get a better one. I'm not really great at maths but that's the best system I ever seen for a relatively high variance game ladder.

I hope it will be implemented on PS
 
#8
The biggest problem with using GXE over ELO is the idea of consistency. Counter-intuitive? No.

First and foremost, the ladder is not and never will be a place to show of skill. It's mainly used for team testing. Secondly, you have to take into account playstyles. Stall for example has a lot more consistency than HO, meaning that a stall player has a much easier time getting a higher GXE than the latter. Does this mean that the former is "better" than the latter?

You need to understand that the "high" ladder is there for players to test out their teams with other players who play the ladder often, which usually means they have more meta experience. They are in no way better due to this standard.

Good math though!
 
Last edited:
#9
I love the effort you put into this Reffrey. Glad you did this because it's really annoying to leave the ladder and decay a lot more. I like your math as well; although who would really want to do math on forums? A for effort

(You may want to shorten it a a for people without the attention span)
 

Zarel

Not a Yuyuko fan
is a member of the Site Staffis a Battle Server Administratoris a Programmeris a Pokemon Researcheris an Administrator
Creator of PS
#10
You PMed me because this thread stagnated and you wanted me to do something, so here's my response.

Issue Number 1: Elo Ratings From One Ladder Cannot Be Accurately Compared to Those From Another

This is something that I hinted at earlier. Every ladder has a different peak elo score, and every ladder has a different degree of variance. As of right now, for instance, the number 1 rated player in Anything Goes has an elo of 2087, and the ranked 500 player has a rating of 1581, while in OU, the number 1 player is at 1979, while the number 500 player is at 1688, as can be seen in these screenshots:

View attachment 88774 View attachment 88775 View attachment 88776 View attachment 88777
So the big question about these screenshots is, which player is better, the number 1 AG player or the number 1 OU player? What about the number 500 AG player and the number 500 OU player? Based on elo, this question is simply impossible to answer. A 1688 elo in OU is not equivalent to a 1688 elo in AG, because both ladders have different degrees of variance, and different population sizes. The AG ladder obviously has more variance than the OU ladder, despite the fact that I am pretty sure that the OU ladder contains a greater number of players. To top that off, elo, simply put, is not an accurate method of displaying a player's unknowable true rating, especially when it comes to the game of pokemon. This will be further emphasized in the next issue with it.
I mean, comparing players of two different games is always going to be hard.

Like, ask yourself this: Is Magnus Carlsen a better chess player than Scarlett is a StarCraft 2 player? How would you even make that comparison?

The real answer is: Math doesn't work that way, you'll have to be more specific than that. You can't compare apples and oranges, but you can massage numbers into something comparable.

For instance, you can ask: Which of them is more likely to win against an average player of their game? (I'd guess Scarlett, because I'd guess there are more things a pro can do to completely overwhelm an amateur in StarCraft than in chess.)

Or you could ask: Which of them is a higher-percentile player? (I'd guess Magnus, because he's competing against many more people; a lot more people know how to play chess than StarCraft)

But the point is, no matter which numbers you compare to each other, it's meaningless unless you know what the numbers mean.

(Anyway, GXE is a good proxy for percentile. Or you can do ladder rank divided by ladder activity, I guess. Or straight ladder rank. Depending on what you care about, those are probably the best way to compare apples to oranges. Just understand that these numbers are only meaningful in a very abstract sense.)

Issue Number 2: Good Players Can Decay and then Rob Higher Elo Rated Players of Points

This, I must say, is a horrendous issue that needs to be fixed. There are some players who are extremely skilled, and easily capable of topping the ladder, yet they do not ladder very often. As a result, they are rated much lower than they should be, due to the fact that they decay by going long periods without laddering. When they actually do ladder, any higher rated sucker who is unfortunate enough to get matched up with them risks getting beaten and losing more points than they should.
No matter what rating system you use, matchups will always be slightly imperfect, and the real answer here for every system is "outliers don't matter, the average matchup will be similar enough, just play more games and your rating will go back to normal, it's really not that big of a deal".

Issue Number 3: All High Rated Players Decay to the Same Rating

As far as I know, rating decay begins at 1500 and does not occur at ratings 1499 or below (although I've heard that in some ladders this value is different, such as 1399, but maybe this isn't true). Regardless of the "minimum rating," that elo decays to, I do not think that every player above this rating should eventually decay down to the same rating. Why on earth should a 2100 rated player decay to 1499 after not playing (in this case, for a very long time) for a while, and a player rated 1550 decay to that same rating (in a shorter amount of time, though)?
This seems like a really minor point. One of Elo's nicest features is that it has no "memory". If you have a rating of 1600, you're the same as everyone else who has a 1600. It's not like other rating systems where you might get "stuck" at a lower rating and have a much harder time raising your rating than a new player with that rating.

But no memory means that if your rating decays, you're treated the same as everyone else at that rating for further decay. It still seems minor to me.

Issue Number 4: Pokemon is not only a game of skill, but also a game of luck

This is perhaps the biggest reason why the elo rating system is not suited for pokemon. As we all know, in pokemon, there exists a little battle mechanic that game freak decided to implement into the game called accuracy. This acts hand in hand with the secondary affect chances that some moves have, such as a 10% chance to burn paralyze or freeze. We all know the frustration of getting "haxed," when we miss a 90% accurate move 3 times in a row, or when we get smacked with a double or even triple crit (I witnessed a triple crit occur in a battle yesterday, not sure if this is a bug or not, but the chance of a crit occuring is supposed to be 1/16), or when the opponent freezes a pokemon with ice beam. And don't forget about the dreaded para or flinch hax! Due to the existence of these "hax" factors, we can sometimes lose battles on the ladder that we clearly should have won, and simply lost because of luck. People may argue that hax happens to everyone, and that it is just a part of the game, and that while it does result in upset losses for a player it also results in upset wins, which is true, but there is no reason why a 2050 rated player should lose 40 points to a 1700s player simply because he lost due to an extremely unlucky game. In other words, the pokemon showdown elo rating system simply does not offer an accurate representation of a player's or pokemon team's consistency.
Every game has some luck involved, the only difference is the level of abstraction.

Some games, like roulette, a literal RNG is the entire game.

Some games, like Pokémon, have an RNG, but still have strategy involved.

Some games, like competitive fighting games or rock-paper-scissors or StarCraft II, have no RNG, but there is still randomness involved in the prediction of your opponents' moves.

Some games, like chess or go, are pure combinatorial games and don't even involve prediction. But unless they're solved (which they aren't), they still have meta-prediction involved, and that's still a type of luck. Don't believe me? Pit two players (or two identical AIs) against each other, and it won't always end the same way. Sometimes one side will win, sometimes the other.

And no, the different players winning isn't always (or even usually) because one player is getting better faster than the other player. It's because they're trying out different strategies every game, because just like rock-paper-scissors, you don't want to be predictable or you'll get beaten.

The only games with no luck at all are solved games, like tic-tac-toe. Which is why no one plays them competitively. Why would you play a game if you already knew the winner ahead of time?

Anyway, the point is that every rating system, including Elo, is designed for games that have some amount of luck and some amount of skill.

So if Pokemon Showdown Does Not Use Elo to Rank Players, What Ranking System Can Possibly Replace It?

This is an interesting question to answer. In the past, pokemon showdown used ACRE, a ranking system that was horribly inaccurate when it came to estimating ratings that had a high rd (rating deviation, which is a fancy term for standard deviation), and was meant as a way of interpreting the glicko-1 rating.
It's kind of wrong to call any rating system "horribly inaccurate" at doing a thing it was never designed to do. It's like calling "1+1=2" a horribly inaccurate way of calculating 5+3. I mean, sure, it's wrong, but that's because it's not what it was meant to do.

ACRE (Advanced Conservative Rating Estimate) was intended to be a conservative rating estimate. Hence the name. "Conservative" means that it makes a "safe" guess. A lower guess is "safer" because you don't want a bad player to be on top of the ladder. It's fine because you can always lower your RD by playing more games.

Eventually, showdown switched to elo, and made a few changes to the system afterward such as introducing a rating floor of 1000, and some other tweaks. Currently, the pokemon showdown ladder displays a player's elo, glicko-1, GXE, and on suspect ladders, COIL. Of all of these ratings, which one takes the cake as being the most accurate? That answer is, without any doubt whatsoever, GXE. If you don't believe me, take a look at X-Act's original post that introduced the concept of GXE, and how it is calculated, which can be found here: http://www.smogon.com/forums/thread...layers-overall-rating-than-shoddys-cre.51169/ . If you read everything that X-Act wrote, you will see that he calculated the exact true rating of 250 chosen players by examining the glicko and rd of each player, and using an equation devised by Mark Glickman, the inventor of the Glicko-1 and Glicko-2 rating systems, to calculate the probability of every single player beating every single other player. He then matched these true ratings up with his GXE formula, and the results were astoundingly accurate. The GXE formula ordered the players in the exact same order that their true ratings would have ordered them in, a degree of accuracy which is absolutely stunning.
GXE is interesting because it presents information from a Glicko rating in a percentage with an easy-to-understand meaning.

It does not order players in the same order that their "true" ratings would have ordered them. If you wanted an order by true rating, the actual Glicko rating is the best you can do.

GXE is, like, ACRE, not an accurate estimate but a conservative estimate: it's going to put some better players who've played fewer games below some worse players who've played more games. This is an intentional feature, because the alternative is to put lucky bad players who've played few games above unlucky good players who've played lots of games, and that's much worse.

So if GXE is More Accurate Than Glicko-1 and Elo, Why Don't We Use it to Rank Players?

The big reason why gxe is not used to rank players is that it is a percentage, rather than a solid whole number. People tend to prefer their rating as a whole number, rather than as a percentage referring to their estimated chance of winning a match against a random opponent. Another issue with GXE is that it is less accurate in determining a player's rating when the player has a high rating deviation, and so rating deviations above 100 result in a rating of "provisional," or 0. It also shares some of the problems with glicko in that after many battles are played, it is difficult to change. However, I have a simple solution to all of these issues.
The big reason why we use Elo rather than GXE to rank players is because Elo is more fun. I mentioned the "memory" thing above, but in general other rating systems are a lot more frustrating than Elo, because they're designed to be accurate instead of fun.

Elo is a system that gives you points when you win, and takes away points when you lose, and makes sure people who don't play don't stay on top of the ladder. It makes the top of a ladder a fun king-of-the-kill-style fight, and it gives you a good feel for approximately how many games you need to win/lose to beat the current leader, instead of having it be numbers calculated by complicated magic formulas. That's what people want. So we give it to them. That's all there is to it.

None of the other problems with GXE actually matter. The fact that it's a percentage is actually a good thing, not a bad thing. The only reason we don't sort ladders by GXE is because it's not fun.

Okay, this explanation is getting pretty huge for one post, so I guess I'm going to put my response to your "improved GXE" idea in another post.
 

Zarel

Not a Yuyuko fan
is a member of the Site Staffis a Battle Server Administratoris a Programmeris a Pokemon Researcheris an Administrator
Creator of PS
#11
My Rating System

In reality, the goal of a rating system is to get as close to an estimate of a player's true rating as possible. X-Act's GXE formula does an extremely good job of doing this, so I really think that we need to consider the work that he left behind, since he is not available on showdown and hasn't been online since 2012. I am honestly baffled at why the smogon staff chose to use elo as the primary ranking system, when GXE is far superior, and can have its issues fixed with a few tweaks.
I answered this already, but I'll say it again because it's important. I chose Elo because it's more fun, because it's important for a ladder to be a fight for the top, not an opaque list generated by a magic algorithm.

Glicko and GXE are about estimating talent, so the amount of points gained/lost is very inconsistent/weird, and if you've played a lot, your points barely even change. Elo gives you comparatively plenty of points when you win, and in general you have a good idea of how many times you need to win to reach a particular score.

This is good because it gives you a feeling of "climbing a ladder". You get rewarded when you win, you get punished when you lose, and overall it makes the ladder a game. It makes the ladder something fun.

The original GXE formula looked like this:

Given a player rating R and a rating deviation RD:
GLIXARE Rating = 0, if RD > 100
GLIXARE Rating = round(10000 / (1 + 10^(((1500 - R) * pi / sqrt(3 * ln(10)^2 * RD^2 + 2500 * (64 * pi^2 + 147 * ln(10)^2)))))) / 100, otherwise


I have tested this formula with the glicko and rds of current top ladder players, and I did not get the same resulting GXEs as what the ladder displays, so I am almost certain that this formula has been modified in some way. Regardless, the above formula is pretty darn accurate, and if it has been modified in some way for the pokemon showdown ladder, which I am sure it has, then I don't doubt that the new formula is just as accurate if not more accurate.
It hasn't been modified at all. I'd give it a new name (and make a thread for it) if it had.

I must say that despite the claim that GXE is not as accurate when glicko-RD is high, the pokemon showdown ladder has a maximum RD of 350 for glicko-1 ratings, and honestly, if you compare a gxe that is based on a glicko rating with an RD of 0 with a glicko rating with an RD of 350, the resulting GXEs are not tremendously different.

Example:
2051 glicko rating rd of 0 >>> GXE: 89.3
2051 glicko rating rd of 350 >>> GXE: 84.6

Yes there is a 4.7 point difference here, but it really isn't that big when you consider how far apart a deviation of 0 (which by the way, is virtually impossible) is from a deviation of 350 (the deviation of a player who has not played any games). These results were calculated using the original GXE formula in microsoft excel.
Yes, deviation has a moderate (not tiny, not huge) effect on GXE. I would consider it commensurate with what it means – i.e. there's no weirdness going on that goes against my intuitions of how GXE should work.

The bigger problem is, I'm not sure what point you're trying to make here.

Like what do you even mean by "accurate" here? For your numbers to be right or wrong, they have to mean something, and I don't think you have a precise idea of what they even mean. GXE does 100% exactly what it says it does, completely as accurately as possible. So when you say "not as accurate"... not as accurate for what?

I feel that it would be very nice if a player's GXE were to be converted into a whole number, preferably a 4 digit number that falls between 1000 and 2100-2300 or so, similarly to the elo rating on pokemon showdown's ladder. X-Act provided a simple way of doing this, proposing that the GXE simply be multiplied by 20, therefore resulting in a whole number rating with a floor of 0 (for a person with a 0 gxe, which is virtually impossible), and a maximum of 2000 (for a person with a 100 gxe, which is also virtually impossible). The problem with a rating system like this is that as a player gets into the 1900s, winning battles results in extremely small point gains, and reaching 2000 or even reaching 1960 is nigh impossible (1960 would be the equivalent of a 98 gxe, which is extremely hard to obtain, unless you are that sweetlol2 guy on the ubers ladder who apparently has a 98.3 gxe).
...Why?

GXE's nicest feature is that it is a percentage with a meaning. Why would you make it a whole number?

And so I wondered, "what if I solve the gxe formula for glicko in terms of gxe, using an rd of 0?" And so I did. Here is how you would derive glicko (without an RD) from the original GXE formula:

Glicko = -((LN(100/GXE-1)/LN(10))-2.50901023943244)/0.00167267349295496


I then considered how the elo rating system relates to the glicko rating system, and I found this: http://www.glicko.net/ratings/report08.txt , which contains a formula for doing a rough conversion of the FIDE chess rating (a chess rating system which is virtually equivalent to elo) to the USCF (United States Chess Federation) chess rating (which is virtually equivalent to glicko). From this, it is possible to derive a rough conversion of the glicko rating system to the elo rating system, with a centered elo rating of 1250 (which I am sure is pretty close to the current mean elo rating on pokemon showdown's ladders, and yes this would mean that the starting rating would NOT be 1000).

The equation looks like this:

Elo = ROUND(((B78-720)/0.624),0) if Glicko < ~1969.2
Elo = ROUND(((Glicko+350)/1.1585),0) if Glicko > ~1969.2


From all of this, we can derive an equation for essentially converting GXE into Elo. However, the rating obtained from this equation is not equivalent to elo, but rather an estimate of what a player's TRUE Elo is, a value that is FAR more accurate in depicting a player's skill level than the crappy elo system that pokemon showdown uses. I'm not sure how the community will view this, but if I were to come up with a name for this rating I suppose the most basic name would be TRUE ELO, or TELO. This equation is listed here:

TELO = ROUND((((-((LN(100/GXE-1)/LN(10))-2.50901023943244)/0.00167267349295496)-720)/0.624),0) if GXE < 85.9
TELO = ROUND((((-((LN(100/GXE-1)/LN(10))-2.50901023943244)/0.00167267349295496)+350)/1.1585),0) if GXE is > or equal to 85.9
wat

I don't even know where to start here. Except... why? This is... basically another complicated attempt at ACRE. But with some flaws (as other people mentioned, the decay is much slower than either ACRE or Elo, and is too slow to actually stop people from parking lots of accounts at the top of the ladder). But even ignoring that, we dropped ACRE for good reason, because it turned out Elo was more fun.

The Advantages of This New Rating System
  • It has built in decay, due to the fact that GXE decays as RD gets bigger
In case it was unclear,

GXE has a tiny amount of decay to account for the effect of lower confidence pulling win/loss estimates towards 50-50, but nowhere near enough to constitute actual ladder decay for the purposes of keeping a ladder fresh and preventing people from parking alts.
 
#12
Zarel, I really appreciate your lengthily response. I just wanted to bring this idea to the attention of the viewers of this thread. Most of this comes from my pms with you so I'm sorry if it is redundant but I am bringing it up for other people on smogon to see (some of it is different from what I said in our pms though).

I am aware of why the elo system is used to rank players, since GXE is pretty hard to pump up in comparison to it and there is also the issue of players parking themselves. In other words, elo is more "fun," just like you mentioned. However, both you and Antar seem to admit that GXE is much better at capturing a player's skill than elo is. However, even GXE is not 100% accurate in capturing a player's percentage chance of winning a battle, since it is only an estimate of it based on their glicko-1, although it is a very good estimate. However, looking through X-Act's ancient post it dawned on me that there IS actually a way to find the REAL percentage chance of every player on the ladder beating every other player. X-Act listed 2 percentage columns in that chart that he posted, one column was the real percentage of each player winning and the other column was the estimated percentage that GXE yielded. This is EXACTLY what I meant when I was talking about the "accuracy of GXE" - the ability of it to yield the correct chance that a player has of winning a battle based on Glicko-1. The formula that showdown uses to compute GXE is mathematically equivalent to this:

GXE = 1 / (1 + 10^(((1500 - Rating) / (400 * sqrt(1 + (3 * ln(10)^2 / (400 * pi)^2) * (130^2 + RD^2))))))

I tested this formula out with the glicko-1 ratings of players on showdown and got the exact resulting GXEs that showdown displayed. What I take from this formula is that GXE is computing the probability of each player beating a player who is rated 1500 with an RD of 130 (someone who has just started on the ladder). This provides a very good estimate of the real winning chance but it is not exact, and this is especially true when the RD is high. However, in order to technically calculate a player's real chance of winning a battle, we would have to use the REAL mean glicko rating and REAL mean RD of every player on the ladder, rather than simply 1500 and 130. But I mean, isn't showdown capable of calculating this? I took a ladder sample from a very small inactive ladder that had less than 500 players, and found that the mean glicko rating was a bit under 1500 (around 1496), and the mean RD was obviously under 130 ( it was about 122). It is clear that every ladder will have a different average glicko rating (but it will be close to 1500) and a different average Rating Deviation (although it will be close to 130). What I'm suggesting is for showdown to constantly compute the average glicko rating and RD of all players on the ladder, and use those 2 numbers in the GXE calculation in place of 1500 and 130. If this were done then it wouldn't be necessary to not display GXE when the RD is over 100, because it would be a 100% accurate estimate (as in, it would be their actual chance of winning a battle rather than an estimate of it). Now, since elo is more "fun" to use, I think it would simply be nice to use this number as a revamped gxe, and make it an option to sort the ladder by this number.

When I mentioned this to you in pms, you replied saying that 1500 will always be the mean glicko rating, and that if 1500 were not the mean rating then it simply wouldn't be glicko. You also mentioned that my computation of the mean glicko rating for that small sample ladder that I used, which came out to be around 1496, was simply incorrect because the ladder I used displayed rounded numbers. However, even if all of the glicko rating numbers were rounded down by 1, and the actual mean glicko rating of the ladder was 1500, the lowest mean that I could have possibly calculated would be 1499. In other words, it is mathematically impossible for the calculated mean (from rounded numbers) to be 1496 if the actual mean is 1500, unless the displayed numbers were rounded off to a greater degree than 1 (but this is not the case since all glicko ratings are rounded to the nearest one's place). Correct me if I'm wrong here. But I think that there's something that you're missing. The glicko-1/2 system follows a logistic distribution, not a normal distribution (unless the pokemon showdown version follows a normally distributed one). Because of this, everyone's starting rating will be 1500 with a 130 RD (although the original glicko uses a 350 RD), and the ladder will be centered around 1500. While 1500 is the baseline for all ratings to be compared to, the average rating of all players on the ladder will NOT be equal to 1500, but it will be very close to this number. At least, that's what I found out by reading about the glicko system on the internet. Logistic distributions are supposed to have the same median and mean but apparently, for some reason, the glicko system's mean will not be equal to exactly 1500. It should be different for different ladders, but it will always be very close to 1500 (likely always within the 1496-1504 range). Honestly, the only way to find out the correct answer is to compute the mean glicko rating of all players on pokemon showdown's ladders and see what the means come out to. Also, the average Rating Deviation will obviously not be 130, since not every player is horribly inactive.

I really want to know:

1.) Is showdown capable of constantly calculating the mean Glicko-RD of all players on the ladder and using it in place of the 1500 and 130 in the GXE calculation?
2.) Am I correct that the mean glicko will slightly differ from 1500 (I need evidence for this from ps ladders)?
3.) Am I correct that using the mean glicko-RD will produce the true percentage chance of a player winning a battle? (According to what I have tested, I'm pretty sure I'm correct).
4.) Regardless of whether or not this can be used as a revamped GXE, can ps staff at least work on making the ladders sortable by GXE (or by this revamped GXE that I am suggesting)? Elo is more fun to use than GXE (I guess) so it might be a bad idea to scrap it, but I would at least like a "sort by GXE/glicko" option.
 
Last edited:
Likes: KW

Zarel

Not a Yuyuko fan
is a member of the Site Staffis a Battle Server Administratoris a Programmeris a Pokemon Researcheris an Administrator
Creator of PS
#13
I am aware of why the elo system is used to rank players, since GXE is pretty hard to pump up in comparison to it and there is also the issue of players parking themselves. In other words, elo is more "fun," just like you mentioned. However, both you and Antar seem to admit that GXE is much better at capturing a player's skill than elo is. However, even GXE is not 100% accurate in capturing a player's percentage chance of winning a battle, since it is only an estimate of it based on their glicko-1, although it is a very good estimate. However, looking through X-Act's ancient post it dawned on me that there IS actually a way to find the REAL percentage chance of every player on the ladder beating every other player. X-Act listed 2 percentage columns in that chart that he posted, one column was the real percentage of each player winning and the other column was the estimated percentage that GXE yielded. This is EXACTLY what I meant when I was talking about the "accuracy of GXE" - the ability of it to yield the correct chance that a player has of winning a battle based on Glicko-1. The formula that showdown uses to compute GXE is mathematically equivalent to this:

GXE = 1 / (1 + 10^(((1500 - Rating) / (400 * sqrt(1 + (3 * ln(10)^2 / (400 * pi)^2) * (130^2 + RD^2))))))

I tested this formula out with the glicko-1 ratings of players on showdown and got the exact resulting GXEs that showdown displayed. What I take from this formula is that GXE is computing the probability of each player beating a player who is rated 1500 with an RD of 130 (someone who has just started on the ladder). This provides a very good estimate of the real winning chance but it is not exact, and this is especially true when the RD is high. However, in order to technically calculate a player's real chance of winning a battle, we would have to use the REAL mean glicko rating and REAL mean RD of every player on the ladder, rather than simply 1500 and 130. But I mean, isn't showdown capable of calculating this? I took a ladder sample from a very small inactive ladder that had less than 500 players, and found that the mean glicko rating was a bit under 1500 (around 1496), and the mean RD was obviously under 130 ( it was about 122). It is clear that every ladder will have a different average glicko rating (but it will be close to 1500) and a different average Rating Deviation (although it will be close to 130). What I'm suggesting is for showdown to constantly compute the average glicko rating and RD of all players on the ladder, and use those 2 numbers in the GXE calculation in place of 1500 and 130. If this were done then it wouldn't be necessary to not display GXE when the RD is over 100, because it would be a 100% accurate estimate (as in, it would be their actual chance of winning a battle rather than an estimate of it). Now, since elo is more "fun" to use, I think it would simply be nice to use this number as a revamped gxe, and make it an option to sort the ladder by this number.
They're both estimates.

The matchup probability formula itself is an estimate.

When we say "60% chance of rain tomorrow", it doesn't mean we have a probability handed down by God. It's an estimate.

Your attempt to make a revamped GXE produces a different number, but your new number is still an estimate. And I have no reason to believe it's a better estimate (and quite a few reasons to believe it's worse).

(And that's ignoring all the stuff about making it harder to calculate.)

If this were done then it wouldn't be necessary to not display GXE when the RD is over 100, because it would be a 100% accurate estimate (as in, it would be their actual chance of winning a battle rather than an estimate of it).
This line specifically belies an utter misunderstanding of what probability actually is.

Okay, let me put it this way.

Say you are a good player. You can win approximately 90% of battles against random players.

You start a new alt. Your GXE is 50%. Your new better-accurate-you-swear GXE calculation method is... approximately 50%.

In no ways is 50% a "100% accurate estimate".

And for that matter, your new better GXE calculation method is... only approximately 50%. It's kind of hard to explain exactly what's wrong with your method, but this is a pretty easy-to-understand way to prove yours is worse. Because if you are matching up two players at random which you know nothing about, your estimate should be exactly 50%. If it's not, it means your estimate is worse, not better.

When I mentioned this to you in pms, you replied saying that 1500 will always be the mean glicko rating, and that if 1500 were not the mean rating then it simply wouldn't be glicko. You also mentioned that my computation of the mean glicko rating for that small sample ladder that I used, which came out to be around 1496, was simply incorrect because the ladder I used displayed rounded numbers. However, even if all of the glicko rating numbers were rounded down by 1, and the actual mean glicko rating of the ladder was 1500, the lowest mean that I could have possibly calculated would be 1499. In other words, it is mathematically impossible for the calculated mean (from rounded numbers) to be 1496 if the actual mean is 1500, unless the displayed numbers were rounded off to a greater degree than 1 (but this is not the case since all glicko ratings are rounded to the nearest one's place). Correct me if I'm wrong here.
You're right. I was wrong about rounding being the reason it's off.

There's more complicated ways rounding can make it lower than 1499 (I went over them with you in PM, basically, floating point numbers need to be rounded, so you get some slight rounding every time you win a battle and recalculate a rating), but being off by 4 is astronomically unlikely and I should definitely have thought about other reasons why 1496 is wrong.

But I think the real answer here is: It doesn't matter what's causing error to accumulate. What matters is that you should correct for accumulated error whenever possible.

Or, to use an analogy, it doesn't matter why dividing circumference by diameter gives you a number slightly different from pi. You'll still get more accurate calculations if you use the real pi in your future calculations.

But I think that there's something that you're missing. The glicko-1/2 system follows a logistic distribution, not a normal distribution (unless the pokemon showdown version follows a normally distributed one). Because of this, everyone's starting rating will be 1500 with a 130 RD (although the original glicko uses a 350 RD), and the ladder will be centered around 1500. While 1500 is the baseline for all ratings to be compared to, the average rating of all players on the ladder will NOT be equal to 1500, but it will be very close to this number. At least, that's what I found out by reading about the glicko system on the internet. Logistic distributions are supposed to have the same median and mean but apparently, for some reason, the glicko system's mean will not be equal to exactly 1500. It should be different for different ladders, but it will always be very close to 1500 (likely always within the 1496-1504 range). Honestly, the only way to find out the correct answer is to compute the mean glicko rating of all players on pokemon showdown's ladders and see what the means come out to. Also, the average Rating Deviation will obviously not be 130, since not every player is horribly inactive.
I was wrong to say it follows a normal distribution, calling it a logistic distribution is more accurate. The normal distribution is just used as an approximation for the logistic distribution. But the difference doesn't actually matter here.

but apparently, for some reason, the glicko system's mean will not be equal to exactly 1500
This is the line that worries me most here: You've discovered a fact that doesn't line up with what you thought you knew about the world... but instead of figuring out why, you just... make an assumption and build an entire new rating system around it. And worse than that, you tell everyone it's definitely better than the original GXE without even mentioning that it's built around this assumption you don't understand.

I really want to know:

1.) Is showdown capable of constantly calculating the mean Glicko-RD of all players on the ladder and using it in place of the 1500 and 130 in the GXE calculation?
2.) Am I correct that the mean glicko will slightly differ from 1500 (I need evidence for this from ps ladders)?
3.) Am I correct that using the mean glicko-RD will produce the true percentage chance of a player winning a battle? (According to what I have tested, I'm pretty sure I'm correct).
4.) Regardless of whether or not this can be used as a revamped GXE, can ps staff at least work on making the ladders sortable by GXE (or by this revamped GXE that I am suggesting)? Elo is more fun to use than GXE (I guess) so it might be a bad idea to scrap it, but I would at least like a "sort by GXE/glicko" option.
1) In theory anything is possible, but this is very difficult and also kind of pointless

2) You are probably not wrong that it can slightly differ; I would assume this is because of various approximations Glicko makes to make itself easier to calculate; our disagreement is on what we should do about this difference

3) To the extent that "true percentage chance" exists as a concept, it is impossible to calculate; all we have are formulas that can estimate it, and the important question is whether or not one estimate is better than another (yours is not)

In particular, GXE is intended to be the win chance against a random unknown opponent (an opponent with R=1500, RD=130), not the win chance against an average opponent (an opponent with R=1500, RD=0). In neither case, though, does it make sense to just grab a new RD by taking the mean of all the RDs of all the players on the ladder; this is a case where I think you don't understand what RD means.

4) This is an entirely separate question to which the answer is "maybe"; this will have to go in an entirely separate thread probably in Policy Review
 
Last edited:
#14
Thank you for your response Zarel. After all of the things that I've said already and that you have said, I don't really have much else to say. In the future though (In a couple of months), my friend and I are going to make our own server and code my rating system into our ladders to test it out (it'll be different then the original TELO one that I came up with though). I'm clueless when it comes to coding but my friend is VERY experienced with it so we should be able to pull it off if that's okay with you. Thanks for putting as much effort into your posts in this thread as I did lol. You seem like a cool guy and you have my respect.
 

Zarel

Not a Yuyuko fan
is a member of the Site Staffis a Battle Server Administratoris a Programmeris a Pokemon Researcheris an Administrator
Creator of PS
#15
Reffrey I do want to say that I think it's really cool that you've been thinking about stuff like this. I called you out kind of harshly because there are a few things you're wrong about that you seemed overconfident about, but overall there's a lot you do understand, and even understood well enough to identify two mistakes I made. I think you should be proud of that, especially considering the number of people who are afraid of thinking about numbers at all.

I think the main thing you could stand to improve upon is having a better feel for what things you really understand, and what things you're guessing about. Other than that, you have a strong foundation and could go far with it.

So a few things I noticed that you might be understanding incorrectly:

- Rating deviation is a representation of how sure Glicko is about your rating. So, when your rating is 1500±130 (i.e. R=1500, RD=130, don't interpret the ± as actually meaning "plus or minus"), that means Glicko thinks there's a 68% chance that your rating is within a distance of 130 of 1500 (i.e. that your rating is 1370 to 1630), a 95% chance you rating is within a distance of 260 (i.e. that your rating is 1240 to 1760). In other words, if you draw a normal distribution where μ=R, σ=RD, that'll be the probability distribution of what Glicko thinks your rating is.

The starting probability distribution is 1500±130 because that's the best approximation of the intended actual distribution: you're going to say that the average player is at rating 1500, the 84th-percentile player to be at rating 1630, etc. There's some complication involved because the intended distribution is logistic and not normal, but it's not too important to this high-level overview. And since we start out knowing absolutely nothing about a player, the starting probability distribution matches the intended actual distribution of the players.

Anyway, because of that, 1500±130 is the probability distribution of the rating of a random player on the ladder, which is why GXE calculates a matchup with that, to represent a matchup with a random ladder player.

If you wanted to synthetically construct a "better" probability distribution for a random player, you wouldn't want the RDs of individual players. Those represent uncertainty, but uncertainty is actually not very important when you're trying to consolidate a rating distribution out of a cloud of individual rating probability distributions. Instead, you'd want to take the standard deviation of the dataset of every rating on the ladder.

(This will give you a probability distribution that's approximately 1500±130. To the extent that it isn't, it's because Glicko uses various approximations to make the math easier. You'll still get generally more accurate numbers if you just pretend the entire thing is 1500±130 – this technique prevents the existing error caused by rounding from having an even greater effect than it currently does.)

- Glicko and Glicko-2 (and even the original Elo) model ratings as points on a logistic distribution representing the distribution of all skill levels. So 1500 is average, 1630 is one standard deviation above average, etc. It's possible for the actual individual Glicko estimates of player ratings to be on a different distribution (they will always be, because discrete distributions (lists of individual ratings) will always be slightly different from continuous distributions like logistic distributions, but the intended logistic distribution will still be the best way to model the playerbase as a whole.
 

Users Who Are Viewing This Thread (Users: 1, Guests: 1)

Top