GXE (GLIXARE): A Much Better Way Of Estimating A Player's Overall Rating Than Shoddy's CRE

X-Act · Feb 10, 2009

As you probably noticed from some of my posts, I really don't like ShoddyBattle's 'Conservative Rating Estimate' (CRE) method of providing the overall rating of a player. For the record, the formula used to calculate the CRE of a player is:

Code:

CRE = Rating - 4 * Deviation

The disadvantages of CRE are the following:

Rating changes are too slow. You'll need to beat quite a lot of players in order to see your rating change acceptably. This makes players use more alts.
The higher the rating deviation of the player, the more the player's true skill is underestimated.
It provides horribly incorrect ratings for people whose rating deviation is very high. For an example, just visit this page.

To be fair, there are also some advantages. Let's list them:

It is simple to calculate.

Umm, actually I cannot think of more. :(

Because of this, I set out to try to find a better way of finding a player's overall rating given his Rating R and Deviation RD... and I managed to do this yesterday.

I read Glickman's paper (the inventor of the Glicko and Glicko-2 rating systems) and he provides an equation that essentially calculates the probability that a player with rating R_1 and deviation RD_1 beats another player with rating R_2 and deviation RD_2. It is written below:

Code:

Probability = 1 / (1 + 10^(((R_2 - R_1) / (400 * sqrt(1 + C * (RD_1^2 + RD_2^2))))))
 
where C = 3 * ln(10)^2 / (400 * pi)^2 (approximately 0.0000100724)
      pi = 3.14159265359
      sqrt(x) is the square root of x
      ln(x) is the natural logarithm of x

I then simulated 250 players, each having their own rating and deviation, and found the probability of every player beating every other player using the above formula, and averaged out the probabilities for every player. This provides the true rating for every player.

However, this is a strenuous effort to do, and hence I wanted to approximate this probability for every player using just his R and RD (not everyone else's as well). After considering various possibilities, it dawned on me that the probability of the player beating a 1500 rating, 350 deviation player (the rating and deviation of a player that has just joined the ladder) would provide a good approximation. When testing it out, it did provide a good approximation of the true rating... a very, very good approximation actually!

The only time it didn't provide a good approximation was when the deviation of the player was high. This confirmed yet again that players that have a rating deviation that is too high (meaning that his rating is too uncertain) shouldn't even be listed on the leaderboard. And this is what I propose for the estimated rating to be done.

After consulting a bit with the community, it was decided that this system's rating should represent the estimated percentage that that player has of winning a battle against a random opponent.

So, finally, here is what I propose to be a much better estimate of the player's rating. I'm calling it GLIXARE, short for 'Glicko - X-Act Rating Estimate':

Code:

Given a player rating R and a rating deviation RD:
GLIXARE Rating = 0, if RD > 100
GLIXARE Rating = round(10000 / (1 + 10^(((1500 - R) * pi / sqrt(3 * ln(10)^2 * RD^2 + 2500 * (64 * pi^2 + 147 * ln(10)^2)))))) / 100, otherwise

The table below shows the top 50 of the 250 players I've tested, ranked according to the true percentage they had of winning against the other players. Next to them is the rank they would have obtained if CRE was used, and the rank they would have obtained if GLIXARE was used. Notice the stunning accuracy of the GLIXARE ranking compared to the true ranking.

Code:

Rank  Rating       Deviation    True Rating  CRE          Rank For CRE  GLIXARE  Rank For GLIXARE
  1   1991.408347  52.4913089   86.25%       1781.443112     1 (=)       86.77%       1 (=)
  2   1992.528854  68.34131414  86.21%       1719.163597    13 (+11)     86.73%       2 (=)
  3   1989.461831  80.60840798  85.95%       1667.028199    18 (+15)     86.51%       3 (=)
  4   1969.23615   50.08961382  85.08%       1768.877695     2 (-2)      85.78%       4 (=)
  5   1968.509612  50.00035476  85.04%       1768.508193     3 (-2)      85.75%       5 (=)
  6   1972.675135  99.88585936  84.88%       1573.131698    34 (+28)     85.58%       6 (=)
  7   1963.494349  50.60336877  84.76%       1761.080874     5 (-2)      85.51%       7 (=)
  8   1962.47472   50.03981759  84.71%       1762.31545      4 (-4)      85.46%       8 (=)
  9   1953.22279   52.20270923  84.18%       1744.411953     6 (-3)      85.00%       9 (=)
 10   1958.445485  96.63628542  84.13%       1571.900343    36 (+26)     84.94%      10 (=)
 11   1943.00504   50.10101356  83.60%       1742.600986     7 (-4)      84.51%      11 (=)
 12   1941.259317  52.64803719  83.48%       1730.667168    10 (-2)      84.41%      12 (=)
 13   1938.159617  50.92018226  83.31%       1734.478888     8 (-5)      84.26%      13 (=)
 14   1936.183665  51.14032287  83.19%       1731.622373     9 (-5)      84.16%      14 (=)
 15   1930.057618  50.16740178  82.84%       1729.388011    11 (-4)      83.85%      15 (=)
 16   1927.779647  51.0262639   82.70%       1723.674591    12 (-4)      83.73%      16 (=)
 17   1922.220146  57.65787498  82.32%       1691.588646    14 (-3)      83.40%      17 (=)
 18   1918.558883  74.43201955  81.99%       1620.830805    28 (+10)     83.09%      18 (=)
 19   1908.558873  59.43764926  81.47%       1670.808276    15 (-4)      82.65%      19 (=)
 20   1898.317729  81.4379487   80.68%       1572.565934    35 (+15)     81.93%      20 (=)
 21   1876.359618  52.21151457  79.44%       1667.513559    17 (-4)      80.86%      21 (=)
 22   1870.366607  50.00313579  79.06%       1670.354064    16 (-6)      80.51%      22 (=)
 23   1867.964646  51.3491295   78.89%       1662.568128    21 (-2)      80.36%      23 (=)
 24   1867.766236  50.56176086  78.88%       1665.519192    19 (-5)      80.35%      24 (=)
 25   1863.669023  50.11759802  78.61%       1663.198631    20 (-5)      80.10%      25 (=)
 26   1866.541991  95.77888912  78.50%       1483.426435    48 (+22)     79.95%      26 (=)
 27   1859.313589  55.68806985  78.28%       1636.561309    25 (-2)      79.81%      27 (=)
 28   1854.389046  51.55994467  77.97%       1648.149268    22 (-6)      79.52%      28 (=)
 29   1855.299562  72.27265208  77.92%       1566.208954    37 (+8)      79.46%      29 (=)
 30   1853.503073  52.54886057  77.90%       1643.307631    23 (-7)      79.46%      30 (=)
 31   1841.225405  50.60827664  77.06%       1638.792298    31 (=)       78.70%      31 (=)
 32   1834.513128  50.14590546  76.59%       1633.929506    26 (-6)      78.26%      32 (=)
 33   1834.145367  52.8071478   76.55%       1622.916775    27 (-6)      78.23%      33 (=)
 34   1801.474785  50.04016895  74.21%       1601.314109    29 (-5)      76.04%      34 (=)
 35   1804.267316  94.72400253  74.15%       1425.371306    55 (+20)     75.93%      35 (=)
 36   1795.795989  50.01171801  73.78%       1595.749117    30 (-6)      75.64%      36 (=)
 37   1793.496594  50.25237181  73.61%       1592.487107    31 (-6)      75.47%      37 (=)
 38   1782.059218  51.06317634  72.75%       1577.806512    32 (-6)      74.65%      38 (=)
 39   1775.843559  50.18930449  72.28%       1575.086341    33 (-6)      74.20%      39 (=)
 40   1748.052482  56.82618848  70.11%       1520.747729    42 (+2)      72.08%      40 (=)
 41   1748.543861  92.78899389  69.97%       1377.387886    67 (+26)     71.89%      41 (=)
 42   1744.455748  50.99377091  69.85%       1540.480664    39 (-3)      71.83%      42 (=)
 43   1743.084558  60.21302067  69.71%       1502.232476    44 (+1)      71.68%      43 (=)
 44   1742.067022  50.19645562  69.66%       1541.2812      38 (-6)      71.65%      44 (=)
 45   1740.620853  84.04522098  69.40%       1404.439969    58 (+13)     71.37%      46 (+1)
 46   1738.536514  51.44593668  69.38%       1532.752767    40 (+6)      71.35%      45 (-1)
 47   1728.468483  59.51691032  68.55%       1490.400841    45 (-2)      70.54%      47 (=)
 48   1727.948725  50.16195168  68.54%       1527.300918    41 (-7)      70.54%      48 (=)
 49   1708.596637  59.16778978  66.96%       1471.925478    50 (+1)      68.94%      49 (=)
 50   1705.528967  50.05894259  66.73%       1505.293197    43 (-7)      68.72%      50 (=)

Needless to say, I propose that the GLIXARE rating be used for our Smogon leaderboards. I would also advise Colin to do the same for his ShoddyBattle rating, but, of course, I'll leave that up to him since it's his program after all.

Caelum · Feb 10, 2009

Heh. I did this stupid competitive league thing in high school and a few of my friends and myself used the TrueSkills rating system (pretty similar in origin to Elo & Glicko) to devise a better construct of the Conservative skill estimate (same thing as CRE except our league was using 3 as the constant multiplier of the deviation, oh and it was called SKE). I was planning to dig it up and see if I could manipulate it to slightly for glicko when I heard you were doing this, didn't think you'd get it done so quickly! Anyway, so that's my background story for obviously supporting the change. I just checked it and ours was similar in form (obviously not identical since it was a different game and slightly different system but yeah) so I'm going to assume you manipulated everything correctly since it looks pretty similar and you are usually right anyway

Only issue I have is, it isn't a mathematical one or a factual one at all.

I think many players will be quite upset to see their "rating" (honestly, people only look at the CRE, or GXE in the future I guess, when talking about rating) listed as 0 simply because there deviation is greater than 100. When this is implemented I'd prefer that the rating listed is done just like anyone elses but leaderboard appearance would be restricted solely to those with a deviation less than or equal to 100. This is more of an aesthetic, to please the user base thing more than anything.

Edit: Also, GXE is cooler than GLIXARE. GXE = Glicko- X Act estimate!

X-Act · Feb 10, 2009

Oh, I was only suggesting we do this for the Smogon leaderboards, as I say at the end of the original post.

Also I don't mind a name change; if you prefer GXE, then so be it.

Caelum · Feb 10, 2009

Oh I see.

On shoddy when you want to see your record the first line listed is your CRE. I was assuming that this rating was just going to replace that line. If you are just using it for leaderboard rankings then I have no complaints.

X-Act · Feb 10, 2009

I would also like the CRE displayed in the program to change to GLIXARE (or GXE, whatever), but I believe that is at Colin's discretion.

Syberia · Feb 10, 2009

You're the math guy; if you think this is a better option you'll get no objections from me.

darkie · Feb 10, 2009

That issue with Trolly is certainly a large one. I don't mind it at all, especially if it can help curb the number alts people need to make. I'm sure Doug would appreciate the alts deal as well.

Seven Deadly Sins · Feb 10, 2009

I love the idea of this, especially if it keeps the whole bazillion alts thing down.

DougJustDoug · Feb 11, 2009

I'm not in a position to validate your formulas, X-Act. However, I completely agree with your motivations and reasoning. Just eye-balling the results, combined with my intuitive recollection of past empirical data -- it looks to be a better fit for our needs.

Here's my proposal:

I'll update the Shoddy server code to allow a more configurable rating system. In that way, any servers comfortable with Glicko2 can continue to use it. But, we will be able to use the GLIXARE system.
I will implement GLIXARE on the Create-A-Pokemon server ladder. I've been interested in making some changes to the way the CAP ladder works, and it involved a ratings reset anyway. I use CAP as a guinea pig for all sorts of programming changes -- since there is not as much traffic, and the CAP server community is supportive of trying out new things (introducing new pokemon every few months has a way of doing that...)
After we work out any kinks on the CAP server, we will make announcements here on the forums and set a cutover date for the Smogon University server.
If the SU implementation goes well, I'll upload the whole thing to the Shoddy source code repository, including server conversion scripts, and other server admins can use GLIXARE, if they so choose.

X-Act · Feb 11, 2009

Actually Doug, GLIXARE can be used by all systems that are based on Glicko, including the current Glicko-2 system implemented by Shoddy. This is just a replacement of CRE, not a replacement of Glicko-2. This is just a better way of interpreting a player's current Rating and Deviation as a single rating than CRE.

Seven Deadly Sins · Feb 11, 2009

After looking at the data, it looks like the biggest issue is that CRE takes Deviation into account on a massive scale due to the fact that CRE is meant to be an average of how good a player is, and Deviation specifically means that the system can't pin down exactly how good the player is. I foresee that the main issue will be that it's a double edged sword in that while it makes it less necessary to have a bunch of alts, it also makes it easier for new alts to climb the ladder due to the rating being based less on deviation.

That's just my take on it, though. There's probably something in there I missed that makes it moot.

DougJustDoug · Feb 11, 2009

Seven Deadly Sins said:
After looking at the data, it looks like the biggest issue is that CRE takes Deviation into account on a massive scale due to the fact that CRE is meant to be an average of how good a player is, and Deviation specifically means that the system can't pin down exactly how good the player is. I foresee that the main issue will be that it's a double edged sword in that while it makes it less necessary to have a bunch of alts, it also makes it easier for new alts to climb the ladder due to the rating being based less on deviation.

That's just my take on it, though. There's probably something in there I missed that makes it moot.

You probably overlooked this part of X-Act's formula:

GLIXARE Rating = 0, if RD > 100

So, new users will have a rating of 0, until they get their deviation low enough to rate them reasonably.

Seven Deadly Sins · Feb 11, 2009

Oh, I did miss that. Thanks. It looks good, actually!

X-Act · Feb 12, 2009

I've done a very slight modification to the GLIXARE rating formula. Basically, I made it so that you know at a glance your playing strength as a probability of you winning a random battle.

I've now changed the rating so that it is a number between 0 and 2000. Basically, your probability of you winning a random battle is approximately (Rating / 20) %. This can be quickly calculated by halving the rating and then moving the decimal point one place to the left.

For example, if your GLIXARE rating is 1743, the probability of you winning against a random opponent is approximately 1743 / 20, or 87.15%.

This means that if your rating is more than 1000, you are better than average, and if it is less than 1000, you are worse than average.

I'll edit the original post shortly.

darkie · Feb 12, 2009

That is a nifty little nuance, and to be honest, it makes perfect sense for rating to be correlated to the chance that a player could win a battle where really only playing skill (and some luck) matters.

DougJustDoug · Feb 12, 2009

X-Act said:
I've done a very slight modification to the GLIXARE rating formula. Basically, I made it so that you know at a glance your playing strength as a probability of you winning a random battle.

I've now changed the rating so that it is a number between 0 and 2000. Basically, your probability of you winning a random battle is approximately (Rating / 20) %. This can be quickly calculated by halving the rating and then moving the decimal point one place to the left.

For example, if your GLIXARE rating is 1743, the probability of you winning against a random opponent is approximately 1743 / 20, or 87.15%.

This means that if your rating is more than 1000, you are better than average, and if it is less than 1000, you are worse than average.

I'll edit the original post shortly.

I think we could actually display the probability percentage when users check their rating, and we could put that number on the leaderboard. It makes more sense than the four-digit number shown today, and it would easily distinguish a "GLIXARE-based server" from a "CRE-based server".

We don't have to do it, but I think it might be a good idea.

X-Act · Feb 13, 2009

Well, in that case, we could just make the GLIXARE rating to be simply this percentage. I didn't want to do this because I don't like decimals when showing a rating. Whole numbers are much simpler. That's why I multiplied this percentage by 20 and rounded to the nearest whole number.

I could have also multiplied by any other number in theory. 20 seemed to be a good all-round number though. Numbers less than 20 would have made ratings too near (or equal) to each other (remember that the number is then rounded to the nearest whole number). Numbers greater than 20, on the other hand, produced ratings that were too big for my tastes. Also, 20 has the advantage that you can know immediately whether you're better than average or not just by counting the number of digits in your rating (4 digits = better than average, less than 4 digits = worse than average). You can also say "I'm over 1200 so I have more than a 60% chance of winning" or "I'm over 1500 so I have more than a 75% chance of winning", or something like that.

Another option would be to multiply the percentage by 100. That way you'd get a rating between 0 and 10000, and the percentage would be calculated very easily. Like if your rating is 4598, you'd have a 45.98% of beating a random player. This would have the following advantage:

0 - 1000: Percentage of beating someone is between 0% and 10%
1000 - 2000: Percentage of beating someone is between 10% and 20%
2000 - 3000: Percentage of beating someone is between 20% and 30%

etc.

DougJustDoug · Feb 13, 2009

I like the winning-chance percentage, even if it is a decimal, because it is easily explained to new users, and users will have an intuitive sense of the magnitude of the numbers.

A rating of 1723 is basically a "meaningless" four-digit number for most users. It only gains meaning when stacked up against other numbers, by looking at the leaderboard. Very few users will ever bother to find out what that number means, even if it is as simple as dividing by 20. Most users will assume the number is undecipherable by non-math-wizards, and they will go about their business.

But a rating of 87.15% almost guarantees that every player will ask a question of, "87.15% of what?". When they are told it represents their chance of winning against a random ladder opponent, that will be quickly understood and never forgotten. Users will tell other users and the "meaning" of the GLIXARE estimate would likely be disseminated quickly and understood by all.

Also, if someone is told their rating is 43.25%, for example -- there is an intuitive appreciation for the magnitude of that number, since the user implicitly knows that 100% is the upper limit. If I told them their rating is 825, they likely have no idea where that number falls in the range of possible values. The percentage has the benefit of "being a percentage", which means that it is part of the common parlance of numbers used by mere math mortals.

I think the percentage is a number more easily digested by lay users, and could be a distinguishing feature of the GLIXARE estimation system.

X-Act · Feb 13, 2009

I have no problem with displaying the GLIXARE rating as a percentage to two decimal places. Let me edit the OP.

Cathy · Feb 14, 2009

This looks good and since there are really no advantages to CRE there's no need for anything fancy to allow the client to use both CRE and this new measure. The client (i.e. the view rating dialogue box) might as well just be changed to use this new measure exclusively.

For display purposes, I'd recommend displaying a user with too high of a rating deviation as "provisional" or something similar, rather than showing a 0.

X-Act · Feb 15, 2009

I agree with using the word 'provisional' too instead of 0.

Although both have the same effects, really: that of placing your rating at the bottom of the leaderboard.

Another way maybe would be to display the word provisional, but then also display the rating that GLIXARE would provide you if your RD wasn't too high. This wouldn't be done on the leaderboard, but rather in the program itself. Something like:

Rating: Provisional (63.44%)

This would be done so that the player would know how he's doing approximately.

Also recall that glicko-2 increases the RD of a player that is not playing. If this becomes too high, his rating would become provisional, which, in my opinion, is a good thing.

darkie · Feb 17, 2009

A thought: it would be interesting to see how the results of, say, a randbat ladder would correspond with the probability percentage of actually winning a randbat.

DougJustDoug · Feb 20, 2009

One possible negative of using the percentage -- people may mistakenly think that it is their "Percentage of wins so far". I still think we should use the percentage, but I won't be surprised if new users jump to this conclusion.

I am changing the code for testing on the CAP server. Like Colin suggested, I plan to keep both CRE and GLIXARE for testing purposes.

I totally agree with the proposal for "Provisional", and I think X-Act's suggestion for showing the user's provisional estimate is a good idea.

Cathy · Feb 20, 2009

I don't really like X-Act's proposed name for this measure and if it's in the client I'd rather it be called something more intuitive that describes what it actually is, perhaps just Percentage Rating Estimate.

X-Act · Feb 21, 2009

I don't care about the name. Name it what the hell you like.

Its real definition is "the probability that you win against a player with a 1500 rating and 350 deviation". Such a player may be considered to be a random opponent, since a player with such a rating may be the best or worst player, or something in between. In fact, this probability estimate is very near the true percentage of winning against a random opponent.

There, you now have hints on what to call it.

GXE (GLIXARE): A Much Better Way Of Estimating A Player's Overall Rating Than Shoddy's CRE

np: Biffy Clyro - Shock Shock

qibz official stalker

np: Biffy Clyro - Shock Shock

qibz official stalker

np: Biffy Clyro - Shock Shock

[custom user title]

.

~hallelujah~

Knows the great enthusiasms

np: Biffy Clyro - Shock Shock

~hallelujah~

Knows the great enthusiasms

~hallelujah~

np: Biffy Clyro - Shock Shock

.

Knows the great enthusiasms

np: Biffy Clyro - Shock Shock

Knows the great enthusiasms

np: Biffy Clyro - Shock Shock

Banned deucer.

np: Biffy Clyro - Shock Shock

.

Knows the great enthusiasms

Banned deucer.

np: Biffy Clyro - Shock Shock

Users Who Are Viewing This Thread (Users: 1, Guests: 0)