As you probably noticed from some of my posts, I really don't like ShoddyBattle's 'Conservative Rating Estimate' (CRE) method of providing the overall rating of a player. For the record, the formula used to calculate the CRE of a player is:
The disadvantages of CRE are the following:
Because of this, I set out to try to find a better way of finding a player's overall rating given his Rating R and Deviation RD... and I managed to do this yesterday.
I read Glickman's paper (the inventor of the Glicko and Glicko-2 rating systems) and he provides an equation that essentially calculates the probability that a player with rating R_1 and deviation RD_1 beats another player with rating R_2 and deviation RD_2. It is written below:
I then simulated 250 players, each having their own rating and deviation, and found the probability of every player beating every other player using the above formula, and averaged out the probabilities for every player. This provides the true rating for every player.
However, this is a strenuous effort to do, and hence I wanted to approximate this probability for every player using just his R and RD (not everyone else's as well). After considering various possibilities, it dawned on me that the probability of the player beating a 1500 rating, 350 deviation player (the rating and deviation of a player that has just joined the ladder) would provide a good approximation. When testing it out, it did provide a good approximation of the true rating... a very, very good approximation actually!
The only time it didn't provide a good approximation was when the deviation of the player was high. This confirmed yet again that players that have a rating deviation that is too high (meaning that his rating is too uncertain) shouldn't even be listed on the leaderboard. And this is what I propose for the estimated rating to be done.
After consulting a bit with the community, it was decided that this system's rating should represent the estimated percentage that that player has of winning a battle against a random opponent.
So, finally, here is what I propose to be a much better estimate of the player's rating. I'm calling it GLIXARE, short for 'Glicko - X-Act Rating Estimate':
The table below shows the top 50 of the 250 players I've tested, ranked according to the true percentage they had of winning against the other players. Next to them is the rank they would have obtained if CRE was used, and the rank they would have obtained if GLIXARE was used. Notice the stunning accuracy of the GLIXARE ranking compared to the true ranking.
Needless to say, I propose that the GLIXARE rating be used for our Smogon leaderboards. I would also advise Colin to do the same for his ShoddyBattle rating, but, of course, I'll leave that up to him since it's his program after all.
Code:
CRE = Rating - 4 * Deviation
The disadvantages of CRE are the following:
- Rating changes are too slow. You'll need to beat quite a lot of players in order to see your rating change acceptably. This makes players use more alts.
- The higher the rating deviation of the player, the more the player's true skill is underestimated.
- It provides horribly incorrect ratings for people whose rating deviation is very high. For an example, just visit this page.
- It is simple to calculate.
Because of this, I set out to try to find a better way of finding a player's overall rating given his Rating R and Deviation RD... and I managed to do this yesterday.
I read Glickman's paper (the inventor of the Glicko and Glicko-2 rating systems) and he provides an equation that essentially calculates the probability that a player with rating R_1 and deviation RD_1 beats another player with rating R_2 and deviation RD_2. It is written below:
Code:
Probability = 1 / (1 + 10^(((R_2 - R_1) / (400 * sqrt(1 + C * (RD_1^2 + RD_2^2))))))
where C = 3 * ln(10)^2 / (400 * pi)^2 (approximately 0.0000100724)
pi = 3.14159265359
sqrt(x) is the square root of x
ln(x) is the natural logarithm of x
However, this is a strenuous effort to do, and hence I wanted to approximate this probability for every player using just his R and RD (not everyone else's as well). After considering various possibilities, it dawned on me that the probability of the player beating a 1500 rating, 350 deviation player (the rating and deviation of a player that has just joined the ladder) would provide a good approximation. When testing it out, it did provide a good approximation of the true rating... a very, very good approximation actually!
The only time it didn't provide a good approximation was when the deviation of the player was high. This confirmed yet again that players that have a rating deviation that is too high (meaning that his rating is too uncertain) shouldn't even be listed on the leaderboard. And this is what I propose for the estimated rating to be done.
After consulting a bit with the community, it was decided that this system's rating should represent the estimated percentage that that player has of winning a battle against a random opponent.
So, finally, here is what I propose to be a much better estimate of the player's rating. I'm calling it GLIXARE, short for 'Glicko - X-Act Rating Estimate':
Code:
Given a player rating R and a rating deviation RD:
GLIXARE Rating = 0, if RD > 100
GLIXARE Rating = round(10000 / (1 + 10^(((1500 - R) * pi / sqrt(3 * ln(10)^2 * RD^2 + 2500 * (64 * pi^2 + 147 * ln(10)^2)))))) / 100, otherwise
The table below shows the top 50 of the 250 players I've tested, ranked according to the true percentage they had of winning against the other players. Next to them is the rank they would have obtained if CRE was used, and the rank they would have obtained if GLIXARE was used. Notice the stunning accuracy of the GLIXARE ranking compared to the true ranking.
Code:
Rank Rating Deviation True Rating CRE Rank For CRE GLIXARE Rank For GLIXARE
1 1991.408347 52.4913089 86.25% 1781.443112 1 (=) 86.77% 1 (=)
2 1992.528854 68.34131414 86.21% 1719.163597 13 (+11) 86.73% 2 (=)
3 1989.461831 80.60840798 85.95% 1667.028199 18 (+15) 86.51% 3 (=)
4 1969.23615 50.08961382 85.08% 1768.877695 2 (-2) 85.78% 4 (=)
5 1968.509612 50.00035476 85.04% 1768.508193 3 (-2) 85.75% 5 (=)
6 1972.675135 99.88585936 84.88% 1573.131698 34 (+28) 85.58% 6 (=)
7 1963.494349 50.60336877 84.76% 1761.080874 5 (-2) 85.51% 7 (=)
8 1962.47472 50.03981759 84.71% 1762.31545 4 (-4) 85.46% 8 (=)
9 1953.22279 52.20270923 84.18% 1744.411953 6 (-3) 85.00% 9 (=)
10 1958.445485 96.63628542 84.13% 1571.900343 36 (+26) 84.94% 10 (=)
11 1943.00504 50.10101356 83.60% 1742.600986 7 (-4) 84.51% 11 (=)
12 1941.259317 52.64803719 83.48% 1730.667168 10 (-2) 84.41% 12 (=)
13 1938.159617 50.92018226 83.31% 1734.478888 8 (-5) 84.26% 13 (=)
14 1936.183665 51.14032287 83.19% 1731.622373 9 (-5) 84.16% 14 (=)
15 1930.057618 50.16740178 82.84% 1729.388011 11 (-4) 83.85% 15 (=)
16 1927.779647 51.0262639 82.70% 1723.674591 12 (-4) 83.73% 16 (=)
17 1922.220146 57.65787498 82.32% 1691.588646 14 (-3) 83.40% 17 (=)
18 1918.558883 74.43201955 81.99% 1620.830805 28 (+10) 83.09% 18 (=)
19 1908.558873 59.43764926 81.47% 1670.808276 15 (-4) 82.65% 19 (=)
20 1898.317729 81.4379487 80.68% 1572.565934 35 (+15) 81.93% 20 (=)
21 1876.359618 52.21151457 79.44% 1667.513559 17 (-4) 80.86% 21 (=)
22 1870.366607 50.00313579 79.06% 1670.354064 16 (-6) 80.51% 22 (=)
23 1867.964646 51.3491295 78.89% 1662.568128 21 (-2) 80.36% 23 (=)
24 1867.766236 50.56176086 78.88% 1665.519192 19 (-5) 80.35% 24 (=)
25 1863.669023 50.11759802 78.61% 1663.198631 20 (-5) 80.10% 25 (=)
26 1866.541991 95.77888912 78.50% 1483.426435 48 (+22) 79.95% 26 (=)
27 1859.313589 55.68806985 78.28% 1636.561309 25 (-2) 79.81% 27 (=)
28 1854.389046 51.55994467 77.97% 1648.149268 22 (-6) 79.52% 28 (=)
29 1855.299562 72.27265208 77.92% 1566.208954 37 (+8) 79.46% 29 (=)
30 1853.503073 52.54886057 77.90% 1643.307631 23 (-7) 79.46% 30 (=)
31 1841.225405 50.60827664 77.06% 1638.792298 31 (=) 78.70% 31 (=)
32 1834.513128 50.14590546 76.59% 1633.929506 26 (-6) 78.26% 32 (=)
33 1834.145367 52.8071478 76.55% 1622.916775 27 (-6) 78.23% 33 (=)
34 1801.474785 50.04016895 74.21% 1601.314109 29 (-5) 76.04% 34 (=)
35 1804.267316 94.72400253 74.15% 1425.371306 55 (+20) 75.93% 35 (=)
36 1795.795989 50.01171801 73.78% 1595.749117 30 (-6) 75.64% 36 (=)
37 1793.496594 50.25237181 73.61% 1592.487107 31 (-6) 75.47% 37 (=)
38 1782.059218 51.06317634 72.75% 1577.806512 32 (-6) 74.65% 38 (=)
39 1775.843559 50.18930449 72.28% 1575.086341 33 (-6) 74.20% 39 (=)
40 1748.052482 56.82618848 70.11% 1520.747729 42 (+2) 72.08% 40 (=)
41 1748.543861 92.78899389 69.97% 1377.387886 67 (+26) 71.89% 41 (=)
42 1744.455748 50.99377091 69.85% 1540.480664 39 (-3) 71.83% 42 (=)
43 1743.084558 60.21302067 69.71% 1502.232476 44 (+1) 71.68% 43 (=)
44 1742.067022 50.19645562 69.66% 1541.2812 38 (-6) 71.65% 44 (=)
45 1740.620853 84.04522098 69.40% 1404.439969 58 (+13) 71.37% 46 (+1)
46 1738.536514 51.44593668 69.38% 1532.752767 40 (+6) 71.35% 45 (-1)
47 1728.468483 59.51691032 68.55% 1490.400841 45 (-2) 70.54% 47 (=)
48 1727.948725 50.16195168 68.54% 1527.300918 41 (-7) 70.54% 48 (=)
49 1708.596637 59.16778978 66.96% 1471.925478 50 (+1) 68.94% 49 (=)
50 1705.528967 50.05894259 66.73% 1505.293197 43 (-7) 68.72% 50 (=)
Needless to say, I propose that the GLIXARE rating be used for our Smogon leaderboards. I would also advise Colin to do the same for his ShoddyBattle rating, but, of course, I'll leave that up to him since it's his program after all.