The OU List.

X-Act

np: Biffy Clyro - Shock Shock
is a Site Content Manager Alumnusis a Programmer Alumnusis a Smogon Discord Contributor Alumnusis a Top Researcher Alumnusis a Top CAP Contributor Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis a Smogon Media Contributor Alumnusis an Administrator Alumnus
However, that per-battle rating calculation is made by taking both players NIGHTLY rating into account -- not their most-current rating at the time of the battle.
I'm sorry, I wasn't aware of this. If that's the case, then you're right.

However, I'd like to say that there is a formula that, given two player's R and RD, gives the probability of Player 1 beating Player 2, which is a suitable measure of by how much a player is better than another player.

I'll post it here after I end the lecture that's coming up, since I need to find it first.

EDIT: Okay, here is the formula. Brace yourselves as Prof. Glickman seems to be a wicked guy who loves really long and complex equations!

If Player 1 has rating R1 and rating deviation RD1 and Player 2 has rating R2 and rating deviation RD2, then the probability of Player 1 winning against Player 2 is:

Code:
1 / (1 + 10^(-g * (R1 - R2) / 400))
 
where g = 1 / sqrt(1 + 3 * (ln(10))^2 * (RD1^2 + RD2^2) / (400*pi)^2)
  and pi is the usual pi (approx. 3.14159)
Intererstingly, the formula above returns 0.5, or 50%, if R1=R2, irrelevantly what RD1 and RD2 are. Also, if the probability that Player 1 wins against Player 2 is p, using the above formula, then the formula returns the probability that Player 2 wins against Player 1 as being, correctly, (1-p).

This can be calculated for each and every player on the ladder against each and every player on the ladder, and these probabilities averaged out, to find the _real_ relative strength of all players. (Of course, a computer is very suited for this!) I think this would be a very good thing to do, actually.
 

X-Act

np: Biffy Clyro - Shock Shock
is a Site Content Manager Alumnusis a Programmer Alumnusis a Smogon Discord Contributor Alumnusis a Top Researcher Alumnusis a Top CAP Contributor Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis a Smogon Media Contributor Alumnusis an Administrator Alumnus
I decided to implement the above "Player1 beats Player2" formula for 200 randomly-generated players. I applied it for every player against every other player, summed up the probabilities and divided by the number of probabilities. I then sorted them in descending order. Here is the result:

Code:
Rating  Deviation  Average Percentage of Winning
------------------------------------------------
1983.23 66.76      87.15%
1957.91 41.12      86.09%
1997.38 224.46     85.81%
1949.17 37.16      85.68%
1998.25 267.42     85.05%
1945.41 139.34     84.67%
1924.07 70.19      84.23%
1944.96 195.91     83.84%
1951.25 219.93     83.75%
1904.87 56.15      83.29%
1901.06 45.13      83.13%
1901.28 138.25     82.38%
1898.10 154.03     82.01%
1940.64 288.15     81.95%
1877.02 114.49     81.29%
1892.77 201.38     81.04%
1882.40 167.35     80.96%
1877.35 150.43     80.90%
1902.79 255.29     80.65%
1858.76 105.46     80.31%
1916.83 321.48     80.08%
1856.85 156.04     79.65%
1843.35 135.22     79.09%
1845.10 151.42     79.00%
1844.72 154.30     78.95%
1844.25 216.56     78.04%
1874.67 317.05     77.95%
1818.45 145.72     77.44%
1851.25 314.26     76.72%
1819.67 246.34     76.09%
1785.44 86.01      75.84%
1779.00 61.09      75.55%
1788.29 149.75     75.46%
1823.07 329.85     74.81%
1792.79 245.09     74.45%
1812.40 332.96     74.13%
1771.23 170.54     74.09%
1784.61 281.98     73.34%
1753.09 156.99     73.02%
1734.63 77.98      72.37%
1730.03 163.12     71.35%
1742.02 235.36     71.29%
1719.81 181.47     70.44%
1708.41 98.77      70.34%
1749.76 333.70     70.31%
1720.29 248.23     69.65%
1704.09 150.84     69.62%
1692.47 92.30      69.20%
1713.39 247.79     69.18%
1682.03 99.20      68.38%
1680.32 190.02     67.51%
1680.18 212.32     67.26%
1650.05 37.92      66.15%
1641.17 67.50      65.37%
1625.81 71.48      64.13%
1625.11 56.86      64.12%
1650.15 313.44     63.98%
1625.63 122.36     63.88%
1625.83 160.14     63.65%
1640.33 330.95     63.11%
1632.88 314.01     62.77%
1609.62 133.08     62.55%
1610.35 213.03     62.06%
1598.89 133.10     61.70%
1610.63 282.97     61.48%
1612.46 298.16     61.47%
1606.04 314.98     60.86%
1592.97 232.67     60.58%
1597.46 300.96     60.36%
1572.13 68.47      59.77%
1568.66 173.63     59.06%
1554.23 122.84     58.12%
1555.91 235.77     57.71%
1555.61 241.36     57.66%
1543.92 250.23     56.71%
1535.43 192.24     56.31%
1521.07 146.41     55.30%
1502.95 216.03     53.62%
1500.28 215.94     53.41%
1491.05 76.36      52.91%
1492.52 195.92     52.83%
1486.64 123.04     52.47%
1491.37 332.37     52.42%
1484.18 162.48     52.21%
1483.60 310.58     51.90%
1466.43 109.26     50.78%
1463.54 69.05      50.55%
1462.24 104.23     50.42%
1461.92 90.69      50.40%
1460.58 147.25     50.26%
1458.63 214.96     50.07%
1455.76 261.16     49.82%
1454.21 127.71     49.73%
1451.04 199.77     49.46%
1447.16 280.82     49.16%
1446.52 55.72      49.08%
1444.22 245.73     48.92%
1443.61 116.84     48.84%
1442.46 116.93     48.74%
1438.16 266.20     48.46%
1437.78 84.40      48.33%
1434.61 260.44     48.17%
1430.27 285.85     47.87%
1428.09 312.95     47.74%
1425.69 223.61     47.43%
1423.95 94.51      47.15%
1415.96 298.29     46.80%
1415.51 241.27     46.65%
1413.50 230.83     46.47%
1405.42 298.90     46.01%
1407.25 85.57      45.71%
1403.19 153.72     45.47%
1398.86 261.68     45.40%
1401.19 138.56     45.28%
1388.74 284.23     44.70%
1383.37 326.26     44.47%
1389.42 158.41     44.34%
1386.58 135.71     44.05%
1385.28 57.73      43.80%
1374.82 313.54     43.78%
1377.23 85.36      43.15%
1374.41 50.06      42.86%
1352.70 331.88     42.26%
1362.51 155.09     42.11%
1353.51 257.87     41.87%
1351.29 264.65     41.74%
1349.10 165.98     41.05%
1345.91 162.57     40.78%
1333.14 297.49     40.59%
1342.71 107.66     40.30%
1328.21 276.83     40.08%
1336.54 63.51      39.66%
1316.78 323.25     39.60%
1327.40 74.11      38.92%
1314.12 258.11     38.87%
1292.72 265.69     37.33%
1304.44 139.44     37.29%
1298.94 189.11     37.18%
1288.70 227.40     36.69%
1297.54 127.79     36.67%
1268.74 304.61     35.99%
1288.07 88.82      35.72%
1259.20 317.22     35.46%
1248.78 334.59     34.94%
1248.34 325.56     34.80%
1258.16 206.71     34.19%
1255.07 194.72     33.84%
1243.24 260.66     33.66%
1248.24 203.64     33.41%
1224.65 335.92     33.32%
1221.06 337.16     33.10%
1245.78 184.10     33.04%
1226.06 277.97     32.66%
1213.53 325.41     32.43%
1211.66 265.66     31.49%
1224.68 166.58     31.29%
1197.35 274.18     30.62%
1203.47 217.32     30.29%
1208.34 185.54     30.27%
1196.33 211.44     29.71%
1173.19 320.69     29.71%
1209.04 103.69     29.59%
1170.00 251.00     28.43%
1183.43 133.26     27.93%
1138.08 335.29     27.73%
1145.32 250.63     26.78%
1171.86 88.50      26.74%
1170.25 94.41      26.66%
1134.71 272.13     26.44%
1117.77 315.69     26.14%
1142.07 221.50     26.12%
1109.73 334.40     25.99%
1146.93 54.03      24.76%
1127.52 176.63     24.53%
1118.74 203.07     24.33%
1128.21 80.85      23.61%
1108.76 195.16     23.57%
1087.69 268.59     23.47%
1116.44 105.31     23.01%
1064.69 285.76     22.43%
1102.82 93.42      22.01%
1058.88 263.63     21.68%
1048.02 286.67     21.50%
1087.57 104.79     21.12%
1079.77 124.62     20.83%
1065.77 185.89     20.77%
1031.67 287.42     20.61%
1077.44 43.42      20.07%
1000.53 296.03     19.14%
1021.67 235.55     19.08%
1051.51 101.94     18.88%
1042.78 127.08     18.63%
1035.97 126.62     18.23%
1015.35 206.60     18.22%
1020.88 164.54     17.87%
1030.80 99.64      17.65%
1001.14 195.80     17.28%
1023.69 36.20      16.83%
1011.82 93.95      16.54%
1002.74 107.62     16.18%
It seems like this percentage is a non-arbitrary way of making weighted stats.

The problem I can see is what Doug said: the R and RD are only 100% accurate at 11:30pm every day, not at the time when the battle was played. At least, however, this percentage would be a non-arbitrary multiplier.
 

DougJustDoug

Knows the great enthusiasms
is a Site Content Manageris a Top Artistis a Programmeris a Forum Moderatoris a Top CAP Contributoris a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Top Tiering Contributor Alumnusis an Administrator Alumnus
We could weigh these numbers as well, especially if we feel that our cutoff point is a bit low. Or we could do something like:

-Players above 1500 are using
-Players above 1600 are using
-Players above 1700 are using

The best part about doing this, is that even if there is a large disparity in battle count between users, we can still gather useful stats. If solid players are using pokemon above 1600 or 1700 that we might not have expected them to use, that's telling. It's also notable if they're not using many OU pokemon ever, or if a couple pokemon seem to be on almost every team over 1700, but see 'normal' numbers at 1500-1600 or so.
I agree that usage split into ranking tiers would yield some interesting stats. I'd love to see how the usage rankings change as you move up in rating. However, combining that information into a single formula for determining "What good players are using". I still think that is fraught with insurmountable problems, many of which I listed in my tl;dr post.

Saw Doug's latest post just as I was about to hit reply: I don't see any reason as to why we can't find some sort of 'magic number' which makes the Glicko actually tell us what we want it to. Though if Glicko is that arbitrary, maybe we need a new system? I assume the entire ladder rating system would have to be rewritten, which sounds like a nightmare unless there are actually 'better' systems under public domain.
I don't think the Glicko system is arbitrary. Using the actual Glicko rating number for weighting pokemon usage -- that is what I think is arbitrary. From what I know, Glicko a great system for rating players. I have no reason to think we should stop using it for ranking players on our ladder. But the numbers yielded by the Glicko system were not devised for weighting pokemon usage.

Many people see the rating number and see the usage numbers, and automatically jump to the conclusion that multiplying the two will somehow yield a meaningful result. I think that is silly.

I agree with your statement that perhaps we could use some other method to make Glicko ratings give us an appropriate weighting multiplier. Maybe so. I think X-Act has an interesting approach...

I decided to implement the above "Player1 beats Player2" formula for 200 randomly-generated players. I applied it for every player against every other player, summed up the probabilities and divided by the number of probabilities. I then sorted them in descending order. Here is the result:

(Percentages)

It seems like this percentage is a non-arbitrary way of making weighted stats.

The problem I can see is what Doug said: the R and RD are only 100% accurate at 11:30pm every day, not at the time when the battle was played. At least, however, this percentage would be a non-arbitrary multiplier.
I agree that using the percentage seems to be much more relevant to making weighted stats than using the ratings numbers themselves.

However, as I have said many times in the past -- I really think our current overall rating system is deeply flawed. Any system that encourages players to "reset" their ratings on a fairly frequent basis, cannot possibly be accurate for determining "the good players" at any point in time.

On top of that, how can you curb manipulation by players -- since the players themselves are in total control of how much weight is given to their usage of a pokemon? If a player desires a pokemon usage to be weighted heavily then they use the pokemon on a "good alt". If they want it to be weighted lower, then they can use it on a "bad alt".

How can the weighted usage numbers be considered an objective observation when the participants have so much control over what is observed and how much importance will be placed on the observation? This gives way too much ability for players to game the weighted usage rankings.

The rating system is predicated on the assumption that all players always do their best to win every match, and that a player has a single "identity" that can be used to accumulate a rating over time -- forever. That is patently NOT the case. Players are not always playing to win, and players cannot be uniquely identified and consistently rated over time. These two inherent problems cannot be avoided. At least I can't see a way to get around them.

No single issue I have posted is THE reason I am against using weighted numbers for determining pokemon tiers. It's all of them. There are just too many problems.
 

X-Act

np: Biffy Clyro - Shock Shock
is a Site Content Manager Alumnusis a Programmer Alumnusis a Smogon Discord Contributor Alumnusis a Top Researcher Alumnusis a Top CAP Contributor Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis a Smogon Media Contributor Alumnusis an Administrator Alumnus
I agree that the Shoddy rating system needs improvement, and that's why I had suggested to implement slight changes in our server ratings to try to 'fix' the problems. That was short-lived though as Shoddybattle was still displaying the CRE to the players and they got confused, so we reverted to the old rating method.

The alts problem can be quickly fixed by simply disallowing alts, I suppose. If you want to start over, you'd need to reset your current account, and not start a new one. It would automatically solve the problem of people playing against themselves to gain rating as well...

... and it would somewhat solve the problem of players not always playing to win. If players have only one account, they'll be much less tempted to play lousily. And if players want to test a new team, they could always play an unrated battle with their one and only account. Ladder matches should only be reserved for when two players play to win, and not to test stuff.

Finally, my suggestion to only count Pokemon in winning teams in weighted battles would make players not spam usages of Pokemon. Players that spam the use of Luvdisc would quickly realise that they're not winning as much as they used to, and not many Luvdisc usages would be counted in the weighted stats as they would invariably lose their games with it, not to mention that their rating would suffer as a result. The fact that they would only have one account would also be a deterrent to spam such 'lol' Pokemon.

When all is said and done, however, no matter what you do, no system is perfect, and exploitable loopholes will arise. For example, two players could 'fix' the result. This could happen even in competitive chess, or in any other competitive game, though, so it's hardly a new thing.

So, the bottom line is: disallowing alts would go a long way to player ratings that are much more credible. I don't know if it would be too controversial to do this, however.
 

Users Who Are Viewing This Thread (Users: 1, Guests: 0)

Top