Resource Everything You Ever Wanted to Know About Ratings

Antar · Aug 20, 2013

This post has been revised to reflect the current state of Pokemon Showdown rating systems

Many users have no understanding of how a rating system works or its fundamental limitations. This article, which could either go in the forums alongside the Weighted Stats FAQ or on-site somewhere, was written to educate members of the community about our rating system, to answer fundamental questions about how it works and, ideally, to lead to a more informed discussion of our ladder practices. I welcome any and all constructive feedback, especially questions that will help populate the FAQ section.

-------------------------

Introduction
Rating systems and ladders are a fundamental component of many competitive games, from college football to tennis to chess to competitive Pokemon. Pokemon Showdown, like other online games like League of Legends or Overwatch, implements a rating system to attempt to rank players based on skill and to pair players with similar skill levels for battles. This is not an easy job, and there is not a system on the planet that can do it perfectly. In what follows, I will be attempting to explain the concepts, theory and execution behind Showdown's rating system (and will in the process explain Pokemon Online's as well). For those who are mathematically inclined, I will include technical details. For those of you who aren't, this article was written so as to allow you to skip those sections without negatively impacting your ability to understand the rest of the article.

Background and History
The goal of most rating system is, most fundamentally, to determine a player's skill level. This can be useful for ranking players, for tournament seeding, or for pairing players of similar skill levels for optimally "interesting" battles, but at the end of the day, the goal is simply to try to determine how good a given player is, to the greatest degree of accuracy.

This is not a new problem, and competitive leagues have been at this for hundreds of years. Easily the simplest rating system is the win-loss record, but that has the problem of rewarding players who only play poor players and punishing those who seek out challenging opponents. Throughout the centuries, more complex points-based systems have been developed to try to correct for this issue but were often criticized as being arbitrary and unfair. Then in 1960, the United States Chess Federation adopted a new rating system designed by chess master and physicist Arpad Elo. His statistics-based rating system became widely popular due to its simplicity and fairness, and half a century later, almost all modern rating systems are built upon his concepts. Pokemon Online and Pokemon Showdown both use variants on Elo as their primary rating systems, although Showdown also implements an extension of Elo called Glicko (more on that in a bit). Older versions of Pokemon Showdown and the old Shoddy Battle used the Glicko-2 rating system, which is an extension of Glicko.

Ratings, Parameters and Estimates
I would say that 90% of people's confusion regarding the various ladders comes down to a lack of conceptual understanding concerning the difference between parameters and estimates. The issue is that a player's true skill level, the piece of information that all rating systems are fundamentally trying to determine, is fundamentally unknowable--there is no realistic way to have every player battle every other player on the ladder, and if there were, there would be no way to guarantee that the same matchup, if repeated, would have the same results. Instead, we have to approximate a player's skill level based on observation of the battles that actually took place.

In the language of statistics, a player's skill level is a "parameter"--if you know that, you know everything about how the player behaves. In contrast, the rating we generate based on the results of his or her battles is an "estimate" of this parameter. If you take nothing else away from this article, that is it, so I will repeat:

The rating shown on a ladder is simply an estimate of a player's skill level, nothing more.

When most people think about climbing the ladder, they think of it in terms of earning "points" for wins and losing them for defeats. This leads to frustration when a highly ranked player sees his or her rating change not at all when they defeat a low-rated opponent (since they didn't gain any points by winning), but this is a mistake in thinking--the "reward" for the highly-rated player is that he or she has had his or her skill estimate "validated"--he or she has successfully shown that the ladder was justified in rating him or her so highly. This is something that bears emphasizing:

Your rating is not something that you "earn." It is something you discover about yourself.

I realize that the above statement may be shocking--the idea that a player should be "rewarded" for battling more is ingrained into most players' psyches. And certainly, we want to encourage players to battle more frequently. What we don't want is players who reach the top of the ladder and "park" there, refusing to battle anyone further, for fear of losing his or her #1 spot when he or she suffer an unlucky defeat, and this is something I will discuss further in a later section. In the meantime, remember that the fundamental purpose of a ladder is not to reward or to punish, but to determine the skill levels of the players on it.

Both Glicko and Elo ratings are estimates of a player's skill, and in the sections that follow, I will explain some of the mechanics underlying these systems and explain how you should interpret your ratings.

Pokemon as a Different Sort of Card Game
Before I get into describing the rating systems themselves, I'm going to describe the mathematical theory at the heart of both systems. If you're not interested in the technical details, feel free to simply skip this section.

An important feature of most rating systems is that they do not require that a better player will always defeat an inferior one. Whether due to chance ("hax") or one player being "in the zone" while another is having an off-day, there will always be some uncertainty regarding the outcome of a match between two players, even if their skill levels are well defined.

To account for this, the Elo and Glicko systems use at their heart a model of "pairwise comparison" called Bradley-Terry, which simplifies all Pokemon battles (or chess games or tennis matches) down to a simple game of chance:

Picture two players, sitting across from each other. Each has his or her own deck of cards, and each card is marked with a number. The players shuffle their decks, and on the count of three, each pulls a single card and places it on the table. The winner is the player whose card has the higher number. The key aspect of this model--and what makes it different from, say, a simple coin toss--is that the two players need not have identical decks, and if one player's deck is stacked with numbers that are generally higher, then that player will be more likely to win.

While in principle, the distributions of cards in the deck could follow any form, in the Bradley-Terry model, the distribution of cards follows the Extreme Value (or Fisher-Tippett) distribution, which is asymmetric about its mean, having a longer positive "tail." This says that it is more likely that a player will occasionally play far above their typical skill level than it will be for them to play far below it. Furthermore, Elo (and consequently Glicko) assumes that these Extreme Value distributions have the same width and vary only by their center.

While this model obscures the thrill of a well-made prediction and the frustration of an unfortunately-timed bit of hax, the point is that it's built into the model that a player will not always have the same "performance" every time he or she plays the game, even if his or her skill level hasn't changed, and the up-shot is that there's always a chance that an inferior player will defeat a superior one.

The Unknowable "True Rating"
The underlying principle behind both Elo and Glicko is that a player has a single "parameter" representing his or her skill level, and that parameter governs how the player will perform: as per the previous section, a player at a given skill level will sometimes end up performing under his or her skill level and sometimes over (due to luck, "hax" or simply being "in the zone"), but given enough battles, his or her "average" performance should converge to a well defined value. This average, or expected, performance I will refer to as the player's "true rating." This "true rating" is the end-all, be-all, the holy grail of ratings. Given two player's rating, it is, in principle, trivial to calculate the odds that one will defeat the other, and it is the goal of both the Elo and Glicko rating systems to estimate a player's "true rating" as closely and as accurately as possible. An important point here is that ratings are absolute--there's no "rock-paper-scissors" element here where Player A usually defeats Player B, Player B usually defeats Player C, but Player C usually defeats Player A--if Player A usually defeats Player B, and Player B usually defeats Player C, then it's assumed that Player A will usually defeat Player C as well.

The Elo Rating System
I'll start with Elo, which is the system behind the Pokemon Online and Pokemon Showdown ladders. Given a series of wins and losses against a collection of opponents, a player's Elo rating is calculated so as to be the "true rating" that has the greatest likelihood of producing those results. Keep in mind, once again, that sometimes a less skilled player will defeat a more-skilled one, so if you defeat three players with 1200 ratings and then proceed to lose to a player with an 800 rating, that's fine, Elo can account for that (what exactly your rating would be in that case depends on a choice of parameter).

Without going to go into details about how Elo ratings are actually calculated or updated, one of its main advantages is its simplicity (you can readily calculate how two players' ratings will change as a result of a single match) and transparency. For those used to points-based rating systems where ones rating goes up or down as a direct consequence of winning or losing, Elo feels comforting, aside from the extremes where a player with a much higher rating than his or her opponent will not gain any points from a victory. It was this simplicity and it's "fairness" and accuracy when compared to previously-used points-based rating systems that led to its wide-scale adoption, and after more than fifty years, its ubiquity is one of its big selling points.

Elo does have a number of downsides, though, the most prominent being that players with high ratings often have little incentive to continue playing and GREAT incentive NOT to play: if you have an rating of 1800 on PO and you end up facing a player with a rating of 1000, a loss will knock 50 points off your rating, whereas a win will net you zero. Consequently, it is not uncommon for top players on Elo-based ladders to "park" themselves at the top and refuse to continue battling altogether. To combat this, Pokemon Online introduced rating decay, which automatically lowers a player's displayed rating for every so-many hours the player stays inactive. Note that this doesn't actually change the player's rating, only what is displayed, and after a few battles, this decay is erased. This system effectively forces players to keep battling to retain their ranking, but there are other related issues that must be guarded against.

Pokemon Showdown's Elo implementation also includes a few tweaks, such as its own decay and a sliding "K-factor" which makes a player gain or lose fewer points the higher his or her Elo score.

The Glicko Rating System
In 1995, Mark Glickman, a statistician who is currently the chair of the United States Chess Federation's Ratings Committee, introduced an extension of the Elo rating system which he called Glicko (which was later extended to Glicko-2). Whereas Elo estimates a player's "true rating" by using only a single value, the player's rating, Glicko adds a second value, called RD (short for "rating deviation"), which quantifies the level of uncertainty in the estimate of a player's "true rating" (Glicko-2 adds on one more variable, called volatility, which measures the erraticness of a player's performance).

Thus, where Elo attempts to directly estimate a player's "true rating," Glicko instead estimates that a player's "true rating" falls within a probability distribution, in this case, a normal distribution of width RD and center R.

Below are two sample distributions: one for a player of rating 1600±100 and one whose rating is 1575±50.

For those unfamiliar with probability distributions, the idea is that the probability of a player's "true rating" being in the infinitesimal interval r<rating<r+dr is f(r)dr.

So looking at these two graphs the question is, who's the better player? Using the Glicko system, it's impossible to know for certain, but what we can determine is the likelihood that each player's "true rating" is above some specific value. For instance, in the case pictured above, there's an 84% chance that the blue player's "true rating" is above 1500 and a 93% chance that the same is true for the red player. On the other hand, there's less than a 1% chance that the red player's true rating is at least 1700, whereas the odds that the blue player is at least at that skill level is nearly 16%.

This showcases the primary problem with Glicko rating systems--whereas under Elo, comparing players was a simple matter of comparing their ratings, under Glicko it's much harder. In addition, Glicko ratings are updated only once per "rating period" (Showdown uses two days or 15 battles). Since players for the most part wouldn't stand for only seeing their ratings change every fifteen battles, players are assigned "provisional" ratings at the end of each battle which are designed to estimate how the new ratings will change. This can lead to confusion when a rating period ends, and the player's rating actually updates.

As with Elo, I won't be going into the technical details concerning implementation, although I will point out a key feature of Glicko: where Elo ladders have to manually implement rating decay to discourage periods of inactivity, Glicko builds inactivity into the rating system, with RD increasing the longer a player remains inactive.

Rating Estimates
So I said that since a player's Glicko rating is not one number but two, it's not nearly as straightforward to directly compare two players' skill levels. Ideally, a player's skill level would always be represented by his or her R and RD, and applications requiring an assessment of a player's skill level would be implemented probabilistically (see: Weighted Stats). Back in the real world, however, things like ladder rankings require us to have a set method of comparing two players and determining which one is "better." The way we do this is through "Conservative Rating Estimates" or CREs. CREs basically say, "I don't know for certain how good a player is, but I'm pretty sure he or she is better than this.

The most famous of these around here is ACRE, which was what Showdown's ladder used to use to rank players. ACRE (the "A" stands for "advanced") corresponds to about the 8th percentile of a player's rating distribution, meaning there's about a 92% chance that a player's "true rating" is at least as high as his or her ACRE. ACRE (and other CREs) is designed to be close to a player's real rating when they've played a lot of games (and Glicko is confident about their rating), and much lower if they haven't (and Glicko isn't confident). Other than that, it has the same problems Elo and Glicko has, with ratings not changing much on wins when you're at the top of the ladder.

One proposed alternative was designed for Smogon's Shoddy ladder by X-Act, a mathematician who was very active in the community in those days. His Glicko-X-Act Estimate (or GXE) measures the odds that a player would win a battle against a randomly selected opponent from the ladder. Players' GXEs are displayed along with their Elo and Glicko ratings on the Showdown ladder. While GXE is designed to have concrete meaning, it also does not fix the problem with wins not changing rating much. That being said, GXE is the primary component of COIL, an achievement-centric ladder score we previously used to determine suspect voting requirements.

There's an important point that bears emphasis about GXE and ACRE: people often erroneously refer to these rating estimates as if they were alternatives to Glicko, the same way Elo and Glicko are different rating systems, but in truth:

ACRE and GXE are measures for interpreting Glicko ratings and are not independent systems.

Quick note: Trueskill
Microsoft's Trueskill is another rating system that's been gaining traction lately, and it is occasionally suggested that PS implement Trueskill for its ladder. The problem is that Trueskill is proprietary (read: might require a license) and the differences between Trueskill and Glicko are mostly trivial.

Summary
At the end of the day, if you come away with nothing else after reading this article, I hope it's the following:

Ratings are not rewards: while one's rating may go up after winning a battle and down after losing one, at the end of the day one's rating, as implemented on Pokemon Online and Pokemon Showdown is designed to be an accurate measure of one's skill level.
It is important to differentiate between a player's "true rating," which is not knowable, his or her Glicko rating, which is not a single number but rather a rating an an uncertainty, RD, that together define the likelihood of his or her "true rating" falling within a given range, and a player's Rating Estimate, either expressed as ACRE or GXE, which whitewashes most of the nuances of your Glicko rating.

Further Reading

For those interested in a more in-depth look at the theory behind statistical rating systems, and Elo in particular, I recommend this paper by Mark Glickman: http://www.glicko.net/research/acjpaper.pdf
Here is a paper by Glickman's outlining his Glicko rating system: http://www.glicko.net/glicko/glicko.pdf
And a paper describing Glicko-2: http://www.glicko.net/glicko/glicko2.pdf

Frequently Asked Questions
Ask away!

Aldaron · Aug 20, 2013

Antar said:
Ratings are not rewards: while your rating may go up after winning a battle and down after losing one, at the end of the day your rating is designed to be an accurate measure of your skill level.

By this you mean the rating systems you described here seek to do that right; what if the purpose of the desired rating system is not to find a person's skill level (which, from what we've observed in the past 6 years of popular online Pokemon laddering, is probably not practically viable, as pretty much no thinks the ladder rankings are "true measures" of people's skill) but actually to measure performance in that ladder?

By this I mean, what if we don't care about finding a player's "true skill" but just want to find out how he has performed in that ladder whenever he has participated in it?

In that scenario, isn't the rating system actually a reward system that we hope to measure?

I know you started this article off with "inform about our ladder practices", which means you're describing the perspective with which we've approached ladder rating systems historically and presently, but I'm asking why we assume that perspective is even what we desire? What if we have no desire to attempt to find "true skill" (considering how few people actually consider any of the shoddy / PO / PS rating systems to ever represent true skill) and simply want to measure how a person has performed in a specific ladder during a specific time?

This is a very important question, particularly because what most people use as a measure of true skill is actually tournament performance, so I would further ask why endeavoring to have a system to find "true skill" on the ladder is even worth our time.

Antar · Aug 20, 2013

@Aldaron, I have updated that bullet point to clarify, but the issue you raise is a valid one. If people wanted to, we *could* implement a point-based rating system. I approached this article, as you correctly inferred, from the perspective of "this is what we've done up until now," but you are correct that point-based rating systems exist to this day, and the article needs to reflect that.

Edit: done. Let me know if you see anywhere else where the language should be changed to reflect this alternate perspective.

Zarel · Aug 24, 2013

Oh, @Antar, if you were wondering why ACRE stands for "Advanced CRE"; it originally stood for "aeo's CRE" but then I changed my username and around the same time decided that naming a CRE after myself was really narcissistic.

Also, at the time I renamed it, the ACRE formula was much more complicated, it was something like:

R
- (RD - 50) if RD > 50
- (RD - 100) if RD > 100
- (RD - 150) if RD > 150
- (RD - 200) if RD > 200

I simplified it after too many complaints about people gaining points for losing battles.

Erico9001 · Jan 2, 2014

Antar said:
At the end of the day, if you come away with nothing else after reading this article, I hope it's the following:

Ratings are not rewards: while your rating may go up after winning a battle and down after losing one, at the end of the day your rating, as implemented on Pokemon Online and Pokemon Showdown is designed to be an accurate measure of your skill level.

While you may wish for people to not view their ranks as rewards, unfortunately it is impossible. The biggest reason people compete in showdown is to prove their skill and smartness and in effect make themselves feel competent. This means that a number or a series of numbers used to estimate the player's skill level will be seen as a reward no matter what because it is the player's only evidence of their skill that they can show other people. At least people can find comfort in knowing PS's rating system is not absolute and convince themselves their true skill level is higher than what is shown, whether they're right or wrong. You know, perhaps it is a good thing that the ranks system isn't perfect for this reason.

As a side note – I just noticed Zarel's message count is 1337. Very nice.

josehand1 · Jan 15, 2014

Hi there,

Could anyone explain how it's now? I don't need ACRE anymore and I see Elo, but Glicko 2 and GXE are still there. If I understood correctly, ELo and Glicko are two different rating systems. Are both being used together?

Antar · Jan 15, 2014

Hi, josehand1! PS is currently using Elo for its primary rating system and is calculating and displaying Glicko2 rating as well (though that's going to be replaced with Glicko1 as soon as possible). Moving forward, we'll be implementing a new kind of rating system that's entirely different from what's described in this post, in that the goal will be to reward achievement on the ladder rather than to accurately assess player skill. This system will exist in tandem with one of our skill-based rating systems--we're ironing out the details now. Stay tuned!

asbdsp · Jan 16, 2014

Antar said:
Moving forward, we'll be implementing a new kind of rating system that's entirely different from what's described in this post, in that the goal will be to reward achievement on the ladder rather than to accurately assess player skill. This system will exist in tandem with one of our skill-based rating systems--we're ironing out the details now. Stay tuned!

Wow, was just thinking about how they ought to do something like this on the toilet yesterday. An exciting new dawn for PS, one feels.

Qoseph · Jan 30, 2014

Thanks, this changed my view. One part in particular especially: Rating is not something you earn, it's something you discover about yourself. I never thought of it like that.

Mulan15262 · Dec 13, 2014

For Elo and Glicko-1's median ranking value, does the amount of battles you played in the past affect your ranking?

If that's not clear, suppose I have an Elo of 1300. I win several battles and raise it up to a 1500, and then lose several and lower it to that same 1300. Would it be any easier or harder to bring it up to 1500 again than last time, or would it be about the same?

Zarel · Dec 13, 2014

For Elo: the same

For Glicko-1: harder

Arcanuke · Jul 26, 2016

Sorry if this question was already answered in the article but with the Glicko-1 rating system does having a more erratic performance result in a lower rating overall than if you won and lost a consistent amount and how is consistency measured

Antar · Jul 26, 2016

Arcanuke IIRC this is the chief improvement of Glicko-2, that it takes into account performance consistency. Off the top of my head I don't know how differently two players with the same average performance but different volatilities would be scored, but I suspect Glickman may talk about it in one of his papers.

pyuk · Aug 26, 2016

From what I can tell, in a pure ELO system, the winner gains as many points as the loser loses. The PS! ladder definitely doesn't adhere to that rule, so the formula has clearly been modified in some way. Would it be possible for the actual formula that PS! uses to be released? I fear that the ladders may be subject to extreme deflation if a very highly skilled player floods the top of the ladder with many, many alts instead of sticking to just one or two. This would be easy to prove in a pure ELO system, as each alt is gaining all but a few points (the ones awarded for beating players rated at 1000) by taking them from the rest of the players, but PS! doesn't use one, so for all I know it might have proper safeguards in place. Still, it would be nice to have the real formula, so I can be sure.

Zarel · Aug 27, 2016

MacChaeger said:
From what I can tell, in a pure ELO system, the winner gains as many points as the loser loses. The PS! ladder definitely doesn't adhere to that rule, so the formula has clearly been modified in some way. Would it be possible for the actual formula that PS! uses to be released? I fear that the ladders may be subject to extreme deflation if a very highly skilled player floods the top of the ladder with many, many alts instead of sticking to just one or two. This would be easy to prove in a pure ELO system, as each alt is gaining all but a few points (the ones awarded for beating players rated at 1000) by taking them from the rest of the players, but PS! doesn't use one, so for all I know it might have proper safeguards in place. Still, it would be nice to have the real formula, so I can be sure.

I'll write up a detailed description of how it works in a bit...

The short answer is that PS increases gains and decreases losses when your Elo is between 1000-1200 (and doesn't allow it to fall below 1000), and is regular Elo with K-scaling above 1200, with the addition of rating decay.

Zarel · Aug 28, 2016

MacChaeger, Antar

Our ladder displays four ratings: Elo, GXE, Glicko-1, and COIL.

Elo is the main ladder rating. It's a pretty normal ladder rating: goes up when you win and down when you lose.

GXE (Glicko X-Act Estimate) is an estimate of your win chance against an average ladder player.

Glicko-1 is a different rating system. It has rating and deviation values.

COIL (Converging Order Invariant Ladder) is mainly used for suspect tests. It goes up as you play games, but not too many games.

Note that win/loss should not be used to estimate skill, since who you play against is much more important than how many times you win or lose. Our other stats like Elo and GXE are much better for estimating skill.

PS Elo

Your rating starts at 1000.

Our Elo implementation uses K-scaling. The K factor is:

K = 50 if Elo is 1100 – 1299
K = 40 if Elo is 1300 – 1599
K = 32 if Elo is 1600 or higher

We have a rating floor of 1000 (If your rating would fall below 1000, it is set to 1000). This makes it unnecessary to create new accounts to "fix" your rating.

Between 1000 and 1100, we have some special behavior:

If Elo is 1000, K = 80 for the winner and K = 20 for the loser. Between 1001 to 1099, K scales linearly from 80 to 50 for the winner and from 20 to 50 for the loser. This helps spread out low ladder people between 1000 and 1100 instead of causing the rating floor to cluster them all at 1000.

In OU and randbats, we have rating decay above 1400. Every day:

If you played 6 or more games, there is no decay
If you played 1-5 games, you lose 1 point for every 100 points above 1500 you are
If you played 0 games, you lose 1 point for every 50 points above 1400 you are

In all other tiers, we have rating decay above 1500. Every day:

If you played 6 or more games, there is no decay
If you played 1-5 games, you lose 1 point for every 100 points above 1700 you are
If you played 0 games, you lose 1 point for every 50 points above 1500 you are

This helps combat rating inflation.

Note that there's no "official" Elo standard. K-scaling and rating floors are common, rating decay somewhat common, and our dynamic K-scaling seems to be unique.

PS Glicko-1

Your rating starts at R = 1500, RD = 130.

We use a rating period of 24 hours and an RD range of 25 to 130, with a system constant of 6.6775026092.

OrdA · Aug 28, 2016

Zarel said:
If you played over 5 games, there is no decay

If you played 1-4 games, you lose 1 point for every 100 points above 1500 you are

If you played 0 games, you lose 1 point for every 50 points above 1400 you are

Based on the source code found here, I think there's a few mistakes in this. Some possible corrections:

If you played over 5 games, there is no decay
If you played 1-5 games, you lose 1 point for every 100 points above 1400 you are
Code:
```
$decay = 0 + intval(($elo-1400)/100)
```
If you played 0 games, you lose 1 point, plus one for every 50 points above 1400 you are
Code:
```
$decay = 1 + intval(($elo-1400)/50)
```

Ladders besides OU and randbats decay by 2 points less, meaning that the decay will be 1 point for every 100 points above 1600, and 1 point plus one for every 50 points above 1500, respectively.

Code:

default:
     $decay -= 2;

Zarel · Aug 29, 2016

OrdA said:
Based on the source code found here, I think there's a few mistakes in this. Some possible corrections:

If you played over 5 games, there is no decay

If you played 1-5 games, you lose 1 point for every 100 points above 1400 you are $decay = 0 + intval(($elo-1400)/100)

If you played 0 games, you lose 1 point, plus one for every 50 points above 1400 you are $decay = 1 + intval(($elo-1400)/50)

"above 1500" and omitting "plus one" are intentional. It's mathematically equivalent.

1-5 games: lose 1 point at 1500, 2 points at 1600, 3 points at 1700, etc...
0 games: lose 1 point at 1400, 2 points at 1450, 3 points at 1500, etc...

OrdA · Aug 29, 2016

Zarel said:
If you played 1-5 games, you lose 1 point for every 100 points above 1500 you are

Zarel said:
1-5 games: lose 1 point at 1500, 2 points at 1600, 3 points at 1700, etc...

Oh. Maybe it's just me reading this wrong. To me, 1 point for every 100 points above 1500 would mean that I lose 1 at 1600 (because that's the first time I have 100 points over 1500). Then 2 at 1700, and so on.

To you, it seems to be "1 point for every started 100" (similar to how you pay for phone calls for every minute you started).

ABR · Sep 1, 2016

Zarel said:
MacChaeger, Antar

Our ladder displays four ratings: Elo, GXE, Glicko-1, and COIL.

Elo is the main ladder rating. It's a pretty normal ladder rating: goes up when you win and down when you lose.

GXE (Glicko X-Act Estimate) is an estimate of your win chance against an average ladder player.

Glicko-1 is a different rating system. It has rating and deviation values.

COIL (Converging Order Invariant Ladder) is mainly used for suspect tests. It goes up as you play games, but not too many games.

Note that win/loss should not be used to estimate skill, since who you play against is much more important than how many times you win or lose. Our other stats like Elo and GXE are much better for estimating skill.

PS Elo

Your rating starts at 1000.

Our Elo implementation uses K-scaling. The K factor is:

K = 50 if Elo is 1100 – 1299

K = 40 if Elo is 1300 – 1599

K = 32 if Elo is 1600 or higher

We have a rating floor of 1000 (If your rating would fall below 1000, it is set to 1000). This makes it unnecessary to create new accounts to "fix" your rating.

Between 1000 and 1100, we have some special behavior:

If Elo is 1000, K = 80 for the winner and K = 20 for the loser. Between 1001 to 1099, K scales linearly from 80 to 50 for the winner and from 20 to 50 for the loser. This helps spread out low ladder people between 1000 and 1100 instead of causing the rating floor to cluster them all at 1000.

In OU and randbats, we have rating decay above 1400. Every day:

If you played 6 or more games, there is no decay

If you played 1-5 games, you lose 1 point for every 100 points above 1500 you are

If you played 0 games, you lose 1 point for every 50 points above 1400 you are

In all other tiers, we have rating decay above 1500. Every day:

If you played 6 or more games, there is no decay

If you played 1-5 games, you lose 1 point for every 100 points above 1700 you are

If you played 0 games, you lose 1 point for every 50 points above 1500 you are

This helps combat rating inflation.

Note that there's no "official" Elo standard. K-scaling and rating floors are common, rating decay somewhat common, and our dynamic K-scaling seems to be unique.

PS Glicko-1

Your rating starts at R = 1500, RD = 130.

We use a rating period of 24 hours and an RD range of 25 to 130, with a system constant of 6.6775026092.

Can you clarify what "every day" means? I asked this somewhere else but it was never answered and this post explains my question well:
http://www.smogon.com/forums/thread...posting-a-thread.3520646/page-53#post-6933426

Zarel · Sep 1, 2016

ABR said:
Can you clarify what "every day" means? I asked this somewhere else but it was never answered and this post explains my question well:
http://www.smogon.com/forums/thread...posting-a-thread.3520646/page-53#post-6933426

Apparently SQSA isn't getting answers reliably? That kinda sucks.

Anyway, the answer is "once per calendar day". Rating periods roll over (Elo calculates decay and Glicko-1 rolls over a rating period) at midnight GMT (4pm-8pm in the US).

ABR · Sep 1, 2016

Zarel said:
Apparently SQSA isn't getting answers reliably? That kinda sucks.

Anyway, the answer is "once per calendar day". Rating periods roll over (Elo calculates decay and Glicko-1 rolls over a rating period) at midnight GMT (4pm-8pm in the US).

Thank you!

e: I lost points when I was sleeping (at least 4 hours after you said decay happens)?

Hifructose · Feb 6, 2018

Am I missing something, or is it possible to calculate your skill percentile using your stats and scores and the amount of players?

pokemonisfun · Jan 6, 2023

can anyone update the image - seems broken - I assume Antar is gone now though rip you, have a nice life :(

Sijih · Jan 29, 2023

pokemonisfun said:
can anyone update the image - seems broken - I assume Antar is gone now though rip you, have a nice life :(

Here are two recreations of the graph for you. The left one is probably what it looked like. In the right one I added indicators of where the 1500 and 1700 numbers Antar is using in the examples are.

I'm pretty sure you and I may be the only people who ever look at this thread, but I'll post this graph here first so anyone can give feedback/make the graph look nicer. Maybe I could add some shading to show probabilities, or add lines to indicate where the distribution means are.

I can also post the code I used to generate this if anyone wants.

If there are no objections I'll ask a member of upper staff to edit one of my graphs into the post in a day or three.

Resource Everything You Ever Wanted to Know About Ratings

Antar

Aldaron

geriatric

Antar

Zarel

Not a Yuyuko fan

Erico9001

josehand1

Antar

asbdsp

Qoseph

Mulan15262

Zarel

Not a Yuyuko fan

Arcanuke

Antar

pyuk

Zarel

Not a Yuyuko fan

Zarel

Not a Yuyuko fan

OrdA

Zarel

Not a Yuyuko fan

OrdA

ABR

Zarel

Not a Yuyuko fan

ABR

Hifructose

pokemonisfun

Banned deucer.

Sijih

game show genius

Users Who Are Viewing This Thread (Users: 1, Guests: 0)