Smogon Community Tournament formats and Rating systems
Tournament formats and Rating systems

/edit: I'm editing a summary of the systems proposed at the bottom of the post.

Note: I got the okay from Groudon80 to make this thread.

I've been thinking about making my own tournament, but because the tournaments applications are locked, I spent my time researching on tournament formats and rating systems, in hope that I would have a very fair and very informative tournament. I started on this journey thinking that the research would be simple, but it is not. Indeed, tournaments and ratings are in fact a difficult mathematical problem that some people base their PHD Thesis on attempting to make the systems more fair.

And believe me, when I discovered this, I did indeed wish to keep it simple. But reading up on a few of the non-technical papers, I am convinced that at least a bit of sound theory is necessary in order to conduct a fair and proper tournament.

Overview of Tournament Formats

Quote:
 Naive: 3.having or marked by a simple, unaffectedly direct style reflecting little or no formal training or technique.
The Naive approaches work, but it can be improved upon. Indeed, they are the most simple to understand and are the tournament formats that we would use. The naive tournament format is the Single Elimination Tournament, which is the vast majority of what this forum has to offer. Fortunately, research does indeed show that the naive "power of two" Single Elimination Tourney is actually optimal. There are a few issues, but I'll discuss that later.

If you want to adhere to the KISS principle, I suggest to use the KISS method of Single Elimination. However, there are issues with the Single Elimination Tourney that must be addressed. The first one is that it is easy for the "true best player" to get unlucky and lose one round. The best player would then be knocked out, maybe early, maybe later, but either way, the "true" champion is not really known.

For example, Snow Cloak or Sand Veil has a 20% chance of activating, which translates to a 4% chance of activating 2 turns in a row. However, in the first round alone of a 16 person Tournament, the chances of this happening somewhere in the tournament (or similar, like a key critical-hit hax) is 39%. I'm not only counting Snow Cloak hax here, this is also the same probability as Dynamic Punch hitting 4 or 5 times in a row. While hax do not happen often in a single game, when many people are playing many games, a "hax loss" is a statistical certainty somewhere along the bracket.

With that, comes the Double Elimination Tournament. You have 2 chances. If you get haxxed once, you get sent to the loser's bracket and then face the champion later on. And from this concept also stems the Triple Elimination Tournament, where you have 3 chances to get screwed. Needless to say, these systems are far more complicated than the Single Elimination Tournament, and they take up more time to do. However, they are clearly more fair as the best player has a better chance of actually winning, while lucky (but worse) players have a much better chance of losing. More on these later.

At the most extreme, we have Swiss Tournaments. These ensure that all players get to play during all rounds, and at the end the win/loss records are compared. The primary advantage is that all players get to play every round, so you get the most balance.

The Approachs to Ratings

The naive approach to ratings is the Win / Loss system. That is to say, after a tournament is done, you tally up the win/losses for every player, and then publish it. The most wins is the best, and the most losses is the worst. It is easy to understand, and easy to apply. But there are major issues. It is not uncommon to have many people with the same win/loss record. We can expect many people to have 4 wins, 4 losses for example, and the win/loss record doesn't help to determine who is better here.

Also, someone who has a 8-2 win/loss record might be better than someone who has a 9-1 win/loss record. The 9-1 guy just got "lucky" because he faced 9 easy opponents, and the 8-2 guy faced all hard opponents.

Chess players solved this problem a while ago. And their system has evolved to tackle new problems and challenges with the system (that I might add... the win/loss system doesn't work here either). The current systems today in Chess are

The Elo system (old but mature and battle tested for decades)
The Glicko system (Elo with modifications. Years of proven results)
The Glicko2 system (Glicko with modifications. State of the Art rating system)

The Elo system is explained in 49+ pages detail over here. You can understand it by reading the first 20ish pages. Here's an executive summary (and yes, I'm making up words to help explain it):

The Elo system assumes every player has a "average ability", and then the "played ability". The "Played ability" is what you did during that round, while the "average ability" is how good you are on the average. Professor Elo then actually turned this "average ability" into a number. So if your average ability is the same as someone else's average ability, you have a 50% chance of beating them. ("Beating them" in mathematical terms means that your "played ability" was better than their "played ability") If your average ability is 35 points higher than someone else's average ability, then you have a 55% chance of winning. So on and so forth.

Elo then provided a method to make your score closer to the true score based on how you performed in a tournament. The formula is listed in that paper.

This system takes into account that your opponent was good. Lets say you are a 1500 ranking, and you play a 1600 ranking and you win 45% of the time. While the opponent won more often than you, you will gain points while he loses points. This is so that you get a better estimation of your true ranking. 1600 should win 63% of the time against you, and you should win only 37% of the time. Therefore, you played better than expected, and he played worse than expected. Thus the points are adjusted as so.

The Glicko system goes ahead and realizes that, hey, I can only estimate someone's ranking. So instead of giving a solid number as an estimate, it gives a range. IE: a new player can have a ranking of 1200 to 1800, while a well seasoned player will have a ranking of 1800 to 1850. The more you play, the more precise the system gets with its estimation.

This way, If a 1550 to 1650 player faces a newbie, who has a rating of 1200 to 1800, then the new player will gain a many points when he wins, but the older player won't lose too many points. This is because the system is "unsure" of how good the new player is, so it won't penalize the old player's ratings that much, while it is somewhat sure how the older player is. (notice, his rating is between 1550 to 1650, while the newbie's range is 1200 to 1800)

The Glicko system is explained in detail here (requires a postscript reader). An example of the Glicko system in action is here.

------------

That is all the time I have for now. I'll post more of what I've learned later. Hopefully, we all can learn a little about tournament formats and make Smogon an even better place to competitivly battle pokemon. Yeah, I didn't get to discuss the research on Double Elimination Tournies (Double-Elimination Tournaments: Counting and Calculating by Christopher T. Edwards) or about Glicko2, among other things... perhaps I'll have time later.

/edit: Summary!

Elo: Everyone gets a rating. If you win against someone, you get points and they lose points. If their rating is higher than yours, a lot of points are exchanged. If you win against someone with a lower rating than yours, few points are exchanged.

Glicko: Mostly the same as Elo, except instead of having a set rating, you have a range of ratings. It basically says, "You are rated somewhere between these scores", which helps account for newer players having less fixed ratings. The more you battle, the narrower this range becomes.

Glicko2: Mostly the same as Glicko, except it adds another factor (the "volatility factor"), which is a measure of how consistent you are. If you win 50% of your matches, you are more consistent if you win every other match of 100 battles than if you win 50 matches, then lose 50 matches. - Obi

 One thing I would like to note is that single elimination tournaments are by far the most enjoyable. Every match counts and if you lose, you dont have to keep battling and potentially keep losing.. Single Elim sorta maximises the enjoyment for the winner while minimising the suffering of losers.. It may not be the most accurate system, but I wouldnt want to change.. Have a nice day.
On the boat home... back to Smogon at last

My problem with tournaments like that is that they can end up very long. A 1 month tourney can take 2 or even more. It's even worst when your opponent is gone for a little while, and it becomes a hassle finding suitable times where you can play. Forced decisions have to be made, and although you weren't posting then, we definitely don't want problems like in Smogon Tourney 3.

It's funny that I seem so against this, I love the idea.
Quote:
 Sidebar: why does it automatically call the person fat when you quote them? I think its really funny, but what is this all about. It's like first, you sir, are fat. Second, this is what you said. Third, this is my response.

Quote:
 One thing I would like to note is that single elimination tournaments are by far the most enjoyable. Every match counts and if you lose, you dont have to keep battling and potentially keep losing.. Single Elim sorta maximises the enjoyment for the winner while minimising the suffering of losers.. It may not be the most accurate system, but I wouldnt want to change..
I don't really know, there's nothing that sucks more than losing in R1 of an important tournament because of hax.

I'm most comfortable with a Swiss style tourney, probably because that's what I'm used to from mtg.

 Sep 17th, 2007, 8:45:43 PM #5
Aeroblacktyl

Wait so what the is the point you're trying to get across and what purpose to achieve? If it's for one single short term tournament, then the way I always understood it is, the one who wins all his matches win. If it's a long term tournament to see who is the 'best' I thought that's what the Smogon Tour was for...
Quote:
 If it's a long term tournament to see who is the 'best' I thought that's what the Smogon Tour was for...
honestly, the tour is a pretty damn good way to see who's good. rating systems aren't flawless... nothing is. there are always going to be matches that don't finish, and in that case, it is bullshit to award someone a win when no battle took place.

I did my best, I have no regrets!

Quote:
 One thing I would like to note is that single elimination tournaments are by far the most enjoyable. Every match counts and if you lose, you dont have to keep battling and potentially keep losing.. Single Elim sorta maximises the enjoyment for the winner while minimising the suffering of losers.. It may not be the most accurate system, but I wouldnt want to change.. Have a nice day.
I've found Double Elimination to be the most fun. That way people who fight a really strong opponent first round still get a chance to keep going and those who got lucked get another chance. It's also pretty intense in the finals if the loser's bracket winner wins.

I'm pretty sure there are a lot of cases in SSBM (played Double Elimination 99% of the time) where someone who lost in the first or second round finished 1st or 2nd in the tournament.

Quote:
 Wait so what the is the point you're trying to get across and what purpose to achieve? If it's for one single short term tournament, then the way I always understood it is, the one who wins all his matches win. If it's a long term tournament to see who is the 'best' I thought that's what the Smogon Tour was for...
My point is just to hold a discussion of tournaments for now. This topic is far deeper than one would first assume, and a comprehensive topic should increase the quality of the typical tournament, as well as give a theoretical background on the subject.

You only need to click on a few of the links in my post to realize how much advanced (Aka, way beyond calculus) math is involved here. I'm seeing Linear Algebra and Matricies, Calculus, and triple integrals in some of these papers, all to answer the question of "How can I build a better tournament" in one form or another.

I figure a discussion should bring the ideas from these papers to the general tournament organizer on Shoddy. Needless to say, making a good tournament is a non-trivial problem and it must be treated as such.

I admit that I don't know much about Smogon's format (like the Smogon Tour), mostly because I've never seen the rules or the webpage up. Can anyone care to tell me how the Smogon Tour works exactly? (Very specifically, like how are points awarded and such)?

Over a long period of time, clearly the best way to figure out who is best is to use a proven rating system like Elo, Glicko or Glicko2. The systems have been mathematically proven to converge at the correct values, the system is consistent and the numbers actually tell you the probability that person A would win over person B. Further, Elo has been tested in the field by international gaming organizations such as FIDE (French for: World chess Federation) and USCF (United Stats Chess Federation).

Ever hear that Fischer is a 2700+ rating? Or that Chess masters are rated 2200+ ? This is the Elo rating system in action.

Course, there may be something like that already in the Smogon Tour set-up... or maybe not. So yeah :-/ I've only joined Smogon on the D/P generation, and the Smogon tour ruleset has never been up for my stay here. Again, explanations would be wonderful :-)

 Sep 18th, 2007, 1:53:02 AM #9
Hipmonlee

As far as an overall rating system goes, the tour is pretty poor.. When we have this database of tournment results that this forum provides, it seems silly not to try to use them.. Especially with a game as heavily luck based as pokemon. Have a nice day.
 Sep 18th, 2007, 2:15:36 AM #10
Aeolus

Glicko looks like an ideal system to implement as a script into competitor.... let's not have that debate here though.
 Sep 18th, 2007, 2:19:39 AM #11 CLegacyM I could be banned!   Join Date: Dec 2006 Posts: 294 I already sent PMs about this but I think the Smogon Tour isn't open enough. They have specific times to play which may be inaccessible for most people that go out on the weekend or work on the weekend as part-time employees. A year-long ladder with special tournaments for the top 16 or 32 or whatever in the ladder would probably provide more enjoyment for most of the site and give us a real ranking system for the players around here based on actual ladder matches so that not everything is based on tournaments and the possibility of luck.
 Sep 18th, 2007, 2:42:06 AM #12
Aeolus

when you complain that tours aren't held enough... realize that a real live person has to commit themselves to hosting each and every one of them... which is, at minimum, a weekly two hour commitment (during prime weekend nights) from start to finish. Two per week is plenty... we offered three for one season and there was poor turnout. The tour is not meant to measure skill, it is a tournament just like all the rest.
Quote:
 I already sent PMs about this but I think the Smogon Tour isn't open enough. They have specific times to play which may be inaccessible for most people that go out on the weekend or work on the weekend as part-time employees. A year-long ladder with special tournaments for the top 16 or 32 or whatever in the ladder would probably provide more enjoyment for most of the site and give us a real ranking system for the players around here based on actual ladder matches so that not everything is based on tournaments and the possibility of luck.
So you expect people to be dedicated for a WHOLE YEAR, yet not find some time to participate in some late night weekend tournaments over the course of 3 months?
 Sep 18th, 2007, 12:11:11 PM #14 CLegacyM I could be banned!   Join Date: Dec 2006 Posts: 294 Well... Well it's like Aeolus said, it's just a tournament only it's spread over a different timeline and format. The problem is that it's an exact time that you have to do all your matches and if you miss a couple of the tour dates you will have a harder time getting into the final tournament etc. The idea of a ladder would be to have maybe scheduled matches each week (for people that sign-up or that are willing to participate in the tour) or maybe just matches that people can set up on their own (if this was the case maybe imposing a limit on the amount of times you can play a certain person on the ladder). With this in place it would be a little more lax on the timeframe that people have to play. People would have all week to play rather than just a couple of hours with limited sign-ups and only having one person with an intensive 3 hour program (tour) they have to organize and run. This wouldn't even have to be a full year thing. It could be maybe a 3 month period, then a tournament for the top players and maybe also a smaller tournament for the lower players. Maybe a week or month break inbetween and then start the ladder up again or continue from where it left off before. Starting up again would allow for improvement of a lower tiered player to reach for the top faster while the declining players would be overtaken. However, going by past ladder rankings if we implemented a tier/point system like dragontamer was talking about with everyone starting at 1000 and getting more or less points from playing higher ranked or lower ranked people and so on so forth, it wouldn't really need to be reset after the tournaments. Anyway, it's just a suggestion. It would probably require some effort and work that there might not be the will to do here. But I think this site would be the best one to run something like this especially considering the competition among the community and the high demand for tournaments in general. With a permanent competitive ladder it would offer sort of a way to do less work in the end. Maybe a weekly update at best. I guess it would just provide a better alternative to the MLG ladder because this site is better recognized and has a better playing format to use than Wifi.
 Sep 18th, 2007, 12:57:24 PM #15
Aeolus

All that does is change the tour to make it just like every other tournament held on the forum except with a larger ranking system. The whole point of the tour series is that each installment finishes in a compressed amount of time. If you can't be there at the designated times, you are welcome to play in one of our many other tournaments that have less stringent time requirements. A season of the old tour had 18 installments... each of them a 32 man tournament spread over 9 weeks. Do you have any idea how long it would take to finish 18 separate 32 man tournaments if we attempted to do them in the format your suggest? It WOULD be a full year thing, if not longer. If we were to do something like you suggest, the amount of actual battles that would take place for points would be drastically fewer. That said, I'm not against a ladder... I just don't think it has any place in the tour.
Quote:
 This wouldn't even have to be a full year thing. It could be maybe a 3 month period, then a tournament for the top players and maybe also a smaller tournament for the lower players. Maybe a week or month break inbetween and then start the ladder up again or continue from where it left off before. Starting up again would allow for improvement of a lower tiered player to reach for the top faster while the declining players would be overtaken.
Actually, that would not be required if we implemented the Glicko2 system. The system detects when you are rapidly improving (or rapidly getting worse >_>) and then starts changing your score faster. Neither Glicko nor Elo does this.

Quote:
 However, going by past ladder rankings if we implemented a tier/point system like dragontamer was talking about with everyone starting at 1000 and getting more or less points from playing higher ranked or lower ranked people and so on so forth, it wouldn't really need to be reset after the tournaments.
In Elo, your starting ranking is 1500. In Glicko/Glicko2, your starting Ranking is 1500 +/- 350. And it goes from there.

Quote:
 Glicko looks like an ideal system to implement as a script into competitor.... let's not have that debate here though.
I want it in Shoddy, but Colin says that no one would care about it >_>. It is a philosophical argument however and doesn't really pertain to this topic of course.

----------

There are advantages to Elo. It is very easy to run a tournament based on Elo, even if you are not a programmer. There are many online sites with Javascript Elo calculators.

However, Glicko is a new system, and Glicko2 is even newer. You'll essentially have to program it yourself if you want to use them. Thus, while Glicko and Glicko2 systems take into account more factors, the Elo system will be useful for anyone who wants to sponsor their own tournament.

EDIT: Talking about Swiss Tournaments, there is surprising amount of depth in organizing them as well. Professor Glickman (inventor of Glicko) is one guy who I'll be refering to a lot... and he has a design here that is similar to Swiss but apparently is better. I haven't gotten around to reading it yet however.

EDIT2: Ignore that paper unless you're a fucking statistics major. >_> That paper makes too little sense to be of much use.

 Sep 18th, 2007, 5:07:06 PM #18
Aeroblacktyl

For some unknown reason I highly, highly, highly doubt we'll be creating a new subforum or any other 'league' system. I mean, if it was really that important, it might be automatically be implemented into Competitor already but no rating system of any kind will be involved. Just a hunch. But all this envisioning is highly amusing and intriguing. Plus, it seems that this is attempting to be 'taking over' the Tour's attraction, and if that were the case, why didn't we just join MLG when they came knocking on our door a few weeks ago...hmm
 Sep 18th, 2007, 6:23:26 PM #19
Aeolus

When I read your post, Siege, it seems like you suggest that the tour be supplanted by the ladder... or at least made secondary to it in some overarching system. That is all I was reacting to. I don't know how I feel about elevating some other sort of system to the same level (or as you present it, to a higher level) than both the tour and the official smogon tournament. Though, I'm not going to veto the idea from the get-go. That would be silly since competitor is not here and we have no idea when it will be. I kinda wish this thread would die since none of it is applicable to our current situation. I doubt Smogon will run a ladder out of Shoddy Battle, so I think this whole thing is premature. Moved to C&C because this isn't a tournament and the tournament forum is for tournaments. Yeah.
 Sep 18th, 2007, 11:10:43 PM #20
Dragontamer

Hmm, moved to C&C... I was looking for this thread for a while. Anyway, yeah, I really don't know where this kind of discussion would go. Probably not the last move we're gonna have :-/ So you think that we should wait till Competitor comes out before actually thinking about these sorts of things? I guess it is a fair assessment, but we probably can set something unofficial up to start testing these ideas (whenever Groudon80 opens up Tournament Applications again) I'm an empiricist first and foremost, so I don't think any of these ideas should be implemented without proper testing. I guess if Seige wants his system to start, the best way is to wait for Groudon80 to open up a tournament, and he can start holding beta testing of the system.
 Sep 20th, 2007, 5:36:39 AM #21
Hipmonlee

One thing I think has been made very clear, is that this will not be a part of competitor.. It would have to be scripted into a server or done seperately from the server altogther. Have a nice day.
 Sep 20th, 2007, 8:05:20 PM #22
david stone

I don't really see a problem with just discussing something, even if we cannot do something about it right now. That way, when we actually can implement it (should we choose to do so), we'll already have an idea of just what it is we want to do. If we wait until we can do something before we even start thinking about whether we want to do it, that will increase the 'downtime' so to speak. Basically, a summary of the Elo rating system, for those unfamiliar with it, would be something like this: Player 1 has a rating of 2000. Player 2 has a rating of 1500. If Player 1 wins, he'll gain some amount of points (let's say 10, just to give it a number, but you can really define the spread to be as big or as small as you want), and Player 2 will lose some number of points (usually the same, so in this case, 10, but it can be different). If Player 2 wins, he'll gain even more points (let's say 20), and Player 1 will lose some number of points. In the event of a tie, Player 2 would gain points (like 5, but it could be different, depending on how ties are considered), and Player 1 would lose points. This is because Player 1 is expected to win because of his higher rating, so when he does win, it's not as big of a deal. One of the problems with this is that a new player has to be given some sort of rating (usually around 1500), and he could be a Pokemon master or a complete fish. If he's really good, he'll crush the people at or above his rating and gain huge points (as they lose huge points), which is obviously unfair. If he sucks, then mediocre players can challenge him and have an incredibly easy battle. One of the ways used to combat this is to give players a "provisional" rating for their first few games, meaning their rating is invisible to everyone else until they get a real rating, and games they fight will only affect their own rating, not the ratings of the people they fight. After the complete a certain amount of battles, they would become a regular player. Another problem is the question of whether we want to have our ratings as close to absolute skill or relative skill as possible. If we want absolute skill, then a player with a rating of 1500 in 2004 should have a 50% chance to win against a player with a rating of 1500 in 2008, if they could somehow travel through time to fight each other. If we want relative skill, then two players with ratings of 1500 in 2008 should have a 50% chance to win against the other, but if the average skill level increases over time, a 1500 rated 2008 player would actually be better than a 2003 1500 rated player. One issue specific to Pokemon would be how a rating system handles different rule sets and generations. The best player at RBY isn't necessarily the best at GSC, but I think we can agree there is a correlation between, for instance, being good at ADV OU and ADV UU. This correlation is greater than between ADV OU and RBY OU, or DP ubers. We could decide to break up versions and tiers, and have performance in one not affect the other, or we could do some sort of scaling rating (doing well in GSC increases your ADV rating, but not as much as doing well in ADV). The second method would require a decision of how related they are.
 Sep 21st, 2007, 8:38:01 PM #23
david stone

I edited this summary into the first post, but I'm posting it here as well so people don't miss it. Without getting into the formulas, this is a summary of each system: Elo: Everyone gets a rating. If you win against someone, you get points and they lose points. If their rating is higher than yours, a lot of points are exchanged. If you win against someone with a lower rating than yours, few points are exchanged. Glicko: Mostly the same as Elo, except instead of having a set rating, you have a range of ratings. It basically says, "You are rated somewhere between these scores", which helps account for newer players having less fixed ratings. The more you battle, the narrower this range becomes. Glicko2: Mostly the same as Glicko, except it adds another factor (the "volatility factor"), which is a measure of how consistent you are. If you win 50% of your matches, you are more consistent if you win every other match of 100 battles than if you win 50 matches, then lose 50 matches.
 Sep 21st, 2007, 11:51:38 PM #24
Dragontamer

Thank you Obi. And yes, what you posted is how I understand the 3 systems to work.
 Sep 26th, 2007, 5:53:01 AM #25 Hipmonlee     Join Date: Dec 2004 Posts: 7,298 http://research.microsoft.com/mlp/apg/Details.aspx How similar is this to glicko? Also I was reading this: http://www.codinghorror.com/blog/archives/000961.html and from that found this: http://www.lifewithalacrity.com/2006...g_systems.html Have a nice day. __________________
Smogon Community