Hello Smogon, it has been a while. I'm probably not going to stay very long, but I've been doing a lot of research into tournaments recently, and I thought that I should get it all down somewhere. Considering how l33t this site is, I think this is the best place for me to get this down. I hope you will find this subject as interesting as I did.
The (Mathematical) Problem
The tournament is designed to solve the "Comparison Sorting" problem in Computer Science. IE: the goal is to find the best player, using only a direct comparison between two competitors. If we are only interested in the best player, and if we assume that every comparison (ie: match) ever made tells the absolute truth with 100% certainty, then the standard single-elimination bracket appears to solve the problem.
However, lets make one new assumption: What if the comparison is not an absolute comparison, but instead a statistic? As we all know, there is a large degree of chance in the game of Pokemon. Stronger, Better players only have a better chance of beating a weaker player, especially when the skill divide between the two players is close.
Thus, this is not a straightforward problem. Once we consider chance and uncertainty in battles, the "sorting" problem becomes invalid. After all, what if by chance, the number #1 player loses in the first round? Heck, in a 32 man single-elimination tournament, even if the #1 player has a 90% chance of winning every battle against everyone else, there is over a 40% chance that the #1 player will lose.
On the otherhand, if we view matches as unreliable, probabilistic data, we are dealing with a statistics problem instead.
Formulating the Statistics Problem: Maximum Likelyhood and The Linear Model
Note: Forgive me Math / Statistics Majors, I'm an Engineer. So if I screw up this explanation, please correct me.
There are two things we need before we can solve this new problem. First, we need a set of assumptions, some way to model how players behave. Second, we need an idea of what a "good answer" is. A way to define an optimal "fit" between our model and the data we have collected.
Starting at the model, my suggestion is to choose a model that most people recognize. If anyone has ever played on Shoddybattle, the "Glicko2" system (with its "Conservative Rankings Estimate") is based on something called the "Linear Model". "The Linear Model", defined by Herbert A David [2], is as follows. Suppose that every player has a "real strength". Lets take two of these players and call them Alice and Bob. Now, the probability that Alice beats Bob is based on some probability function of: Magic_Function(Alice's Strength - Bob's Strength).
But what Magic_Function to use? Fortunately, there are two in popular use right now, and according to Glickman [4], the choice between them is a "moot issue". Thus, I'll just go into detail about the Bradley-Terry model. Those who are interested can look up the other one, the Thurstone-Mosteller Model (which leads to TrueSkill, Glicko, and Glicko2). Analysis of Bradley-Terry eventually will lead to the Elo system, and the ranking system I am proposing.
Here is the definition of the Bradley-Terry Model. Assume that the difference in Alice's and Bob's strength leads to the following probability: Alice's Strength / (Alice's Strength + Bob's Strength). (Which... if you are curious, is actually 1/4 integral_(-(Alice's Strength - Bob's Strength)) ^ infinity (sech^2 (1/2 x) dx). But it is easier to use the first formula. Nevertheless, it IS a linear model, a function of (Alice's Strength - Bob's Strength))
Anyway, this formula states that if Alice's "True" strength is 5, and Bob's "True" strength is 10, then Alice will win 5 times out of 15, and Bob will win 10 times out of 15. Note: if you take 400 * Log(Alice's Strength) + C (where C is value close to ~1000), you get the value that the "Elo" Ranking System tries to track, and a number that is similar to the Glicko2 system loaded to the ShoddyBattle server.
Long Story Short: This model is one that you all trust already. If you think Glicko2 is doing a good job, you'll probably agree that the above model works as well. And while Glicko2 converges to a Linear Model, the method I present here is different. So pay attention, now it changes a bit.
The second thing needed is a method to fit the model (ie: the BradleyTerry model) to the data (ie: the matches in a tournament). One popular method is the Maximum Likelyhood Estimation. Based on the data we have, what is the set of strengths that would give the highest chance of creating the tournament results, assuming the model is correct? The "Maximum Likelyhood Estimation" finds this out.
Note: Elo, Glicko, Glicko2, and Trueskill do not use Maximum Likelyhood to calculate your ranking. These systems are designed to estimate the player's strength, even as the player's strength changes from week to week, or from month to month. MLE takes all the data as "equally important", no matter how old or new it is. So MLE would work better for say... a short term tournament, in comparison to Glicko2, which would work better for long-term leagues / ladders. From what I can see, Elo / Glicko2 and so forth are based on the Newton-Raphson method, or some other fixed-point iteration.
The Algorithm
Fortunately, this part isn't nearly as hard as you might expect. A Maximum Likelyhood Estimation on the BradleyTerry Model is a solved problem, and a simple algorithm exists. In Herbert A David's book, he goes over the algorithm a bit. http://www.stat.psu.edu/~dhunter/papers/bt.pdf has the algorithm as well, and the matlab code for the algorithm is avaliable at http://www.stat.psu.edu/~dhunter/code/btmatlab/.
(Specifically, http://www.stat.psu.edu/~dhunter/code/btmatlab/btmm.m )
As you can see, the algorithm is quite simple, but there is an important assumption that was made (see Hunter's paper):
1. There are two sets of people in your data where no one played anyone else. IE: A, B, and C played each other. D, E and F all played each other, but {A, B, C} never played anyone in {D, E, F}. Obviously, you cannot compare {A, B, C} against {D, E, F} if they never played each other. For a practical example, in a Single-Elimination Tournament with 16 players, the #16 seed cannot be compared with the #15 seed for this reason. #16 lost to #1, and #15 lost to #2, but there is no real way to compare #16 and #15.
2. There is a set of people that no one defeated. IE: Lets say A beat everyone in {B, C, D}. This case actually does come up occasionally in tournaments (and it always comes up in single-elimination tournaments). If you run the algorithm, A would be considered "infinitely" better than {B, C, D} and the algorithm will essentially fail to converge. It makes sense, but lets say A beat {B, C, D} and then B beat {C, D}. Now B is considered infinitely better than {C, D} and A is infinitely better than B. Either way, the algorithm is not designed to handle this case. It simply doesn't converge.
Fortunately, this is a standard graph problem in Computer Science. It is called the "Strongly Connected Component", where a "win" is considered an edge in the graph, and the players are the nodes. http://en.wikipedia.org/wiki/Strongly_connected_component
Here is a simple algorithm to find the entire strongly-connected component given a player P.
1. Perform a Breadth First Search starting at P, where a forward edge is a "win".
2. Perform a Breadth First Search starting at P, where a forward edge is a "loss".
3. Take the intersection of the two sets, and you have a Strongly Connected Component. IE: The set you have now is the entire set of people who are connected either by a win or a loss to P.
Note: It doesn't really matter if you use a Breadth First Search or a Depth First Search. As long as you visit all the nodes that are visitable from that node, then you're good.
The algorithm can then be run on this set, and rank everyone in the set according to the Bradley Terry Model. Afterwards, it appears that the graph of these components will form a Directed Acyclic Graph (a DAG), which means you can sort it to a Topological Sorting. (http://en.wikipedia.org/wiki/Topological_sorting). This, combined with the algorithm to rank players within components, will provide a full ranking for all of the players in an arbitrary tournament format. Note that while multiple topological sortings exist, if you organize your tournament correctly, you should have a single "best" winner.
Applying the Math to a Tournament
The above math is very general, and does not immediately suggest a tournament format. Any and all formats can technically have the above math applied to it. (Swiss, Double-Elimination, etc. etc.)
However, I can point out a few things from practice. In the BlazBlue forums, I have experimented with this format a little bit, and it seems to work out more "fairly" than any other method I've used. There are a few caviats however.
1. The performance of your opponents, as well as the opponents of your opponents, (and so forth) affect your score. That is to say, if you beat someone who eventually becomes a high-rank, you will likely be a high rank as well. Beating someone of low rank doesn't "disqualify" you really, but you don't benefit as much as beating a high ranked person. On the other side, losing to someone with a high rank won't actually affect your score as much as losing to someone with a low rank.
2. A few members participating in the tournament told me that they were worried about NCAA Football BCS Bullshit. People don't like to be graded by a computer program. They prefer simple methods they can do themselves as opposed to complicated (but rigorous) methods. However, if you understand what is going on here, I'm sure you'll see that a Most Likely estimate over the Bradley Terry Model (or, some other Linear Model) is as statistically fair as you can get.
Sample Format
Over at Dustloop (BlazBlue Forums), here is a description of what I do.
16 Man Tournament.
Every person is put into a 4 man pod.
Round 1:
Each pod plays Round Robin.
Round 2:
People with 3 wins are placed in Pod 1. People with 2 wins are placed in Pod 2, etc. etc. Every pod then plays Round Robin again.
Round 3:
Rounds continue in this fashon. We usually do either 2 or 3 rounds depending on how long the tournament is lasting. People with similar amounts of Wins / Losses are placed into the same pod, so that people play with similarly skilled players.
Finals:
All of the Rounds Data is plugged into the program I wrote that does the calculations I described above. The top 4 players then play a single-elimination tournament. The best one is the winner of the tournament.
In comparison to the Status Quo
This is begging the question I guess. Why use this over Single Elimination / Double-Elimination? Why use this over Glicko2? Why use this over Round Robin?
I'll start off with the easiest one. Round Robin takes too damn long to do. Plus, this is a system that can take data from a Round Robin tournament, and find the best fit of player strengths to the data. This method provides a better fit than "Win / Loss" records in a Round Robin tournament. So even if you wanted to do a Round Robin Tournament, it would simply be more fair to send the data through this algorithm.
Okay, why this over Glicko2? Glicko2 converges in the long run, but it was not designed for a single shot tournament. Glicko2 is a long-term ranking system, and it doesn't really look "back" at the choices it makes. Take this example: if you lose to the #1 player in round 1, your score will be affected a lot in Glicko2 (because you have the same rating, and both of you have a very large ratings deviation). Later, after several rounds have gone by, the #1 player's score is now high and his deviation is low, and later players who face him don't get as big of a penalty as you did in round 1. However, this "Regression" system looks at all the data, essentially simultaniously. It does go back into the past, notes that your Round 1 loss was against the #1 player, and then negates the penalty a little bit. All players who lose to the #1 player will be equally penalized, while if you did the Glicko system, the "earlier" players are penalized more than the later ones. This is clearly unfair.
On the other hand, if you don't start everyone at an equal Rating, then their initial rankings will affect the end results of the tournament. A low-ranked player may never have had a chance to win. Either way, Glicko2, while useful for a long-term ranking / ladder system (and superior to this method for long-term ladders), it inferior to this system in a short-term one-shot tournament.
What about Single Elimination / Double Elimination ?? Well, IMO, these are extremely unfair tournament systems. It is very unlikely that the "real" best player actually wins the tournament, because he can only lose once or twice. Nevertheless, people like Elimination Tournaments, so I suggest that a small number of "Finals" be done using Single Elimination or similar.
Extra Stuffs
Herbert A David's book [2] goes into much detail about the full breadth of statistical methods that can be applied to the Linear Model. You can calculate the variance of the strengths of the players. You can perform a hypothesis test to see the probability thta you've actually got the strongest player. More or less, all standard statistics can be found from this model. It also goes a little bit into the combinatorics of this sort of thing.
Mark Glickman has a few papers on optimizing the pairings between players so that the entropy can be maximized between rounds. IE: so that you get the most amount of information from the fewest number of rounds.
For those into the Programming Language "R", there is a BradleyTerry package that performs the algorithm. Unfortunately, it does not do the Strongly Connected Component check, so there are times when the algorithm just won't converge on a dataset. Nevertheless, this is IMO the fastest and easiest way you can start experimenting with the BradleyTerry model.
I do have a crappy implementation that I programmed myself. It was programmed in Java... rather crappily... in command line, but it works. (and it does have an implementation of the Strongly Connected Components algorithm as well) There are few comments and this post provides better documentation of what I did... but if you still want the code, shoot me a private message. I'm willing to share, but I don't have Webspace >_<
References
[1] http://www.stat.psu.edu/~dhunter/papers/bt.pdf
[2] Herbert A David: The Method of Paired Comparisons (Published Book 1988)
[3] http://research.microsoft.com/pubs/74417/NIPS2007_0931.pdf
[4] Mark Glickman: Comprehensive Guide to Chess Ratings. http://math.bu.edu/people/mg/research/acjpaper.pdf
---------------
Yeah, that was long. I hope it was interesting to those who bothered to read it. Maybe you'll find a use of this format, maybe not. As far as I know, I can't think of a more fair way to run a tournament.
The (Mathematical) Problem
The tournament is designed to solve the "Comparison Sorting" problem in Computer Science. IE: the goal is to find the best player, using only a direct comparison between two competitors. If we are only interested in the best player, and if we assume that every comparison (ie: match) ever made tells the absolute truth with 100% certainty, then the standard single-elimination bracket appears to solve the problem.
However, lets make one new assumption: What if the comparison is not an absolute comparison, but instead a statistic? As we all know, there is a large degree of chance in the game of Pokemon. Stronger, Better players only have a better chance of beating a weaker player, especially when the skill divide between the two players is close.
Thus, this is not a straightforward problem. Once we consider chance and uncertainty in battles, the "sorting" problem becomes invalid. After all, what if by chance, the number #1 player loses in the first round? Heck, in a 32 man single-elimination tournament, even if the #1 player has a 90% chance of winning every battle against everyone else, there is over a 40% chance that the #1 player will lose.
On the otherhand, if we view matches as unreliable, probabilistic data, we are dealing with a statistics problem instead.
Formulating the Statistics Problem: Maximum Likelyhood and The Linear Model
Note: Forgive me Math / Statistics Majors, I'm an Engineer. So if I screw up this explanation, please correct me.
There are two things we need before we can solve this new problem. First, we need a set of assumptions, some way to model how players behave. Second, we need an idea of what a "good answer" is. A way to define an optimal "fit" between our model and the data we have collected.
Starting at the model, my suggestion is to choose a model that most people recognize. If anyone has ever played on Shoddybattle, the "Glicko2" system (with its "Conservative Rankings Estimate") is based on something called the "Linear Model". "The Linear Model", defined by Herbert A David [2], is as follows. Suppose that every player has a "real strength". Lets take two of these players and call them Alice and Bob. Now, the probability that Alice beats Bob is based on some probability function of: Magic_Function(Alice's Strength - Bob's Strength).
But what Magic_Function to use? Fortunately, there are two in popular use right now, and according to Glickman [4], the choice between them is a "moot issue". Thus, I'll just go into detail about the Bradley-Terry model. Those who are interested can look up the other one, the Thurstone-Mosteller Model (which leads to TrueSkill, Glicko, and Glicko2). Analysis of Bradley-Terry eventually will lead to the Elo system, and the ranking system I am proposing.
Here is the definition of the Bradley-Terry Model. Assume that the difference in Alice's and Bob's strength leads to the following probability: Alice's Strength / (Alice's Strength + Bob's Strength). (Which... if you are curious, is actually 1/4 integral_(-(Alice's Strength - Bob's Strength)) ^ infinity (sech^2 (1/2 x) dx). But it is easier to use the first formula. Nevertheless, it IS a linear model, a function of (Alice's Strength - Bob's Strength))
Anyway, this formula states that if Alice's "True" strength is 5, and Bob's "True" strength is 10, then Alice will win 5 times out of 15, and Bob will win 10 times out of 15. Note: if you take 400 * Log(Alice's Strength) + C (where C is value close to ~1000), you get the value that the "Elo" Ranking System tries to track, and a number that is similar to the Glicko2 system loaded to the ShoddyBattle server.
Long Story Short: This model is one that you all trust already. If you think Glicko2 is doing a good job, you'll probably agree that the above model works as well. And while Glicko2 converges to a Linear Model, the method I present here is different. So pay attention, now it changes a bit.
The second thing needed is a method to fit the model (ie: the BradleyTerry model) to the data (ie: the matches in a tournament). One popular method is the Maximum Likelyhood Estimation. Based on the data we have, what is the set of strengths that would give the highest chance of creating the tournament results, assuming the model is correct? The "Maximum Likelyhood Estimation" finds this out.
Note: Elo, Glicko, Glicko2, and Trueskill do not use Maximum Likelyhood to calculate your ranking. These systems are designed to estimate the player's strength, even as the player's strength changes from week to week, or from month to month. MLE takes all the data as "equally important", no matter how old or new it is. So MLE would work better for say... a short term tournament, in comparison to Glicko2, which would work better for long-term leagues / ladders. From what I can see, Elo / Glicko2 and so forth are based on the Newton-Raphson method, or some other fixed-point iteration.
The Algorithm
Fortunately, this part isn't nearly as hard as you might expect. A Maximum Likelyhood Estimation on the BradleyTerry Model is a solved problem, and a simple algorithm exists. In Herbert A David's book, he goes over the algorithm a bit. http://www.stat.psu.edu/~dhunter/papers/bt.pdf has the algorithm as well, and the matlab code for the algorithm is avaliable at http://www.stat.psu.edu/~dhunter/code/btmatlab/.
(Specifically, http://www.stat.psu.edu/~dhunter/code/btmatlab/btmm.m )
As you can see, the algorithm is quite simple, but there is an important assumption that was made (see Hunter's paper):
Also, in the Matlab code.In every possible partition of individuals into two nonempty subsets, some individual in the second set beats some individual in the first set at least once
This is an important assumption actually, and in my experience it is one that is often violated, at least... case #2 as follows tends to violate it.% This algorithm does not contain any checks for lack of convergence;
% it is assumed that for any partition of the teams into sets A and B,
% at least one team from A beats at least one team from B at least once.
1. There are two sets of people in your data where no one played anyone else. IE: A, B, and C played each other. D, E and F all played each other, but {A, B, C} never played anyone in {D, E, F}. Obviously, you cannot compare {A, B, C} against {D, E, F} if they never played each other. For a practical example, in a Single-Elimination Tournament with 16 players, the #16 seed cannot be compared with the #15 seed for this reason. #16 lost to #1, and #15 lost to #2, but there is no real way to compare #16 and #15.
2. There is a set of people that no one defeated. IE: Lets say A beat everyone in {B, C, D}. This case actually does come up occasionally in tournaments (and it always comes up in single-elimination tournaments). If you run the algorithm, A would be considered "infinitely" better than {B, C, D} and the algorithm will essentially fail to converge. It makes sense, but lets say A beat {B, C, D} and then B beat {C, D}. Now B is considered infinitely better than {C, D} and A is infinitely better than B. Either way, the algorithm is not designed to handle this case. It simply doesn't converge.
Fortunately, this is a standard graph problem in Computer Science. It is called the "Strongly Connected Component", where a "win" is considered an edge in the graph, and the players are the nodes. http://en.wikipedia.org/wiki/Strongly_connected_component
Here is a simple algorithm to find the entire strongly-connected component given a player P.
1. Perform a Breadth First Search starting at P, where a forward edge is a "win".
2. Perform a Breadth First Search starting at P, where a forward edge is a "loss".
3. Take the intersection of the two sets, and you have a Strongly Connected Component. IE: The set you have now is the entire set of people who are connected either by a win or a loss to P.
Note: It doesn't really matter if you use a Breadth First Search or a Depth First Search. As long as you visit all the nodes that are visitable from that node, then you're good.
The algorithm can then be run on this set, and rank everyone in the set according to the Bradley Terry Model. Afterwards, it appears that the graph of these components will form a Directed Acyclic Graph (a DAG), which means you can sort it to a Topological Sorting. (http://en.wikipedia.org/wiki/Topological_sorting). This, combined with the algorithm to rank players within components, will provide a full ranking for all of the players in an arbitrary tournament format. Note that while multiple topological sortings exist, if you organize your tournament correctly, you should have a single "best" winner.
Applying the Math to a Tournament
The above math is very general, and does not immediately suggest a tournament format. Any and all formats can technically have the above math applied to it. (Swiss, Double-Elimination, etc. etc.)
However, I can point out a few things from practice. In the BlazBlue forums, I have experimented with this format a little bit, and it seems to work out more "fairly" than any other method I've used. There are a few caviats however.
1. The performance of your opponents, as well as the opponents of your opponents, (and so forth) affect your score. That is to say, if you beat someone who eventually becomes a high-rank, you will likely be a high rank as well. Beating someone of low rank doesn't "disqualify" you really, but you don't benefit as much as beating a high ranked person. On the other side, losing to someone with a high rank won't actually affect your score as much as losing to someone with a low rank.
2. A few members participating in the tournament told me that they were worried about NCAA Football BCS Bullshit. People don't like to be graded by a computer program. They prefer simple methods they can do themselves as opposed to complicated (but rigorous) methods. However, if you understand what is going on here, I'm sure you'll see that a Most Likely estimate over the Bradley Terry Model (or, some other Linear Model) is as statistically fair as you can get.
Sample Format
Over at Dustloop (BlazBlue Forums), here is a description of what I do.
16 Man Tournament.
Every person is put into a 4 man pod.
Round 1:
Each pod plays Round Robin.
Round 2:
People with 3 wins are placed in Pod 1. People with 2 wins are placed in Pod 2, etc. etc. Every pod then plays Round Robin again.
Round 3:
Rounds continue in this fashon. We usually do either 2 or 3 rounds depending on how long the tournament is lasting. People with similar amounts of Wins / Losses are placed into the same pod, so that people play with similarly skilled players.
Finals:
All of the Rounds Data is plugged into the program I wrote that does the calculations I described above. The top 4 players then play a single-elimination tournament. The best one is the winner of the tournament.
In comparison to the Status Quo
This is begging the question I guess. Why use this over Single Elimination / Double-Elimination? Why use this over Glicko2? Why use this over Round Robin?
I'll start off with the easiest one. Round Robin takes too damn long to do. Plus, this is a system that can take data from a Round Robin tournament, and find the best fit of player strengths to the data. This method provides a better fit than "Win / Loss" records in a Round Robin tournament. So even if you wanted to do a Round Robin Tournament, it would simply be more fair to send the data through this algorithm.
Okay, why this over Glicko2? Glicko2 converges in the long run, but it was not designed for a single shot tournament. Glicko2 is a long-term ranking system, and it doesn't really look "back" at the choices it makes. Take this example: if you lose to the #1 player in round 1, your score will be affected a lot in Glicko2 (because you have the same rating, and both of you have a very large ratings deviation). Later, after several rounds have gone by, the #1 player's score is now high and his deviation is low, and later players who face him don't get as big of a penalty as you did in round 1. However, this "Regression" system looks at all the data, essentially simultaniously. It does go back into the past, notes that your Round 1 loss was against the #1 player, and then negates the penalty a little bit. All players who lose to the #1 player will be equally penalized, while if you did the Glicko system, the "earlier" players are penalized more than the later ones. This is clearly unfair.
On the other hand, if you don't start everyone at an equal Rating, then their initial rankings will affect the end results of the tournament. A low-ranked player may never have had a chance to win. Either way, Glicko2, while useful for a long-term ranking / ladder system (and superior to this method for long-term ladders), it inferior to this system in a short-term one-shot tournament.
What about Single Elimination / Double Elimination ?? Well, IMO, these are extremely unfair tournament systems. It is very unlikely that the "real" best player actually wins the tournament, because he can only lose once or twice. Nevertheless, people like Elimination Tournaments, so I suggest that a small number of "Finals" be done using Single Elimination or similar.
Extra Stuffs
Herbert A David's book [2] goes into much detail about the full breadth of statistical methods that can be applied to the Linear Model. You can calculate the variance of the strengths of the players. You can perform a hypothesis test to see the probability thta you've actually got the strongest player. More or less, all standard statistics can be found from this model. It also goes a little bit into the combinatorics of this sort of thing.
Mark Glickman has a few papers on optimizing the pairings between players so that the entropy can be maximized between rounds. IE: so that you get the most amount of information from the fewest number of rounds.
For those into the Programming Language "R", there is a BradleyTerry package that performs the algorithm. Unfortunately, it does not do the Strongly Connected Component check, so there are times when the algorithm just won't converge on a dataset. Nevertheless, this is IMO the fastest and easiest way you can start experimenting with the BradleyTerry model.
I do have a crappy implementation that I programmed myself. It was programmed in Java... rather crappily... in command line, but it works. (and it does have an implementation of the Strongly Connected Components algorithm as well) There are few comments and this post provides better documentation of what I did... but if you still want the code, shoot me a private message. I'm willing to share, but I don't have Webspace >_<
References
[1] http://www.stat.psu.edu/~dhunter/papers/bt.pdf
[2] Herbert A David: The Method of Paired Comparisons (Published Book 1988)
[3] http://research.microsoft.com/pubs/74417/NIPS2007_0931.pdf
[4] Mark Glickman: Comprehensive Guide to Chess Ratings. http://math.bu.edu/people/mg/research/acjpaper.pdf
---------------
Yeah, that was long. I hope it was interesting to those who bothered to read it. Maybe you'll find a use of this format, maybe not. As far as I know, I can't think of a more fair way to run a tournament.