Hello Smogon, it has been a while. I'm probably not going to stay very long, but I've been doing a lot of research into tournaments recently, and I thought that I should get it all down somewhere. Considering how l33t this site is, I think this is the best place for me to get this down. I hope you will find this subject as interesting as I did. The (Mathematical) Problem The tournament is designed to solve the "Comparison Sorting" problem in Computer Science. IE: the goal is to find the best player, using only a direct comparison between two competitors. If we are only interested in the best player, and if we assume that every comparison (ie: match) ever made tells the absolute truth with 100% certainty, then the standard single-elimination bracket appears to solve the problem. However, lets make one new assumption: What if the comparison is not an absolute comparison, but instead a statistic? As we all know, there is a large degree of chance in the game of Pokemon. Stronger, Better players only have a better chance of beating a weaker player, especially when the skill divide between the two players is close. Thus, this is not a straightforward problem. Once we consider chance and uncertainty in battles, the "sorting" problem becomes invalid. After all, what if by chance, the number #1 player loses in the first round? Heck, in a 32 man single-elimination tournament, even if the #1 player has a 90% chance of winning every battle against everyone else, there is over a 40% chance that the #1 player will lose. On the otherhand, if we view matches as unreliable, probabilistic data, we are dealing with a statistics problem instead. Formulating the Statistics Problem: Maximum Likelyhood and The Linear Model Note: Forgive me Math / Statistics Majors, I'm an Engineer. So if I screw up this explanation, please correct me. There are two things we need before we can solve this new problem. First, we need a set of assumptions, some way to model how players behave. Second, we need an idea of what a "good answer" is. A way to define an optimal "fit" between our model and the data we have collected. Starting at the model, my suggestion is to choose a model that most people recognize. If anyone has ever played on Shoddybattle, the "Glicko2" system (with its "Conservative Rankings Estimate") is based on something called the "Linear Model". "The Linear Model", defined by Herbert A David [2], is as follows. Suppose that every player has a "real strength". Lets take two of these players and call them Alice and Bob. Now, the probability that Alice beats Bob is based on some probability function of: Magic_Function(Alice's Strength - Bob's Strength). But what Magic_Function to use? Fortunately, there are two in popular use right now, and according to Glickman [4], the choice between them is a "moot issue". Thus, I'll just go into detail about the Bradley-Terry model. Those who are interested can look up the other one, the Thurstone-Mosteller Model (which leads to TrueSkill, Glicko, and Glicko2). Analysis of Bradley-Terry eventually will lead to the Elo system, and the ranking system I am proposing. Here is the definition of the Bradley-Terry Model. Assume that the difference in Alice's and Bob's strength leads to the following probability: Alice's Strength / (Alice's Strength + Bob's Strength). (Which... if you are curious, is actually 1/4 integral_(-(Alice's Strength - Bob's Strength)) ^ infinity (sech^2 (1/2 x) dx). But it is easier to use the first formula. Nevertheless, it IS a linear model, a function of (Alice's Strength - Bob's Strength)) Anyway, this formula states that if Alice's "True" strength is 5, and Bob's "True" strength is 10, then Alice will win 5 times out of 15, and Bob will win 10 times out of 15. Note: if you take 400 * Log(Alice's Strength) + C (where C is value close to ~1000), you get the value that the "Elo" Ranking System tries to track, and a number that is similar to the Glicko2 system loaded to the ShoddyBattle server. Long Story Short: This model is one that you all trust already. If you think Glicko2 is doing a good job, you'll probably agree that the above model works as well. And while Glicko2 converges to a Linear Model, the method I present here is different. So pay attention, now it changes a bit. The second thing needed is a method to fit the model (ie: the BradleyTerry model) to the data (ie: the matches in a tournament). One popular method is the Maximum Likelyhood Estimation. Based on the data we have, what is the set of strengths that would give the highest chance of creating the tournament results, assuming the model is correct? The "Maximum Likelyhood Estimation" finds this out. Note: Elo, Glicko, Glicko2, and Trueskill do not use Maximum Likelyhood to calculate your ranking. These systems are designed to estimate the player's strength, even as the player's strength changes from week to week, or from month to month. MLE takes all the data as "equally important", no matter how old or new it is. So MLE would work better for say... a short term tournament, in comparison to Glicko2, which would work better for long-term leagues / ladders. From what I can see, Elo / Glicko2 and so forth are based on the Newton-Raphson method, or some other fixed-point iteration. The Algorithm Fortunately, this part isn't nearly as hard as you might expect. A Maximum Likelyhood Estimation on the BradleyTerry Model is a solved problem, and a simple algorithm exists. In Herbert A David's book, he goes over the algorithm a bit. http://www.stat.psu.edu/~dhunter/papers/bt.pdf has the algorithm as well, and the matlab code for the algorithm is avaliable at http://www.stat.psu.edu/~dhunter/code/btmatlab/. (Specifically, http://www.stat.psu.edu/~dhunter/code/btmatlab/btmm.m ) As you can see, the algorithm is quite simple, but there is an important assumption that was made (see Hunter's paper): Also, in the Matlab code. This is an important assumption actually, and in my experience it is one that is often violated, at least... case #2 as follows tends to violate it. 1. There are two sets of people in your data where no one played anyone else. IE: A, B, and C played each other. D, E and F all played each other, but {A, B, C} never played anyone in {D, E, F}. Obviously, you cannot compare {A, B, C} against {D, E, F} if they never played each other. For a practical example, in a Single-Elimination Tournament with 16 players, the #16 seed cannot be compared with the #15 seed for this reason. #16 lost to #1, and #15 lost to #2, but there is no real way to compare #16 and #15. 2. There is a set of people that no one defeated. IE: Lets say A beat everyone in {B, C, D}. This case actually does come up occasionally in tournaments (and it always comes up in single-elimination tournaments). If you run the algorithm, A would be considered "infinitely" better than {B, C, D} and the algorithm will essentially fail to converge. It makes sense, but lets say A beat {B, C, D} and then B beat {C, D}. Now B is considered infinitely better than {C, D} and A is infinitely better than B. Either way, the algorithm is not designed to handle this case. It simply doesn't converge. Fortunately, this is a standard graph problem in Computer Science. It is called the "Strongly Connected Component", where a "win" is considered an edge in the graph, and the players are the nodes. http://en.wikipedia.org/wiki/Strongly_connected_component Here is a simple algorithm to find the entire strongly-connected component given a player P. 1. Perform a Breadth First Search starting at P, where a forward edge is a "win". 2. Perform a Breadth First Search starting at P, where a forward edge is a "loss". 3. Take the intersection of the two sets, and you have a Strongly Connected Component. IE: The set you have now is the entire set of people who are connected either by a win or a loss to P. Note: It doesn't really matter if you use a Breadth First Search or a Depth First Search. As long as you visit all the nodes that are visitable from that node, then you're good. The algorithm can then be run on this set, and rank everyone in the set according to the Bradley Terry Model. Afterwards, it appears that the graph of these components will form a Directed Acyclic Graph (a DAG), which means you can sort it to a Topological Sorting. (http://en.wikipedia.org/wiki/Topological_sorting). This, combined with the algorithm to rank players within components, will provide a full ranking for all of the players in an arbitrary tournament format. Note that while multiple topological sortings exist, if you organize your tournament correctly, you should have a single "best" winner. Applying the Math to a Tournament The above math is very general, and does not immediately suggest a tournament format. Any and all formats can technically have the above math applied to it. (Swiss, Double-Elimination, etc. etc.) However, I can point out a few things from practice. In the BlazBlue forums, I have experimented with this format a little bit, and it seems to work out more "fairly" than any other method I've used. There are a few caviats however. 1. The performance of your opponents, as well as the opponents of your opponents, (and so forth) affect your score. That is to say, if you beat someone who eventually becomes a high-rank, you will likely be a high rank as well. Beating someone of low rank doesn't "disqualify" you really, but you don't benefit as much as beating a high ranked person. On the other side, losing to someone with a high rank won't actually affect your score as much as losing to someone with a low rank. 2. A few members participating in the tournament told me that they were worried about NCAA Football BCS Bullshit. People don't like to be graded by a computer program. They prefer simple methods they can do themselves as opposed to complicated (but rigorous) methods. However, if you understand what is going on here, I'm sure you'll see that a Most Likely estimate over the Bradley Terry Model (or, some other Linear Model) is as statistically fair as you can get. Sample Format Over at Dustloop (BlazBlue Forums), here is a description of what I do. 16 Man Tournament. Every person is put into a 4 man pod. Round 1: Each pod plays Round Robin. Round 2: People with 3 wins are placed in Pod 1. People with 2 wins are placed in Pod 2, etc. etc. Every pod then plays Round Robin again. Round 3: Rounds continue in this fashon. We usually do either 2 or 3 rounds depending on how long the tournament is lasting. People with similar amounts of Wins / Losses are placed into the same pod, so that people play with similarly skilled players. Finals: All of the Rounds Data is plugged into the program I wrote that does the calculations I described above. The top 4 players then play a single-elimination tournament. The best one is the winner of the tournament. In comparison to the Status Quo This is begging the question I guess. Why use this over Single Elimination / Double-Elimination? Why use this over Glicko2? Why use this over Round Robin? I'll start off with the easiest one. Round Robin takes too damn long to do. Plus, this is a system that can take data from a Round Robin tournament, and find the best fit of player strengths to the data. This method provides a better fit than "Win / Loss" records in a Round Robin tournament. So even if you wanted to do a Round Robin Tournament, it would simply be more fair to send the data through this algorithm. Okay, why this over Glicko2? Glicko2 converges in the long run, but it was not designed for a single shot tournament. Glicko2 is a long-term ranking system, and it doesn't really look "back" at the choices it makes. Take this example: if you lose to the #1 player in round 1, your score will be affected a lot in Glicko2 (because you have the same rating, and both of you have a very large ratings deviation). Later, after several rounds have gone by, the #1 player's score is now high and his deviation is low, and later players who face him don't get as big of a penalty as you did in round 1. However, this "Regression" system looks at all the data, essentially simultaniously. It does go back into the past, notes that your Round 1 loss was against the #1 player, and then negates the penalty a little bit. All players who lose to the #1 player will be equally penalized, while if you did the Glicko system, the "earlier" players are penalized more than the later ones. This is clearly unfair. On the other hand, if you don't start everyone at an equal Rating, then their initial rankings will affect the end results of the tournament. A low-ranked player may never have had a chance to win. Either way, Glicko2, while useful for a long-term ranking / ladder system (and superior to this method for long-term ladders), it inferior to this system in a short-term one-shot tournament. What about Single Elimination / Double Elimination ?? Well, IMO, these are extremely unfair tournament systems. It is very unlikely that the "real" best player actually wins the tournament, because he can only lose once or twice. Nevertheless, people like Elimination Tournaments, so I suggest that a small number of "Finals" be done using Single Elimination or similar. Extra Stuffs Herbert A David's book [2] goes into much detail about the full breadth of statistical methods that can be applied to the Linear Model. You can calculate the variance of the strengths of the players. You can perform a hypothesis test to see the probability thta you've actually got the strongest player. More or less, all standard statistics can be found from this model. It also goes a little bit into the combinatorics of this sort of thing. Mark Glickman has a few papers on optimizing the pairings between players so that the entropy can be maximized between rounds. IE: so that you get the most amount of information from the fewest number of rounds. For those into the Programming Language "R", there is a BradleyTerry package that performs the algorithm. Unfortunately, it does not do the Strongly Connected Component check, so there are times when the algorithm just won't converge on a dataset. Nevertheless, this is IMO the fastest and easiest way you can start experimenting with the BradleyTerry model. I do have a crappy implementation that I programmed myself. It was programmed in Java... rather crappily... in command line, but it works. (and it does have an implementation of the Strongly Connected Components algorithm as well) There are few comments and this post provides better documentation of what I did... but if you still want the code, shoot me a private message. I'm willing to share, but I don't have Webspace >_< References [1] http://www.stat.psu.edu/~dhunter/papers/bt.pdf [2] Herbert A David: The Method of Paired Comparisons (Published Book 1988) [3] http://research.microsoft.com/pubs/74417/NIPS2007_0931.pdf [4] Mark Glickman: Comprehensive Guide to Chess Ratings. http://math.bu.edu/people/mg/research/acjpaper.pdf --------------- Yeah, that was long. I hope it was interesting to those who bothered to read it. Maybe you'll find a use of this format, maybe not. As far as I know, I can't think of a more fair way to run a tournament.

Interesting...but perhaps overcomplicated. Elimination tournaments are unfair, but quick. Round-robin tournaments are fair, but slow. However, Pokemon battles are pretty quick right? Quicker than soccer or chess matches, MUCH quicker than cricket tests (_everything's_ quicker than cricket tests!) So everyone playing everyone else shouldn't take too long. Scoring can then be done simply in a league system. If there are too many players, a structure like the European football leagues works over long term - players split into multiple divisions, with promotion and relegation between them. Of course you need an initial seed, but it's not critical this be fair. For your maths, is there a way to factor in the score? Pokemon gives us more than just win, draw (usually banned by clauses mind), or lose - there's a score, the number of remaining Pokemon of the winner. I suppose the first question is is 6-0 really much better than 1-0? Only tangentially related; I've an interesting idea. How about a league based on the concept of 'home' and 'away' matches. Home player sets the tier (and maybe other clauses; they should remain the same for the whole 'season'). So if you think you're best at Ubers, your home matches are Uber. But when playing 'away', your opponent might be choosing OU or UU say. To perform well would require strength in ALL tiers, meaning the league winner could be considered master of all.

No, definitely not "Everyone plays Everyone". Everyone plays Everyone is O(n^2) number of battles. A Single Elimination Tournament is O(n) battles. IE: A Single Elimination Tournament with 32 players will have 31 battles total. (or, 5 rounds in parallel). A Round Robin Tournament with 32 players will have 31 battles PER PLAYER (or, 992 battles total) The area which this calculation can hold is anywhere from an Elimination Tournament, to a Round-Robin. There lies the flexibility of the structure. The specific format that I have listed is similar to a league, where everyone plays round robin in the league. But ultimately, the qualification rounds can be done in any way necessary. Yes, it is trivial to factor in a score. However, I do not believe 6-0 is any "better" than 1-0 in Pokemon. A win is a win, especially because you may sacrifice your own Pokemon to put yourself in a position to sweep. (IE: if the opponent has no Bulky-Gyarados counter, and your opponent's current pokemon can't kill / cripple BulkyGyarados in (say) 5 hits... you can sacrifice your current pokemon out for a nearly guarenteed victory. Therefore, it makes no statistical sense to make 6-0 wins "better" than 1-0 wins. As far as I can see anyway. The concept of "Home" and "Away" has been generalized to the BradleyTerry model. Details can be found in the [1] reference, here: http://www.stat.psu.edu/%7Edhunter/papers/bt.pdf

Has anyone considered swiss format. It never really presented any difficulties in my mtg days. Most tournaments here are 64 or so man so swiss rounds cut to a top 4 would work. It would also make say a 110 man tournament easier to run.

I read most of the OP and feel I have a solid understanding of how these tournaments can go, but I'd really like to clarify something: You're essentially saying here that a tournament run under this system can be configured to run exactly as quickly as a Single Elimination tournament? If so, can you explain how that would work exactly? There are a lot of differences between the online Blazblue community and Smogon (Blazblue seems to deal with a smaller number of players, the games are very different, etc) and I don't want to go into exactly how they could render this format undesirable to many of our members here; I'll just leave it at "Smogon places a lot of value in the speediness of tournaments," even though it's probably not that simple. also, hi lol

It is a good format in general. The issue is this: Wins are not necessarily a good way to rank people. Especially the middle group of players, it will be hard to rank them against each other. The thing about Swiss is... you can "strategically lose" so that you have easier opponents in the future. On the other hand, those who consistently win will be forced to play each other. Arguably, a player who goes "Win -> Win -> Win -> Loss" (3,1) is stronger than a player who goes "Loss -> Win -> Win -> Win". IE: The first player Goes up against opponents with... Round 1: 0 wins 0 losses Round 2: 1 Wins 0 loss Round 3: 2 Win 0 loss Round 4: 3 Wins 0 loss The second player plays against opponents with... Round 1: 0 wins 0 losses Round 2: 0 Wins 1 loss Round 3: 1 Win 1 loss Round 4: 2 Wins 1 loss So, in Round 5, both players are ranked the same, even if the first player had a harder group of opponents than the first. And regardless of how you set up the tournament, those who went the "win" path arguably had a harder time. Anyway, the Swiss Format is not necessarily mutually exclusive with the system I proposed above. You can run a Swiss Tournament, and then utilize the algorithm in the papers above to calculate the top players. EDIT: Not exactly. I'm saying that the specific format of the tournament I proposed is actually not defined. IE: You can run a Single-Elimination Tournament, and then run this algorithm to rank the players. (As well as calculate the standard error / deviation of the estimation of their strength) All the above presents is a way to convert a set of data (ie: the wins and losses) into a set of strengths (the Most Likely Fit to the BradleyTerry Model). Yup, I know its a different, and the "pods" don't make any sense here for example. (because there are no "PSN Rooms" in ShoddyBattle). Nevertheless, the structure of the BradleyTerry Model is very generic. It does apply to a multitude of formats, and will attempt to rank the players. Thus, I figured it would work here a bit. WTF are you doing here?

Ah, I see. Well, I'll be frank, if you're talking about actually applying this model to Smogon's tournaments, I think it's a longshot. I think "People don't like to be graded by a computer program. They prefer simple methods they can do themselves as opposed to complicated (but rigorous) methods" captures the situation perfectly, and in fact this community is probably particularly resistant (Best of 1, for example, is still inexplicably the tournament standard). I don't know whether the intention of this thread is to convince us to apply this, or if you're just "throwing it out there" for education's sake or because you find it fun (or a little of all three). But if your ultimate goal is to convince people that this is the way tournaments should be held from now on, I don't think it's going to happen, even in the long-term, without actually setting up a tournament and letting people "get a feel for things." Not many people who regularly play in tournaments will read this and feel too comfortable about it. That's what I thought when I first saw you on Dustloop, lol. I've been around here forever, I just didn't post that much when your reign of mathematical terror was at its peak.

I am posting it here because guess where the discussion started that eventually led to all of this :-) Before Shoddybattle had a ranking system, there was a certain discussion which eventually led to research of Glicko2 and so forth. It was here where the "seed of knowledge" was planted in me, and from that seed I eventually came up with the system that we use in that Dustloop thread. Smogon is full of l33t math professors. I'm not sure if everyone who participated in that discussion long ago is still here, but if they are, I figure they'll enjoy this for the mathematical curiosity alone. Yeah, I noticed your username and join date, and was pretty curious. Technically, you were here before me. Lol.

Welcome back Dragontamer. Correct me if I'm wrong, but you are assuming in your Glicko2 treatment that the estimated rating for a player is CRE. The problem is that CRE is not a very good rating estimate. The Glicko and Glicko-2 systems assume that your rating is never certain. At the start, your rating is as uncertain as can be (has an 'uncertainty' of 350, the highest possible value in that system). The more you play, the more the uncertainty is lowered, but it never becomes 0. This is why, initially, your rating changes a lot. Yes, it does change a lot, but it is also relatively uncertain. And this is actually why CRE is not a good estimate at all when the uncertainty of the rating of the player is large. The more you play, the more your rating converges to your true rating, and hence the more certain your rating becomes. This is why, for Glicko, your first game and your hundredth game are not the same. In your first game, the system had no prior knowledge of your playing prowess (or lack of), while in the hundredth game, it does. This is akin to when you sign up to a forum for the first time. Your first post will be regarded differently than the thousandth one, because, in your first post, people really cannot know enough about you to answer you well, while in your 1000th one, people would have a rough idea of your writing prowess. This happened to me when I signed here, incidentally; when I displayed my mathematical arguments here for the first time, people actually disregarded them as garbage, but now they don't. (I would link you to examples of this from past posts but they were pruned.) This is why I agree with you that Glicko and Glicko-2 converge _eventually_, but they need about 25 to 30 games for a person to start to see this convergence. This is usually too many games for a tournament setting, however, which is why other systems are usually used for tournaments. Like I like to say, I regard the ladder as one big, continuous tournament.

I never liked the CRE. CRE makes the skill statistic a function of time, and thus players who play more often get an advantage. Also... tells you less information than just giving the two numbers (ie: The Rating Statistic + The Rating Deviation statistic). But... it is in popular use (see Trueskill), and I did shoot Professor Glickman an email and he seemed to think it was an adequate way to combine the statistics, so I stopped campaigning against it. It wouldn't be how I'd implement a ranking system however. (BTW: Professor Glickman is a very nice guy who gives speedy email replies :-) ) If it were up to me... I'd prefer just using the average Rating Statistic, and a "threshold" for the Rating Deviation (to cut out inactive players). Long story short, Glicko's Average Skill IIRC actually tracks the player's skill, so thats what I was talking about Glicko2 above. Indeed. Which is why I proposed a new scoring system and a sample format :-) The model I propose is very similar to the Glicko2 model, but ultimately the method of calculating the ranking is different. The above proposal DOES work for a short-term tournament however. Instead of having an iterative method (like Glicko), it is equivalent to performing a regression on the data: finding the Most Likely fit for player strengths to the win/loss data collected during a tournament.

You don't require a game of chance to have this be true. Consider a chess match. If two players play several games, and they are close in skill level, it is unlikely that either player will win 100% of the time, yet the game has no element of chance in it. This is because people player at different skill levels, even in games in the same hour. That sounds dangerous. People will assume that the winner of the single elimination finals is the winner of the tournament, when it's possible that this isn't the case (right?).

Welcome back, dragontamer. Great essay. I'll probably have more to say about it later. What he is suggesting is calculating the top n (where n is a power of 2) players from the tournament based on the algorithm, and then deciding the actual winner of the tournament based on a 1-elim tournament among those n players. The only purpose of this scheme is to make the finals more interesting or "entertaining".

I agree. People's own skill is somewhat random from day to day, hour to hour. (probably a function of sleep, hunger, distractions and stress). But... it is easier to make the point when the specific game we play has so much chance involved. True, while it should be noted that even in a tournament conducted in this format is still not perfect either. (IE there is a chance that the winner of this above model isn't really the strongest player. For example, the strongest player just got really unlucky for 10 games in a row). However, based on the data, this method provides the most likely fit, and thus the winner chosen by this method is the most likely winner, assuming the model is true. This algorithm (provably) gives only the best guess. In addition, the single-elimination tournament format is simple to understand, and everyone agrees to it. It IS the status-quo format, and anything else and I'll guarantee that the players will start complaining and bring up the NCAA and BCS. The algorithm (while simple) is ultimately too hard to complex to do by hand for a large number of players, and thus the only way to implement this in practice is through the use of a computer program. "Computer Program" tournaments are a kinda political issue to those who follow the NCAA, and it doesn't seem like too many people like them. Even if they are mathematically superior... they don't offer the same authority or Ethos that a single-elimination tournament offers. I think a combination of both is necessary to please the people who don't keep up with the math. "Entertaining" is a good word :-p ---------------- EDIT: I know the BradleyTerry model is a linear model... but perhaps my notation is off or something. The formula i / (i+j) for positive reals does not seem to fit the linear model. :-( I must have copied something down wrong. I'll re-check my math later