Your last blurb in there completely contradicts your first two. You said that the statistics dont really represent anything because of the "arbitrary ranking system", which you then go on to claim is "vaild".
The blurb was for Colin, not for you. Colin knows what I'm talking about and I'd rather not get into the unnecessary details.
Nonetheless, I feel this deserves a decent response. While every parameter to the Conservative Ratings Estimate is arbitrary, that does not prevent the fact that it is a very valid method for estimating the true skill of a player. The mathematics are solid in this respect. The Glicko2 rating system would work if every number was negative. It would even work if the best ranked player was 0. This is what I mean by arbitrary. Now, it is fortunate that the default parameters (~800 initial ranking and so forth) put people generally at a positive number, but as far as the mathematics are concerned... the "middle range" of the ranking system could have been chosen anywhere.
This becomes a problem when you start multiplying numbers. Take for example if the #1 player has a ranking of 0 (and the worse players had a negative ranking... which is entirely possible in Glicko2). In this case, the pokemon used by good players would be ignored (their usage statistic is multiplied by a number ~0), while the pokemon that poor players use would be multiplied with a relatively large negative number.
Theorymon is fun and all, but the statistics do not lie.
Statistics do not lie if you interpret them correctly. However, I do not see a correct interpretation of the "Weighted List".
Sure, having to get used to the rule changes of Wobbuffet (which i have yet to see in any ladder match) and Deoxys-S might take a while, but having this test data is a good thing. "Forcing" the community to test these pokemon is the only reliable way of accumulating the data that we thankfully now have. The simple fact is that Deoxys-E and Wobbuffet are not overpowered for OU. They may still be high-class OU pokemon, and very good at what they do, but the mountains of battle statistics show that they are both more not-uber than uber (as someone earlier said).
That does not change the fact that direct testing on the ladder has unnecessarily pissed people off. If the whole thing was tested with better considerations of this community, less people would be complaining about Wobbuffet and more people would be accepting the statistics.
My issue was neither with Wobbuffet nor with Deoxys-S.
The politics of this issue were ignored, and now the community is paying the price.
Forcing players to test it is the only way anything will ever get done. A 1-week testing period in the ladder is more than fair, especially since it was announced far beforehand. Saying that this test was "thrust" upon the community is misleading and unfair.
Sorry to interrupt, I was making a point to
Blaziken_57 by using elements from his post.
If you didn't want to play in an environment with Wobby, you could have just not played for that week and it obviously would have been proven broken. Except it failed its test miserably, and can not perform to the "Uber" level that people once thought, so that entire argument is thrown out the window. "The end justifies the means".
The end? Wobbuffet is unbanned, relatively few people use him now, and yet when he is used a distinct portion of the community feels isolated: split between quitting the game they love or playing the game with Pokemon they hate.
I feel this could have been at least partially avoided if the test was conducted in a more appropriate manner. There are significantly less people who complain about Deoxys-S than about Wobbuffet.
You also claim that the test of Wobbuffet was "arbitrary", which is as close to wrong as one can get about the situation. Pokemon like Kyogre, Dialga, etc are obviously overcentralizing, but the same can not be said about Wobbuffet. There has been much doubt as to its tier status since adv. It would be a "slap in the face to the greater community" to not allow this kind of testing, as long as there is reasonable suspicion, because it broadens the metagame, opens up new strategies, ideas and concepts, and ultimately makes the game more fun to play.
If you don't mind, I'll use your own words.
Theorymon is fun and all, but the statistics do not lie.
There were no statistics to warrent the unbanning of Wobbuffet on the ladder. Period. This is why I am against this method of testing. You force the community to change to the new rules before you gather statistics.
I only ask that when the next test occurs... that some friggen statistics are gathered
before the ladder is affected. Honestly, I don't think that is too much to ask.
dragontamer: your metric is very problematic. Let's suppose Garchomp often occurs with another pokemon X. Now do we ban pokemon X as well if Garchomp and X are winning 95% of battles? Your metric doesn't even attempt to measure centralisation. It just measures something entirely arbitrary, with no connection to centralisation at all. A pokemon that is rarely used might be on a winning team here and there and end up with 100% win rate. Not to mention a pokemon might be able to effect centralisation without actually winning its battles, simply by forcing everybody to be prepared for it -- it would not tend to win that many in that environment. Aside from being arbitrary, this metric is inferior to one I've already outlined many times in this topic and elsewhere.
I can agree with that. Nonetheless, I'd like to continue to think up better methods than the current one. The current measurement just does not sit well in my stomach.
I already agreed with you elsewhere that the current weighted list is arbitrary, but so is the one you proposed, and the difference in the top of the list is quite minor. And yes, the top is all that matters, because NU would be decided by statistics in UU, not statistics in standard. I could easily calculate the "sum of probability of each user winning against a 1500-rated player" weighted list for each previous month as well since I already have the script.
Actually, OU pokemon are defined by their statistics in OU, which is what affects UU. (Aka: Tentacruel's banning from UU). So yes, the middle of the list actually matters a lot for the UU metagame.