Separate names with a comma.
Discussion in 'Smogon Metagames' started by Antar, Mar 13, 2014.
That's not bad at all. I was worried it'd only be 100 or 200 people affecting the tiering.
Kind of off-topic but not sure where else to ask this: why is the top 500 based off elo while weighting is basef off Glicko? Why not use one system for both? Been curious about this for a while
Different measures for different purposes. Elo doesn't do a great job at accurately measuring a player's skill, but Glicko is hard to boil down to a single number (further reading), so it's difficult to rank players based on it. Neither by themselves is particularly suitable for suspect tests, so we're using a metric called COIL to do that. The fundamental idea is: we use the best tool for each job, because if you try to have a rating system do everything, it will do none of those things well.
My one and only complaint about this decision is that this topic should have been made before implementing this change on PS. This change caused some pretty hefty changes in the tiers, far more than any changes caused by monthly stats in recent memory caused (which considering how young the meta is, that's an accomplishment). I know that more then a few UU teams became unusable due to Zapdos and Keldeo's sudden rise to OU, and I'm certain that even relatively well informed players were blind sided. Outside of the topic on the PR board, which by the way didn't even say that you all were going to implement the change until after it was already implemented, was there even any indication that you all were going to do this, before, ya'know, you all actually did this? When lurking the friggen PR board isn't enough to alert an average user like myself of important and extremely relevant changes like this one, what is?
Anyway, sorry about the rant, especially since your posts are pretty much the only reason I know anything about the subject in the first place (not to mention that judging from the OP, you probably realizes most of that already, so I'm probably beating a horse that is already dead).
Big Concern: So if I'm understanding the new policy, each month you use a combination of common sense and magical statistics mumbo jumbo that I can't even begin to understand to determine these "candles of known brightness", with which you will use in combination of more magical statistics mumbo jumbo to produce a single number that we interpret as the glicko rating of the "average competetive player". I have some concerns with the process in which these "candles" are determined.
Related Sub-Concern 1: Are you determining these candles unilaterally? I have no issues with you determining the statistical equations you use to determine the conversions from "candle usage" (the amount of usage these "candle sets" receive at certain parts of the ladder) to the final "cutoff number"(massive misnomer btw, you may want to consider trying to come up with a better name, as it's causing a fair bit of confusion even to relatively informed members, like the ones reading and posting in this topic) because equations can be quantified, they can be posted, and other folks who have some understanding of them can look at them and spot any holes in them. "Common sense" is nowhere near as quantifiable, and honestly I can't think of a good way to evaluate these "candles" except on a case by case basis, in which case you should probably open it up to others for some discussion.
Related Sub-Concern 2: So I noticed that you are largely using the presence of certain moves and items as flags for possible "candles". This is fine as long as the word "possible" is kept. Where is the line between "bad" and "no competitive value" drawn? Using recharge moves, I provide the following 2 examples.
Example 1: Choice specs adaptability porygon-z with hyper beam: You are probably already aware of this set, and it's purpose is pretty friggen obvious, it's basically the only way in pokemon to get a special type explosion. Does it accomplish that purpose? Yeah it does, it straight up murders anything that doesn't resist it or is a pink blob, and if the pink blob happens to be half health, it dies too.
252+ SpA Choice Specs Adaptability Porygon-Z Hyper Beam vs. 252 HP / 4 SpD Eviolite Chansey: 352-416 (50 - 59%) -- guaranteed 2HKO
Now obviously there are a lot of problems with this set. While safe switch ins are few, the opponent is gaurenteed a free turn regardless of whether or not the beam actually kills something. In a meta of dangerous sweepers, that's pretty friggen dangerous and it's probably not worth to allow the opponent a free +2 on their pinsir or megamaw just to get rid of one of their pokemon. However, while that may be enough reasoning to keep duck-z off the viability rankings, is it bad enough to warrant the label of "candle of known brightness"? According to your definition (correct me on this one), a "candle of known brightness" is a set or pokemon that has zero competitive value in the meta in question. So where do we draw that line? Here we have a mon that is probably not OU viable, but does have a very unique niche of having the single most powerful unboosted special attack in the game, is this uncompetitive?
Example 2: Slaking with giga impact: Here is the other mon that I have seen with a recharge move. As we all know, Slaking is a normal type with stats rivaling a regigigas and the ability traunt, which allows it to only attack every other turn. Giga impact can recharge during that time, making it a seemingly obvious choice and indeed probably would be if it weren't for the fact that you cannot switch while recharging. Giga impact on slaking provides additional power but hurts slaking's ability to function as a revenge and run killer (I mean, he has 3 other moveslots, but lets ignore that for a moment). Slaking is considered unviable in OU largely because of traunt, however there is a gimmick strategy that pairs up slaking with eject button cofagorus to create a rather dangerous mon. So the question is, is Slaking in of itself a "candle of known brightness", or is it Slaking with giga impact, or are neither considered a "candle"?
Related Sub-Concern 3: How do we handle pokemon of lower tiers being used in higher tiers? For example, Donphan is currently UU, but you have specifically stated that Donphan in itself is not a "candle". However, what about RU and NU mons? What about LC mons? Some more examples for consideration.
Example from RU: Good ole Slowking: Known to be mostly outclassed by Slowbro due to his bro having a better defense stat, Slowking still has the amazing regenerator ability and a decent movepool that can allow it to function as a specially defensive tank. Frankly regenerator alone probably makes it OU usable, but it's still considered outclassed and I can't think of many things that it can do that isn't outclassed by Slowbro. Candle or not candle?
Example from NU: Here's one that I've actually used myself, Gorebyss! Gorebyss is known to be one of the few pokemon capable of smash passing as well as having decent offensive stats and a pair of situational but pretty powerful abilities. It's largely held back by it's pitiful speed and being mostly outclassed by keldeo outside of rain (where a gorebyss can nab itself a notable speed advantage, even without a shell smash boost). Gorebyss can either be used as a smash passer, or setup a shell smash sweep itself, or even as an instantaneous rain threat with swift swim and choice specs. It has a relatively unique niche in smash passing and a few other niches that one can argue to be OU usable (esp since drizzle swim is a thing now), yet it was NU last gen and is prolly not leaving, is gorebyss a candle or not?
Example from LC: Now here I think is the first place where I struggle to find good mons, because a decent majority (if not all) of the tier has no competitive value in OU. However, there is a single mon that deserves a look, Murkrow! Murkrow is the only NFE who has prankster recovery, allowing it to boost it's pitiful defenses with eviolite and be incredibly obnoxious to any team without knock off or priority (which currently exists almost nowhere in OU). It has an ok-ish support movepool, sporting things like prankster twave, mean song, weather inducing moves, swagplay (which may or may not get banned, for this discussion, lets pretend it's not), and of course, roost, but almost all of these are badly outclassed by sableye or thundo-I. Is this where we draw the line between "simply bad" and "no competitive value what so ever", is it higher then this? Or is it still lower then this?
WebBowser, I'll try to do a point-by-point response to your post in the next day or so, but let's start with this: before this month, there was no Official UU. UU was in Beta. Wild changes in the tier lists were expected from the get-go. So your arguments about this being a "drastic" tier shift make no sense--because before this month, UU was merely a hypothetical tier with a sample banlist.
Antar No hurry on this. As you have stated and as I have a rather bad habit of forgetting, UU is indeed still in beta. We definitely have time, but I figure now(using a very loose definition of "now", as I am aware that the end of the month is almost upon us and you will likely be swamped soon) is as good of a time as any to have this discussion. Thank you again for making this topic.
UU is really desperate to get off the ground, and given that there was no actual precedent (no previous UU banlist for Gen VI), it was decided that, for the sake of expediency, we'd move on this right away.
In truth, you're exactly right: this is a bunch of BS that we're using to justify our decisions. I really can't say I care for the decision to increase the ratings cutoff beyond 1500, because it means that players with ratings above 1500 can end up mattering less than players who have never played a match before. And we are exploring alternatives to make this no longer necessary. But in the meantime, raising the cutoff is the least bad decision, and trying to inform where to put the cutoff based on data and metrics rather than just straight-up "I think it should go here," makes this least bad decision slightly better.
I'll start off by proposing some candles, but I don't have the best knowledge of the metagame, so I'll be listening to suggestions (from both the tiering councils and from ordinary members) to come up with more.
I know, but I can't think of a better name for it.
I'll probably start a thread, and I'll definitely listen to feedback, but at the end of the day, these "candles" are being used as helpful markers, not to directly determine anything, so it'll probably be a constantly evolving process rather than a matter of "okay, here are the set of candles everyone has agreed to."
Items are probably better candles than moves, by and large. Assault Vest / Sitrus Berry / Leftovers in Little Cup is the best example. But Sitrus Berry on non-Harvest/Recycle sets would probably be a useful candle. And Focus Sash on Sturdy sets. The easiest candle metric would definitely be subpar abilities: there is no competitive reason to use Truant Durant (outside of Doubles), Flame Body Talonflame (you give up too much not running Gale Wings)...
IMO, this is the exception that proves the rule.
I've seen this discussed before. First off, if you're using Slaking in OU, I think you're "doing it wrong," but secondly, the truant move can (and should) be used to switch out, so Giga Impact isn't even generally considered to be a good move on Slaking, especially when Return has 102 base power. Or so I understand it...
It's fine to use lower-tier Pokemon in OU. I'd argue every team should have at least one. The only Pokemon that could be considered candles are ones who are 100% outclassed by another Pokemon. So Gyarados > Feraligatr used to be an example. Chansey/Blissey > Audino would be another example (Regenerator doesn't help *that* much).
If anyone reading this objects to my specific examples, please save your thoughts for when I make the "candles" discussion thread. I don't want this thread getting derailed.
And with that, I think I've responded to everything! But feel free to ask more questions and attack any of these arguments, as long as you're not discussing specific examples (you can say, "I disagree with some of your examples," but please leave it at that).
First off, thank you very much for making such a detailed post explaining everything. So if I understand you correctly, there is no direct correlation between the actual usage of these candles of known brightness and the cutoff, it's just used to attempt to justify the "cutoff number" to try and make it seem less arbitrary. So the relationship between the usage of the candles or the rating these candles are found at and the actual cutoff number is not something that can be represented as an equation or anything that neat. I'm... actually ok with that, especially since I can't think of a better solution.
One thing that I can (try to) help with is with the term "cutoff number". If I have been paying any attention at all to this, the "cutoff number" is a sort of weighted average that determines the rating of the "average competitive player", and weights are determined by the number of standard deviations away from this "average" the player's rank is at. So instead of calling this value a "cutoff", why not call it a "weighted average" or "weighted average ranking", which more accurately describes both how the number is derived and how it is used in determining tiering. I dunno, just an idea I had this morning, let me know what you think of it.
Hi Antar, I didn't know where to post this as the OU candles thread was already locked by the time I got to it, but how about marking teams based on memes as not serious/competitive? A team where the pokemon are wielding a Helix Fossil, for example, is almost always a Twitch Plays Pokemon team...
Helix fossil wouldn't be competitive regardless of whether or not it was a meme. HOWEVER, memes tend to be more popular then completely random garbage, so they may be worth looking out for. Case in point, Air balloon rotom-fan.
That being said, judging from responses on the candle thread, there is prolly going to be some considerable discussion on what exactly is a candle, because many players felt the bar was set way too low and defeats the purpose of using 1760 stats.
I would probably be against this had I not decided to start laddering on a new alt last week. Hyper Beam Blissey with Specs may be fun and all, but I don't think it should be impacting any competitive usage stats.
This is also a bit of a concern of mine. Swarm Scolipede is terrible. Hyper Beam on Mega Gardevoir is terrible. Competitive players will never use them.
Limber Ditto is not just terrible, it's not just something an "uncompetitive" player will do, it's either something going to be used by someone so functionally incompetent they can't handle a pokemon with two abilities and one move, or someone that just accidentally forgot to set the ability. I know the ladder has become saturated with awful players but I think the bar is being set too low, some of this stuff is only going to be found under 1200 rating or whatever.
It'd be nice to see some stats, I'd be glad to be proven wrong.
Or someone who simply doesn't know what imposter does. When I first started playing PS, I actually hadn't played BW so I only really had knowledge of moves/mons/abilities from gen 4 and before. It is entirely possible that a guy, seeing some random ability he's never heard before, is going to go for the status immunity ability just because he knows it's at least somewhat helpful. It doesn't help that a lot of descriptions get cut off when you click on them or try to /data them in PS.
Not saying that we should call this guy "competitive", but ignorance != incompetence.
Not to mention some competitive sets are just downright non intuitive. Sheer force special attacker Nidoking, one of UU's top wall breakers, anyone?
Okay, complete incompetence and/or just ignorance. But Imposter is the only reason to use Ditto, someone who neither knows how to use Ditto or knows why it's used is still going to have an abysmal rating, or at least far from being a competitive player.
I too am severly concerned by the results of the candle thread. Just because something like Water Absorb politoed has a tiny niche does not make it in any way competitive. Somebody being smart enough to avoid using such utter garbage does not mean they arent a bad
player anyways. I have personally never seen the vast majority of the things in the candle thread during the process if laddering multiple alts, if they see any usage they it is so low on the ladder that they invalidated as a means to distinguish between good and bad players.
As another example take Donphan. Donphan is significantly better than any of the listed candles, but is still so bad as to have no competitive use in OU. Things like Donphan were said to be "not a candle" but the vast majority of Donphan users are so bad that they should not have an influence on tiering. If the bar for who affects tiering were set by the candles determined in the thread, I am quite confident Donphan would be back in OU, along with plenty of other shitmons.
And by "really good" I mean stuff that's in no way ambiguous or controversial. There is zero competitive advantage to Limber Ditto. There might be some small niche for Donphan or Water Absorb Politoed. So what that means is that if I plot the distribution of Glicko R's for players that use Limber Ditto, I can put the cutoff above all of them, but if I use Water Absorb Politoed, the cutoff will probably be only such that it filters out 99% of them. Does that make sense?
I think the major issue here is that this stuff is so bad that even the 1500 stats filtered it out just fine. Even with the ladder as awful as it is, there is no way anyone will ever win with Limber Ditto, and there is also no way anyone who would ever purposefully put a Limber Ditto on their team would ever win, let alone stay above or anywhere near 1500 Glicko. The same is true for Water Absorb Politoed, Pressure Bisharp, Focus Sash Blissey (Heck, Blissey itself is bad enough since it is outclassed in every meaningful way in OU by Chansey), and other such examples, all of which are completely useless options and are actually somewhat comparable to Leftovers in LC. The Limber Ditto users were never a problem becasue they never even influenced the usage stats to begin with. The entire reason the cutoff was raised was that nearly all above average OU players felt that the OU list did a horrible job of actually representing what was good in the tier, which is the entire purpose of the tier lists. Staples of good OU teams such as Keldeo, Kyurem-B, and Manaphy were languishing in UU or BL while awful and entirely outclassed Pokemon such as Donphan and Forretress that saw next to zero use in high or even mid level competitive play were solidly OU. Determining the cutoff based on candles such as these would completely undermine the whole idea Jukain was proposing by suggesting to use 1760 stats for tiering and would quite honestly make next to no sense.
Except, of course, that it didn't. All of these candles appear at nonzero percentages in the stats. This is why I ruled out shit like "Pound," which *never* appears, even though it's even worse than Limber Ditto.
I think you're confusing 1500 Glicko with 1500 Elo. The only way to see your Glicko rating is to type /rating or go onto the leaderboard. That starts at 1500. Elo is the one that starts at 1000.
I personally think this analogy is just plain wrong: ORAN BERRY in OU is what's comparable to Leftovers in LC. But all these above examples are somewhat viable, just out-classed by better options. You can make a team that wins matches with all of the Pokemon you mentioned... you just probably won't end up very close to the top of the ladder.
And that's the key point here: I'm not trying to filter out all but the very best--that's what the old 1337/1850 stats were for. Here, I'm just trying to filter out the very worst players, the ones who are not at all competitive, who don't care about winning and who have probably never even been to this site.
0.758% of Ditto ran Limber. That means Ditto's usage is 0.02 percentage points higher than it should be. That may seem like a small number, but OU-UU cutoffs have been decided by less.
As I've said repeatedly, these candles are merely tools to be used in helping determine the appropriate cutoffs. The ideal is that we set the cutoff to the level at which these candles go to effectively zero, but if that doesn't work in practice, because we look at the data and see that Donphan is too high and Manaphy too low, then we adjust.
I'm enjoying the discussion here. Being relatively new to Smogon, it's nice to see things explained in detail.
As was mentioned earlier, though, "cutoff" is a bit of a misleading name for the whole thing. If someone doesn't read into the OP enough, he or she will think that accounts don't count at all if their Glicko rating doesn't exceed 1760. Why not call it something softer, like, I don't know, "Full Weight Threshold?" It doesn't even have to be completely true, which is the beauty of creative naming. The Patagonian toothfish sold horribly until it became the Chilean Seabass.
What about obvious troll/scouting teams, like one-mon teams (usually packing Imposter Ditto or Transform Smeargle)?
That's the point of using "1760" stats. Those teams will hardly count, since there's no way they're getting high enough on the ladder to have a major impact.
As for a player working his or her way up to a high rating using a viable team and then using a troll team on a drop down- that player alone won't hurt the ladder much, and it's more trouble than it's worth, really. Since OU takes place on the scale of millions of battles, this thread is about a way to determine competitive tiers as objectively as possible using massive amounts of data.
Fair enough, I should have looked it up first. However, this does raise a question: How many of the Limber Ditto had Limber intentionally and how many were used by players who simply forgot to change Ditto's ability in the teambuilder? This isn't that hard to miss, considering the fact that the default ability is Limber. You can get at least reasonably high on the ladder if you forget to change the ability and simply win your first few battles without ever sending out Ditto, I doubt that anywhere near .758% of Ditto users intentionally ran Limber. Stuff like this is even easier to miss with Pokemon that have useless default abilities that aren't as visible as non-Imposter Ditto. For example, 3.311% of Starmie run Illuminate, a completely useless ability that happens to be the default one. Realizing that you forgot to change your Starmie's ability is actually reasonably hard to tell as none of its abilities have an easy to notice impact on the battle. For this reason, I don't this Limber Ditto is a good candle, as it signifies absent-minded players more than it signifies bad ones. (even if it didn't, there's no real way to tell)
No, I wasn't. Limber Ditto is so awful that there is no way someone who intentionally uses it could even be good enough to stay at the starting rating for very long.
Oran Berry in OU is also comparable. There is absolutely zero reason to use Leftovers in LC when you can use Berry Juice or Eviolite, just like there is absolutely zero reason to use Water Absorb Politoed in OU when you can use Drizzle Politoed or any of the dozens of Water-types that are better than Water Absorb Politoed in every single way. Even if it weren't outclassed in any way, Water Absorb Politoed would still be worthless in OU because its stats are just so bad. Heck, Ninetales is barely viable in XY OU with Drought, let alone without it. It's just that bad.
I understand the idea behind this, but it does have some big concerns about seeming arbitrary and elitist. To work around that, I'm curious if you've considered finding ways to filter out noncompetitive teams directly? It might be possible to compile a list of characteristics which we could reasonable say no competitive team would have, at least in a given tier: particularly certain Pokemon, moves, and items which, if present on a team, would disqualify the team entirely from counting for a subclass of usage stats which would be considered "competitive" and therefore used for tiering. This seems along the lines of your "candles" idea, but targeting the candles directly rather than blanket sweeping away 98% of the teams just because that group contains the candles.
It'd be a rougher approximation than this new system, but it would keep the tiers to their original purpose of actually representing a threat list. I remember looking over the Policy Review thread about this a while ago and saw that this "threat list" concept was brought up as an argument for the change, but this strikes me as very strange. If the tiers are meant to be a threat list, they can only function as a threat list for the ratings they're based on. If the OU tier only says what threats a player at around 1760 can expect to encounter, it's serving very little purpose for newer players. And newer players are the ones who need a simple threat list laid out for them; the top 1500 players know plenty well how to look at usage stats and metagame strategies in more detail to figure out what their own threats are. They're not going to be confused into overpreparing for things like Salamence, but it's valuable for newer players to know that there are plenty of Salamence around their level.
Now, having less powerful Pokemon like Salamence be OU makes the tier not as much a list of the strongest Pokemon available. But that's not the purpose of a usage tier in the first place. If, as of Gen 6, it's now more important to have a tier list that represents the most powerful Pokemon in a tier rather than the more used ones, shouldn't we switch to that entirely? We could abolish the system of usage tiers entirely for Gen 6 and replace OU with a non-usage-based tier representing the 50 or so best Pokemon in the standard metagame, based on criteria similar to the viability rankings, or based on the viability rankings themselves. (And so forth for lower tiers.) This would allow us to match the tier to the best Pokemon directly, rather than picking a group of players whose usage we believe will best estimate that group.
Whether what we really want are usage tiers or effectiveness tiers, these could be ways to pick one of them and make it work without letting it be influenced by blatantly uninterested players either way, rather than going for a strange midpoint by deciding over and over again which group of players represents the "right" usage tiers.
It took me a while to realize this, but the purpose of the standard candles is to identify a subset a of Pokemon Showdown users using Pokemon, movesets, abilities, and/or items possessing unequivocally no competitive merit. The candles are not meant to filter out, but the provide one with a sample of obviously uncompetitive players as one would use this information to determine what a rating of an uncompetitive player should be? The designated criteria of the standard candles are obviously theoretically useless in gameplay, and thus their usage should be correlated with a player's ineptitude (such ignorance of game mechanics and/or lack of knowledge of the Pokemon's movepool) or lack of competitive intent (such as trolling). Moreover, even competent players who use the standard candles would be handicapped severely and this would adversely affect their potential rating.
From the data set of the users who have a standard candle, one could know the average rating of non-competitive players and then set a cut-off that would practically exclude them. For instance, if the average rating of Gengar users having a physical attack (barring Focus Punch which has some situational merit) is 1450 GLICKO with RD 70, setting the cutoff at 1600 would exclude those players from influencing tiering (their weight would be less than .02).
Still, I believe it is better to base tiering on empirical statistics rather than on inexact theoretical considerations (theorymon isn't quantum electrodynamics where one can get verified quantitative predictions down to many sig figs!) about the efficacy of a given Pokemon and subjective preferences and considerations that are similar to viability rankings. However, one should have some theoretical understanding to know why certain Pokemon are more useful than others in a given metagame setting that can explain the usage trends. But regardless of any theoretical considers, the competent players will simply use what works; tautologically, if they did not, they wouldn't be deemed "competent" or "competitive".
I wonder if Antar decided on a criteria of standard candles and determine the strength of uncompetitive players.