OK so, I probably should have been more aware of this sooner, but PO's rating system basically (tries to) implement the
Elo rating system, which is used in FIDE ratings and similar places. While Elo itself is, I suppose, "legitimate", there are still a few problems with this that result in what I had demonstrated in the OP:
Shoddy had Glicko2. Elo is still worse than Glicko2 (the convergence complaint in the OP still applies). Thus, it's still a step backward. I'll happily admit that I (and most others) probably wouldn't be complaining at all if Shoddy never existed, but Shoddy has still set this standard as well as others. We should never accept steps backward.
There's also the fact that chess player ratings, at least at the top, come from many years of playing thousands of matches. Compare that to Pokémon, specifically our suspect tests. Our tests last a month; in my best attempt at laddering to the voting requirements, I played a bit over 200 battles within a month, and Jibaku apparently played around 300. IMO, more would have been unreasonable to anyone with a life. Elo, with its lack of a deviation, was simply never meant to be used for such "low" numbers for matches. We time our suspect tests to reflect the time that it supposedly takes to understand the metagame, but we pay no heed to the time that it may take to get a proper reading of true player ratings. Coyotte likes to argue that players will approach their true ratings "eventually", but in light of the chess comparisons, it's a pretty lazy argument.
There's a significant random element in Pokémon. Chess players may experience performance variation for other reasons, but the luck factor is still not nearly as prevalent (zomg I'm Black I'm slightly disadvantaged!!!). The results of randomness reflect in "wrong" rating changes and continue to impact the rating.
The matchmaker uses the ratings to find opponents. This may not seem like a big deal, but considering everything said before, it really works to make laddering more of a chore than it should be (or at least more than it was on Shoddy). This works to widen the gap between tournament performance and ladder performance. Also, I'm not sure how relevant
this is atm but I'll mention it here anyway. I really do believe that the matchmaker really magnifies the issue (though it's hardly its fault).
I know that this post may be a bit irrelevant right now considering the server problems (lol PO strikes again!), but it came up in chat, so it's here.