I honestly don't see the point of this test.
First, I think the test it self-defeating - it assumes that every Uber has the same type of effect in the metagame. The easiest example is the difference between Rayquaza and Wobbuffet - it is obvious to see what I mean. Of course, even ignoring Wobbuffet who is commonly called the anomaly, we can see how Pokemon like Kyogre and Rayquaza each have their own ways of wrecking the metagame. I don't see how this will be useful at *all* in the suspect test, if that is the purpose behind this to begin with, considering things we are testing are not quite that broken.
We already know what broken is, and how it affects the metagame. You may argue that "we don't have a good definition of a uber", but that's particularly because the LINE of what is uber and what is just "OU level centralizing" is a very very grey line we are experimenting with. The second reason I find this test pointless is that it doesn't solve the problem of where to draw the line of what is "uber" or not simply because every Pokemon has a "centralizing" effect on the metagame to begin with. The line we are skirting around for the test is "at what point is the pokemon centralizing too much"? It doesn't solve the problem at all.
Third reason. The test purposely sets up a non-competitive metagame. The results will be flawed - as there is *zero* incentive to play seriously in this ladder, unlike suspect where there is an incentive involved in the ability to vote. What kind of results will you expect? Will you hold a tournament and hand out trophy badges? Because that is honestly the only way you will generate enough incentive to play seriously within this "ladder". it is a purposely broken metagame - what incentive is there to play competitively, because the user is "bored"? Do we want results that came form a bored user who decided to toy around as an "objective" result from what is uber or not? Creating a new ladder for this will have this exact effect.