i don't mean to interject where i'm not welcome, but its generally a terrible idea to compute symmetric wald confidence intervals (aka the ones that were created itt) when you have p-values sitting near the boundary because your confidence intervals may suffer from empirical undercoverage. if you want a better estimator for small p-values you should use a bayesian estimate with a beta prior (probably 1/2 1/2 if you wanna perform well towards the boundary) and compute an acceptable sample size based on the bayesian credible intervals rather than the wald CIs

guess it might not matter due to the clt but the credible intervals will almost surely perform better than the CIs for inferential purposes anyway

anyway the sample sizes that were computed on page 1 were definitely too low (edit: not really i just misread the question), so i just ran a quick sim study in order to get an estimate on the sample size required in order to make sure that the length of the intervals are below the cut off, which ill throw in a pastebin incase anyone is interested (it was written in R)

https://pastebin.com/n8XV0CPX
probably won't be easy to follow if you arent familiar with the bayesian paradigm, but ull have to trust the method for now if thats true

anyway i used the median of the N values rather than the means (just depends what you'd rather optimize, L2 loss or L1 loss, and in this case i was more interested in L1 loss) and got that in order to be safe:

90% confidence: N = 355,000

95% confidence: N = 504,000

99% confidence: N = 870,000

s.t. the length of our confidence intervals is less than .001. note this corresponds to p_hat +/ .0005.

for estimating the sample in from first page we were talking about, i get (87250, 123950, and 214800), which is a bit less conservative

code can be adjusted fairly easily in order to compute N values if you wanna change the cut off at all. i threw an example of using the binomial approximation to the normal as well, but simulation studies at N = 100,000 still showed empirical undercoverage near the boundaries (

https://pastebin.com/eKb9XS9q for those who are interested). that being said, the normal approximation gave similar values at least for the 90% confidence case

yea hope this helped maybe, sorry for ranting. if you're confused about what empirical undercoverage is shoot me a pm and ill send u a paper i wrote on this exact topic. also its late for me and i misread some stuff and ended up giving a lot of unnecessary calculations

also, i saw people tossin around potentially being ok with the cutoff being .005 (ie: p_hat +/ .0025), in which case the simulation yields the medians 13700 19550 and 33900 respectively, which is definitely MORE than enough in order to make good inferences about the data. a sample of 1/10th of the TOTAL population that is taken randomly in order to mitigate bias is p much a statisticians dream. i fully support using the sub populations

tl;dr: basing tiering decisions off of 10% of the total population chosen at random is a perfectly legimate way to do things and there should be no problems. also credible intervals are better when we're dealing with deciding whether to drop something (p relatively close to the boundary) and they require even smaller sample sizes