Serious Application of the Infinite Monkey Theorem (Math)

forestflamerunner

Ain't no rest for the wicked
So in my free time I try to solve math problems as they relate to "real" world situations. I recently came up with a question that I've been trying to solve for a couple of days but I can't figure out where to start. I was hoping some of the stronger mathematicians onsite can help me get started. Note that I do not want the answer as that soaks up all the fun. I want a nudge in the right direction.

The question: There is a Monkey who, once a day, chooses one of the twenty-six letters of the alphabet at random. How many days would it take the monkey, on average, to choose every letter at least once? what is the probability that the monkey chose every letter at least once after fifty days? After how many days would there be a 95%probability that the monkey hit every letter at least once?

Like I said, I've made very little headway on this problem. My gut is telling me that the answer is has something to do with 26^n/n! but that may just be because I feel like this is a combination problem. I tried Googling this problem a bit and I'm pretty sure it has something to do with the infinite monkey theorem, but I can't really extrapolate much from it.
 
Sorry, I got a 5 on the ap stat test and then immediately deleted it all from my brain. This kind of problem is easy with binary situations but 26 possibilities every trial is just??? You could program a simulation to find approximate percentages but that's no real solution.

The probability that all come within 50 days might be just P(A comes at least once in 50 days)^26, which could be easily found using a binary model for A vs. A'. Wait, the probability that all come within 50 days is just the number of 50-letter sequences where all appear once (which I think is 26! * 26^24) divided by 26^50 (the number of possible sequences for 50 days).

Double edit: nope, that simplifies to the probability of all letters appearing in 26 days so I was off by some factor (50 ncr 26?).

Anyways, an "application" usually has a slightly more practical purpose but I guess I could think about this more.
 
Last edited:
for this type of problem, the answer can be found by writing a program that models the probabilities and then running trials of 50 days, repeat the trials and you find a percentage by asking in how many cases out of the whole did the monkey choose the entire alphabet after 50 days?. Try looking at the birthday problem, i think they have important similarities.
 
for this type of problem, the answer can be found by writing a program that models the probabilities and then running trials of 50 days, repeat the trials and you find a percentage by asking in how many cases out of the whole did the monkey choose the entire alphabet after 50 days?. Try looking at the birthday problem, i think they have important similarities.

I think this is exactly what i need to get started. Thanks
 
it's been a while since I took a stat class and I never learned much probability, but the first and third problems seem difficult to solve without numerical methods. I thought about the birthday problem a little and maybe it is possible to expand that to address the second problem - this link might help http://en.wikipedia.org/wiki/Multiset.

important edit: when I say it might help I say that as someone who has had a long week and doesn't want to think any longer about whether it actually helps. not as a teacher who says this might help *hint*

edit2 - specifically "counting multisets" in that link. had a few lines about poisson distribution & renewal theory that I deleted because I couldn't see how they were exactly relevant but maybe you can find something there
 
Last edited:
i thought the infinite monkey theorem was a simulation of population growth in china

anyway

Wrote a simple little script and ran it to simulate this; if you don't want to see the results then don't open the hide tags.
Code:
public class InfiniteMonkey {

    public static void main() {
        //int[] data = new int[1000];
        int fiftyCount = 0;
        for (int i = 0; i < 1000000; i++) {
            String str = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
            String str2 = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
            int roll, index;
            int day = 1;
            while (str.length() > 0) {
                roll = (int) (Math.random() * 26);
                index = str.indexOf(str2.substring(roll, roll+1));
                if (index > -1 && index+1 < str.length()) {
                    str = str.substring(0, index) + str.substring(index+1);
                    //System.out.println("Found: "+str2.substring(roll, roll+1)+" on day "+day+".");
                }
                else if (index > -1 && index+1 == str.length()) {
                    str = str.substring(0, index);
                    //System.out.println("Found: "+str2.substring(roll, roll+1)+" on day "+day+".");
                }
                day++;
            }
            //System.out.println("All letters found after "+day+" days.");
            //data[i] = day;
            if (day <= 50)
                fiftyCount++;
        }
        double fiftyProb = (double) fiftyCount / 1000000;
        System.out.println(fiftyProb);
        /* double average = 0;
        for (int num : data) {
            System.out.print(num + ",");
            average += num;
        }
        average /= 100;
        System.out.println();
        System.out.println("average: "+average); */
  
    }
}
77,61,91,97,94,108,146,139,147,96,90,207,78,81,69,103,117,73,83,69,104,89,102,87,112,83,187,76,135,75,81,59,99,133,92,158,96,86,104,79,79,88,79,112,90,197,137,146,98,120,82,100,68,92,123,89,121,80,71,157,77,127,87,112,75,95,87,105,103,112,66,170,93,114,101,70,99,126,80,83,71,88,102,103,97,223,78,110,101,108,148,144,93,79,79,204,84,61,172,116,
average: 104.35
83,83,117,162,101,90,140,78,65,84,154,70,105,87,98,118,114,95,67,92,59,83,76,89,160,145,76,92,99,170,95,101,166,103,56,73,62,102,85,70,146,126,177,123,93,137,87,111,112,86,104,75,93,192,71,153,78,118,90,76,142,111,72,145,76,76,68,111,155,91,133,75,68,192,102,68,110,100,69,94,102,104,85,63,155,94,220,75,96,125,64,100,96,136,109,95,143,90,95,100,
average: 104.18
99,92,70,85,113,82,117,81,79,169,109,72,91,68,153,86,100,130,107,74,79,113,70,91,156,64,84,83,80,123,106,183,63,127,86,102,125,94,76,68,112,95,176,147,80,61,104,94,104,100,98,92,93,114,93,128,59,127,130,121,141,93,92,103,102,104,131,172,98,66,186,85,102,71,101,76,132,125,165,72,48,83,93,93,104,129,106,95,84,135,124,110,86,100,80,78,74,84,90,84,
average: 101.75
97,86,110,90,87,100,96,157,76,78,104,91,167,281,74,101,75,105,176,104,81,164,92,91,113,73,75,66,185,96,164,111,109,81,82,156,136,90,102,99,117,54,144,91,93,92,121,94,139,69,92,122,68,81,70,141,69,133,94,100,134,67,121,95,80,97,181,73,113,62,138,103,62,115,102,133,64,82,79,50,104,104,161,64,80,78,97,91,92,88,82,66,117,90,75,82,70,90,126,73,
average: 101.86
82,84,107,77,117,74,61,80,109,88,149,106,79,140,141,85,119,122,106,95,129,75,118,126,143,92,82,89,91,68,120,68,81,99,117,137,78,65,116,155,84,75,107,81,67,223,67,112,129,121,68,80,128,95,124,115,98,72,129,77,101,111,73,111,88,108,168,69,84,72,110,96,114,153,77,86,82,73,71,94,119,113,101,93,85,140,99,139,120,104,88,141,104,110,121,71,93,75,96,110,
average: 101.85
119,78,83,102,120,70,122,63,108,109,143,116,61,143,115,73,124,106,85,82,87,67,147,90,79,90,78,65,77,187,106,80,127,147,71,130,96,75,68,124,63,88,102,102,122,116,117,72,90,102,70,71,87,142,83,80,169,98,164,89,101,66,100,70,84,107,69,146,114,67,102,91,90,66,143,81,125,117,95,57,132,72,99,120,95,110,128,67,89,83,76,114,77,89,121,95,120,231,64,111,
average: 100.24
85,81,151,79,92,86,91,159,144,62,130,58,102,95,68,105,80,179,108,115,104,73,72,169,62,101,142,73,59,102,102,64,108,59,155,93,112,93,66,91,80,78,81,79,75,97,91,119,72,67,55,112,78,65,109,85,176,117,79,77,103,92,104,184,102,70,104,89,79,129,111,78,78,60,91,132,115,129,113,87,80,72,136,89,102,144,127,116,90,74,107,106,133,84,94,75,168,107,60,115,
average: 98.91
83,97,134,72,77,82,81,85,92,87,90,202,87,112,116,95,95,97,86,80,57,57,93,138,122,105,97,134,60,97,87,114,142,120,126,72,105,158,89,72,89,73,83,70,82,65,66,153,170,94,80,70,116,124,99,109,112,82,82,84,73,100,120,89,153,82,107,71,74,109,97,104,114,149,89,73,61,94,71,136,136,86,121,54,108,81,112,103,106,86,59,103,63,74,83,194,116,157,94,60,
average: 98.6
102,112,67,106,52,71,127,106,88,76,79,94,52,185,89,98,112,79,69,113,113,121,124,109,88,76,89,93,108,88,62,80,138,105,78,83,108,65,55,152,132,158,86,135,68,83,82,94,73,92,106,87,88,131,78,104,86,98,60,115,66,109,136,115,78,74,93,101,98,204,97,81,121,119,83,90,83,129,97,76,74,63,87,118,80,103,105,82,119,120,168,117,168,107,74,112,143,64,115,98,
average: 99.35
0.006179
0.005972
0.006023
0.005883
0.005936
0.006074
as for the last question, "After how many days would there be a 95%probability that the monkey hit every letter at least once?", the answer should be whatever number that is above 95% of the day counts returned by the script, which i guess i'll edit in at some point.

e:
ran 10 million trials and got {average: 100.220335}, but maybe moonbound starts with day 0 and i start with day 1?????
double e: yeah, i seem to be counting an extra day, w/e.

alright i've corrected it in the brackets too, it's verrry close
 
Last edited:
I also simulated it and got a similar and consistent average of about [whitetext: 99 days].


Edit: ran 10 million trials, actual value should be very close to {100.2206602}. RIP my laptop.
Double Edit: Yeah, I was returning 1 less, I corrected it in the brackets.

Code:
import java.util.Random;
public class monkey2{
   public static void main(String[] args){
     Random gen = new Random();
     double total = 0.0;
     for(int q=0;q<10000000;q++){
     int[] letter;
     letter = new int[26];
     for(int x = 0; x<26; x++){
       letter[x]=1;
     }
     for(int day=1; day>-1; day++){
       int z = gen.nextInt(26);
       letter[z]=0;
       int sum=0;
       for(int x=0;x<26;x++){
         sum+=letter[x];
       }
       if(sum == 0){
         total+=day;
         break;
       }
     }
     }
     System.out.println((total/10000000.0));
   }
}
 
Last edited:
Think about it in terms of all the possibilities of the sets of letters that he can pick. For example, he can pick 50 As with probability 1/(26)^50, or he can pick 49 As and 1 B with probability 50*1/26^(50), or he can pick...

To do the average time needed, it would be the time needed such that the number of valid choices (those where you pick all 26 letters) would have a 50 percent chance of occurring.

Another helpful hint is to start doing it in small cases at first: for example, if you only have three letters, then what's the chance of picking all three letters within 3 days? within 5 days?

For the three day case, it's easy to show that the chance of picking all three letters is 2/9. Let's assume the letters are A, B, and C. Then on the first day he picks a letter, and we can assume that this letter is A. So, he has two letters that he needs to pick in the next 2 days. The chance of him picking one of B or C is 2/3, and the chance of him then picking the other one is 1/3, so the chance of him picking A then B then C is 2/9.

Another way to see this is that we can denote his choices by XYZ, where X, Y, and Z are the letters he picks (not necessarily unique). For example, if he picks A, then B, then C, his choice will be ABC. There are 3 choices for the first pick, second pick and third pick each, so there are 3^3 = 27 different combinations. On the other hand, how many choices are there where we pick all three letters? There are 6, and it's not that hard to find them all (it's 3!, because there are 3 places we can put A, 2 places we can put B, and 3 places we can put C).

Then you generalize, although it's a bit trickier.

Yet another way (which is probably dumb, to be honest) is to think about it in terms of a graph. Let each letter be a specific vertex, and then something like ABC would be a graph in which vertex A is connected to vertex B is connected to vertex C. Then there's probably some neat trick you can do but idk what it would be.

Or, finally, you can do it by calculating the ways to pick when you only have 25 letters. Since all the choices only use 25 letters, you just subtract the ways to pick from the ways to pick using 26 letters and then you have the number of choices that use all 26 letters. This is probably the smartest way to do it btw!
 
Or, finally, you can do it by calculating the ways to pick when you only have 25 letters. Since all the choices only use 25 letters, you just subtract the ways to pick from the ways to pick using 26 letters and then you have the number of choices that use all 26 letters. This is probably the smartest way to do it btw!
I think there's a snag here. You have to calculate all the ways to do it with 25 or less letters. It's not quite as simple as saying, "let's not use the letter A" and then multiplying by 26 because you could choose not to use any letter. If you did that, you could come up with the sequence BBBBBBBBB... which is also included in "let's not use C, or D, or anything besides B"

I'm sure there's a better way to count how to do it with 25 or less letters, but I'm not thinking of it right now.
 
I'm pretty sure you just do it inductively. Something like

there are 26 ways to do it with 1 letter, then if you choose one more letter then n slots are filled up with letter one so 50-n slots are filled up with letter 2 and you calculate how to arrange that, and then you just combine letter 1 and 2, so if n slots are filled up with letter one and two (you have to account for the way to arrange them but that's not that hard) then 50-n slots are filled up with letter 3 and you calculate how to arrange that, but that's the same thing you calculated so you're done

actually i'm dumb and there are 25^n ways to arrange 25 letters if you have n slots LOL
 
6HQc0.png

thanks blarajan
 
on average, it will take 26/26 trials to get the first one, 26/25 trials for the second, 26/24 trials for the third, etc.

thus, on average, it will take 26 (sum of 1/x from 1 to 26). This is, as wolfram alpha conveniently gives me, {100.21...}, or the 26th Harmonic number multiplied by 26.

you guys waaaaay overdid this; ffs you don't write code to brute force stats problems >:(

The 95% one is definitely something very stats-y that's flying a bit over my head (honestly I feel like any percent is a linear multiple (or something really trivial) of the answer for the first question, which is essentially just asking 50% instead of 95%); the 50 one is a bit more elusive.
 
For 50% and 95% chances, you just apply the central limit theorem - to a very good approximation the expected value of the number of trials required is normally distributed. So to have a 50% chance of getting enough trials, you just wait for the expected number of days (which comes out at roughly 100).

To get a 95% chance, you'd need to know the standard deviation of the distribution too. But you can get this too, because the standard deviation is just the square root of the sum of the squares of the standard deviations of the steps taken (standard stats theorem). And it turns out that the standard deviation of the nth step is: ((n-1)/(27-n))

Therefore st dev = sqrt(((1-1)/(27-1))^2 + ((2-1)/(27-2))^2 + ....)

Which comes out as about 30. So to have a 95% chance of typing ever letter at least once, that works out at the mean + 2 standard deviations = 100 + 30 x 2 = 160 days.

I used a lot of statistical theorems in there that I'm not going to prove: I suggest you Google the terms "Standard Deviation", "Normal Distribution", "Poisson Distribution" and "Central Limit Theorem" for more information.
 
You definitely got further than I was able to. For some reason CLT didn't cross my mind. However, I do have a couple nits to pick.

First, if my reading of the CLT Wikipedia entry is correct, that's slightly cavalier application of the CLT because each step does not have the same pmf, so you'd need to demonstrate Lyapunov's condition or something similar. This is tiny, and it probably doesn't even matter.

Second, ((n-1)/(27-n)) would be the variance of each step assuming that the count in each step is a Poisson distribution truncated at 1. So you'd just take the root sum, rather than the root sum square, of these values for each step. Somewhat minor, the idea is there, but you get a fucked-up number if you take that as the stdev.

However, my last beef is the biggie: the counts at each step are not poisson-distributed. Within a step, count increments do not occur independently of each other; each count increment within a step is dependent on the previous count increments not having moved to the next step. However, we can determine the pmf even though we can't assume a poisson distribution. Within each step, the pmf of the count should look like:

P(k|n) = (27-n)/26 * ( (n-1)/26 )^(k-1), n ∈ [2,26], k ∈ [1,∞)
P(k|n=1) = δ[k-1]

where n is the step number (i.e., you have picked n-1 unique letters and are looking to pick your nth) and k is the count within the step (i.e., how many letters you pick until you get a unique letter). From this you can identify the mean and variance and then use the magic of CLT to get the aggregate mean and confidence interval.
 
When I saw this thread I was expecting something along the lines of the two envelopes problem.

Anyway I did what others did before me and simulated it (in C, fuck java).

Code:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#define NUM_OF_TRIALS 10000
int main()
{
    srand(time(NULL));
    int j, amountOfLetters = 26, results[NUM_OF_TRIALS], sum = 0;
    for (j = 0; j < NUM_OF_TRIALS; ++j)
    {
        results[j] = 0;
        while (amountOfLetters > 0)
        {
            if ((rand() % 26) + 1 <= amountOfLetters) amountOfLetters--;
            results[j]++;
        }
        amountOfLetters = 26;
        sum += results[j];
    }
    printf("The average number of days it took was %f.\n", (double) sum / NUM_OF_TRIALS);
    return 0;
}
The average number of days it took was 100.245800.
 
Last edited:
Back
Top