Monday, September 15, 2014

If the typing monkeys have met Mr Markov: probabilities of spelling "omglolbbq" after the digitial monkeys have read Dracula

On the weekend, randomly after watching Catching Fire, I remember the problem of the typing monkeys (Infinite monkey theorem) in which basically could be defined as (Thanks to Wiki):

# *******************
# *******************

The infinite monkey theorem states that a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type a given text, such as the complete works of William Shakespeare.

There is a straightforward proof of this theorem. As an introduction, recall that if two events are statistically independent, then the probability of both happening equals the product of the probabilities of each one happening independently. For example, if the chance of rain in Moscow on a particular day in the future is 0.4 and the chance of an earthquake in San Francisco on that same day is 0.00003, then the chance of both happening on that day is 0.4 * 0.00003 = 0.000012, assuming that they are indeed independent.

Suppose the typewriter has 50 keys, and the word to be typed is banana. If the keys are pressed randomly and independently, it means that each key has an equal chance of being pressed. Then, the chance that the first letter typed is 'b' is 1/50, and the chance that the second letter typed is a is also 1/50, and so on. Therefore, the chance of the first six letters spelling banana is

less than one in 15 billion, but not zero, hence a possible outcome.

# *******************
# *******************

In my implementation, I will only consider 26 characters of the alphabet (from a to z, excluding the whitespace). The real question I would like to ask is the following:

Given a target word, say "banana", how many monkeys would be needed to have at least one successful event (a monkey typed the target) after the monkeys have typed 6 characters.

To solve this, first calculate the probability of typing the word banana:

Now, just compute the number of monkeys that might be needed:

The model that assigns the same probability for each character is labeled as "uniform model" in my simulation.

My goal is to optimize n (minimize the number of monkeys needed because I am on a tight budget). So I decided to use a Markov Chain model of order 1 to do so. If you are unfamiliar with Markov Chains here is a very nice explanation of the models here.

The training set of the emission probability matrix, consist on a parsed version of Dracula (chapters 1 to 3, no punctuation signs, lowercase characters only)

The emission probability matrix of the Markov Chain ensures that the transition from one character to another character is constrained by previous character and this relation is weighted based on the frequencies obtained in the training text. 

It is like having a keyboard with lights for each key, after "a" is pressed, the light intensity of each key would be proportional of what characters are more likely to appear after an "a". For example "b" would have more light than "a", because it is more common to find words having *a-b* than *a-a*.

# *******************
# *******************

1) Plot the distribution of characters in the uniform model

Distribution of characters after 10,000 iterations using the uniform model

2) Plot the emission matrices

A) As expected, the transition from one character to another character is constrained by previous character and this relation is weighted based on the frequencies obtained in the training text. B) in the uniform model each character has the same probability to be typed and does not depend on the previous character. 

3) Compare the performance of the two models

In this plot I am comparing the number of monkeys (log10(x)) required to type the target words (indicated in red text) using the Markov Chain model and the uniform model. In general the Markov Chain model requires less monkeys in words that are likely to appear in the training set, like "by", "the", "what" , "where" and "Dracula". On the other hand, words that only have one character like "a", given that there's no prior information the models perform equally. Now another interesting example is the word "linux",  in which is not very likely to appear in the training set and therefore the models perform likely equally. The extreme case example is the word "omglolbbq", in which the Markov Chain model performs worse than the uniform model due of the very low probability of this word to happen, so it is penalized and I will need more monkeys to get this target word

# *******************
# *******************

Source and files


Sunday, August 10, 2014

To cooperate of defect (besides of coding): Prisoners dilemma, a game theory example in R

Hello Computer Science and/or R enthusiasts.

This week I had the opportunity to try something that was in my To-Do list a while ago. The idea came almost instantly after reading Dr. Richard Dawkins book, The Selfish Gene (which was a BD gift, thanks Andy).

I feel the obligated necessity to program my own implementation of the prisoners dilemma and make my own version of the contest. To try different parameters of greediness and vindictiveness, just coding in the name of science.

You can download the code here


Here's a basic definition of prisoners dilemma that can be found in this Wikipedia article:

The structure of the traditional Prisoners’ Dilemma can be generalized from its original prisoner setting. Suppose that the two players are represented by the colors, red and blue, and that each player chooses to either "Cooperate" or "Defect".

If both players cooperate, they both receive the reward, R, for cooperating. If Blue defects while Red cooperates, then Blue receives the temptation, T payoff while Red receives the "sucker's", S, payoff. 

Similarly, if Blue cooperates while Red defects, then Blue receives the sucker's payoff S while Red receives the temptation payoff T. If both players defect, they both receive the punishment payoff P.

And to be a prisoner's dilemma game in the strong sense, the following condition must hold for the payoffs:

T > R > P > S


I decided to program three basic strategy functions:

  1. this simple strategy repeats opponent's last choice
    1. This bot/strategy is parameter-free
  2. this strategy is affected by the parameter of greedy.level, which increases the probability of the bot to "defect" when this parameter is close to 1.0. If this parameter is set to 0.5, then the bot will choose randomly. 
    1. This bot does not care about the previous action of the other player, it only decides to "defect" or "cooperate" based on its own greediness.
  3.  this strategy is affected by the parameter of vindictive.level, which is only used when the previous action of the other player is "defect" in the same sense as the greedy.level affects the, otherwise it will "cooperate". 
    1. This  means that if the other player plays "cooperate", this bot will "cooperate" too, but at any moment when the other player plays "defect", this bot will trigger its own vindictive.level to decide if it will play "defect" or to "cooperate".
  4. All bots have three parameters as arguments:
    1. action: last action of the other player
    2. greedy.level*
    3. vindictive.level
  5. *The greedy.level parameter in the and the is only used when this bots are playing as player.1 and have to decide the first action/move against player.2. After that event, this parameter is no longer used for these bots.
  6. Each point in the plots means 1000 (iter=1000) plays of player.1 vs player.2
  7. Parameters for the payoff matrix: T=20, R=5, P=1 and S=0


  1. Try different pairwise comparisons of strategies varying the greediness and the vindictiveness of the bots
  2. Make our clandestine fight club with the bots (1st RULE: You do not code about FIGHT CLUB)



Using different thresholds of greediness in a vs we can see that the handles very well when this bot increases its greediness. this bot cooperates when the adversary cooperates, but reacts when the other player decides to "defects"


Using different thresholds of vindictive.level in a vs  A) We can see that the does not "defect" unless provoked.  Therefore as these bots "cooperate" along all the simulations, they get the score for 5000 which is the maximum score for mutual cooperation along 1000 iterations.  B) For this experiment, we are increasing the greedy.level and the vindictive.level by steps of 0.05. The greedy.level in the only determines the probability to "defect" in the first move only and only if the is playing as player.1. We can see that the handles very well when this bot increases its greediness and the vindictiveness. This bot cooperates when the adversary cooperates, but reacts when the other player decides to "defects"


Using different thresholds of greedy.level in vs the effect of vindictive.level in a  A) greedy.level and vindictive.level increased proportionally equal among the two bots. we can see that the when becomes more greedy it outperform the because the only triggers its self-defense strategy when provoked, and the probability to take revenge is based on the vindictive.level. B) greedy.level and vindictive.level increased proportionally inverse among the two bots. This means that the most greedy bot will face with the most forgiving bot.

2.1) FIGHT CLUB: ROUND 1 (0.7 to 1.0 in greedy and vindictive.level)

Tournament of all vs all using a range of parameters from 0.7 to 1.0 in greedy and vindictive.level  A) Scatter-plot of each strategy. in x-axis is the mean score of the strategy versus all other strategies when it is playing as player.1 and y.axis when is playing as player.2. We can see that the (greedy.level=1.0) has the lowest score followed by the vindictive.bots and the The bots that gets higher scores are the ones that have a greedy personality, but not "so-greedy" after all, for example the most successful bot only has 0.7 of probability of "defect". This experiment is in someway biased because the greedy.bots are taking advantage by exploiting the "nice bots" and getting higher scores for that reason.  B)  This barplot shows how is the effect of the greediness in the, as it becomes more greedy, the scores tend to be lower. The opposite behavior is shown for the

2.2) FIGHT CLUB: ROUND 2 (0.8 to 1.0 in greedy and vindictive.level)

To avoid that bias, now we show the experiment with more fierce adversaries

Tournament of all vs all using a range of parameters from 0.8 to 1.0 in greedy and vindictive.level  A) Scatter-plot of each strategy. in x-axis is the mean score of the strategy versus all other strategies when it is playing as player.1 and y.axis when is playing as player.2. We can see that the (greedy.level=1.0) has the lowest score followed by the vindictive.bots and the not-so-greedy bots. Now the bot that wins the contest is the simplest strategy, tit.for.that. B)  This barplot shows how is the effect of the greediness in the, as it becomes more greedy, the scores tend to be lower. The opposite behavior is shown for the In general we can conclude that the the vindictive.bots performs better than the greedy.bots because they tend to cooperate unless provoked, so, they are nicer bots.


NICE "BOTS" FINISH FIRST, as Dr Dawkins concluded in his book.

Dr. Dawkins illustrates the four conditions of tit for tat.

  1. Unless provoked, the agent will always cooperate.
  2. If provoked, the agent will retaliate.
  3. The agent is quick to forgive.
  4. The agent must have a good chance of competing against the opponent more than once.

In conclusion, it is better to be a nice human than a greedy one. 


Friday, August 1, 2014

UPDATE: 100 Prisoners, 100 lines of code, a game theory example in R

I  was looking for some Game-Theory post in R-bloggers and found this very cool post,  obviously I decided to read the post, to look at the code and implement a version of it.

The idea of the post was very cool for me that I thought it would be a nice idea to get more "juice" from it.

All the credit for this post, is for its original author.

The code is available here.

As Matt Asher describes in his original post:

# ***************
# ***************

In this game, the warden places 100 numbers in 100 boxes, at random with equal probability that any number will be in any box. Each convict is assigned a number. One by one they enter the room with the boxes, and try to find their corresponding number. They can open up to 50 different boxes. Once they either find their number or fail, they move on to a different room and all of the boxes are returned to exactly how they were before the prisoner entered the room.

The prisoners can communicate with each other before the game begins, but as soon as it starts they have no way to signal to each other. The warden is requiring that all 100 prisoners find their numbers, otherwise she will force them to listen to hundreds of hours of non-stop, loud rock musician interviews. Can they avoid this fate?

The first thing you might notice is that if every prisoner opens 50 boxes at random, they will have a 0.5 probability of finding their number.

There's actually a strategy that can guarantee all the prisoners to get their numbers.

For this post, we will call that strategy, model, and the strategy with no strategy at all, we will call it random.model.

The model strategy consist in that at the beginning, each prisoner will open the box based on their number, for example, the prisoner 42 will open the box at the position 42, after open it, say it contains the number 12, and given that 42 is not equal to 12, he will have to open the box at position 12. This will happen until he had opened 50 boxes or found his number.

The random.model strategy consist in no strategy at all, only opening boxes at random until he had opened 50 boxes or found his number.

Every time a prisoner found his number we will call this event a success.

Our interest is to find the strategy that maximizes the success rate. We will run each simulation 1000 times.

# ***************
# ***************

Let's take a look at the performance of each model individually with this histogram:

We can see that the random.model behaves as we were expecting assuming that by chance we have a 0.5 chance of getting our number by selecting 50 random boxes, On the other hand, the other strategy has a very different distribution, there's a gap between 51 and 99 with 0 of success rate!, any suggestion to explain how is this possible?

Another way to plot the distribution of the success rates obtained from each model is by looking at a density plot:

This histogram is telling the same story than the previous plots, but in the same plot:

And as usual, I always left the best plot for the grand finale. 

What we can say about this plot?, before going there, I will continue using Matt's explanation:

The theoretical success rate of the model strategy is about 31%

One way to look at the distribution of numbers in boxes is to see it as a permutation of the numbers from 1 to 100. Each permutation can be partitioned into what are called cycles. A cycle works like this: pick any number in your permutation. Let's say it's 23. Then you look at the number the 23rd place (ie the number in the 23rd box, counting from the left). If that number is 16, you look at the number in the 16th place. If that number is 87, go open box number 87 and follow that number. Eventually, the box you open up will have the number that brings you back to where you started, completing the cycle. Different permutations have different cycles.

The key for the prisoner is that by starting with the box that is the same place from the left as his number, and by following the numbers in the boxes, the prisoner guarantees that if he is in a cycle of length less than 50, he will eventually open the box with his number in it, which would complete the cycle he began. One way to envision cycles of different lengths is to think about the extreme cases. If a particular permutation shifted every single number over one to the left (and wrapped number 1 onto the end), you would have a single cycle of length 100. Box 1 would contain number 2, box 2 number 3 and so on. On the other hand, if a permutation flipped every pair of consecutive numbers, you would have 50 cycles, each of length 2: box 1 would have number 2, box 2 would have number 1. Of course if your permutation doesn't change anything you have 100 cycles of length 1.

As you can see from the histogram of the model strategy, you can never have between 50 and 100 winning prisoners. Anytime you have a single cycle of length greater than 50, for example 55, then all 55 prisoners who start on that cycle will fail to find their number. If no cycles are longer than 50, everyone wins.

Now, looking at the boxplot, OK we would say, using the model strategy we have like a 30% percent of chance that every one of the prisoners will get their number, but on the other hand, the random strategy guarantees that at least every prisoner will have 50% chance of getting his number, and therefore win.

So, the model strategy is like an all of nothing strategy, you can win the lottery or die trying. This is observable looking at the large standard deviation in the boxplot, on the other hand, the random.model strategy is very consistent with low dispersion values. This means that if the prisoners decides to use the model strategy, they will have to face tree possible outcomes, or everybody wins, or nobody does, or only a few does. On the other hand, with the random.model strategy, they will expect that at least 50 prisoners will win, thinking as a prisoner as a logical individual, I think this is the best strategy he can use.

Write your opinions.

Thanks to Matt Asher for the idea


Monday, July 7, 2014

What are the names of the school principals in Mexico?, If your name is Maria, probably this post will interest you. Trends and cool plots from the national education census of Mexico in 2013

I will start this post with a disclaimer:

The main intention of the post is to show how is the distribution of the school principal names in Mexico, for example, to show basic trends regarding about what is the most common nation-wide first name and so on, also to show trends delimited by state and regions.

These trends in data would answer questions such:

1. Are the most common first names distributed equally among the states?
2. Does the states sharing the same region also share the same "naming" behavior?

Additionally, this post includes cool wordclouds.

Finally and the last part of my disclaimer is that, I am really concerned about the privacy of the persons involved. I am not by any sense promoting the exploitation of this personal data, if you decide to download the dataset, I would really ask you to study it and to generate information that is beneficial, do not join the Dark side.



The database is located here
The R code can be downloaded here
Additional data can be downloaded here

All the results were computed exploring 202,118 schools across the 32 states of Mexico from the 2013 census


Here is the wordcloud of names (by name, I am referring to first name only), it can be concluded that MARIA is by far the most common first name of a school principal in all Mexican schools, followed by JOSE and then by JUAN

The following wordcloud includes every word in the responsible_name column (this includes, first name, last names). Now the plot shows that besides the common first name of MARIA, also the last names of HERNANDEZ, MARTINEZ and GARCIA are very common.


Looking at this barplot, the name MARIA is by far the most common name of the Mexican school's principals, with a frequency ~25,000. The next most popular name is JOSE with a frequency of ~7,500

Looking at the same data, just adjusted to represent the % of each name inside the pool of first names we have that MARIA occupy ~11% of the names pool.


 With this heatmap, my intention is to show the distribution of the top 20 most common first names across all the Mexican states

It can be concluded that there is a small cluster of states which keep the most number of principals named MARIA(but no so fast!, some states, for example Mexico and Distrito Federal are very populated, so I will reduce this effect in the following plot). In summary the message of this plot is the distribution of frequency of the top 20 most frequent first-names across the country.


For me, a young data-science-padawan, this is my favorite analysis: "hunting down the trends".

The setup of the experiment is very simple, map the top 1,000 most frequent nation-wide names across each state to create a 32 x 1000 matrix (32 states and 1,000 most nation-wide frequent names).

With this matrix, normalize the values by diving each row by the sum of it  (this will minimize the effect of the populated states vs the non populated while maintaining the proportion of the name frequencies per state). Then I just computed a distance matrix and plotted it as a heatmap.

What I can conclude with this plot is that, there are clusters of states that seems to maintain a geographical preference to be clustered within its region, this would be concluded that it is likely that states sharing the same regions would be more likely to share the "naming" trends due to some cultural factors (like the cluster that includes Chihuahua, Sonora and Sinaloa). But this effect is not present in all the clusters.

All images can be downloaded in PDF format here, just don't do evil with them!

Plot 1 here
Plot 2 here
Plot 3 here
Plot 4 here
Plot 5 here
Plot 6 here


Wednesday, June 18, 2014

[SOLVED] Problems compiling Rcpp dependent R packages in Crunchbang Linux 11 (Debian 7.5 (wheezy) 64-bit)

I had a very weird issue when I tried to compile (install) certain R packages like "wordcloud", "RSNNS" or "GOSemSim" (a Bioconductor package). The installation always ended with a compiling error at the end.

At first, I tried to solve the problem by finding anyone that had faced the same issue, so I Duck-Duck-go-it, Google-it and at least for me, I did not find anyone.

At the end, and which is the core of my post, I solved the problem by removing or commenting all my customizations in the /usr/lib/R/etc/ file (for example, loading libraries automatically).

In summary, using the default worked for me.

I wrote this post to make a record of the issue, so maybe it would help someone to solve the same in the future.


Tuesday, February 25, 2014

Convert Ensembl, Unigene, Uniprot and RefSeq IDs to Symbol IDs in R using Bioconductor

Hello, I have programmed a function that converts different sources of IDs to Symbol IDs.

The input ID types allowed are (at the moment):  Ensembl, Unigene, Uniprot and RefSeq.

The code is available clicking here

NOTE: The function depends on the Bioconductor package "" available here

For example, lets show 10 Ensembl IDs:

> id[1:10]
 [1] "ENSG00000121410" "ENSG00000175899" "ENSG00000256069" "ENSG00000171428"
 [5] "ENSG00000156006" "ENSG00000196136" "ENSG00000114771" "ENSG00000127837"
 [9] "ENSG00000129673" "ENSG00000090861"

And their Symbol IDs:

> res[1:10]
 [1] "A1BG"     "A2M"      "A2MP1"    "NAT1"     "NAT2"     "SERPINA3"
 [7] "AADAC"    "AAMP"     "AANAT"    "AARS"    

This is a running example of the function to convert Unigene IDs to Symbol IDs (For all the other IDs types, just replace "unigene" to "ensembl" or "refseq" or "uniprot"):

unigene <- toTable(org.Hs.egUNIGENE)
# extract 100 random unigene entries
id  <- unigene[sample(1:length(unigene[,2]),100),2]
id.type  <- "unigene"
res <- get.symbolIDs(id,id.type)


Monday, February 10, 2014

Upgrade and update R 2.15 to R 3.0 in Debian Wheezy

Following the instructions from CRAN, you need to add the R backports in your source list.


First, open a Terminal and open the sources.list file:

$ gksudo gedit /etc/apt/sources.list

Then, add these lines at the bottom of the file (Note, I use the Revolution Analytics Dallas, TX server, but this can be easily changed taking a look here for the mirrors):

deb wheezy-cran3/
#deb-src wheezy-cran3/


There's a folder where R uses to store the packages we download, just rename it to the current version of R. For example, mine was "2.15" and then I just renamed it to "3.0" and was inside this path:


Remember that some packages also needs to install some files in folders that belongs to the root, so, I would recommend to open R in sudo mode (only if you're sure about what you're doing :P) just by executing R this way: "sudo R" and then, in the R console type :

update.packages(checkBuilt=TRUE, ask=FALSE)


The Debian backports archives on CRAN are signed with the key ID 381BA480, to add them, in a Terminal prompt type:

gpg --keyserver --recv-key 381BA480
gpg -a --export 381BA480 > jranke_cran.asc
sudo  apt-key add jranke_cran.asc


Save the file and you can either enter to Synaptic, update the packages list and then just upgrade the packages or in a terminal type:

sudo apt-get update
sudo apt-get upgrade

And that's all.