Computational Biology Blog in fasta format: Markov Chains

Showing posts with label Markov Chains. Show all posts

Monday, September 15, 2014

If the typing monkeys have met Mr Markov: probabilities of spelling "omglolbbq" after the digitial monkeys have read Dracula

On the weekend, randomly after watching Catching Fire, I remember the problem of the typing monkeys (Infinite monkey theorem) in which basically could be defined as (Thanks to Wiki):

# *******************
# INTRODUCTION
# *******************

The infinite monkey theorem states that a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type a given text, such as the complete works of William Shakespeare.

There is a straightforward proof of this theorem. As an introduction, recall that if two events are statistically independent, then the probability of both happening equals the product of the probabilities of each one happening independently. For example, if the chance of rain in Moscow on a particular day in the future is 0.4 and the chance of an earthquake in San Francisco on that same day is 0.00003, then the chance of both happening on that day is 0.4 * 0.00003 = 0.000012, assuming that they are indeed independent.

Suppose the typewriter has 50 keys, and the word to be typed is banana. If the keys are pressed randomly and independently, it means that each key has an equal chance of being pressed. Then, the chance that the first letter typed is 'b' is 1/50, and the chance that the second letter typed is a is also 1/50, and so on. Therefore, the chance of the first six letters spelling banana is

less than one in 15 billion, but not zero, hence a possible outcome.

# *******************
# METHODS
# *******************

In my implementation, I will only consider 26 characters of the alphabet (from a to z, excluding the whitespace). The real question I would like to ask is the following:

Given a target word, say "banana", how many monkeys would be needed to have at least one successful event (a monkey typed the target) after the monkeys have typed 6 characters.

To solve this, first calculate the probability of typing the word banana:

Now, just compute the number of monkeys that might be needed:

The model that assigns the same probability for each character is labeled as "uniform model" in my simulation.

My goal is to optimize n (minimize the number of monkeys needed because I am on a tight budget). So I decided to use a Markov Chain model of order 1 to do so. If you are unfamiliar with Markov Chains here is a very nice explanation of the models here.

The training set of the emission probability matrix, consist on a parsed version of Dracula (chapters 1 to 3, no punctuation signs, lowercase characters only)

The emission probability matrix of the Markov Chain ensures that the transition from one character to another character is constrained by previous character and this relation is weighted based on the frequencies obtained in the training text.

It is like having a keyboard with lights for each key, after "a" is pressed, the light intensity of each key would be proportional of what characters are more likely to appear after an "a". For example "b" would have more light than "a", because it is more common to find words having *a-b* than *a-a*.

# *******************
# RESULTS
# *******************

1) Plot the distribution of characters in the uniform model

Distribution of characters after 10,000 iterations using the uniform model

2) Plot the emission matrices

A) As expected, the transition from one character to another character is constrained by previous character and this relation is weighted based on the frequencies obtained in the training text. B) in the uniform model each character has the same probability to be typed and does not depend on the previous character.

3) Compare the performance of the two models

In this plot I am comparing the number of monkeys (log10(x)) required to type the target words (indicated in red text) using the Markov Chain model and the uniform model. In general the Markov Chain model requires less monkeys in words that are likely to appear in the training set, like "by", "the", "what" , "where" and "Dracula". On the other hand, words that only have one character like "a", given that there's no prior information the models perform equally. Now another interesting example is the word "linux", in which is not very likely to appear in the training set and therefore the models perform likely equally. The extreme case example is the word "omglolbbq", in which the Markov Chain model performs worse than the uniform model due of the very low probability of this word to happen, so it is penalized and I will need more monkeys to get this target word

# *******************
# SOURCE AND FILES
# *******************

Source and files

Benjamin

Wednesday, May 2, 2012

My poster "Mining Biosequences" (Spanish) at the Scientific Poster Contest due to the BSc in Genomic Biotechnology anniversary at Faculty of Biological Sciences in UANL

My poster is an introduction to HMM using some awesome material from Eddy S (among other awesome authors)., hope you like it.

Click the image to download the poster.

Benjamin

Friday, April 13, 2012

Introduction to Markov Chains and modeling DNA sequences in R

Markov chains are probabilistic models which can be used for the modeling of sequences given a probability distribution and then, they are also very useful for the characterization of certain parts of a DNA or protein string given for example, a bias towards the AT or GC content.

For now, I am going to introduce how to build our own Markov Chain of zero order and first order in R programming language. The definition of a zero order Markov Chain relies in that, the current state (or nucleotide) does not depends on the previous state, so there's no "memory" and every state is untied.

For the first order Markov Chain the case is different because the current state actually depends only on the previous state. Given that points clear, a second order Markov Model will be a model that reflects that the current state only depends on the previous two states before it (This model will be useful for the study of codons, given that they are substrings of 3 nucleotides long).

Download the scripts (R script and graphviz scripts) here

Example of a Markov Chain of zero order (the current nucleotide is totally independent of the previous nucleotide).

The multinomial model is:
p(A)+p(C)+p(G)+p(T) = 1.0
0.4 +0.1 +0.1 +0.4 = 1.0

Example of the structure of the zero order model:

Example of a Markov Chain of first order (the current nucleotide only depends on the previous nucleotide).

The multinomial model per base is:
p(A|A)+p(C|A)+p(G|A)+p(T|A) = 1.0
p(A|C)+p(C|C)+p(G|C)+p(T|C) = 1.0
p(A|G)+p(C|G)+p(G|G)+p(T|G) = 1.0
p(A|T)+p(C|T)+p(G|T)+p(T|T) = 1.0

So:
0.6 + 0.1 + 0.1 + 0.2 = 1.0
0.1 + 0.5 + 0.3 + 0.1 = 1.0
0.05+ 0.2 + 0.7 + 0.05 = 1.0
0.4 + 0.05+0.05 + 0.5 = 1.0

Example of the structure of the first order model:

Code:

#    Author: Benjamin Tovar
#    Date: 13/April/2012
#
#    Example of a Markov Chain of zero order (the current nucleotide is
#    totally independent of the previous nucleotide).

#    The multinomial model is:
#    p(A)+p(C)+p(G)+p(T) = 1.0
#    0.4 +0.1 +0.1 +0.4  = 1.0

# Define the DNA alphabet that will be used to put names to the objects
alp <- c("A","C","G","T")
# Create the vector that represents the probability distribution of the model
zeroOrder <- c(0.4,0.1,0.1,0.4)
# Put the name of reference of each base 
names(zeroOrder) <- alp 
# Create a sequence of 1000 bases using this model.
zeroOrderSeq <- sample(alp,1000,rep=T,prob=zeroOrder)

# ***** Study the composition bias of the sequence *****
# We wil use the "seqinr" package.
# For the installation of the package, type:
# install.packages("seqinr")
# Then, load the package:
require("seqinr")

# Count the frequency of each base 
# in the sequence using the "count" function
zeroOrderFreq <- count(zeroOrderSeq,1,alphabet=alp,freq=TRUE)

# Count the frequency of dinucleotides 
# in the sequence using the "count" function
zeroOrderFreqDin <- count(zeroOrderSeq,2,alphabet=alp,freq=TRUE) 
 
# Now, plot the results in the same plot:
layout(1:2)
barplot(zeroOrderFreq,col=1:4,main="Compositional bias of each nucleotide",
        xlab="Base",
        ylab="Base proportion")
barplot(zeroOrderFreqDin,col=rainbow(16),
        main="Compositional bias of each dinucleotide",
        xlab="Base",
        ylab="Base proportion")

# &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

#    Example of a Markov Chain of first order (the current nucleotide only
#    depends on the previous nucleotide).

#    The multinomial model per base is:
#    p(A|A)+p(C|A)+p(G|A)+p(T|A) = 1.0
#    p(A|C)+p(C|C)+p(G|C)+p(T|C) = 1.0
#    p(A|G)+p(C|G)+p(G|G)+p(T|G) = 1.0
#    p(A|T)+p(C|T)+p(G|T)+p(T|T) = 1.0        

#    So:
#    0.6 + 0.1 + 0.1 + 0.2  = 1.0
#    0.1 + 0.5 + 0.3 + 0.1  = 1.0
#    0.05+ 0.2 + 0.7 + 0.05 = 1.0
#    0.4 + 0.05+0.05 + 0.5  = 1.0
    
    
# Create the matrix that will store the probability distribution given 
# a certain nucleotide:
firstOrderMat <- matrix(NA,nr=4,nc=4)
# Put names to the 2 dimensions of the matrix 
colnames(firstOrderMat) <- rownames(firstOrderMat) <- alp 
# Add the probability distribution per base:
firstOrderMat[1,] <- c(0.6,0.1,0.1,0.2)  # Given an A in the 1st position
firstOrderMat[2,] <- c(0.1,0.5,0.3,0.1)  # Given a  C in the 1st position
firstOrderMat[3,] <- c(0.05,0.2,0.7,0.05)# Given a  G in the 1st position
firstOrderMat[4,] <- c(0.4,0.05,0.05,0.5)# Given a  T in the 1st position

# Now we got a matrix
#    > firstOrderMat
#         A    C    G    T
#    A 0.60 0.10 0.10 0.20
#    C 0.10 0.50 0.30 0.10
#    G 0.05 0.20 0.70 0.05
#    T 0.40 0.05 0.05 0.50

# In order to continue, we need an initial probability distribution to know
# which base is the most probable to start up the sequence.
inProb <- c(0.4,0.1,0.1,0.4); names(inProb) <- alp
# So, the sequence will have a 40% to start with an A or a T and 10% with C or G

# Create a function to generate the sequence.
# NOTE: To load the function to the current environment, just copy
# the entire function and paste it inside the R prompt.

generateFirstOrderSeq <- function(lengthSeq,
                                  alphabet,  
                                  initialProb,
                                  firstOrderMatrix){
#    lengthSeq = length of the sequence
#    alphabet = alphabet that compounds the sequence 
#    initialProb   = initial probability distribution
#    firstOrderMatrix = matrix that stores the probability distribution of a 
#                       first order Markov Chain

    # Construct the object that stores the sequence
    outputSeq <- rep(NA,lengthSeq)
    # Which is the first base:
    outputSeq[1]  <- sample(alphabet,1,prob=initialProb) 
    # Let the computer decide:
    for(i in 2:length(outputSeq)){
        prevNuc <- outputSeq[i-1]    
        currentProb <- firstOrderMatrix[prevNuc,]
        outputSeq[i] <- sample(alp,1,prob=currentProb)
    } 
    cat("** DONE: Sequence computation is complete **\n")
    return(outputSeq)
}

# Use the generateFirstOrderSeq function to generate a sequence of 1000 bases 
# long
firstOrderSeq <- generateFirstOrderSeq(1000,alp,inProb,firstOrderMat)

# ***** Study the composition bias of the sequence *****
# We wil use the "seqinr" package.
# For the installation of the package, type:
# install.packages("seqinr")
# Then, load the package:
require("seqinr")

# Count the frequency of each base 
# in the sequence using the "count" function
firstOrderFreq <- count(firstOrderSeq,1,alphabet=alp,freq=TRUE)

# Count the frequency of dinucleotides 
# in the sequence using the "count" function
firstOrderFreqDin <- count(firstOrderSeq,2,alphabet=alp,freq=TRUE) 
 
# Now, plot the results in the same plot:
layout(1:2)
barplot(firstOrderFreq,col=1:4,main="Compositional bias of each nucleotide",
        xlab="Base",
        ylab="Base proportion")
barplot(firstOrderFreqDin,col=rainbow(16),
        main="Compositional bias of each dinucleotide",
        xlab="Base",
        ylab="Base proportion")
        
## Lets plot the 4 plots in one window
    layout(matrix(1:4,nr=2,nc=2))
    # Results from the zero order 
    barplot(zeroOrderFreq,col=1:4,
            main="Compositional bias of each nucleotide\nZero Order Markov Chain",
            xlab="Base",
            ylab="Base proportion")
    barplot(zeroOrderFreqDin,col=rainbow(16),
            main="Compositional bias of each dinucleotide\nZero Order Markov Chain",
            xlab="Base",
            ylab="Base proportion")
    # Results from the first order 
    barplot(firstOrderFreq,col=1:4,
            main="Compositional bias of each nucleotide\nFirst Order Markov Chain",
            xlab="Base",
            ylab="Base proportion")
    barplot(firstOrderFreqDin,col=rainbow(16),
            main="Compositional bias of each dinucleotide\nFirst Order Markov Chain",
            xlab="Base",
            ylab="Base proportion")
        
        
# end.

Plot of the zero order model:

Plot of the first order model

All plots