Computational Biology Blog in fasta format: 2012

Sunday, December 2, 2012

Póster presentado en la XIV Escuela de Otoño en Biología Matemática, México

Póster presentado en la XIV Escuela de Otoño en Biología Matemática - 8vo Encuentro de Biología Matemática celebrado en San Luis Potosí, S.L.P, México:

"Predicción de promotores RNA POL-II en Drosophila melanogaster utilizando propiedades de señal, contexto y estructura a partir de secuencias nucleotídicas"

Póster disponible aquí

Brownian motion simulation in R

I have based this post on a very useful piece of code which basically is the core of my own implementation of a Brownian Motion simulation in R. The original reference code <- http://landshape.org/enm/r-code-for-brownian-motion/

According to Wikipedia the mathematical model for Brownian motion (also known as random walks) can also be used to describe many phenomena as well as the random movements of minute particles,
such as stock market fluctuations and the evolution of physical characteristics in the fossil record. The simple form of the mathematical model for Brownian motion has the form:

S_t = eS_t-1

where e is drawn from a probability distribution.

The source code is here

After loading the source code, there are two functions:

The first one, brownian will plot in an R graphics window the resulting simulation in an animated way.

The second function, export.brownian will export each step of the simulation in independent PNG files.

Example of running:

> source("brownian.motion.R")
> brownian(500)

The second function will produce this output
> export.brownian(500)
CODE:

# *******************************
# BROWNIAN MOTION SIMULATION
# December 2012 | Benjamin Tovar
# *******************************
#
#   REFERENCES
#   http://landshape.org/enm/r-code-for-brownian-motion/
#
#   According to Wikipedia the mathematical model for Brownian motion 
#   (also known as random walks) can also be used to describe many 
#   phenomena as well as the random movements of minute particles, 
#   such as stock market fluctuations and the evolution of physical 
#   characteristics in the fossil record. The simple form of the 
#   mathematical model for Brownian motion has the form:
#
#    S_t = eS_t-1
#
#    where e is drawn from a probability distribution.
#
#######################################################################

brownian <- function(n.times){
    x <- y <- x.new <- y.new <- x.new.p <- y.new.p <- vector()
    for(i in 1:n.times){
        # Initialize variables
        x <- rnorm(1)
        y <- rnorm(1)
        # concatenate variables 
        # to increase the vector size
        x.new <- c(x.new,x)
        y.new <- c(y.new,y)
        # sum the vector numbers
        x.new.p <- cumsum(x.new)
        y.new.p <- cumsum(y.new)  
        # plot the model
        plot(x.new.p,y.new.p,type="b",
             main=paste("Brownian motion simulation in R\nTime =",i,sep=" "),
             xlab="x coordinates",ylab="y coordinates",
             col=c(rep("gray",i-1),"red"),
             pch=c(rep(20,i-1),1))    
    }
}

# Test the function
# brownian(500)

# ****************************************
# EXPORT BROWNIAN MOTION SIMULATION IMAGES
# ****************************************

export.brownian <- function(n.times){
    x <- y <- x.new <- y.new <- x.new.p <- y.new.p <- vector()
    for(i in 1:n.times){
        # Initialize variables
        x <- rnorm(1)
        y <- rnorm(1)
        # concatenate variables to increase the
        # vector size
        x.new <- c(x.new,x)
        y.new <- c(y.new,y)
        # sum the vector numbers
        x.new.p <- cumsum(x.new)
        y.new.p <- cumsum(y.new)  
        # plot the model
        png(paste("image",i,"png",sep="."),width=600,height=600)
            plot(x.new.p,y.new.p,type="b",
                 main=paste("Brownian motion simulation in R\nTime =",
                 i,sep=" "),
                 xlab="x coordinates",ylab="y coordinates",
                 col=c(rep("gray",i-1),"red"),
                 pch=c(rep(20,i-1),1))
                 cat("image",i,"DONE",date(),"\n")
        dev.off()
    }
}

# Test the function
# export.brownian(500)

Sunday, July 29, 2012

Extracting upstream regions of a RefSeq human gene list in R using Bioconductor

Suppose that you want to do local mapping of upstream regions of a given RefSeq IDs in a particular genome in R using Bioconductor. Download the script here.

In this case, you may take a look at the Bioconductor AnnotationData Packages here: http://www.bioconductor.org/packages/release/data/annotation/

The goal of this post is that for example I have the following RefSeq IDs and want to extract 250 bases upstream of each gene in a single list with another useful information such the entrez.id, the symbol and the gene description.


# RefSeqs IDs:

gene.list.refseq <- c("NM_003588","NM_001145436", "NM_001135188","NM_020760","NM_173362", "NM_198393","NM_022736","NM_025074","NM_033449","NM_015726", "NM_022110","NM_016478","NM_020634","NM_002291","NM_000418", "NM_001862","NM_017752","NM_006591","NM_000124","NM_144610")



# How many bases upstream of each gene:    

bases.upstream <- 250

Before starting, please download the following packages in R:


       source("http://bioconductor.org/biocLite.R")

       biocLite("BSgenome.Hsapiens.UCSC.hg19")                               

       biocLite("Biostrings")

       biocLite("org.Hs.eg.db")

Load the function using:


source("extract.five.utr.sequence.R")

And finally make the computations:


output.sequences <- extract.five.utr.sequence(gene.list.refseq,bases.upstream)

CODE:

# ******************************************************************************
#       FUNCTION: extract.five.utr.sequence | 29/JULY/2012 | BENJAMIN TOVAR
# ******************************************************************************
extract.five.utr.sequence <- function(gene.list.refseq,
                                      number.bases.upstream=1000){
    
    
    # *****************************************************************
    # RUNNING NOTES: Please download this packages from Bioconductor
    # http://www.bioconductor.org/packages/release/data/annotation/
    # ***************************************************************** 
    #   source("http://bioconductor.org/biocLite.R")
    #   biocLite("BSgenome.Hsapiens.UCSC.hg19")                                
    #   biocLite("Biostrings")
    #   biocLite("org.Hs.eg.db")
                                      
    # *****************************************************************
    # RUNNING EXAMPLE
    # *****************************************************************                              
    #
    ## Extract 250 bases upstream of each gene in gene.list.refseq 
    #
    # gene.list.refseq <- c("NM_003588","NM_001145436", "NM_001135188","NM_020760","NM_173362",   
    #                    "NM_198393","NM_022736","NM_025074","NM_033449","NM_015726",   
    #                    "NM_022110","NM_016478","NM_020634","NM_002291","NM_000418",   
    #                    "NM_001862","NM_017752","NM_006591","NM_000124","NM_144610") 
    #
    # bases.upstream <- 250
    #  
    # result <- extract.five.utr.sequence(gene.list.refseq,bases.upstream)                             
                             
    # *****************************************************************
    # LOAD THE LIBRARIES
    # ***************************************************************** 
    cat("Loading libraries",date(),"\n")   
    # human genome DNA sequences
        library(BSgenome.Hsapiens.UCSC.hg19) 
    # human genome wide annotations 
        library(org.Hs.eg.db) 
    # load IDs, symbol and gene descriptions
        refseq.id <- toTable(org.Hs.egREFSEQ)
        symbol <- toTable(org.Hs.egSYMBOL)
        gene.description <- toTable(org.Hs.egGENENAME)
    
    # *****************************************************************
    # LOAD THE UPSTREAM SEQUENCES
    # *****************************************************************

    if(number.bases.upstream <= 1000){
        # load the 1000 bases upstream of all genes in the human genome 
        # in the BSgenome.Hsapiens.UCSC.hg19 package
        upstream <- Hsapiens$upstream1000
    }
    
    if(number.bases.upstream > 1000 && number.bases.upstream <= 2000){
        # load the 1000 bases upstream of all genes in the human genome 
        # in the BSgenome.Hsapiens.UCSC.hg19 package
        upstream <- Hsapiens$upstream2000
    }
    
    if(number.bases.upstream > 2000 && number.bases.upstream <= 5000){
        # load the 1000 bases upstream of all genes in the human genome 
        # in the BSgenome.Hsapiens.UCSC.hg19 package
        upstream <- Hsapiens$upstream5000
    }        
    
    if(number.bases.upstream > 5000){
        cat("ERROR: The number of bases upstream of 5'UTR is too large (max value = 5000)\n")
        return(0);
    } 
    
    if(number.bases.upstream < 1){
        cat("ERROR: The number of bases upstream of 5'UTR cannot be less than 1 nucleotide (min value = 1)\n")
        return(0);
    }                   
        
    # extract the names of each gene
    names.genes.upstream <- names(upstream)

    # extract the refseq of each gene  
    refseq.upstream.id <- vector("character",length(names.genes.upstream))
    for(i in 1:length(names.genes.upstream)){
        refseq.upstream.id[i] <- gsub(", ","_",toString(unlist(strsplit(names.genes.upstream[i],
                                      "\\_"))[c(1,2)],sep=""))
    }    
    names(refseq.upstream.id) <- 1:length(refseq.upstream.id)

    # *****************************************************************
    # CREATE OUTPUT OBJECT
    # *****************************************************************    
    output.params <- c("entrez.id","symbol","gene.description","five.UTR.sequence")
                        
    output.list <- list()
    for(i in 1:length(gene.list.refseq)){
        output.list[[i]] <- list()
        for(j in 1:length(output.params)){
            output.list[[i]][[j]] <- list()
        }
        names(output.list[[i]]) <- output.params
    }
    names(output.list) <- gene.list.refseq

    # *****************************************************************
    # LET THE COMPUTER DECIDE
    # *****************************************************************     
    cat("Extracting data",date(),"\n")
    
    for(i in 1:length(gene.list.refseq)){
        cat(rep(".",1))

        # extract the entrez.id of each gene in geneList 
        output.list[[i]]$entrez.id <- refseq.id[which(refseq.id$accession==gene.list.refseq[i]),]$gene_id

        # extract the symbol of each gene in geneList 
        output.list[[i]]$symbol <- symbol[which(symbol==output.list[[i]]$entrez.id),]$symbol
        
        # extract the gene description of each gene in geneList 
        output.list[[i]]$gene.description <- gene.description[which(gene.description==output.list[[i]]$entrez.id),]$gene_name  
        
        # Return the index of the target genes
        index.target.gene <- sequence <- NA;
        index.target.gene <- as.numeric(names(refseq.upstream.id[which(refseq.upstream.id==gene.list.refseq[i])]))
        
        # NOTE: some genes have repeated RefSeq.id which corresponds to the same gene in other Chromosomes.
        # by using index.target.gene[1] we are selecting only the first RefSeq entry.
        sequence <- upstream[index.target.gene[1]]
        sequence.length <- nchar(sequence)
        start.position <- ((sequence.length-number.bases.upstream)+1)
        sequence <- substring(toString(sequence),1:sequence.length,1:sequence.length)
        sequence <- sequence[start.position:sequence.length]
        output.list[[i]]$five.UTR.sequence <- sequence
    }
    cat("\n")
    cat("Computations DONE",date(),"\n")
    return(output.list)
}

Benjamin.

Saturday, June 23, 2012

Handling large FASTA sequence datasets in R: Shuffle and retrieve "n" number of sequences of fixed length from the whole FASTA file and export them in a new FASTA file.

When you are working with large FASTA datasets is probable to find out that the sequences are in sort of a mixed quality (obviously, depending on your scientific question),

I mean for example, imagine that you retrieve the whole collection of exons of a given organism and suppose that the FASTA file is 50mb and there are included ~50,000 DNA sequences but, if you look at them, you may find that there are sequences much larger than others and others will be probably 50 bases long.

Well that's what I mean with mixed quality and therefore, a sort of filter might be very helpful because you may find more informative to put a threshold and say "hey, from the pool of ~50K sequences I only want 5K sequences randomly chosen (from the entire FASTA dataset) and every sequence must have a fixed length, say 1000 bases long".

That's why I wrote some code in R language, it's a function titled "shuffleAndExtract" and you can download the example set and the source here.

Here is the description of the function:

handleFastaDatasets.R

shuffleAndExtract: This function in R is designed to open a fasta file dataset, shuffle the sequences and extract the desired sequences wanted by the user to generate a new dataset of fixed size (number of required sequences) and with the same length for each sequence.

And, after you download the example set and the R source file, you can try to run a very simple example:

NOTE: my implementation depends on the function "seqinr", if is not installed, you may do this before all the magic begin with a simple install.packages("seqinr").

# run example:
source("handleFastaDatasets.R")
shuffleAndExtract("example.fasta",1000,200)

The arguments of the function are:

inputFastaFile: name of the input fasta file
numberOfoutputSeqs: number of desired sequences in the output file
lengthOutputSeqs: fixed length of every sequence in the output file
initialPos: Position where the new window sizing will begin, default = 1
outputFileName: name of the output file, by default will be (e.g): "inputFastaFile.fasta.output.fasta"

CODE:

################################################################################

shuffleAndExtract <- function(inputFastaFile,
                              numberOfoutputSeqs,
                              lengthOutputSeqs,
                              initialPos=1,
                              outputFileName = paste(inputFastaFile,"output",
                                                     "fasta",sep=".")
                              ){

    #
    #    handleFastaDatasets.R   
    #
    #    This function in R is designed to open a fasta file dataset, shuffle the 
    #    sequences and extract the desired sequences wanted by the user to generate 
    #    a new dataset of fixed size (number of required sequences) and with the 
    #    same length for each sequence. 
    #
    #    Author: Benjamin Tovar
    #    Date: 22/JUNE/2012
    #
    ################################################################################

    #    run example:
    #    source("handleFastaDatasets.R")
    #    shuffleAndExtract("example.fasta",1000,200)

    # inputFastaFile: name of the input fasta file
    # numberOfoutputSeqs: number of desired sequences in the output file 
    # lengthOutputSeqs: fixed length of every sequence in the output file
    # initialPos: Position where the new window sizing will begin, default = 1
    # outputFileName: name of the output file, by default will be (e.g):
    #                 "inputFastaFile.fasta.output.fasta"

    cat("*** Starting computations |",date(),"****\n")    
        
    # Load the seqinr package
    require(seqinr)

    # Load the large seq and not shuffled dataset
    inputSeqs <- read.fasta(inputFastaFile)

    cat("\tProcessing for",length(inputSeqs),"sequences |",date(),"\n")    

    # Extract the length of every sequence and store the results in a vector.
    inputSeqsSize <- rep(NA,length(inputSeqs))
    for(i in 1:length(inputSeqs)){ 
        inputSeqsSize[i] <- summary(inputSeqs[[i]])$length
    }

    # Extract the index of the sequences which are longer than threshold
    inputSeqsLongSeqIndex <- which(inputSeqsSize > (lengthOutputSeqs+initialPos))

    # randomly pick numberOfoutputSeqs indexes that will be used to create 
    # the output dataset 
    inputSeqsIndex <- sample(inputSeqsLongSeqIndex,numberOfoutputSeqs,rep=F)
    
    # Store the Fasta header of each selected sequence in a vector
    inputSeqsIndexNames <- rep(NA,numberOfoutputSeqs)
    
    # create output object 
    outputSeqs <- list()
    for(i in 1:numberOfoutputSeqs){
        # Extract the fasta headers 
        inputSeqsIndexNames[i] <- attr(inputSeqs[[inputSeqsIndex[i]]],"name")
        # Extract the sequence
        outputSeqs[[i]] <- inputSeqs[[inputSeqsIndex[i]]][initialPos:((initialPos+lengthOutputSeqs)-1)]
    }
 
    # Export the sequences in a new fasta file 
    write.fasta(outputSeqs,inputSeqsIndexNames,outputFileName)
    
    cat("\n*** DONE |",date(),"****\n")
}


# 22 Jun 2012 | Benjamin Tovar

Benjamin

Sunday, May 6, 2012

First anniversary of my blog

Exactly today (5th/May), a year ago I decided to start with a project, mostly motivated in a way that I feel that probably some of my post will help someone in the future (bots included).

My motivation comes from, because my knowledge about Biology is ~80% and the other ~20% is about Computer Science, I thought that it will be a very cool idea to share some code, ideas, tutorials and even random messages for people that have not been surrounded with bunches of code in front of a UNIX Terminal.

I promise to still publishing posts, and in the same way I still continue my own training, the awesomeness :P of the posts will increase too.

Wishing you the best for every person which has visited my blog. Thank you so much and keep coding!

Benjamin.

Wednesday, May 2, 2012

My poster "Mining Biosequences" (Spanish) at the Scientific Poster Contest due to the BSc in Genomic Biotechnology anniversary at Faculty of Biological Sciences in UANL

My poster is an introduction to HMM using some awesome material from Eddy S (among other awesome authors)., hope you like it.

Click the image to download the poster.

Benjamin

Saturday, April 21, 2012

Calculate the average distance between a given DNA motif within DNA sequences in R

Suppose that we want to calculate the expected distance of a DNA motif within a DNA target sequence, if we know the composition bias or the probability distribution (multinomial model) we can compute it just fine.

Download the R code <- here

FIRST PART

For example, supose that we want to compute the expected distance of the motif "GATC" in a sequence composed of 10,000 bases given that the whole sequence follows a probability distribution of p(A) = 0.3, p(C) = 0.2, p(G) = 0.2, p(T) = 0.3.

So open an R prompt and load the functions:


source("motifOccurrence.R")

Then, enter the initial values:


lengthSeq <- 10000

motif <- c("G","A","T","C")

probDistr <- c(0.3,0.2,0.2,0.3)

And finally, compute the average expected distance of every occurrence of the motif inside the target sequence. Using the following formula (this computation corresponds to the "computeExpectedDistance" function of the R script "motifOccurrence.R"):

expectedDistance = lengthDNAseq/(lengthDNAseq*p))

where "p" stands for the joint probability of the motif, in other words: p = p(GATC) = p(G)*p(A)*p(T)*p(C) = 0.2*0.3*0.3*0.2 = 0.0036

To easily compute the expected distance in R, type:


expectedDist <- computeExpectedDistance(probDistr,lengthSeq,motif)

As you see, it returns the value of "277", this number means that, for the "GATC" motif inside a sequence of 10,000 bases with a composition bias of p(A) = 0.3, p(C) = 0.2, p(G) = 0.2, p(T) = 0.3 we may expect a distance of 277 bases between each "GATC".

Or, a little more graphic:

277 bases | GATC | 277 bases | GATC | .....

SECOND PART

Another way to compute this (even though it involves more computations) we can simulate X number of DNA sequences with a fixed length with an equal probability distribution per sequence and extract the coordinates of the motif within each sequence to finally compute the average distance of the motif.

There is a function titled "iterateComputeDistance" to do the calculations for you. Add the next parameter to the R environment:


iterations <- 100

Compute the average distance of the "GATC" motif within 100 DNA sequences (the other parameters remain equal)


expectedDistWithSimSeqs <- iterateComputeDistance(probDistr,lengthSeq,motif,iterations)

As we expected, the results of the two approaches are highly similar (ouuu yeah!)

THIRD PART

But, what happens when we already have a sequence and want to know the expected distance of that motif inside of it?.

Just like "Hey dude, I have an E.coli plasmid DNA sequence and want to know the average distance of the GATC motif".

Lets test using the sequence "Escherichia coli 2078 plasmid pQNR2078 complete sequence" <- http://www.ncbi.nlm.nih.gov/nuccore/HE613857.1

Ok, use the "ape" library to import the sequence to the R environment (if this library is not installed, type: install.packages("ape"))

NOTE: the GenBank sequences are in lowecase, so it will be needed to use a motif in lowercase to do the right computations.

Import the library:


require("ape")

Import the sequence:


plasmid <- read.GenBank("HE613857.1")

plasmidDNA <- as.character.DNAbin(plasmid)

plasmidDNA <- plasmidDNA[[1]]

motifEcoli <- c("g","a","t","c")

Get the coordinates:
plasmidDNAcoord <- coordMotif(plasmidDNA,motifEcoli)
Get the average distance between the motif occurrences.


plasmidDNAmotifDistance <- computeDistance(plasmidDNAcoord)

> plasmidDNAmotifDistance
[1] 270

Compute the number of occurrences of the motif among the plasmid (the result is 151 occurrences):

The number of occurrences of the motif in R is:


(length(plasmidDNAcoord)-1)

So, the main distance between the motif "gatc" finally is 270 bases and we are done :D

CODE:

#
#    Script: motifOccurrence.R
#    Author: Benjamin Tovar
#    Date: 21/April/2012
#
################################################################################

#                       ############
#                        FUNCTIONS:
#                       ############

##############################################################################
iterateComputeDistance <- function(multinomialDNAmodel,
                                   lengthDNAseq,
                                   motif,
                                   numberOfIterations){

    # This function returns the mean distance 
    # of a given motif given X number of DNA sequences given a multinomial model
    # (probability distribution of each base).
    
    # So, it will generate X number of DNA sequences using a given 
    # probability distribution and then it will compute the distance among 
    # that mofit within the total set of sequences to finally returns 
    # the average distance of the motif.
    
    result <- rep(NA,numberOfIterations)
    for(i in 1:numberOfIterations){
        currentGenome <- NA
        currentCoordinatesOfMotif <- NA        
        currentSequence <- sample(c("A","C","G","T"),
                                    lengthDNAseq,rep=T,
                                    prob=multinomialDNAmodel)
        currentCoordinatesOfMotif <- coordMotif(currentSequence,motif)
        result[i] <- computeDistance(currentCoordinatesOfMotif)
        cat(" *** Iteration number: ",i," completed *** | average distance = "
            ,result[i],"\n")  
    }
    result <- trunc(mean(result))
    cat(" \n*** Computation status: DONE ***\n\n")
    return(result)  
}
##############################################################################
coordMotif <- function(targetSequence,motif){

    # This function returns the coordinates of the motif of study in a target 
    # DNA sequence. In other words, if I found the motif, tell me exactly in
    # which position of the DNA sequence is.
    
    lengthMotif <- length(motif)
    lengthTargetSeq <- (length(targetSequence)-lengthMotif)
    motif <- toString(motif)
    motif <- gsub(", ","",motif)
    res <- 1    
    for(i in 1:lengthTargetSeq){
        currentTargetSeq <- targetSequence[i:(i+(lengthMotif)-1)]
        currentTargetSeq <- toString(currentTargetSeq)
        currentTargetSeq <- gsub(", ","",currentTargetSeq)
        if(currentTargetSeq == motif){
            res[(length(res)+1)] <- i
        }
    }
    return(res)
}
##############################################################################
computeDistance <- function(coordinatesOfMotif){
    
    # This function returns the mean distance 
    # of a given motif given its coordinates within a target DNA sequence.
    # In other words, If I already got a list with the coordinates where the 
    # motif is inside a DNA sequence, tell me the average distance between
    # this coordinates to get the expected distance of that motif.

    currentDistance <- rep(NA,(length(coordinatesOfMotif)-1))
    lengthCoord <- length(currentDistance)
    for(i in 1:lengthCoord){
        currentDistance[i] <- coordinatesOfMotif[i+1]-coordinatesOfMotif[i]
    }
    res <- trunc(mean(currentDistance))
    return(res)   
}
##############################################################################
computeExpectedDistance <- function(multinomialModel,
                                    lengthDNAseq,
                                    motif){
                                    
    # This function computes the expected distance of a given motif in a DNA
    # sequence given its multinomial model (probability distribution of 
    # each base)
    
    # Convert the motif into an index                                       
    motifIndex <- gsub("A",1,motif); motifIndex <- gsub("C",2,motifIndex)
    motifIndex <- gsub("G",3,motifIndex); motifIndex <- gsub("T",4,motifIndex)
    motifIndex <- as.numeric(motifIndex)
    # Compute p value of the motif given the multinomial model
    p <- rep(NA,length(motif))
    for(i in 1:length(motifIndex)){
        p[i] <- multinomialModel[motifIndex[i]]
    }    
    p <- prod(p)
    result <- trunc(lengthDNAseq/(lengthDNAseq*p))
    return(result)
}                                   
##############################################################################

# Benjamin

Benjamin

Friday, April 13, 2012

Introduction to Markov Chains and modeling DNA sequences in R

Markov chains are probabilistic models which can be used for the modeling of sequences given a probability distribution and then, they are also very useful for the characterization of certain parts of a DNA or protein string given for example, a bias towards the AT or GC content.

For now, I am going to introduce how to build our own Markov Chain of zero order and first order in R programming language. The definition of a zero order Markov Chain relies in that, the current state (or nucleotide) does not depends on the previous state, so there's no "memory" and every state is untied.

For the first order Markov Chain the case is different because the current state actually depends only on the previous state. Given that points clear, a second order Markov Model will be a model that reflects that the current state only depends on the previous two states before it (This model will be useful for the study of codons, given that they are substrings of 3 nucleotides long).

Download the scripts (R script and graphviz scripts) here

Example of a Markov Chain of zero order (the current nucleotide is totally independent of the previous nucleotide).

The multinomial model is:
p(A)+p(C)+p(G)+p(T) = 1.0
0.4 +0.1 +0.1 +0.4 = 1.0

Example of the structure of the zero order model:

Example of a Markov Chain of first order (the current nucleotide only depends on the previous nucleotide).

The multinomial model per base is:
p(A|A)+p(C|A)+p(G|A)+p(T|A) = 1.0
p(A|C)+p(C|C)+p(G|C)+p(T|C) = 1.0
p(A|G)+p(C|G)+p(G|G)+p(T|G) = 1.0
p(A|T)+p(C|T)+p(G|T)+p(T|T) = 1.0

So:
0.6 + 0.1 + 0.1 + 0.2 = 1.0
0.1 + 0.5 + 0.3 + 0.1 = 1.0
0.05+ 0.2 + 0.7 + 0.05 = 1.0
0.4 + 0.05+0.05 + 0.5 = 1.0

Example of the structure of the first order model:

Code:

#    Author: Benjamin Tovar
#    Date: 13/April/2012
#
#    Example of a Markov Chain of zero order (the current nucleotide is
#    totally independent of the previous nucleotide).

#    The multinomial model is:
#    p(A)+p(C)+p(G)+p(T) = 1.0
#    0.4 +0.1 +0.1 +0.4  = 1.0

# Define the DNA alphabet that will be used to put names to the objects
alp <- c("A","C","G","T")
# Create the vector that represents the probability distribution of the model
zeroOrder <- c(0.4,0.1,0.1,0.4)
# Put the name of reference of each base 
names(zeroOrder) <- alp 
# Create a sequence of 1000 bases using this model.
zeroOrderSeq <- sample(alp,1000,rep=T,prob=zeroOrder)

# ***** Study the composition bias of the sequence *****
# We wil use the "seqinr" package.
# For the installation of the package, type:
# install.packages("seqinr")
# Then, load the package:
require("seqinr")

# Count the frequency of each base 
# in the sequence using the "count" function
zeroOrderFreq <- count(zeroOrderSeq,1,alphabet=alp,freq=TRUE)

# Count the frequency of dinucleotides 
# in the sequence using the "count" function
zeroOrderFreqDin <- count(zeroOrderSeq,2,alphabet=alp,freq=TRUE) 
 
# Now, plot the results in the same plot:
layout(1:2)
barplot(zeroOrderFreq,col=1:4,main="Compositional bias of each nucleotide",
        xlab="Base",
        ylab="Base proportion")
barplot(zeroOrderFreqDin,col=rainbow(16),
        main="Compositional bias of each dinucleotide",
        xlab="Base",
        ylab="Base proportion")

# &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

#    Example of a Markov Chain of first order (the current nucleotide only
#    depends on the previous nucleotide).

#    The multinomial model per base is:
#    p(A|A)+p(C|A)+p(G|A)+p(T|A) = 1.0
#    p(A|C)+p(C|C)+p(G|C)+p(T|C) = 1.0
#    p(A|G)+p(C|G)+p(G|G)+p(T|G) = 1.0
#    p(A|T)+p(C|T)+p(G|T)+p(T|T) = 1.0        

#    So:
#    0.6 + 0.1 + 0.1 + 0.2  = 1.0
#    0.1 + 0.5 + 0.3 + 0.1  = 1.0
#    0.05+ 0.2 + 0.7 + 0.05 = 1.0
#    0.4 + 0.05+0.05 + 0.5  = 1.0
    
    
# Create the matrix that will store the probability distribution given 
# a certain nucleotide:
firstOrderMat <- matrix(NA,nr=4,nc=4)
# Put names to the 2 dimensions of the matrix 
colnames(firstOrderMat) <- rownames(firstOrderMat) <- alp 
# Add the probability distribution per base:
firstOrderMat[1,] <- c(0.6,0.1,0.1,0.2)  # Given an A in the 1st position
firstOrderMat[2,] <- c(0.1,0.5,0.3,0.1)  # Given a  C in the 1st position
firstOrderMat[3,] <- c(0.05,0.2,0.7,0.05)# Given a  G in the 1st position
firstOrderMat[4,] <- c(0.4,0.05,0.05,0.5)# Given a  T in the 1st position

# Now we got a matrix
#    > firstOrderMat
#         A    C    G    T
#    A 0.60 0.10 0.10 0.20
#    C 0.10 0.50 0.30 0.10
#    G 0.05 0.20 0.70 0.05
#    T 0.40 0.05 0.05 0.50

# In order to continue, we need an initial probability distribution to know
# which base is the most probable to start up the sequence.
inProb <- c(0.4,0.1,0.1,0.4); names(inProb) <- alp
# So, the sequence will have a 40% to start with an A or a T and 10% with C or G

# Create a function to generate the sequence.
# NOTE: To load the function to the current environment, just copy
# the entire function and paste it inside the R prompt.

generateFirstOrderSeq <- function(lengthSeq,
                                  alphabet,  
                                  initialProb,
                                  firstOrderMatrix){
#    lengthSeq = length of the sequence
#    alphabet = alphabet that compounds the sequence 
#    initialProb   = initial probability distribution
#    firstOrderMatrix = matrix that stores the probability distribution of a 
#                       first order Markov Chain

    # Construct the object that stores the sequence
    outputSeq <- rep(NA,lengthSeq)
    # Which is the first base:
    outputSeq[1]  <- sample(alphabet,1,prob=initialProb) 
    # Let the computer decide:
    for(i in 2:length(outputSeq)){
        prevNuc <- outputSeq[i-1]    
        currentProb <- firstOrderMatrix[prevNuc,]
        outputSeq[i] <- sample(alp,1,prob=currentProb)
    } 
    cat("** DONE: Sequence computation is complete **\n")
    return(outputSeq)
}

# Use the generateFirstOrderSeq function to generate a sequence of 1000 bases 
# long
firstOrderSeq <- generateFirstOrderSeq(1000,alp,inProb,firstOrderMat)

# ***** Study the composition bias of the sequence *****
# We wil use the "seqinr" package.
# For the installation of the package, type:
# install.packages("seqinr")
# Then, load the package:
require("seqinr")

# Count the frequency of each base 
# in the sequence using the "count" function
firstOrderFreq <- count(firstOrderSeq,1,alphabet=alp,freq=TRUE)

# Count the frequency of dinucleotides 
# in the sequence using the "count" function
firstOrderFreqDin <- count(firstOrderSeq,2,alphabet=alp,freq=TRUE) 
 
# Now, plot the results in the same plot:
layout(1:2)
barplot(firstOrderFreq,col=1:4,main="Compositional bias of each nucleotide",
        xlab="Base",
        ylab="Base proportion")
barplot(firstOrderFreqDin,col=rainbow(16),
        main="Compositional bias of each dinucleotide",
        xlab="Base",
        ylab="Base proportion")
        
## Lets plot the 4 plots in one window
    layout(matrix(1:4,nr=2,nc=2))
    # Results from the zero order 
    barplot(zeroOrderFreq,col=1:4,
            main="Compositional bias of each nucleotide\nZero Order Markov Chain",
            xlab="Base",
            ylab="Base proportion")
    barplot(zeroOrderFreqDin,col=rainbow(16),
            main="Compositional bias of each dinucleotide\nZero Order Markov Chain",
            xlab="Base",
            ylab="Base proportion")
    # Results from the first order 
    barplot(firstOrderFreq,col=1:4,
            main="Compositional bias of each nucleotide\nFirst Order Markov Chain",
            xlab="Base",
            ylab="Base proportion")
    barplot(firstOrderFreqDin,col=rainbow(16),
            main="Compositional bias of each dinucleotide\nFirst Order Markov Chain",
            xlab="Base",
            ylab="Base proportion")
        
        
# end.

Plot of the zero order model:

Plot of the first order model

All plots

Benjamin

Wednesday, April 11, 2012

Generate artificial DNA or protein sequences in R in a single line of code.

To generate an artificial DNA sequence of "n" bases long with a fixed composition bias in just one line of code, just open your R prompt and type:


seqX <- sample(c("A","C","G","T"),10000,rep=TRUE,prob=c(0.4,0.1,0.1,0.4))

As you see, the alphabet is the four letter alphabet of the DNA (so, if desired, you can replace that alphabet with any other alphabet that you might need, like the amino acids 20 letter code), the next parameter is the length of the desired output sequence. rep=TRUE means that each letter of the alphabet could be repeated and finally, I think that the most useful parameter of the function sample is the prob parameter because it allows you to select the multinomial model (the proportion or "bias" of each base). For example our simulated sequence is 80% AT rich and 20% GC rich given that for the "A" base we got a probability of 0.4, for the "C" base 0.2 and so on.

Now, to check it out that our artificial sequence follows that bias, a simple plot will tell us more than thousand words:

Lets extract the proportion of each base along the sequence using the table and the length function:


freqX <- table(seqX)/length(seqX)

Now, lets plot it doing:


barplot(freqX,col=1:4,main="Compositional bias of seqX",xlab="Base",ylab="Base proportion")

And finally we got this awesome plot.

So, the barplot shows that effectively each base is well represented by the multinomial model and our artificial sequence is loyal to it.

Benjamin

Epic R is Epic <- Beginners command reference card

This reference card has been written by Tom Short.

Click the image to see the R magic:

Thank you for your support Tom.
Benjamin

Friday, April 6, 2012

Using the Blogger app for Android..

Well, the application has a simple and elegant UI.

Let's check it out and see how it works :P

Install R 2.15 and further versions in Debian Squeeze

The last Friday, March 30th, the last stable version of R, the version 2.15.0 was released.

So, to install it in Debian Squeeze, or in another Distro powered by Debian (I actually use CrunchBang Linux), just follow the same instructions described here for the installation of R 2.14.2 by clicking here.

Do not forget to upgrade the packages too.

Benjamin.

Install octave 3.6.1 in Debian Squeeze

Following the nice instructions found it here <- http://verahill.blogspot.mx/2012/02/debian-testing-wheezy-64-compiling.html , I have successfully installed Octave 3.6.1 in Squeeze (I had to rename some packages names to fit with the Squeeze packages).

FIRST STEP: Install the required libraries and dependencies.



sudo apt-get install gfortran build-essential

sudo apt-get install libqhull-dev libpcre++-dev libblas-dev liblapack-dev libreadline-dev

sudo apt-get install libcurl4-openssl-dev

sudo apt-get install libfltk1.1-dev

sudo apt-get install libgraphicsmagick++-dev

sudo apt-get install libhdf5-serial-dev

sudo apt-get install libqrupdate-dev

sudo apt-get install libsuitesparse-dev

sudo apt-get install  glpk gperf flex bison libfontconfig1-dev

SECOND STEP: Download Octave and extract it.



wget ftp://ftp.gnu.org/gnu/octave/octave-3.6.1.tar.gz

tar -xvf octave-3.6.1.tar.gz

THIRD STEP: Compile it.

 cd octave-3.6.1/

./configure

make -j3

where 3 is the number of cores +1 (for me 2 cores)

FOURTH STEP: Validate the compiled version:

make check

FIFTH STEP: Install Octave



sudo make install

That's all. Thanks to Lindqvist - a blog about Linux and Science. Mostly.
Benjamin

Thursday, March 29, 2012

UPDATE: Install R 2.14.2 or R 2.15 in Debian Squeeze

Since not so long, I upgraded the version of R that is included in the Debian repositories (2.11.1 in Squeeze) and I was very happy with it since 2010. But there was some packages that I could not install and then, I decided to upgrade R despite of having to upgrade every package too (I have more than 1 GB in my home R packages folder).

There's nothing new or tricky about how to upgrade, the awesome guys of CRAN always brings an answer for our needs.

FIRST PART: ADD R BACKPORTS:

First, open a Terminal and open the sources.list file:



$ gksudo gedit /etc/apt/sources.list

Then, add these lines at the bottom of the file (Note, I use the UCLA server, but this can be easily changed taking a look here for the mirrors):

## R BACKPORTS

deb http://cran.stat.ucla.edu/bin/linux/debian squeeze-cran/

#deb-src http://cran.stat.ucla.edu/bin/linux/debian squeeze-cran/

The original set of instructions can be seen here.

SECOND PART: RENAME THE R PACKAGES FOLDER:

There's a folder where R uses to store the packages we download, just rename it to the current version of R.

For example, mine was "2.11" and then I just renamed it to "2.14" (or "2.15", without quotes) and was inside this path:

Before:

/home/benjamin/R/x86_64-pc-linux-gnu-library/2.11

After:

/home/benjamin/R/x86_64-pc-linux-gnu-library/2.15

Remember that some packages also needs to install some files in folders that belongs to the root, so, I would recommend to open R in sudo mode (only if you're sure about what you're doing :P) just by executing R this way: "sudo R" and then, in the R console type :

update.packages(checkBuilt=TRUE, ask=FALSE)

THIRD PART: SECURE APT:

The Debian backports archives on CRAN are signed with the key ID 381BA480, to add them, in a Terminal prompt type:


gpg --keyserver pgp.mit.edu --recv-key 381BA480

gpg -a --export 381BA480 > jranke_cran.asc
sudo  apt-key add jranke_cran.asc

FOURTH PART: UPDATE AND UPGRADE R:

Save the file and you can either enter to Synaptic, update the packages list and then just upgrade the packages or in a terminal type:


sudo apt-get update

sudo apt-get upgrade

And that's all.

Benjamin

Sunday, March 25, 2012

Transcription Factors and gene expression demands ...

Currently, I feel sad about how much abandoned this project is.. I know it have been a very long time since the last post but, to be honest I am doing my best to finish my thesis research and finally conclude my undergraduate education.

I promise to come back and since the last post I have learned a lot of new things that will be pretty cool to be shared. Specially in R language: Markov Chains, Hidden Markov Models and their applications. All the programming lessons will be from the scratch.

Thanks.
Benjamin

Labels

Sunday, December 2, 2012

Póster presentado en la XIV Escuela de Otoño en Biología Matemática, México

Brownian motion simulation in R

Sunday, July 29, 2012

Extracting upstream regions of a RefSeq human gene list in R using Bioconductor

Saturday, June 23, 2012

Handling large FASTA sequence datasets in R: Shuffle and retrieve "n" number of sequences of fixed length from the whole FASTA file and export them in a new FASTA file.

Sunday, May 6, 2012

First anniversary of my blog

Wednesday, May 2, 2012

My poster "Mining Biosequences" (Spanish) at the Scientific Poster Contest due to the BSc in Genomic Biotechnology anniversary at Faculty of Biological Sciences in UANL

Saturday, April 21, 2012

Calculate the average distance between a given DNA motif within DNA sequences in R

Friday, April 13, 2012

Introduction to Markov Chains and modeling DNA sequences in R

Wednesday, April 11, 2012

Generate artificial DNA or protein sequences in R in a single line of code.

Epic R is Epic <- Beginners command reference card

Friday, April 6, 2012

Using the Blogger app for Android..

Install R 2.15 and further versions in Debian Squeeze

Install octave 3.6.1 in Debian Squeeze

Thursday, March 29, 2012

UPDATE: Install R 2.14.2 or R 2.15 in Debian Squeeze

Sunday, March 25, 2012

Transcription Factors and gene expression demands ...

POWERED BY LINUX

POWERED BY #!