Suppose that we want to calculate the expected distance of a DNA motif within a DNA target sequence, if we know the composition bias or the probability distribution (multinomial model) we can compute it just fine.

Download the R code <-

here
FIRST PART
For example, supose that we want to compute the expected distance of the motif

"GATC" in a sequence composed of 10,000 bases given that the whole sequence follows a probability distribution of p(A) = 0.3, p(C) = 0.2, p(G) = 0.2, p(T) = 0.3.

So open an R prompt and load the functions:

```
source("motifOccurrence.R")
```

Then, enter the initial values:

```
lengthSeq <- 10000
```

motif <- c("G","A","T","C")

probDistr <- c(0.3,0.2,0.2,0.3)

And finally, compute the average expected distance of every occurrence of the motif inside the target sequence. Using the following formula (this computation corresponds to the

"computeExpectedDistance" function of the R script

"motifOccurrence.R"):

expectedDistance = lengthDNAseq/(lengthDNAseq*p))
where

"p" stands for the joint probability of the motif, in other words: p = p(GATC) = p(G)*p(A)*p(T)*p(C) = 0.2*0.3*0.3*0.2 = 0.0036

To easily compute the expected distance in R, type:

```
expectedDist <- computeExpectedDistance(probDistr,lengthSeq,motif)
```

As you see, it returns the value of

"277", this number means that, for the

"GATC" motif inside a sequence of 10,000 bases with a composition bias of p(A) = 0.3, p(C) = 0.2, p(G) = 0.2, p(T) = 0.3 we may expect a distance of 277 bases between each

"GATC".

Or, a little more graphic:

277 bases | GATC | 277 bases | GATC | .....

SECOND PART
Another way to compute this (even though it involves more computations) we can simulate X number of DNA sequences with a fixed length with an equal probability distribution per sequence and extract the coordinates of the motif within each sequence to finally compute the average distance of the motif.

There is a function titled

"iterateComputeDistance" to do the calculations for you. Add the next parameter to the R environment:

```
iterations <- 100
```

Compute the average distance of the

"GATC" motif within 100 DNA sequences (the other parameters remain equal)

```
expectedDistWithSimSeqs <- iterateComputeDistance(probDistr,lengthSeq,motif,iterations)
```

As we expected, the results of the two approaches are highly similar (ouuu yeah!)

THIRD PART
But, what happens when we already have a sequence and want to know the expected distance of that motif inside of it?.

Just like "Hey dude, I have an E.coli plasmid DNA sequence and want to know the average distance of the GATC motif".

Lets test using the sequence

"Escherichia coli 2078 plasmid pQNR2078 complete sequence" <- http://www.ncbi.nlm.nih.gov/nuccore/HE613857.1

Ok, use the

"ape" library to import the sequence to the R environment (if this library is not installed, type:

install.packages("ape"))

NOTE: the GenBank sequences are in lowecase, so it will be needed to use a motif in lowercase to do the right computations.

Import the library:

```
require("ape")
```

Import the sequence:

```
plasmid <- read.GenBank("HE613857.1")
```

plasmidDNA <- as.character.DNAbin(plasmid)

plasmidDNA <- plasmidDNA[[1]]

motifEcoli <- c("g","a","t","c")

Get the coordinates:

`plasmidDNAcoord <- coordMotif(plasmidDNA,motifEcoli)`

Get the average distance between the motif occurrences.

```
plasmidDNAmotifDistance <- computeDistance(plasmidDNAcoord)
```

> plasmidDNAmotifDistance
[1] 270
Compute the number of occurrences of the motif among the plasmid (the result is 151 occurrences):

The number of occurrences of the motif in R is:

```
(length(plasmidDNAcoord)-1)
```

So, the main distance between the motif

"gatc" finally is 270 bases and we are done :D

CODE:

```
#
# Script: motifOccurrence.R
# Author: Benjamin Tovar
# Date: 21/April/2012
#
################################################################################
# ############
# FUNCTIONS:
# ############
##############################################################################
iterateComputeDistance <- function(multinomialDNAmodel,
lengthDNAseq,
motif,
numberOfIterations){
# This function returns the mean distance
# of a given motif given X number of DNA sequences given a multinomial model
# (probability distribution of each base).
# So, it will generate X number of DNA sequences using a given
# probability distribution and then it will compute the distance among
# that mofit within the total set of sequences to finally returns
# the average distance of the motif.
result <- rep(NA,numberOfIterations)
for(i in 1:numberOfIterations){
currentGenome <- NA
currentCoordinatesOfMotif <- NA
currentSequence <- sample(c("A","C","G","T"),
lengthDNAseq,rep=T,
prob=multinomialDNAmodel)
currentCoordinatesOfMotif <- coordMotif(currentSequence,motif)
result[i] <- computeDistance(currentCoordinatesOfMotif)
cat(" *** Iteration number: ",i," completed *** | average distance = "
,result[i],"\n")
}
result <- trunc(mean(result))
cat(" \n*** Computation status: DONE ***\n\n")
return(result)
}
##############################################################################
coordMotif <- function(targetSequence,motif){
# This function returns the coordinates of the motif of study in a target
# DNA sequence. In other words, if I found the motif, tell me exactly in
# which position of the DNA sequence is.
lengthMotif <- length(motif)
lengthTargetSeq <- (length(targetSequence)-lengthMotif)
motif <- toString(motif)
motif <- gsub(", ","",motif)
res <- 1
for(i in 1:lengthTargetSeq){
currentTargetSeq <- targetSequence[i:(i+(lengthMotif)-1)]
currentTargetSeq <- toString(currentTargetSeq)
currentTargetSeq <- gsub(", ","",currentTargetSeq)
if(currentTargetSeq == motif){
res[(length(res)+1)] <- i
}
}
return(res)
}
##############################################################################
computeDistance <- function(coordinatesOfMotif){
# This function returns the mean distance
# of a given motif given its coordinates within a target DNA sequence.
# In other words, If I already got a list with the coordinates where the
# motif is inside a DNA sequence, tell me the average distance between
# this coordinates to get the expected distance of that motif.
currentDistance <- rep(NA,(length(coordinatesOfMotif)-1))
lengthCoord <- length(currentDistance)
for(i in 1:lengthCoord){
currentDistance[i] <- coordinatesOfMotif[i+1]-coordinatesOfMotif[i]
}
res <- trunc(mean(currentDistance))
return(res)
}
##############################################################################
computeExpectedDistance <- function(multinomialModel,
lengthDNAseq,
motif){
# This function computes the expected distance of a given motif in a DNA
# sequence given its multinomial model (probability distribution of
# each base)
# Convert the motif into an index
motifIndex <- gsub("A",1,motif); motifIndex <- gsub("C",2,motifIndex)
motifIndex <- gsub("G",3,motifIndex); motifIndex <- gsub("T",4,motifIndex)
motifIndex <- as.numeric(motifIndex)
# Compute p value of the motif given the multinomial model
p <- rep(NA,length(motif))
for(i in 1:length(motifIndex)){
p[i] <- multinomialModel[motifIndex[i]]
}
p <- prod(p)
result <- trunc(lengthDNAseq/(lengthDNAseq*p))
return(result)
}
##############################################################################
# Benjamin
```

Benjamin