Wednesday, April 11, 2012

Generate artificial DNA or protein sequences in R in a single line of code.

To generate an artificial DNA sequence of  "n" bases long with a fixed composition bias in just one line of code, just open your R prompt and type:

seqX <- sample(c("A","C","G","T"),10000,rep=TRUE,prob=c(0.4,0.1,0.1,0.4))

As you see, the alphabet is the four letter alphabet of the DNA (so, if desired, you can replace that alphabet with any other alphabet that you might need, like the amino acids 20 letter code), the next parameter is the length of the desired output sequence. rep=TRUE means that each letter of the alphabet could be repeated and finally, I think that the most useful parameter of the function sample is the prob parameter because it allows you to select the multinomial model (the proportion or "bias" of each base). For example our simulated sequence is 80% AT rich and 20% GC rich given that for the "A" base we got a probability of 0.4, for the "C" base 0.2 and so on.

Now, to check it out that our artificial sequence follows that bias, a simple plot will tell us more than thousand words:

Lets extract the proportion of each base along the sequence using the table and the length function:

freqX <- table(seqX)/length(seqX)

Now, lets plot it doing:

barplot(freqX,col=1:4,main="Compositional bias of seqX",xlab="Base",ylab="Base proportion")

And finally we got this awesome plot.

So, the barplot shows that effectively each base is well represented by the multinomial model and our artificial sequence is loyal to it.