Saturday, June 23, 2012

Handling large FASTA sequence datasets in R: Shuffle and retrieve "n" number of sequences of fixed length from the whole FASTA file and export them in a new FASTA file.

When you are working with large FASTA datasets is probable to find out that the sequences are in sort of a mixed quality (obviously, depending on your scientific question),

I mean for example, imagine that you retrieve the whole collection of exons of a given organism and suppose that the FASTA file is 50mb and there are included ~50,000 DNA sequences but, if you look at them, you may find that there are sequences much larger than others and others will be probably 50 bases long.

Well that's what I mean with mixed quality and therefore, a sort of filter might be very helpful because you may find more informative to put a threshold and say "hey,  from the pool of ~50K sequences I only want 5K sequences randomly chosen (from the entire FASTA dataset) and every sequence must have a fixed length, say 1000 bases long".

That's why I wrote some code in R language, it's a function titled "shuffleAndExtract" and you can download the example set and the source here.

Here is the description of the function:

handleFastaDatasets.R 

shuffleAndExtract: This function in R is designed to open a fasta file dataset, shuffle the sequences and extract the desired sequences wanted by the user to generate a new dataset of fixed size (number of required sequences) and with the same length for each sequence.


And, after you download the example set and the R source file, you can try to run a very simple example:

NOTE: my implementation depends on the function "seqinr", if is not installed, you may do this before all the magic begin with a simple install.packages("seqinr").

        # run example:
        source("handleFastaDatasets.R")
        shuffleAndExtract("example.fasta",1000,200)

The arguments of the function are:

     inputFastaFile: name of the input fasta file
     numberOfoutputSeqs: number of desired sequences in the output file
     lengthOutputSeqs: fixed length of every sequence in the output file
     initialPos: Position where the new window sizing will begin, default = 1
     outputFileName: name of the output file, by default will be (e.g):  "inputFastaFile.fasta.output.fasta"

CODE:

################################################################################

shuffleAndExtract <- function(inputFastaFile,
                              numberOfoutputSeqs,
                              lengthOutputSeqs,
                              initialPos=1,
                              outputFileName = paste(inputFastaFile,"output",
                                                     "fasta",sep=".")
                              ){

    #
    #    handleFastaDatasets.R   
    #
    #    This function in R is designed to open a fasta file dataset, shuffle the 
    #    sequences and extract the desired sequences wanted by the user to generate 
    #    a new dataset of fixed size (number of required sequences) and with the 
    #    same length for each sequence. 
    #
    #    Author: Benjamin Tovar
    #    Date: 22/JUNE/2012
    #
    ################################################################################

    #    run example:
    #    source("handleFastaDatasets.R")
    #    shuffleAndExtract("example.fasta",1000,200)

    # inputFastaFile: name of the input fasta file
    # numberOfoutputSeqs: number of desired sequences in the output file 
    # lengthOutputSeqs: fixed length of every sequence in the output file
    # initialPos: Position where the new window sizing will begin, default = 1
    # outputFileName: name of the output file, by default will be (e.g):
    #                 "inputFastaFile.fasta.output.fasta"

    cat("*** Starting computations |",date(),"****\n")    
        
    # Load the seqinr package
    require(seqinr)

    # Load the large seq and not shuffled dataset
    inputSeqs <- read.fasta(inputFastaFile)

    cat("\tProcessing for",length(inputSeqs),"sequences |",date(),"\n")    

    # Extract the length of every sequence and store the results in a vector.
    inputSeqsSize <- rep(NA,length(inputSeqs))
    for(i in 1:length(inputSeqs)){ 
        inputSeqsSize[i] <- summary(inputSeqs[[i]])$length
    }

    # Extract the index of the sequences which are longer than threshold
    inputSeqsLongSeqIndex <- which(inputSeqsSize > (lengthOutputSeqs+initialPos))

    # randomly pick numberOfoutputSeqs indexes that will be used to create 
    # the output dataset 
    inputSeqsIndex <- sample(inputSeqsLongSeqIndex,numberOfoutputSeqs,rep=F)
    
    # Store the Fasta header of each selected sequence in a vector
    inputSeqsIndexNames <- rep(NA,numberOfoutputSeqs)
    
    # create output object 
    outputSeqs <- list()
    for(i in 1:numberOfoutputSeqs){
        # Extract the fasta headers 
        inputSeqsIndexNames[i] <- attr(inputSeqs[[inputSeqsIndex[i]]],"name")
        # Extract the sequence
        outputSeqs[[i]] <- inputSeqs[[inputSeqsIndex[i]]][initialPos:((initialPos+lengthOutputSeqs)-1)]
    }
 
    # Export the sequences in a new fasta file 
    write.fasta(outputSeqs,inputSeqsIndexNames,outputFileName)
    
    cat("\n*** DONE |",date(),"****\n")
}


# 22 Jun 2012 | Benjamin Tovar

Benjamin