Monday, July 11, 2011

Bash script: Extract DNA sequences from each motif of MEME output files and parse them in FASTA format

This script parses the DNA sequences that conforms each motif in MEME output files ("meme.txt") and finally exports them in FASTA format.

MEME (Multiple EM for Motif Elicitation) program is a very powerful tool for motif mining in sequence datasets (This means that the program looks for overrepresented substrings with statistical significance among every sequence inside the dataset. A practical and common use is, for example, characterization in silico of motifs that could function as promoters upstream of a dataset constructed with CDS).

For more information about this program and other very useful Bioinformatics tools for motif mining tasks, take a look here in the MEME-suite home page.

So, with this script you will be able to extract every DNA sequence dataset that conforms each MEME motif.

The script is called extract-FASTA_from-MEME.sh 

To use it, just change the properties of the file to be executable:


$ chmod +x extract-FASTA_from-MEME.sh


and finally, just execute inside the folder that contains all the meme output files (meme.txt, meme.output.txt and so on):


$ ./extract-FASTA_from-MEME.sh


Code:

#!/bin/bash

# extract-FASTA_from-MEME.sh
#
# I used this script to parse the DNA sequences obtained 
# from each motif of MEME output files "meme.txt"
# to generate a single FASTA file per motif.

# Finally I used the FASTA files to build PWMs

# Author: Benjamin Tovar
# Date: 11 July 2011

###########################################################
# Parse the data that is among the line "BL MOTIF" and "//":
# to retrieve the DNA sequences that defines each motif
##########################################################

for meme_file in *.txt
do
    sed -n '/BL   MOTIF/,/\/\//p' $meme_file > $meme_file.sed
done;

##########################################################
# Split every DNA motif into separated files in "*.csplit"
# format
##########################################################

for sed_file in *.sed
do
csplit -z $sed_file '/^BL   MOTIF/' '{*}' --suffix="%02d.csplit" --prefix=$sed_file- -s
done

##########################################################
# Parse the DNA sequences from each *.csplit files
##########################################################

for csplit_file in *.csplit
do

# grep -v '^$' <- delete blank lines
# sed 's/1//g' <- deletes the number "1" from the line.
 
cut -c34-150 $csplit_file |grep -v '^$' | sed 's/1//g' > $csplit_file.cut
done

##########################################################
# Generate Fasta files 
##########################################################

for cut_file in *.cut

do
pr -n:3 -t -T $cut_file | sed 's/^[ ]*/>/' | tr ":" "\n" | fold -w 100 > $cut_file.fa
done

# remove unnecessary files:

rm *.sed | rm *.csplit | rm *.cut

# Rename the FASTA files

rename -f 's/\.csplit.cut.fa$/\.fa/' *.fa

# Benjamin Tovar

Benjamin