This script parses the DNA sequences that conforms each motif in MEME output files ("meme.txt") and finally exports them in FASTA format.
MEME (Multiple EM for Motif Elicitation) program is a very powerful tool for motif mining in sequence datasets (This means that the program looks for overrepresented substrings with statistical significance among every sequence inside the dataset. A practical and common use is, for example, characterization in silico of motifs that could function as promoters upstream of a dataset constructed with CDS).
MEME (Multiple EM for Motif Elicitation) program is a very powerful tool for motif mining in sequence datasets (This means that the program looks for overrepresented substrings with statistical significance among every sequence inside the dataset. A practical and common use is, for example, characterization in silico of motifs that could function as promoters upstream of a dataset constructed with CDS).
For more information about this program and other very useful Bioinformatics tools for motif mining tasks, take a look here in the MEME-suite home page.
So, with this script you will be able to extract every DNA sequence dataset that conforms each MEME motif.
The script is called extract-FASTA_from-MEME.sh
To use it, just change the properties of the file to be executable:
$ chmod +x extract-FASTA_from-MEME.sh
and finally, just execute inside the folder that contains all the meme output files (meme.txt, meme.output.txt and so on):
$ ./extract-FASTA_from-MEME.sh
Code:
#!/bin/bash
# extract-FASTA_from-MEME.sh
#
# I used this script to parse the DNA sequences obtained
# from each motif of MEME output files "meme.txt"
# to generate a single FASTA file per motif.
# Finally I used the FASTA files to build PWMs
# Author: Benjamin Tovar
# Date: 11 July 2011
###########################################################
# Parse the data that is among the line "BL MOTIF" and "//":
# to retrieve the DNA sequences that defines each motif
##########################################################
for meme_file in *.txt
do
sed -n '/BL MOTIF/,/\/\//p' $meme_file > $meme_file.sed
done;
##########################################################
# Split every DNA motif into separated files in "*.csplit"
# format
##########################################################
for sed_file in *.sed
do
csplit -z $sed_file '/^BL MOTIF/' '{*}' --suffix="%02d.csplit" --prefix=$sed_file- -s
done
##########################################################
# Parse the DNA sequences from each *.csplit files
##########################################################
for csplit_file in *.csplit
do
# grep -v '^$' <- delete blank lines
# sed 's/1//g' <- deletes the number "1" from the line.
cut -c34-150 $csplit_file |grep -v '^$' | sed 's/1//g' > $csplit_file.cut
done
##########################################################
# Generate Fasta files
##########################################################
for cut_file in *.cut
do
pr -n:3 -t -T $cut_file | sed 's/^[ ]*/>/' | tr ":" "\n" | fold -w 100 > $cut_file.fa
done
# remove unnecessary files:
rm *.sed | rm *.csplit | rm *.cut
# Rename the FASTA files
rename -f 's/\.csplit.cut.fa$/\.fa/' *.fa
# Benjamin Tovar
Benjamin
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.