Saturday, July 30, 2011

MEME2fasta.sh <- Bash script to Parse MEME motifs into separated FASTA files

Hi, this script Parses motifs from MEME output files (meme.txt) and print them in separated FASTA formated files.

The script parses every *.txt meme output file inside the target folder where you run it to automatize the procedure.

You can download the script here: MEME2fasta.sh

NOTES: It works in Debian and Debian based Linux systems and I have not tested yet in another Linux distributions.

In order to run the script:

STEP 1 <- To execute it, just change the permission of the file to run as a program:

$ chmod +x MEME2fasta.sh

STEP 2 <- To run the program (you can copy and paste it inside your bin path or run the script locally):
# From the bin folder:
# Go to the path of the target meme.txt output files and then:

$ MEME2fasta.sh

# From the local folder (Which contain the script and the target meme.txt files)

$ ./MEME2fasta.sh


SHORT TUTORIAL

INPUT FOLDER AND INPUT FILES:



OUTPUT FOLDER AND OUTPUT FILES:


Code:

#!/bin/bash

# MEME2fasta.sh
#
# I used this script to parse the DNA sequences obtained 
# from each motif of MEME output files "meme.txt"
# to generate a single FASTA file per motif.

# Finally I used the FASTA files to build PWMs

# Author: Benjamin Tovar
# Date: 11 July 2011

###########################################################
# Parse the data that is among the line "BL MOTIF" and "//":
# to retrieve the DNA sequences that defines each motif
##########################################################

for meme_file in *.txt
    do
        sed -n '/BL   MOTIF/,/\/\//p' $meme_file > $meme_file.sed
    done;

##########################################################
# Split every DNA motif into separated files in "*.csplit"
# format
##########################################################

for sed_file in *.sed
    do
        csplit -z $sed_file '/^BL   MOTIF/' '{*}' --suffix="%02d.csplit" --prefix=$sed_file- -s
    done

##########################################################
# Parse the DNA sequences from each *.csplit files
##########################################################

for csplit_file in *.csplit
    do
        # grep -v '^$' <- delete blank lines
        # sed 's/1//g' <- deletes the number "1" from the line. 
        cut -c34-150 $csplit_file |grep -v '^$' | sed 's/1//g' > $csplit_file.cut
    done

##########################################################
# Generate Fasta files 
##########################################################

for cut_file in *.cut
    do
        pr -n:3 -t -T $cut_file | sed 's/^[ ]*/>/' | tr ":" "\n" | fold -w 100 > $cut_file.fa
    done

# remove unnecessary files:

rm *.sed | rm *.csplit | rm *.cut

# Rename the FASTA files

rename -f 's/\.csplit.cut.fa$/\.fa/' *.fa 

rename -f 's/.txt.sed//s' *.fa

exit;

# Benjamin Tovar

Benjamin