Saturday, July 30, 2011

MEME2fasta.sh <- Bash script to Parse MEME motifs into separated FASTA files

Hi, this script Parses motifs from MEME output files (meme.txt) and print them in separated FASTA formated files.

The script parses every *.txt meme output file inside the target folder where you run it to automatize the procedure.

You can download the script here: MEME2fasta.sh

NOTES: It works in Debian and Debian based Linux systems and I have not tested yet in another Linux distributions.

In order to run the script:

STEP 1 <- To execute it, just change the permission of the file to run as a program:

$ chmod +x MEME2fasta.sh

STEP 2 <- To run the program (you can copy and paste it inside your bin path or run the script locally):
# From the bin folder:
# Go to the path of the target meme.txt output files and then:

$ MEME2fasta.sh

# From the local folder (Which contain the script and the target meme.txt files)

$ ./MEME2fasta.sh


SHORT TUTORIAL

INPUT FOLDER AND INPUT FILES:



OUTPUT FOLDER AND OUTPUT FILES:


Code:

#!/bin/bash

# MEME2fasta.sh
#
# I used this script to parse the DNA sequences obtained 
# from each motif of MEME output files "meme.txt"
# to generate a single FASTA file per motif.

# Finally I used the FASTA files to build PWMs

# Author: Benjamin Tovar
# Date: 11 July 2011

###########################################################
# Parse the data that is among the line "BL MOTIF" and "//":
# to retrieve the DNA sequences that defines each motif
##########################################################

for meme_file in *.txt
    do
        sed -n '/BL   MOTIF/,/\/\//p' $meme_file > $meme_file.sed
    done;

##########################################################
# Split every DNA motif into separated files in "*.csplit"
# format
##########################################################

for sed_file in *.sed
    do
        csplit -z $sed_file '/^BL   MOTIF/' '{*}' --suffix="%02d.csplit" --prefix=$sed_file- -s
    done

##########################################################
# Parse the DNA sequences from each *.csplit files
##########################################################

for csplit_file in *.csplit
    do
        # grep -v '^$' <- delete blank lines
        # sed 's/1//g' <- deletes the number "1" from the line. 
        cut -c34-150 $csplit_file |grep -v '^$' | sed 's/1//g' > $csplit_file.cut
    done

##########################################################
# Generate Fasta files 
##########################################################

for cut_file in *.cut
    do
        pr -n:3 -t -T $cut_file | sed 's/^[ ]*/>/' | tr ":" "\n" | fold -w 100 > $cut_file.fa
    done

# remove unnecessary files:

rm *.sed | rm *.csplit | rm *.cut

# Rename the FASTA files

rename -f 's/\.csplit.cut.fa$/\.fa/' *.fa 

rename -f 's/.txt.sed//s' *.fa

exit;

# Benjamin Tovar

Benjamin

Friday, July 29, 2011

fasta2clustal.pl <- Perl script for convert aligned and gapless FASTA files to aligned and gapless CLUSTAL formated files

Fasta to Clustal:

Hello there, recently I am working with literally thousands of PWM (Positional Weight Matrices) generated from MEME and MotifClick output.

So, to use the motifs in FASTA format to generate PWM I need to align them first to generate CLUSTAL formated files but errors occur when gaps delivered from the alignment appear..

Yep, the UGENE PWM builder does not like gaps and shows up errors building the PWM.

The situation is that you already got an aligned and gapless FASTA file and you only need to convert it to CLUSTAL without "the need" to align the sequences once more and deal with gaps.

So, here is my Perl script called: fasta2clustal.pl

STEP 1 <- To execute it, just change the permission of the file to run as a program:

$ chmod +x fasta2clustal.pl


STEP 2 <- To run the program (you can copy and paste it inside your bin path or run the script locally):

# From the bin folder (The folder must contain the target FASTA files):

$ fasta2clustal.pl <input_fasta_file.fasta>

EXAMPLE: fasta2clustal.pl fasta_sequences.fasta

# From the local folder (Which contain the script and the target FASTA files)

$ ./fasta2clustal.pl <input_fasta_file.fasta>

EXAMPLE: ./fasta2clustal.pl fasta_sequences.fasta

SHORT TUTORIAL:

RUNNING THE SCRIPT


INPUT FILE:



OUTPUT FILE:


Code:

#!/usr/bin/perl


#       fasta2clustal.pl
#       
#       Copyright 2011 Benjamin Tovar
#       
#       This program is free software; you can redistribute it and/or modify
#       it under the terms of the GNU General Public License as published by
#       the Free Software Foundation; either version 2 of the License, or
#       (at your option) any later version.
#       
#       This program is distributed in the hope that it will be useful,
#       but WITHOUT ANY WARRANTY; without even the implied warranty of
#       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#       GNU General Public License for more details.
#       
#       You should have received a copy of the GNU General Public License
#       along with this program; if not, write to the Free Software
#       Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
#       MA 02110-1301, USA.
#
################################################################################
#
# DATE: 27/July/2011
# AUTHOR: Benjamin Tovar
#
# This program converts aligned and gapless FASTA file to
# a gapless CLUSTAL file, ideal for convert motifs to PWM.
#
################################################################################

use warnings;
use strict;

###### Print USAGE if no argument is given
my $usage = "\nUSAGE: fasta2clustal.pl 
EXAMPLE: fasta2clustal.pl dna_sequence.fa\n\n";

###### Get the arguments:  
my $input_fasta_file = shift or die $usage;

###### Let the computer decide:

 my $dna_sequence ='';
 my $sequence_name='';
 my $output_file_name='';

 open(INPUT_FILE,$input_fasta_file);
 my @file = ;
 close INPUT_FILE;

  # ClUSTAL HEADER
  print "CLUSTAL W 2.0 multiple sequence alignment\n\n";

 foreach my $line(@file){
  
  # Discard empty lines
  if($line =~ /^\s*$/){
  next;
  }
  
  else{
    # Extract sequence names
    if($line =~ /^>/) {
     $sequence_name = $line;
     $sequence_name =~ s/\s//g;
     $sequence_name =~ s/\>//g;
     $sequence_name =~ s/\Start_position/_Start_position/g;
     $sequence_name =~ s/\:/_/g;
     chomp $sequence_name;
    }

    # Extract DNA sequence
    else{
    $dna_sequence = $line;
    chomp $dna_sequence;
  # Print results
                printf("%-50s %10s \n", "$sequence_name","$dna_sequence");
    } 
  }
 }

# Powred by #!CrunchBang Linux
# Benjamin Tovar

exit;

Benjamin

Wednesday, July 13, 2011

Bash tips: copy and move files avoiding the annoying "Argument list too long"

Hello there! Today I was into an interesting situation:

First: I had more than 80,000 GENBANK (*.gb) files inside a folder in combination with 100 FASTA files (*.fa) and 100 MAFFT alignment files (*.mafft).

Second: I wanted to create separated folders and then just create a list or summary of every kind of file with a simple:


# Create a summary of every FASTA file inside this folder:

$ ls *.fa | sort > fasta_summary


Third: Yep, it worked for the FASTA files and the MAFFT ones, but my life suddenly changed when I tried to use that code line for the GENBANK files because I got this -> "Argument list too long".

Ok I said, lets do it different, and finally here is the solution:

1) To create a summary of every genbank file inside the folder.


# be sure to be inside the folder with the Terminal
# Let Perl work ;)

$ perl -e 'opendir(DIR, "."); @all = grep /.gb/, readdir DIR; closedir DIR; print "@all\n";' | xargs ls > GENBANK_FILES_SUMMARY


grep/.gb/ <- Perl will look for that regular expression and list every file that have the ".gb" extension (you could  adapt the argument depending on your needs).


2) To copy every GENBANK file to a folder called "GENBANK_FOLDER-COPY":


$ find -name "*.gb" | xargs -i cp {} GENBANK_FOLDER-COPY/


3) To move every GENBANK file to a folder called "GENBANK_FOLDER":

$ find -name "*.gb" | xargs -i mv {} GENBANK_FOLDER/


# Description of the last line:

$ find source/ -name "*.txt" | xargs -i mv {} target/ <- Where "source/" is the input path and "target/" is the output path. "-name "*.txt" is the regular expression to look up for every *.txt file inside the input folder.

Check out the "cp" argument in task #2 and "mv" argument in task #3 for copying and moving respectively.

Hope this helps someone.

Benjamin

Monday, July 11, 2011

Bash script: Extract DNA sequences from each motif of MEME output files and parse them in FASTA format

This script parses the DNA sequences that conforms each motif in MEME output files ("meme.txt") and finally exports them in FASTA format.

MEME (Multiple EM for Motif Elicitation) program is a very powerful tool for motif mining in sequence datasets (This means that the program looks for overrepresented substrings with statistical significance among every sequence inside the dataset. A practical and common use is, for example, characterization in silico of motifs that could function as promoters upstream of a dataset constructed with CDS).

For more information about this program and other very useful Bioinformatics tools for motif mining tasks, take a look here in the MEME-suite home page.

So, with this script you will be able to extract every DNA sequence dataset that conforms each MEME motif.

The script is called extract-FASTA_from-MEME.sh 

To use it, just change the properties of the file to be executable:


$ chmod +x extract-FASTA_from-MEME.sh


and finally, just execute inside the folder that contains all the meme output files (meme.txt, meme.output.txt and so on):


$ ./extract-FASTA_from-MEME.sh


Code:

#!/bin/bash

# extract-FASTA_from-MEME.sh
#
# I used this script to parse the DNA sequences obtained 
# from each motif of MEME output files "meme.txt"
# to generate a single FASTA file per motif.

# Finally I used the FASTA files to build PWMs

# Author: Benjamin Tovar
# Date: 11 July 2011

###########################################################
# Parse the data that is among the line "BL MOTIF" and "//":
# to retrieve the DNA sequences that defines each motif
##########################################################

for meme_file in *.txt
do
    sed -n '/BL   MOTIF/,/\/\//p' $meme_file > $meme_file.sed
done;

##########################################################
# Split every DNA motif into separated files in "*.csplit"
# format
##########################################################

for sed_file in *.sed
do
csplit -z $sed_file '/^BL   MOTIF/' '{*}' --suffix="%02d.csplit" --prefix=$sed_file- -s
done

##########################################################
# Parse the DNA sequences from each *.csplit files
##########################################################

for csplit_file in *.csplit
do

# grep -v '^$' <- delete blank lines
# sed 's/1//g' <- deletes the number "1" from the line.
 
cut -c34-150 $csplit_file |grep -v '^$' | sed 's/1//g' > $csplit_file.cut
done

##########################################################
# Generate Fasta files 
##########################################################

for cut_file in *.cut

do
pr -n:3 -t -T $cut_file | sed 's/^[ ]*/>/' | tr ":" "\n" | fold -w 100 > $cut_file.fa
done

# remove unnecessary files:

rm *.sed | rm *.csplit | rm *.cut

# Rename the FASTA files

rename -f 's/\.csplit.cut.fa$/\.fa/' *.fa

# Benjamin Tovar

Benjamin