Friday, July 29, 2011

fasta2clustal.pl <- Perl script for convert aligned and gapless FASTA files to aligned and gapless CLUSTAL formated files

Fasta to Clustal:

Hello there, recently I am working with literally thousands of PWM (Positional Weight Matrices) generated from MEME and MotifClick output.

So, to use the motifs in FASTA format to generate PWM I need to align them first to generate CLUSTAL formated files but errors occur when gaps delivered from the alignment appear..

Yep, the UGENE PWM builder does not like gaps and shows up errors building the PWM.

The situation is that you already got an aligned and gapless FASTA file and you only need to convert it to CLUSTAL without "the need" to align the sequences once more and deal with gaps.

So, here is my Perl script called: fasta2clustal.pl

STEP 1 <- To execute it, just change the permission of the file to run as a program:

$ chmod +x fasta2clustal.pl


STEP 2 <- To run the program (you can copy and paste it inside your bin path or run the script locally):

# From the bin folder (The folder must contain the target FASTA files):

$ fasta2clustal.pl <input_fasta_file.fasta>

EXAMPLE: fasta2clustal.pl fasta_sequences.fasta

# From the local folder (Which contain the script and the target FASTA files)

$ ./fasta2clustal.pl <input_fasta_file.fasta>

EXAMPLE: ./fasta2clustal.pl fasta_sequences.fasta

SHORT TUTORIAL:

RUNNING THE SCRIPT


INPUT FILE:



OUTPUT FILE:


Code:

#!/usr/bin/perl


#       fasta2clustal.pl
#       
#       Copyright 2011 Benjamin Tovar
#       
#       This program is free software; you can redistribute it and/or modify
#       it under the terms of the GNU General Public License as published by
#       the Free Software Foundation; either version 2 of the License, or
#       (at your option) any later version.
#       
#       This program is distributed in the hope that it will be useful,
#       but WITHOUT ANY WARRANTY; without even the implied warranty of
#       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#       GNU General Public License for more details.
#       
#       You should have received a copy of the GNU General Public License
#       along with this program; if not, write to the Free Software
#       Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
#       MA 02110-1301, USA.
#
################################################################################
#
# DATE: 27/July/2011
# AUTHOR: Benjamin Tovar
#
# This program converts aligned and gapless FASTA file to
# a gapless CLUSTAL file, ideal for convert motifs to PWM.
#
################################################################################

use warnings;
use strict;

###### Print USAGE if no argument is given
my $usage = "\nUSAGE: fasta2clustal.pl 
EXAMPLE: fasta2clustal.pl dna_sequence.fa\n\n";

###### Get the arguments:  
my $input_fasta_file = shift or die $usage;

###### Let the computer decide:

 my $dna_sequence ='';
 my $sequence_name='';
 my $output_file_name='';

 open(INPUT_FILE,$input_fasta_file);
 my @file = ;
 close INPUT_FILE;

  # ClUSTAL HEADER
  print "CLUSTAL W 2.0 multiple sequence alignment\n\n";

 foreach my $line(@file){
  
  # Discard empty lines
  if($line =~ /^\s*$/){
  next;
  }
  
  else{
    # Extract sequence names
    if($line =~ /^>/) {
     $sequence_name = $line;
     $sequence_name =~ s/\s//g;
     $sequence_name =~ s/\>//g;
     $sequence_name =~ s/\Start_position/_Start_position/g;
     $sequence_name =~ s/\:/_/g;
     chomp $sequence_name;
    }

    # Extract DNA sequence
    else{
    $dna_sequence = $line;
    chomp $dna_sequence;
  # Print results
                printf("%-50s %10s \n", "$sequence_name","$dna_sequence");
    } 
  }
 }

# Powred by #!CrunchBang Linux
# Benjamin Tovar

exit;

Benjamin