Thursday, June 2, 2011

Simple Perl script for computing the reverse complement of a single DNA string

This entry will help you if you are looking for a Perl script that reads an input file (it can be in FASTA format or just the sequence without the FASTA header) that contains just one sequence (at least for now, later on I will upload a script that can handle n number of sequences inside a single file in FASTA format), computes the reverse complement of the string and then create an output file in FASTA format (even if the original file was not in FASTA format).

The script is named "rev_com.pl" and can be downloaded by clicking in the script name.

NOTE: In a Linux/UNIX system we can set the properties of the script to be executable and then put it inside the "bin" folder (like /usr/local/bin or /home/USER/bin), So in this way we can run the script in every directory we want without the need to copy it to every folder where we want to run the script.

Here is an example of my personal "bin" folder:

But in this short Tutorial I will run the Perl script that is inside the folder of my DNA sequences and not the one that is inside my "bin" folder.



Lets run an example to see how it works ;)

Lets imagine that the got this two files inside a folder:
  1. One is in FASTA format <- D_mel.fa 
  2. The other one just contains the DNA sequence <- dna_sequence


    And we want to compute the reverse complement of each file and export them in a new output file in FASTA format with the FASTA header.

    So, lets run the script:

    NOTE: remember that we got to access the folder with the Terminal to execute the script. We can do this by using the UNIX command "cd" and finally get access to the folder that contains our files.
    1. STEP 1 <- Access the folder with the terminal
    2. STEP 2 <- Execute the Perl script:

    $ perl rev_com.pl

    Example of executing the Perl script

    EXPLANATION OF THE PARAMETERS

    perl <- You are telling the Terminal that you want to run a Perl script.

    rev_com.pl <- name of the perl script.

    STEP 3 <- Write the name of the input file (if the file has extension, do not forget to type it too):


    STEP 4 <- Hit enter and run! (it was a joke):

    Enjoy the results directly from the Terminal

    In the bottom of the Terminal, you can read how is the name of the output file :D

    Enjoy the results of another example of an input file that is not in FASTA format (it just contains the DNA sequence pasted in)

    Finally, we can go to the folder and see our new files that contains the reverse complement of an input user string:


    Can you see it now? The script had named the output files with the ".fa" extension and inside the file, we can see the FASTA headers in making them officially FASTA readable and recognizable files.

    ##############################################

    If you are curious and want to know how to run the script from the "bin" folder, here is an screenshot about it:


    As you can see, there is no need to tell the Terminal that we want to run a Perl script and another good thing about running programs from the bin folder is that we do not need to copy them to every input folder (where our input files lives in).

    #!/usr/bin/perl
    #       rev_com.pl
    #       
    #       Copyright 2011 Benjamin Tovar 
    #       
    #       This program is free software; you can redistribute it and/or modify
    #       it under the terms of the GNU General Public License as published by
    #       the Free Software Foundation; either version 2 of the License, or
    #       (at your option) any later version.
    #       
    #       This program is distributed in the hope that it will be useful,
    #       but WITHOUT ANY WARRANTY; without even the implied warranty of
    #       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    #       GNU General Public License for more details.
    #       
    #       You should have received a copy of the GNU General Public License
    #       along with this program; if not, write to the Free Software
    #       Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
    #       MA 02110-1301, USA.
    #
    ################################################################################
    #
    # NAME OF THE PROGRAM: rev_com.pl
    # DATE: 01/Jun/2011
    # AUTHOR: Benjamin Tovar
    # COMMENTS: Little script that can be used for the computation of 
    # the reverse complement of a DNA string and then export the resulting string
    # to a file in Fasta format with fasta headers (even if the original file was not)
    #
    ################################################################################
    
    use warnings;
    use strict;
    
    #### script introduction
    
    print "\n\n-------------------------------------------------------------\n\n";
    
    print "PROGRAM DEFINITION: rev_com.pl <- PERL SCRIPT WRITTEN BY BENJAMIN TOVAR
    THAT COMPUTES THE REVERSE COMPLEMENT OF DNA STRINGS AND EXPORT THE RESULTS TO AN 
    OUTPUT FILE IN FASTA FORMAT (EVEN IF THE ORIGINAL FILE WAS NOT)\n\n";
    
    print "This program will ask the user to type the name of the file 
    (if the file name has an extension, write it please) that contains the DNA sequence\n\n";
    
    #### Ask the user for the name of the file
    
    print "Please, type the complete name of the file:\n\n";
    
    my $user_in =;
    
    ### Remove empty spaces
    
    chomp $user_in;
    
    ### Open the file
    
    open(INPUT_FILE,$user_in) or die "\n\n WARNING: The file does not exist, please check the spelling, the extension and the existence of the file\n\n";
    
    ### Copy the content of the file to an array variable called "@input_file"
    
    my @input_file = ;
    
    ### Remove the Fasta header and extract the DNA sequence
    
    my $input_file = extract_sequence_from_fasta_data(@input_file);
    
    ### Close the opened file
    
    close INPUT_FILE;
    
    ### Compute the reverse DNA string
    
    my $rev_com = reverse ($input_file);
    
    ### Compute the complement DNA string
    
    $rev_com =~ tr/ATGCatgc/TACGtacg/;
    
    ### Compute the length of the input DNA string
    
    my $length = length ($input_file);
    
    ############## RESULTS #################################################
    
    print "\n----------- RESULTS --------------\n\n";
    
    print "Input file string of length $length:\n\n";
    
    print $input_file,"\n";
    
    print "\nOutput reverse complement of the DNA string:\n\n";
    
    print $rev_com, "\n\n";
    
    ############## EXPORTING THE RESULTS TO A FILE IN FASTA FORMAT #########
    
    ######################## Naming the output file: #######################
    
    ### sdfgh
    
    my $fasta_header_name = extract_fasta_header_name(@input_file);
    
    ## Remove the input file extension
    
    $user_in =~ s/\..*//; 
    
    # Concatenate the name of the file (without the .fa/.fasta extension) with "-rev_com.fa"
    
    my $rev_com_name = "-revcom.fa";
    
    my $output_name = $user_in . $rev_com_name;
    
    # Name of the output file
    
    my $out = "$output_name";
    
    # Set the file handle "OUTPUT".
    
    open (OUTPUT, ">$out"); 
    
    # Print the results (content) of the variable "$rev_com" (this variable contains the reverse complement string) 
    # into a file named "$output_name" and put the Fasta header before the output DNA string with "$fasta_header_name","-reverse_complementary\n"
    
    print OUTPUT "$fasta_header_name","-reverse_complement\n","$rev_com";
    
    
    print "-------- EXPORT THE RESULTS TO A FILE IN FASTA FORMAT ----------\n";
    print "\nThe output string has been exported to the file \"$output_name\"\n\n";
    
    exit;
    
    ################################################################################
    ############################### SUBROUTINES ####################################
    ################################################################################
    
    
    ################################################################################
    # extract_fasta_header_name
    # A subroutine to extract the FASTA header of the original input file
    # and use it to name the FASTA header of the output file
    ################################################################################
    
    sub extract_fasta_header_name{
    
        my(@fasta_file_data) = @_;
    
        use strict;
        use warnings;
    
        # Declare and initialize variables
        my $fasta_header_name = '';
    
        foreach my $line (@fasta_file_data) {
        
            if($line =~ /^>/) {
                
                $fasta_header_name = $line;
        
        # If the file is not in Fasta format, use the name of the file to name the fasta header           
            } else {
            
                $fasta_header_name = ">$user_in";
            
            }
            
        # Remove non-sequence data (in this case, whitespace) from $fasta_header_name string
            $fasta_header_name =~ s/\s//g;
            
        # Export the results of the subroutine to the main program    
            return $fasta_header_name;
            
        }
    }
    
    ################################################################################
    # extract_sequence_from_fasta_data
    # A subroutine to extract FASTA sequence data from an array
    # taken from James Tisdall's Beginning Perl for Bioinformatics
    ################################################################################
    
    sub extract_sequence_from_fasta_data {
    
        my(@fasta_file_data) = @_;
    
        use strict;
        use warnings;
    
        # Declare and initialize variables
        my $sequence = '';
    
        foreach my $line (@fasta_file_data) {
    
            # discard blank line
            if ($line =~ /^\s*$/) {
                next;
    
            # discard comment line
            } elsif($line =~ /^\s*#/) {
                next;
    
            # discard fasta header line
            } elsif($line =~ /^>/) {
                next;
    
            # keep line, add to sequence string
            } else {
                $sequence .= $line;
            }
        }
    
        # remove non-sequence data (in this case, whitespace) from $sequence string
        $sequence =~ s/\s//g;
    
        return $sequence;
    }
    

    Benjamin.