Wednesday, June 29, 2011

Bash script for split an input FASTA file into single files in FASTA format output

Hello everyone, yesterday I downloaded a FASTA file with more than 16,500 DNA sequences but I also need that every single sequence of that original FASTA file be split into 16,500 single files and each file must contain a single DNA sequence.

To do so, I write this short and simple script in bash called "fasta_split.sh".

To use the script:

$ ./fasta_split.sh

Here is an example, just follow the instructions the program ask you and that's all



Example input/output folders and input/output files:


 Code:

#!/bin/bash

#       fasta_split.sh
#       
#       2011 - Benjamin Tovar 
#       
#
# NAME OF THE PROGRAM: fasta_split.sh
# 
# DATE: 29/JUN/2011
# 
# AUTHOR: Benjamin Tovar
# 
# COMMENTS: This script will split a FASTA file that contains many sequences into
# Separated files containing one single sequence per file.
#
# I used this script to split a FASTA file containing ~16,500 sequences into
# ~16,500 independent files very quickly and simple.
#
################################################################################

# BEGINNING OF THE PROGRAM
echo
echo "This script relies on the program called \"csplit\" available in Linux/UNIX systems."

# Ask the user to type the name of the input FASTA file:
echo
echo -n "   Enter the name of the input file in FASTA format: "
    read input_file

# Ask the user to type the name of the output FASTA files (note in line 73, 
# "--suffix="%02d.fa" means that every output file will have the extension "*.fa"
# if you like to use another extension (for example, you like that every output file have
# the extension *.fasta), just replace the ".fa" with ".fasta" this way -> "--suffix="%02d.fasta"

# The part of "02" in "--suffix="%02d.fa" means that every file will be named with two numbers 
# sorted by their occurrence in the original FASTA input file.

# For example: in an input file that contains 2 sequences and with "--suffix="%02d.fa"
# the output will be:
#
# outputfile-00.fa
# outputfile-01.fa
#
# For example: in an input file that contains 2 sequences and with "--suffix="%04d.fa"
# the output will be:
#
# outputfile-0000.fa
# outputfile-0001.fa
echo      
echo -n "   Enter the name of the output files: "
    read output_file

# Ask the user to type the name of the output folder that will contain all the output files   
echo 
echo -n "   Enter the name of the output folder: "
read dir
echo
echo "Creating directory called \"$dir\"."

# Create the output folder
mkdir $dir

# Copy the input FASTA file to the output folder
cp $input_file $dir

# Open to the output folder
cd $dir
echo
echo "   ...splitting FASTA file ...saving them in \"$dir\"."

# Splitting the input FASTA file (For some settings, read line 31 to 49)
csplit -z $input_file '/^>/' '{*}' --suffix="%02d.fa" --prefix=$output_file- -s

# Delete the input FASTA file in the output folder
rm $input_file
echo
echo "Printing output summary in a file called \"OUTPUT-SUMMARY.out\""

# Create the output summary
ls | sort | sed -e 's/OUTPUT-SUMMARY.out//g' > OUTPUT-SUMMARY.out 
echo
echo " ---- PROCESS DONE ---- "
echo


Benjamin

Thursday, June 16, 2011

And here I go ..

Hello everyone! I realized how important is to get the right box, the right colors, the right font among others to represent a good source code..

So this morning I added some lines to my blog's template.

And this is the cool output:


#!/bin/bash

echo "random code"

exit;

Benjamin

Monday, June 13, 2011

How to download automatically every image that is linked from inside a website with a bash script

This time I did not write any code, instead of that I want to share with you this very useful bash script written by Darrin Goodman.

For the version 1 of the script with all its full reference, please enter here, and click here for the full reference of the version 2 (the one that is posted is my blog entry).

There are occasions when an individual might wish to download any or all of the images that may be linked from a web page, such as when there is a thumbnail image that is linked to a larger version of the same image so in all those cases this script will help you to do that job ;).

NOTE: This script depends on programs such "awk", "lynx", "grep" and "wget".

If you use Debian or a Debian based distro you must install the "lynx" program first:

$ sudo apt-get install lynx

So lets begin:

STEP 1 <- Download the script here

STEP 2 <- Make it executable 

$ chmod +x  image_downloader-v2.sh

STEP 3 <- Run the script

$ ./image_downloader-v2.sh

Example

Enjoy ;)

Code:

 #!/bin/bash  
 # Written by Darrin Goodman with inspiration from:  
 # http://www.go2linux.org/linux/2010/09/how-download-all-links-webpage-including-hidden-776  
 # THIS PROGRAM WILL DOWNLOAD IMAGES THAT ARE LINKED FROM A WEBSITE,  
 # SUCH AS WHEN THERE IS A THUMBNAIL IMAGE THAT IS LINKED TO A LARGER VERSION OF THE SAME IMAGE.  
 # THIS IS VERSION 2 - THE SIMPLE IMAGE DOWNLOADER - NO EXTRA FRILLS  
 # THE MORE FULL-FEATURED VERSION CAN BE FOUND HERE:   
 # http://www.hilltopyodeler.com/blog/?p=324  
   
 function grabURL {  
 echo -n " Enter the desired URL or 'q' to QUIT: "  
 read a  
 for a in $(cat url); do #$a; done  
 if [ $a == "q" ]  
 then  
 # figlet Done!  
 echo "Done!"  
 echo  
 echo  
 exit 0  
 else  
 echo " The URL that you entered is: $a"  
 echo  
 echo "Ok...... working........................."  
 echo  
 fi ; done  
 }  
 function imageDownload {  
 grabURL  
 echo  
   
 # GRAB A LIST OF ALL IMAGES BEING LINKED TO AND STORE THEM IN FILE CALLED images.txt  
 # THIS LIST HAS BEEN EXPANDED TO ALSO GRAB SWF's AND FLV's  
 lynx --dump $a | awk '/http/{print $2}' | grep png > images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep jpg >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep gif >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep flv >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep swf >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep PNG >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep JPG >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep GIF >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep FLV >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep SWF >> images.txt  
   
 # LOOP THROUGH THE LIST OF IMAGES STORED IN images.txt AND DOWNLOAD THEM TO THE CURRENT DIRECTORY  
 for i in $(cat images.txt); do wget $i; done  
 echo "//////////////////////////////////////////////////////////////////////////////"  
 echo  
 echo "Your images have downloaded to your current working directory."  
 echo  
 echo "Thank you for using ImageDownloader"  
 echo  
 # figlet Done!  
 echo "Done!"  
 echo  
 echo  
 exit 0  
 }  
 function whatNext {  
 echo "What would you like to do?"  
 echo  
 imageDownload  
 }  
   
 # BEGINNING OF THE PROGRAM  
   
 # Clear the screen  
 clear  
 echo "This script relies on the program called \"lynx\"."  
 echo " ImageDownloader v.2 is ready"  
 echo  
   
 # PROMPT USER TO DECIDE WHAT TO DO NEXT  
 whatNext  
   
   
   
 exit 0  

Benjamin

Sunday, June 12, 2011

Firefox 4 Installer For Debian

Hello there, surfing in the "Internets" I found this bash script "install_firefox_debian.sh" written by AnonyMous.

So, if you use Debian or a Debian based distro like I do with #!Crunchbang Linux we can install Firefox 4 instead of the default Iceweasel.


#!/bin/bash
# FireFox 4 Installer For Debian 
# Open As Root 
# Wtite By AnonyMous :)

echo " Downloading FireFox "

wget http://mozilla.c3sl.ufpr.br/releases//firefox/releases/4.0.1/linux-i686/en-US/firefox-4.0.1.tar.bz2

# Install FireFox4 

echo " Installing FireFox 4.0.1 "

tar -xvjf firefox*.tar.bz2

sudo mv firefox /usr/local/firefox4

sudo ln -s /usr/local/firefox4/firefox /usr/local/bin/firefox4

echo " FireFox Installed Successfuly "


Benjamin

Saturday, June 11, 2011

Perl script to delete repeated entries of a plain text file using the Linux/UNIX command "awk"

This script called "delete_repeats.pl" will help you if you are looking for a more user friendly alternative to execute the following lines inside a Terminal:

$ awk ' !x[$0]++' input_file > output_file


I know! the syntax is very logic, simple and probably is not necessary to deserve a Perl script to automatize this process of deleting entries with this simple line of code.

But, the good thing about writing a script that ask me for the input and output and then automatically replace those values inside the system command awd is that I no longer need to type the whole line.

I copied my script inside my "bin" folder, execute it and then I just worry about type the correct name of the input file and type the output file name and that is all ;)

Here is an example:

Input file:


Now, lets execute the Perl script with the following line inside a terminal (remember that we must go inside the folder that contains our input file first and be sure that our script is inside the same folder):


$ perl delete_repeats.pl


NOTE: I execute the script this way:


$ delete_repeats.pl


Because I copied the script to my /home/benjamin/bin folder so is no longer necessary to copy it script to every folder where I want to execute it. This way lets the Terminal to recognize it and execute it inside every folder that I want to use it :D


And the output :D


Benjamin

Bug fixed: Perl script to generate "n" random DNA strings of "n" length with a desired nucleotide probability distribution defined by the user and export them in a file in FASTA format

After a short review of my own script "random_dna_strings.pl" I realized that is more useful to set the parameter of C content distribution probability (which is the last parameter of probability distribution that is asked to the user) automatically to:

line 116:

my $C_content = (1-($A_content+$T_content+$G_content))

Rather than ask the user for typing manually the remaining value that finally satisfy the 1.00 of total probability value.

Total p value = 1 = (A content + T content + G content + C content)


C content = (1 - (A content + T content + G content))

Benjamin

Wednesday, June 8, 2011

SUPPORT HERE: LINUX 20TH YEAR ANNIVERSARY 2011 T SHIRT DESIGN CONTEST FINALIST

Information proudly taken from www.linux.com

After reviewing more than 130 submissions for the 20th Anniversary of Linux T-shirt design contest, we are excited to reveal our finalists!

Please check out the designs below and vote for your favorite. You will be able to vote once a day. Feel passionate about a specific design? Share this page with your friends online and encourage them to vote!

Voting will be open for two weeks: today, June 8, 2011, through midnight PT on June 22, 2011. The winner will be announced shortly thereafter. The winning design will be used as the basis for this year's LinuxCon attendee T-shirt and will be available on T-shirts for purchase at the Linux.com Store later this summer.

Now, get out the vote!


VOTE AND BE COOLER THAN BEFORE BY CLICKING HERE

(IT HAS BEEN DEMONSTRATED THAN VOTING WILL REDUCE THE PROBABILITY OF INCIDENCE OF MUTATIONS IN YOUR p53 GENES)
Benjamin

Sunday, June 5, 2011

Perl script to generate "n" random DNA strings of "n" length with a desired nucleotide probability distribution defined by the user and export them in a file in FASTA format

How many of you that are working with artificial DNA sets and have to use the tools that are available over the Web have to edit the output because the FASTA header of such programs does not correspond to your needs, the output is not in FASTA format or the lazy reason to copy and paste the output sequences inside a manually generated file because it depends on important ATP molecules that can be used in another tasks.

Well, this Perl script called "random_dna_strings.pl" can be downloaded by clicking the link.

This script takes arguments given by the user such the nucleotide frequencies of each nucleotide (in a scale from 0.0 to 1.0), generates a "n" number of sequences of "n" length with a FASTA header also given by the user and finally prints an output file in FASTA format.

So, here is a short Tutorial about how to use it:

Well, imagine that we want to generate 100 random DNA sequences of 100 nucleotides length with a nucleotide distribution equal to every nucleotide, this is "A=0.25, T=0.25, G=0.25 and C=0.25" with the FASTA header of "just a bunch of random sequences of equal random prob nt distribution" in a file called "ran_set_1.fa".

STEP 1 <- Open a Terminal inside the output folder (where you want to put the output file)

STEP 2 <- Execute the Perl script (This time, I copied the script to my bin folder, if you have questions about how to run your scripts this way, please visit this entry.

Command:

$ random_dna_strings.pl


STEP 3 <- Please type the number of iterations (How many random sequences do you want)

In this Tutorial I want to generate 100 artificial sequences.


STEP 4 <- Please type the length of the random DNA strings (how many nucleotides length)

In this Tutorial I want each sequence to be 100 nucleotides long


STEP 5 <- Please type the probability distribution of A, T, G and C content:

In this Tutorial I want each nucleotide got an equal probability distribution (this means 1/4 of probability of A,T,G or C at each string position)



STEP 6 <- Please, type the name of the fasta header for each sequence (is not necessary to put the >)

In this Tutorial, I want that the FASTA header be: "just a bunch of random sequences of equal random prob nt distribution"



STEP 7 <- Please, type the name of the output file:

In this Tutorial, I want to name it "ran_set_1.fa"



STEP 8 <- Enjoy the results


Another example:

Now I want to generate 100 random DNA sequences of 100 nucleotides length with a nucleotide distribution of "A=0.7, T=0.2, G=0.1 and C=0.0" with the FASTA header of "random dna sequence with too many As" in a file called "random_set_A_rich.fa".


Now, take a look of the output folder:


Do you see it :D, now we got two output files in FASTA format of random made sequences that follows a probability distribution of nucleotides given by us.

NOTE: The yellow highlighting just mean the differences of the content of A nucleotides among the two files showing us that the probability distribution obviously has an impact on the nucleotide composition of the random generated DNA strings.

Benjamin

Saturday, June 4, 2011

Perl script to export every line of independent DNA sequences inside a single file to a FASTA formated output file

This Perl script will take every line of the input file as a single DNA sequence and then it will create a new file with every sequence in FASTA format with the FASTA headers defined by the user.

The script is called "fasta_seq.pl" and can be downloaded by clicking the link.

So here it is a short Tutorial about how to use it:

Imagine that we got a file called "seq" that contains a single DNA sequence per row and we do not want to separate each one with spaces and name them manually.

NOTE: This script only will run under Linux/UNIX environments because it depends of the commands: "pr","sed","tr" and "fold" (Sorry to all the Windows users).



STEP 1 <- Open a Terminal and get inside the input files folder:

STEP 2 <- Execute the Perl script with

perl fasta_seq.fa





EXPLANATION OF THE PARAMETERS:
  1. perl <- here you are telling the Terminal that you want to run the Perl environment
  2. fasta_seq.fa <- name of the Perl script

STEP 3 <- Type the name of the input file that contains the DNA sequences (in this Tutorial, our file is called "seq")



STEP 4 <- Type the name of the fasta header for each sequence (is not necessary to put the ">" symbol).

In this Tutorial, I want that the FASTA header will be "just a dna sequence"



STEP 5 <- Type the width of the sequences (how many nucleotides per column)

In this Tutorial, I want that the width of nucleotides per column will be "45" nucleotides



STEP 6 <- Type the complete name of output file (In this Tutorial, I want that the output file name of my results will be "output_sequences.fa"



STEP 7 <- Finally, enjoy the results ;)


Now we can go into our input folder and take a look of the output file:



Do you see it?? successfully now we got an output file that contains every sequence of the input file (this script considers every line as an independent DNA string) under a fasta header named by ourselves and ready in FASTA format.


Benjamin

Thursday, June 2, 2011

Simple Perl script for computing the reverse complement of a single DNA string

This entry will help you if you are looking for a Perl script that reads an input file (it can be in FASTA format or just the sequence without the FASTA header) that contains just one sequence (at least for now, later on I will upload a script that can handle n number of sequences inside a single file in FASTA format), computes the reverse complement of the string and then create an output file in FASTA format (even if the original file was not in FASTA format).

The script is named "rev_com.pl" and can be downloaded by clicking in the script name.

NOTE: In a Linux/UNIX system we can set the properties of the script to be executable and then put it inside the "bin" folder (like /usr/local/bin or /home/USER/bin), So in this way we can run the script in every directory we want without the need to copy it to every folder where we want to run the script.

Here is an example of my personal "bin" folder:

But in this short Tutorial I will run the Perl script that is inside the folder of my DNA sequences and not the one that is inside my "bin" folder.



Lets run an example to see how it works ;)

Lets imagine that the got this two files inside a folder:
  1. One is in FASTA format <- D_mel.fa 
  2. The other one just contains the DNA sequence <- dna_sequence


    And we want to compute the reverse complement of each file and export them in a new output file in FASTA format with the FASTA header.

    So, lets run the script:

    NOTE: remember that we got to access the folder with the Terminal to execute the script. We can do this by using the UNIX command "cd" and finally get access to the folder that contains our files.
    1. STEP 1 <- Access the folder with the terminal
    2. STEP 2 <- Execute the Perl script:

    $ perl rev_com.pl

    Example of executing the Perl script

    EXPLANATION OF THE PARAMETERS

    perl <- You are telling the Terminal that you want to run a Perl script.

    rev_com.pl <- name of the perl script.

    STEP 3 <- Write the name of the input file (if the file has extension, do not forget to type it too):


    STEP 4 <- Hit enter and run! (it was a joke):

    Enjoy the results directly from the Terminal

    In the bottom of the Terminal, you can read how is the name of the output file :D

    Enjoy the results of another example of an input file that is not in FASTA format (it just contains the DNA sequence pasted in)

    Finally, we can go to the folder and see our new files that contains the reverse complement of an input user string:


    Can you see it now? The script had named the output files with the ".fa" extension and inside the file, we can see the FASTA headers in making them officially FASTA readable and recognizable files.

    ##############################################

    If you are curious and want to know how to run the script from the "bin" folder, here is an screenshot about it:


    As you can see, there is no need to tell the Terminal that we want to run a Perl script and another good thing about running programs from the bin folder is that we do not need to copy them to every input folder (where our input files lives in).

    #!/usr/bin/perl
    #       rev_com.pl
    #       
    #       Copyright 2011 Benjamin Tovar 
    #       
    #       This program is free software; you can redistribute it and/or modify
    #       it under the terms of the GNU General Public License as published by
    #       the Free Software Foundation; either version 2 of the License, or
    #       (at your option) any later version.
    #       
    #       This program is distributed in the hope that it will be useful,
    #       but WITHOUT ANY WARRANTY; without even the implied warranty of
    #       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    #       GNU General Public License for more details.
    #       
    #       You should have received a copy of the GNU General Public License
    #       along with this program; if not, write to the Free Software
    #       Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
    #       MA 02110-1301, USA.
    #
    ################################################################################
    #
    # NAME OF THE PROGRAM: rev_com.pl
    # DATE: 01/Jun/2011
    # AUTHOR: Benjamin Tovar
    # COMMENTS: Little script that can be used for the computation of 
    # the reverse complement of a DNA string and then export the resulting string
    # to a file in Fasta format with fasta headers (even if the original file was not)
    #
    ################################################################################
    
    use warnings;
    use strict;
    
    #### script introduction
    
    print "\n\n-------------------------------------------------------------\n\n";
    
    print "PROGRAM DEFINITION: rev_com.pl <- PERL SCRIPT WRITTEN BY BENJAMIN TOVAR
    THAT COMPUTES THE REVERSE COMPLEMENT OF DNA STRINGS AND EXPORT THE RESULTS TO AN 
    OUTPUT FILE IN FASTA FORMAT (EVEN IF THE ORIGINAL FILE WAS NOT)\n\n";
    
    print "This program will ask the user to type the name of the file 
    (if the file name has an extension, write it please) that contains the DNA sequence\n\n";
    
    #### Ask the user for the name of the file
    
    print "Please, type the complete name of the file:\n\n";
    
    my $user_in =;
    
    ### Remove empty spaces
    
    chomp $user_in;
    
    ### Open the file
    
    open(INPUT_FILE,$user_in) or die "\n\n WARNING: The file does not exist, please check the spelling, the extension and the existence of the file\n\n";
    
    ### Copy the content of the file to an array variable called "@input_file"
    
    my @input_file = ;
    
    ### Remove the Fasta header and extract the DNA sequence
    
    my $input_file = extract_sequence_from_fasta_data(@input_file);
    
    ### Close the opened file
    
    close INPUT_FILE;
    
    ### Compute the reverse DNA string
    
    my $rev_com = reverse ($input_file);
    
    ### Compute the complement DNA string
    
    $rev_com =~ tr/ATGCatgc/TACGtacg/;
    
    ### Compute the length of the input DNA string
    
    my $length = length ($input_file);
    
    ############## RESULTS #################################################
    
    print "\n----------- RESULTS --------------\n\n";
    
    print "Input file string of length $length:\n\n";
    
    print $input_file,"\n";
    
    print "\nOutput reverse complement of the DNA string:\n\n";
    
    print $rev_com, "\n\n";
    
    ############## EXPORTING THE RESULTS TO A FILE IN FASTA FORMAT #########
    
    ######################## Naming the output file: #######################
    
    ### sdfgh
    
    my $fasta_header_name = extract_fasta_header_name(@input_file);
    
    ## Remove the input file extension
    
    $user_in =~ s/\..*//; 
    
    # Concatenate the name of the file (without the .fa/.fasta extension) with "-rev_com.fa"
    
    my $rev_com_name = "-revcom.fa";
    
    my $output_name = $user_in . $rev_com_name;
    
    # Name of the output file
    
    my $out = "$output_name";
    
    # Set the file handle "OUTPUT".
    
    open (OUTPUT, ">$out"); 
    
    # Print the results (content) of the variable "$rev_com" (this variable contains the reverse complement string) 
    # into a file named "$output_name" and put the Fasta header before the output DNA string with "$fasta_header_name","-reverse_complementary\n"
    
    print OUTPUT "$fasta_header_name","-reverse_complement\n","$rev_com";
    
    
    print "-------- EXPORT THE RESULTS TO A FILE IN FASTA FORMAT ----------\n";
    print "\nThe output string has been exported to the file \"$output_name\"\n\n";
    
    exit;
    
    ################################################################################
    ############################### SUBROUTINES ####################################
    ################################################################################
    
    
    ################################################################################
    # extract_fasta_header_name
    # A subroutine to extract the FASTA header of the original input file
    # and use it to name the FASTA header of the output file
    ################################################################################
    
    sub extract_fasta_header_name{
    
        my(@fasta_file_data) = @_;
    
        use strict;
        use warnings;
    
        # Declare and initialize variables
        my $fasta_header_name = '';
    
        foreach my $line (@fasta_file_data) {
        
            if($line =~ /^>/) {
                
                $fasta_header_name = $line;
        
        # If the file is not in Fasta format, use the name of the file to name the fasta header           
            } else {
            
                $fasta_header_name = ">$user_in";
            
            }
            
        # Remove non-sequence data (in this case, whitespace) from $fasta_header_name string
            $fasta_header_name =~ s/\s//g;
            
        # Export the results of the subroutine to the main program    
            return $fasta_header_name;
            
        }
    }
    
    ################################################################################
    # extract_sequence_from_fasta_data
    # A subroutine to extract FASTA sequence data from an array
    # taken from James Tisdall's Beginning Perl for Bioinformatics
    ################################################################################
    
    sub extract_sequence_from_fasta_data {
    
        my(@fasta_file_data) = @_;
    
        use strict;
        use warnings;
    
        # Declare and initialize variables
        my $sequence = '';
    
        foreach my $line (@fasta_file_data) {
    
            # discard blank line
            if ($line =~ /^\s*$/) {
                next;
    
            # discard comment line
            } elsif($line =~ /^\s*#/) {
                next;
    
            # discard fasta header line
            } elsif($line =~ /^>/) {
                next;
    
            # keep line, add to sequence string
            } else {
                $sequence .= $line;
            }
        }
    
        # remove non-sequence data (in this case, whitespace) from $sequence string
        $sequence =~ s/\s//g;
    
        return $sequence;
    }
    

    Benjamin.