Labels

Tuesday, November 15, 2011

Perl script to convert raw biological strings in a file (DNA or amino acids strings) to FASTA format

Hello, this simple script is an update for another one I wrote eons ago.
Download the script <- seq2fas.pl

NOTE: It depends on the UNIX commands "pr","sed","tr" and "fold" to work out ;)

The functionality is very simple, it receives a file like this:

ACCTTACGCC
AGTAACGTAG
TTAGTATATA
ACCTACGATA
AAACAGGCCC
ACCGCTAGAT
AGCCCCATCC
CCGGTATACC
AGCGGACCCC
AACAACCCCC

And prints the output in FASTA format:

>Random_seq-1
ACCTTACGCC
>Random_seq-2
AGTAACGTAG
>Random_seq-3
TTAGTATATA
>Random_seq-4
ACCTACGATA
>Random_seq-5
AAACAGGCCC
>Random_seq-6
ACCGCTAGAT
>Random_seq-7
AGCCCCATCC
>Random_seq-8
CCGGTATACC
>Random_seq-9
AGCGGACCCC
>Random_seq-10
AACAACCCCC

This script is very useful if you are working with artificial sets produced in R, Perl, etc.

To run the script, just download it, change its permission to be executable and run it:

$ chmod +x seq2fas.pl
$ perl seq2fas.pl


Here is the USAGE:


perl seq2fas.pl randomDNAsequences.dna random_seqs 60 randomSeqs.fas

CODE:
#!/usr/bin/perl
 
################################################################################
# seq2fas.pl
# This script takes an input file like this:
#
#    ACCTTACGCC
#    AGTAACGTAG
#    TTAGTATATA
#    ACCTACGATA
#    AAACAGGCCC
#    ACCGCTAGAT
#    AGCCCCATCC
#    CCGGTATACC
#    AGCGGACCCC
#    AACAACCCCC
#
# and prints an output file in fasta format like this:
#
#    >1
#    ACCTTACGCC
#    >2
#    AGTAACGTAG
#    >3
#    TTAGTATATA
#    >4
#    ACCTACGATA
#    >5
#    AAACAGGCCC
#    >6
#    ACCGCTAGAT
#    >7
#    AGCCCCATCC
#    >8
#    CCGGTATACC
#    >9
#    AGCGGACCCC
#    >10
#    AACAACCCCC
#
################################################################################
# Author: Benjamin Tovar
################################################################################

use warnings;
use strict;

my $USAGE = "
USAGE:

seq2fas.pl    

EXAMPLE: 

seq2fas.pl randomDNAsequences.dna random_seqs 60 randomSeqs.fas

";

my $user_in = shift or die $USAGE;
my $fasta_header = shift or die $USAGE;
my $width = shift or die $USAGE;
my $output_file = shift or die $USAGE;

system("pr -n:3 -t -T $user_in | sed 's/^[ ]*/>$fasta_header-/' | tr \":\" \"\n\" | fold -w $width > $output_file");


exit;


Author: Benjamin Tovar

Thursday, August 18, 2011

Simple Bash command line to reduce the length of the fasta header lines.

Hi there, how many times we download a FASTA file that contains a huge and enormous fasta header like this:



So, to clean up the header, just use this simple command line:

$ cat <input_file> | awk '{print $1}' > <output_file>


EXAMPLE:

$ cat data.fa | awk '{print $1}' > data_parsed.fa

And the output will be:



Hope this helps

Benjamin.

Saturday, July 30, 2011

MEME2fasta.sh <- Bash script to Parse MEME motifs into separated FASTA files

Hi, this script Parses motifs from MEME output files (meme.txt) and print them in separated FASTA formated files.

The script parses every *.txt meme output file inside the target folder where you run it to automatize the procedure.

You can download the script here: MEME2fasta.sh

NOTES: It works in Debian and Debian based Linux systems and I have not tested yet in another Linux distributions.

In order to run the script:

STEP 1 <- To execute it, just change the permission of the file to run as a program:

$ chmod +x MEME2fasta.sh

STEP 2 <- To run the program (you can copy and paste it inside your bin path or run the script locally):
# From the bin folder:
# Go to the path of the target meme.txt output files and then:

$ MEME2fasta.sh

# From the local folder (Which contain the script and the target meme.txt files)

$ ./MEME2fasta.sh


SHORT TUTORIAL

INPUT FOLDER AND INPUT FILES:



OUTPUT FOLDER AND OUTPUT FILES:


Code:

#!/bin/bash

# MEME2fasta.sh
#
# I used this script to parse the DNA sequences obtained 
# from each motif of MEME output files "meme.txt"
# to generate a single FASTA file per motif.

# Finally I used the FASTA files to build PWMs

# Author: Benjamin Tovar
# Date: 11 July 2011

###########################################################
# Parse the data that is among the line "BL MOTIF" and "//":
# to retrieve the DNA sequences that defines each motif
##########################################################

for meme_file in *.txt
    do
        sed -n '/BL   MOTIF/,/\/\//p' $meme_file > $meme_file.sed
    done;

##########################################################
# Split every DNA motif into separated files in "*.csplit"
# format
##########################################################

for sed_file in *.sed
    do
        csplit -z $sed_file '/^BL   MOTIF/' '{*}' --suffix="%02d.csplit" --prefix=$sed_file- -s
    done

##########################################################
# Parse the DNA sequences from each *.csplit files
##########################################################

for csplit_file in *.csplit
    do
        # grep -v '^$' <- delete blank lines
        # sed 's/1//g' <- deletes the number "1" from the line. 
        cut -c34-150 $csplit_file |grep -v '^$' | sed 's/1//g' > $csplit_file.cut
    done

##########################################################
# Generate Fasta files 
##########################################################

for cut_file in *.cut
    do
        pr -n:3 -t -T $cut_file | sed 's/^[ ]*/>/' | tr ":" "\n" | fold -w 100 > $cut_file.fa
    done

# remove unnecessary files:

rm *.sed | rm *.csplit | rm *.cut

# Rename the FASTA files

rename -f 's/\.csplit.cut.fa$/\.fa/' *.fa 

rename -f 's/.txt.sed//s' *.fa

exit;

# Benjamin Tovar

Benjamin

Friday, July 29, 2011

fasta2clustal.pl <- Perl script for convert aligned and gapless FASTA files to aligned and gapless CLUSTAL formated files

Fasta to Clustal:

Hello there, recently I am working with literally thousands of PWM (Positional Weight Matrices) generated from MEME and MotifClick output.

So, to use the motifs in FASTA format to generate PWM I need to align them first to generate CLUSTAL formated files but errors occur when gaps delivered from the alignment appear..

Yep, the UGENE PWM builder does not like gaps and shows up errors building the PWM.

The situation is that you already got an aligned and gapless FASTA file and you only need to convert it to CLUSTAL without "the need" to align the sequences once more and deal with gaps.

So, here is my Perl script called: fasta2clustal.pl

STEP 1 <- To execute it, just change the permission of the file to run as a program:

$ chmod +x fasta2clustal.pl


STEP 2 <- To run the program (you can copy and paste it inside your bin path or run the script locally):

# From the bin folder (The folder must contain the target FASTA files):

$ fasta2clustal.pl <input_fasta_file.fasta>

EXAMPLE: fasta2clustal.pl fasta_sequences.fasta

# From the local folder (Which contain the script and the target FASTA files)

$ ./fasta2clustal.pl <input_fasta_file.fasta>

EXAMPLE: ./fasta2clustal.pl fasta_sequences.fasta

SHORT TUTORIAL:

RUNNING THE SCRIPT


INPUT FILE:



OUTPUT FILE:


Code:

#!/usr/bin/perl


#       fasta2clustal.pl
#       
#       Copyright 2011 Benjamin Tovar
#       
#       This program is free software; you can redistribute it and/or modify
#       it under the terms of the GNU General Public License as published by
#       the Free Software Foundation; either version 2 of the License, or
#       (at your option) any later version.
#       
#       This program is distributed in the hope that it will be useful,
#       but WITHOUT ANY WARRANTY; without even the implied warranty of
#       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#       GNU General Public License for more details.
#       
#       You should have received a copy of the GNU General Public License
#       along with this program; if not, write to the Free Software
#       Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
#       MA 02110-1301, USA.
#
################################################################################
#
# DATE: 27/July/2011
# AUTHOR: Benjamin Tovar
#
# This program converts aligned and gapless FASTA file to
# a gapless CLUSTAL file, ideal for convert motifs to PWM.
#
################################################################################

use warnings;
use strict;

###### Print USAGE if no argument is given
my $usage = "\nUSAGE: fasta2clustal.pl 
EXAMPLE: fasta2clustal.pl dna_sequence.fa\n\n";

###### Get the arguments:  
my $input_fasta_file = shift or die $usage;

###### Let the computer decide:

 my $dna_sequence ='';
 my $sequence_name='';
 my $output_file_name='';

 open(INPUT_FILE,$input_fasta_file);
 my @file = ;
 close INPUT_FILE;

  # ClUSTAL HEADER
  print "CLUSTAL W 2.0 multiple sequence alignment\n\n";

 foreach my $line(@file){
  
  # Discard empty lines
  if($line =~ /^\s*$/){
  next;
  }
  
  else{
    # Extract sequence names
    if($line =~ /^>/) {
     $sequence_name = $line;
     $sequence_name =~ s/\s//g;
     $sequence_name =~ s/\>//g;
     $sequence_name =~ s/\Start_position/_Start_position/g;
     $sequence_name =~ s/\:/_/g;
     chomp $sequence_name;
    }

    # Extract DNA sequence
    else{
    $dna_sequence = $line;
    chomp $dna_sequence;
  # Print results
                printf("%-50s %10s \n", "$sequence_name","$dna_sequence");
    } 
  }
 }

# Powred by #!CrunchBang Linux
# Benjamin Tovar

exit;

Benjamin

Wednesday, July 13, 2011

Bash tips: copy and move files avoiding the annoying "Argument list too long"

Hello there! Today I was into an interesting situation:

First: I had more than 80,000 GENBANK (*.gb) files inside a folder in combination with 100 FASTA files (*.fa) and 100 MAFFT alignment files (*.mafft).

Second: I wanted to create separated folders and then just create a list or summary of every kind of file with a simple:


# Create a summary of every FASTA file inside this folder:

$ ls *.fa | sort > fasta_summary


Third: Yep, it worked for the FASTA files and the MAFFT ones, but my life suddenly changed when I tried to use that code line for the GENBANK files because I got this -> "Argument list too long".

Ok I said, lets do it different, and finally here is the solution:

1) To create a summary of every genbank file inside the folder.


# be sure to be inside the folder with the Terminal
# Let Perl work ;)

$ perl -e 'opendir(DIR, "."); @all = grep /.gb/, readdir DIR; closedir DIR; print "@all\n";' | xargs ls > GENBANK_FILES_SUMMARY


grep/.gb/ <- Perl will look for that regular expression and list every file that have the ".gb" extension (you could  adapt the argument depending on your needs).


2) To copy every GENBANK file to a folder called "GENBANK_FOLDER-COPY":


$ find -name "*.gb" | xargs -i cp {} GENBANK_FOLDER-COPY/


3) To move every GENBANK file to a folder called "GENBANK_FOLDER":

$ find -name "*.gb" | xargs -i mv {} GENBANK_FOLDER/


# Description of the last line:

$ find source/ -name "*.txt" | xargs -i mv {} target/ <- Where "source/" is the input path and "target/" is the output path. "-name "*.txt" is the regular expression to look up for every *.txt file inside the input folder.

Check out the "cp" argument in task #2 and "mv" argument in task #3 for copying and moving respectively.

Hope this helps someone.

Benjamin

Monday, July 11, 2011

Bash script: Extract DNA sequences from each motif of MEME output files and parse them in FASTA format

This script parses the DNA sequences that conforms each motif in MEME output files ("meme.txt") and finally exports them in FASTA format.

MEME (Multiple EM for Motif Elicitation) program is a very powerful tool for motif mining in sequence datasets (This means that the program looks for overrepresented substrings with statistical significance among every sequence inside the dataset. A practical and common use is, for example, characterization in silico of motifs that could function as promoters upstream of a dataset constructed with CDS).

For more information about this program and other very useful Bioinformatics tools for motif mining tasks, take a look here in the MEME-suite home page.

So, with this script you will be able to extract every DNA sequence dataset that conforms each MEME motif.

The script is called extract-FASTA_from-MEME.sh 

To use it, just change the properties of the file to be executable:


$ chmod +x extract-FASTA_from-MEME.sh


and finally, just execute inside the folder that contains all the meme output files (meme.txt, meme.output.txt and so on):


$ ./extract-FASTA_from-MEME.sh


Code:

#!/bin/bash

# extract-FASTA_from-MEME.sh
#
# I used this script to parse the DNA sequences obtained 
# from each motif of MEME output files "meme.txt"
# to generate a single FASTA file per motif.

# Finally I used the FASTA files to build PWMs

# Author: Benjamin Tovar
# Date: 11 July 2011

###########################################################
# Parse the data that is among the line "BL MOTIF" and "//":
# to retrieve the DNA sequences that defines each motif
##########################################################

for meme_file in *.txt
do
    sed -n '/BL   MOTIF/,/\/\//p' $meme_file > $meme_file.sed
done;

##########################################################
# Split every DNA motif into separated files in "*.csplit"
# format
##########################################################

for sed_file in *.sed
do
csplit -z $sed_file '/^BL   MOTIF/' '{*}' --suffix="%02d.csplit" --prefix=$sed_file- -s
done

##########################################################
# Parse the DNA sequences from each *.csplit files
##########################################################

for csplit_file in *.csplit
do

# grep -v '^$' <- delete blank lines
# sed 's/1//g' <- deletes the number "1" from the line.
 
cut -c34-150 $csplit_file |grep -v '^$' | sed 's/1//g' > $csplit_file.cut
done

##########################################################
# Generate Fasta files 
##########################################################

for cut_file in *.cut

do
pr -n:3 -t -T $cut_file | sed 's/^[ ]*/>/' | tr ":" "\n" | fold -w 100 > $cut_file.fa
done

# remove unnecessary files:

rm *.sed | rm *.csplit | rm *.cut

# Rename the FASTA files

rename -f 's/\.csplit.cut.fa$/\.fa/' *.fa

# Benjamin Tovar

Benjamin

Wednesday, June 29, 2011

Bash script for split an input FASTA file into single files in FASTA format output

Hello everyone, yesterday I downloaded a FASTA file with more than 16,500 DNA sequences but I also need that every single sequence of that original FASTA file be split into 16,500 single files and each file must contain a single DNA sequence.

To do so, I write this short and simple script in bash called "fasta_split.sh".

To use the script:

$ ./fasta_split.sh

Here is an example, just follow the instructions the program ask you and that's all



Example input/output folders and input/output files:


 Code:

#!/bin/bash

#       fasta_split.sh
#       
#       2011 - Benjamin Tovar 
#       
#
# NAME OF THE PROGRAM: fasta_split.sh
# 
# DATE: 29/JUN/2011
# 
# AUTHOR: Benjamin Tovar
# 
# COMMENTS: This script will split a FASTA file that contains many sequences into
# Separated files containing one single sequence per file.
#
# I used this script to split a FASTA file containing ~16,500 sequences into
# ~16,500 independent files very quickly and simple.
#
################################################################################

# BEGINNING OF THE PROGRAM
echo
echo "This script relies on the program called \"csplit\" available in Linux/UNIX systems."

# Ask the user to type the name of the input FASTA file:
echo
echo -n "   Enter the name of the input file in FASTA format: "
    read input_file

# Ask the user to type the name of the output FASTA files (note in line 73, 
# "--suffix="%02d.fa" means that every output file will have the extension "*.fa"
# if you like to use another extension (for example, you like that every output file have
# the extension *.fasta), just replace the ".fa" with ".fasta" this way -> "--suffix="%02d.fasta"

# The part of "02" in "--suffix="%02d.fa" means that every file will be named with two numbers 
# sorted by their occurrence in the original FASTA input file.

# For example: in an input file that contains 2 sequences and with "--suffix="%02d.fa"
# the output will be:
#
# outputfile-00.fa
# outputfile-01.fa
#
# For example: in an input file that contains 2 sequences and with "--suffix="%04d.fa"
# the output will be:
#
# outputfile-0000.fa
# outputfile-0001.fa
echo      
echo -n "   Enter the name of the output files: "
    read output_file

# Ask the user to type the name of the output folder that will contain all the output files   
echo 
echo -n "   Enter the name of the output folder: "
read dir
echo
echo "Creating directory called \"$dir\"."

# Create the output folder
mkdir $dir

# Copy the input FASTA file to the output folder
cp $input_file $dir

# Open to the output folder
cd $dir
echo
echo "   ...splitting FASTA file ...saving them in \"$dir\"."

# Splitting the input FASTA file (For some settings, read line 31 to 49)
csplit -z $input_file '/^>/' '{*}' --suffix="%02d.fa" --prefix=$output_file- -s

# Delete the input FASTA file in the output folder
rm $input_file
echo
echo "Printing output summary in a file called \"OUTPUT-SUMMARY.out\""

# Create the output summary
ls | sort | sed -e 's/OUTPUT-SUMMARY.out//g' > OUTPUT-SUMMARY.out 
echo
echo " ---- PROCESS DONE ---- "
echo


Benjamin

Thursday, June 16, 2011

And here I go ..

Hello everyone! I realized how important is to get the right box, the right colors, the right font among others to represent a good source code..

So this morning I added some lines to my blog's template.

And this is the cool output:


#!/bin/bash

echo "random code"

exit;

Benjamin

Monday, June 13, 2011

How to download automatically every image that is linked from inside a website with a bash script

This time I did not write any code, instead of that I want to share with you this very useful bash script written by Darrin Goodman.

For the version 1 of the script with all its full reference, please enter here, and click here for the full reference of the version 2 (the one that is posted is my blog entry).

There are occasions when an individual might wish to download any or all of the images that may be linked from a web page, such as when there is a thumbnail image that is linked to a larger version of the same image so in all those cases this script will help you to do that job ;).

NOTE: This script depends on programs such "awk", "lynx", "grep" and "wget".

If you use Debian or a Debian based distro you must install the "lynx" program first:

$ sudo apt-get install lynx

So lets begin:

STEP 1 <- Download the script here

STEP 2 <- Make it executable 

$ chmod +x  image_downloader-v2.sh

STEP 3 <- Run the script

$ ./image_downloader-v2.sh

Example

Enjoy ;)

Code:

 #!/bin/bash  
 # Written by Darrin Goodman with inspiration from:  
 # http://www.go2linux.org/linux/2010/09/how-download-all-links-webpage-including-hidden-776  
 # THIS PROGRAM WILL DOWNLOAD IMAGES THAT ARE LINKED FROM A WEBSITE,  
 # SUCH AS WHEN THERE IS A THUMBNAIL IMAGE THAT IS LINKED TO A LARGER VERSION OF THE SAME IMAGE.  
 # THIS IS VERSION 2 - THE SIMPLE IMAGE DOWNLOADER - NO EXTRA FRILLS  
 # THE MORE FULL-FEATURED VERSION CAN BE FOUND HERE:   
 # http://www.hilltopyodeler.com/blog/?p=324  
   
 function grabURL {  
 echo -n " Enter the desired URL or 'q' to QUIT: "  
 read a  
 for a in $(cat url); do #$a; done  
 if [ $a == "q" ]  
 then  
 # figlet Done!  
 echo "Done!"  
 echo  
 echo  
 exit 0  
 else  
 echo " The URL that you entered is: $a"  
 echo  
 echo "Ok...... working........................."  
 echo  
 fi ; done  
 }  
 function imageDownload {  
 grabURL  
 echo  
   
 # GRAB A LIST OF ALL IMAGES BEING LINKED TO AND STORE THEM IN FILE CALLED images.txt  
 # THIS LIST HAS BEEN EXPANDED TO ALSO GRAB SWF's AND FLV's  
 lynx --dump $a | awk '/http/{print $2}' | grep png > images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep jpg >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep gif >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep flv >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep swf >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep PNG >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep JPG >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep GIF >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep FLV >> images.txt  
 lynx --dump $a | awk '/http/{print $2}' | grep SWF >> images.txt  
   
 # LOOP THROUGH THE LIST OF IMAGES STORED IN images.txt AND DOWNLOAD THEM TO THE CURRENT DIRECTORY  
 for i in $(cat images.txt); do wget $i; done  
 echo "//////////////////////////////////////////////////////////////////////////////"  
 echo  
 echo "Your images have downloaded to your current working directory."  
 echo  
 echo "Thank you for using ImageDownloader"  
 echo  
 # figlet Done!  
 echo "Done!"  
 echo  
 echo  
 exit 0  
 }  
 function whatNext {  
 echo "What would you like to do?"  
 echo  
 imageDownload  
 }  
   
 # BEGINNING OF THE PROGRAM  
   
 # Clear the screen  
 clear  
 echo "This script relies on the program called \"lynx\"."  
 echo " ImageDownloader v.2 is ready"  
 echo  
   
 # PROMPT USER TO DECIDE WHAT TO DO NEXT  
 whatNext  
   
   
   
 exit 0  

Benjamin

Sunday, June 12, 2011

Firefox 4 Installer For Debian

Hello there, surfing in the "Internets" I found this bash script "install_firefox_debian.sh" written by AnonyMous.

So, if you use Debian or a Debian based distro like I do with #!Crunchbang Linux we can install Firefox 4 instead of the default Iceweasel.


#!/bin/bash
# FireFox 4 Installer For Debian 
# Open As Root 
# Wtite By AnonyMous :)

echo " Downloading FireFox "

wget http://mozilla.c3sl.ufpr.br/releases//firefox/releases/4.0.1/linux-i686/en-US/firefox-4.0.1.tar.bz2

# Install FireFox4 

echo " Installing FireFox 4.0.1 "

tar -xvjf firefox*.tar.bz2

sudo mv firefox /usr/local/firefox4

sudo ln -s /usr/local/firefox4/firefox /usr/local/bin/firefox4

echo " FireFox Installed Successfuly "


Benjamin

Saturday, June 11, 2011

Perl script to delete repeated entries of a plain text file using the Linux/UNIX command "awk"

This script called "delete_repeats.pl" will help you if you are looking for a more user friendly alternative to execute the following lines inside a Terminal:

$ awk ' !x[$0]++' input_file > output_file


I know! the syntax is very logic, simple and probably is not necessary to deserve a Perl script to automatize this process of deleting entries with this simple line of code.

But, the good thing about writing a script that ask me for the input and output and then automatically replace those values inside the system command awd is that I no longer need to type the whole line.

I copied my script inside my "bin" folder, execute it and then I just worry about type the correct name of the input file and type the output file name and that is all ;)

Here is an example:

Input file:


Now, lets execute the Perl script with the following line inside a terminal (remember that we must go inside the folder that contains our input file first and be sure that our script is inside the same folder):


$ perl delete_repeats.pl


NOTE: I execute the script this way:


$ delete_repeats.pl


Because I copied the script to my /home/benjamin/bin folder so is no longer necessary to copy it script to every folder where I want to execute it. This way lets the Terminal to recognize it and execute it inside every folder that I want to use it :D


And the output :D


Benjamin

Bug fixed: Perl script to generate "n" random DNA strings of "n" length with a desired nucleotide probability distribution defined by the user and export them in a file in FASTA format

After a short review of my own script "random_dna_strings.pl" I realized that is more useful to set the parameter of C content distribution probability (which is the last parameter of probability distribution that is asked to the user) automatically to:

line 116:

my $C_content = (1-($A_content+$T_content+$G_content))

Rather than ask the user for typing manually the remaining value that finally satisfy the 1.00 of total probability value.

Total p value = 1 = (A content + T content + G content + C content)


C content = (1 - (A content + T content + G content))

Benjamin

Wednesday, June 8, 2011

SUPPORT HERE: LINUX 20TH YEAR ANNIVERSARY 2011 T SHIRT DESIGN CONTEST FINALIST

Information proudly taken from www.linux.com

After reviewing more than 130 submissions for the 20th Anniversary of Linux T-shirt design contest, we are excited to reveal our finalists!

Please check out the designs below and vote for your favorite. You will be able to vote once a day. Feel passionate about a specific design? Share this page with your friends online and encourage them to vote!

Voting will be open for two weeks: today, June 8, 2011, through midnight PT on June 22, 2011. The winner will be announced shortly thereafter. The winning design will be used as the basis for this year's LinuxCon attendee T-shirt and will be available on T-shirts for purchase at the Linux.com Store later this summer.

Now, get out the vote!


VOTE AND BE COOLER THAN BEFORE BY CLICKING HERE

(IT HAS BEEN DEMONSTRATED THAN VOTING WILL REDUCE THE PROBABILITY OF INCIDENCE OF MUTATIONS IN YOUR p53 GENES)
Benjamin

Sunday, June 5, 2011

Perl script to generate "n" random DNA strings of "n" length with a desired nucleotide probability distribution defined by the user and export them in a file in FASTA format

How many of you that are working with artificial DNA sets and have to use the tools that are available over the Web have to edit the output because the FASTA header of such programs does not correspond to your needs, the output is not in FASTA format or the lazy reason to copy and paste the output sequences inside a manually generated file because it depends on important ATP molecules that can be used in another tasks.

Well, this Perl script called "random_dna_strings.pl" can be downloaded by clicking the link.

This script takes arguments given by the user such the nucleotide frequencies of each nucleotide (in a scale from 0.0 to 1.0), generates a "n" number of sequences of "n" length with a FASTA header also given by the user and finally prints an output file in FASTA format.

So, here is a short Tutorial about how to use it:

Well, imagine that we want to generate 100 random DNA sequences of 100 nucleotides length with a nucleotide distribution equal to every nucleotide, this is "A=0.25, T=0.25, G=0.25 and C=0.25" with the FASTA header of "just a bunch of random sequences of equal random prob nt distribution" in a file called "ran_set_1.fa".

STEP 1 <- Open a Terminal inside the output folder (where you want to put the output file)

STEP 2 <- Execute the Perl script (This time, I copied the script to my bin folder, if you have questions about how to run your scripts this way, please visit this entry.

Command:

$ random_dna_strings.pl


STEP 3 <- Please type the number of iterations (How many random sequences do you want)

In this Tutorial I want to generate 100 artificial sequences.


STEP 4 <- Please type the length of the random DNA strings (how many nucleotides length)

In this Tutorial I want each sequence to be 100 nucleotides long


STEP 5 <- Please type the probability distribution of A, T, G and C content:

In this Tutorial I want each nucleotide got an equal probability distribution (this means 1/4 of probability of A,T,G or C at each string position)



STEP 6 <- Please, type the name of the fasta header for each sequence (is not necessary to put the >)

In this Tutorial, I want that the FASTA header be: "just a bunch of random sequences of equal random prob nt distribution"



STEP 7 <- Please, type the name of the output file:

In this Tutorial, I want to name it "ran_set_1.fa"



STEP 8 <- Enjoy the results


Another example:

Now I want to generate 100 random DNA sequences of 100 nucleotides length with a nucleotide distribution of "A=0.7, T=0.2, G=0.1 and C=0.0" with the FASTA header of "random dna sequence with too many As" in a file called "random_set_A_rich.fa".


Now, take a look of the output folder:


Do you see it :D, now we got two output files in FASTA format of random made sequences that follows a probability distribution of nucleotides given by us.

NOTE: The yellow highlighting just mean the differences of the content of A nucleotides among the two files showing us that the probability distribution obviously has an impact on the nucleotide composition of the random generated DNA strings.

Benjamin

Saturday, June 4, 2011

Perl script to export every line of independent DNA sequences inside a single file to a FASTA formated output file

This Perl script will take every line of the input file as a single DNA sequence and then it will create a new file with every sequence in FASTA format with the FASTA headers defined by the user.

The script is called "fasta_seq.pl" and can be downloaded by clicking the link.

So here it is a short Tutorial about how to use it:

Imagine that we got a file called "seq" that contains a single DNA sequence per row and we do not want to separate each one with spaces and name them manually.

NOTE: This script only will run under Linux/UNIX environments because it depends of the commands: "pr","sed","tr" and "fold" (Sorry to all the Windows users).



STEP 1 <- Open a Terminal and get inside the input files folder:

STEP 2 <- Execute the Perl script with

perl fasta_seq.fa





EXPLANATION OF THE PARAMETERS:
  1. perl <- here you are telling the Terminal that you want to run the Perl environment
  2. fasta_seq.fa <- name of the Perl script

STEP 3 <- Type the name of the input file that contains the DNA sequences (in this Tutorial, our file is called "seq")



STEP 4 <- Type the name of the fasta header for each sequence (is not necessary to put the ">" symbol).

In this Tutorial, I want that the FASTA header will be "just a dna sequence"



STEP 5 <- Type the width of the sequences (how many nucleotides per column)

In this Tutorial, I want that the width of nucleotides per column will be "45" nucleotides



STEP 6 <- Type the complete name of output file (In this Tutorial, I want that the output file name of my results will be "output_sequences.fa"



STEP 7 <- Finally, enjoy the results ;)


Now we can go into our input folder and take a look of the output file:



Do you see it?? successfully now we got an output file that contains every sequence of the input file (this script considers every line as an independent DNA string) under a fasta header named by ourselves and ready in FASTA format.


Benjamin

Thursday, June 2, 2011

Simple Perl script for computing the reverse complement of a single DNA string

This entry will help you if you are looking for a Perl script that reads an input file (it can be in FASTA format or just the sequence without the FASTA header) that contains just one sequence (at least for now, later on I will upload a script that can handle n number of sequences inside a single file in FASTA format), computes the reverse complement of the string and then create an output file in FASTA format (even if the original file was not in FASTA format).

The script is named "rev_com.pl" and can be downloaded by clicking in the script name.

NOTE: In a Linux/UNIX system we can set the properties of the script to be executable and then put it inside the "bin" folder (like /usr/local/bin or /home/USER/bin), So in this way we can run the script in every directory we want without the need to copy it to every folder where we want to run the script.

Here is an example of my personal "bin" folder:

But in this short Tutorial I will run the Perl script that is inside the folder of my DNA sequences and not the one that is inside my "bin" folder.



Lets run an example to see how it works ;)

Lets imagine that the got this two files inside a folder:
  1. One is in FASTA format <- D_mel.fa 
  2. The other one just contains the DNA sequence <- dna_sequence


    And we want to compute the reverse complement of each file and export them in a new output file in FASTA format with the FASTA header.

    So, lets run the script:

    NOTE: remember that we got to access the folder with the Terminal to execute the script. We can do this by using the UNIX command "cd" and finally get access to the folder that contains our files.
    1. STEP 1 <- Access the folder with the terminal
    2. STEP 2 <- Execute the Perl script:

    $ perl rev_com.pl

    Example of executing the Perl script

    EXPLANATION OF THE PARAMETERS

    perl <- You are telling the Terminal that you want to run a Perl script.

    rev_com.pl <- name of the perl script.

    STEP 3 <- Write the name of the input file (if the file has extension, do not forget to type it too):


    STEP 4 <- Hit enter and run! (it was a joke):

    Enjoy the results directly from the Terminal

    In the bottom of the Terminal, you can read how is the name of the output file :D

    Enjoy the results of another example of an input file that is not in FASTA format (it just contains the DNA sequence pasted in)

    Finally, we can go to the folder and see our new files that contains the reverse complement of an input user string:


    Can you see it now? The script had named the output files with the ".fa" extension and inside the file, we can see the FASTA headers in making them officially FASTA readable and recognizable files.

    ##############################################

    If you are curious and want to know how to run the script from the "bin" folder, here is an screenshot about it:


    As you can see, there is no need to tell the Terminal that we want to run a Perl script and another good thing about running programs from the bin folder is that we do not need to copy them to every input folder (where our input files lives in).

    #!/usr/bin/perl
    #       rev_com.pl
    #       
    #       Copyright 2011 Benjamin Tovar 
    #       
    #       This program is free software; you can redistribute it and/or modify
    #       it under the terms of the GNU General Public License as published by
    #       the Free Software Foundation; either version 2 of the License, or
    #       (at your option) any later version.
    #       
    #       This program is distributed in the hope that it will be useful,
    #       but WITHOUT ANY WARRANTY; without even the implied warranty of
    #       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    #       GNU General Public License for more details.
    #       
    #       You should have received a copy of the GNU General Public License
    #       along with this program; if not, write to the Free Software
    #       Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
    #       MA 02110-1301, USA.
    #
    ################################################################################
    #
    # NAME OF THE PROGRAM: rev_com.pl
    # DATE: 01/Jun/2011
    # AUTHOR: Benjamin Tovar
    # COMMENTS: Little script that can be used for the computation of 
    # the reverse complement of a DNA string and then export the resulting string
    # to a file in Fasta format with fasta headers (even if the original file was not)
    #
    ################################################################################
    
    use warnings;
    use strict;
    
    #### script introduction
    
    print "\n\n-------------------------------------------------------------\n\n";
    
    print "PROGRAM DEFINITION: rev_com.pl <- PERL SCRIPT WRITTEN BY BENJAMIN TOVAR
    THAT COMPUTES THE REVERSE COMPLEMENT OF DNA STRINGS AND EXPORT THE RESULTS TO AN 
    OUTPUT FILE IN FASTA FORMAT (EVEN IF THE ORIGINAL FILE WAS NOT)\n\n";
    
    print "This program will ask the user to type the name of the file 
    (if the file name has an extension, write it please) that contains the DNA sequence\n\n";
    
    #### Ask the user for the name of the file
    
    print "Please, type the complete name of the file:\n\n";
    
    my $user_in =;
    
    ### Remove empty spaces
    
    chomp $user_in;
    
    ### Open the file
    
    open(INPUT_FILE,$user_in) or die "\n\n WARNING: The file does not exist, please check the spelling, the extension and the existence of the file\n\n";
    
    ### Copy the content of the file to an array variable called "@input_file"
    
    my @input_file = ;
    
    ### Remove the Fasta header and extract the DNA sequence
    
    my $input_file = extract_sequence_from_fasta_data(@input_file);
    
    ### Close the opened file
    
    close INPUT_FILE;
    
    ### Compute the reverse DNA string
    
    my $rev_com = reverse ($input_file);
    
    ### Compute the complement DNA string
    
    $rev_com =~ tr/ATGCatgc/TACGtacg/;
    
    ### Compute the length of the input DNA string
    
    my $length = length ($input_file);
    
    ############## RESULTS #################################################
    
    print "\n----------- RESULTS --------------\n\n";
    
    print "Input file string of length $length:\n\n";
    
    print $input_file,"\n";
    
    print "\nOutput reverse complement of the DNA string:\n\n";
    
    print $rev_com, "\n\n";
    
    ############## EXPORTING THE RESULTS TO A FILE IN FASTA FORMAT #########
    
    ######################## Naming the output file: #######################
    
    ### sdfgh
    
    my $fasta_header_name = extract_fasta_header_name(@input_file);
    
    ## Remove the input file extension
    
    $user_in =~ s/\..*//; 
    
    # Concatenate the name of the file (without the .fa/.fasta extension) with "-rev_com.fa"
    
    my $rev_com_name = "-revcom.fa";
    
    my $output_name = $user_in . $rev_com_name;
    
    # Name of the output file
    
    my $out = "$output_name";
    
    # Set the file handle "OUTPUT".
    
    open (OUTPUT, ">$out"); 
    
    # Print the results (content) of the variable "$rev_com" (this variable contains the reverse complement string) 
    # into a file named "$output_name" and put the Fasta header before the output DNA string with "$fasta_header_name","-reverse_complementary\n"
    
    print OUTPUT "$fasta_header_name","-reverse_complement\n","$rev_com";
    
    
    print "-------- EXPORT THE RESULTS TO A FILE IN FASTA FORMAT ----------\n";
    print "\nThe output string has been exported to the file \"$output_name\"\n\n";
    
    exit;
    
    ################################################################################
    ############################### SUBROUTINES ####################################
    ################################################################################
    
    
    ################################################################################
    # extract_fasta_header_name
    # A subroutine to extract the FASTA header of the original input file
    # and use it to name the FASTA header of the output file
    ################################################################################
    
    sub extract_fasta_header_name{
    
        my(@fasta_file_data) = @_;
    
        use strict;
        use warnings;
    
        # Declare and initialize variables
        my $fasta_header_name = '';
    
        foreach my $line (@fasta_file_data) {
        
            if($line =~ /^>/) {
                
                $fasta_header_name = $line;
        
        # If the file is not in Fasta format, use the name of the file to name the fasta header           
            } else {
            
                $fasta_header_name = ">$user_in";
            
            }
            
        # Remove non-sequence data (in this case, whitespace) from $fasta_header_name string
            $fasta_header_name =~ s/\s//g;
            
        # Export the results of the subroutine to the main program    
            return $fasta_header_name;
            
        }
    }
    
    ################################################################################
    # extract_sequence_from_fasta_data
    # A subroutine to extract FASTA sequence data from an array
    # taken from James Tisdall's Beginning Perl for Bioinformatics
    ################################################################################
    
    sub extract_sequence_from_fasta_data {
    
        my(@fasta_file_data) = @_;
    
        use strict;
        use warnings;
    
        # Declare and initialize variables
        my $sequence = '';
    
        foreach my $line (@fasta_file_data) {
    
            # discard blank line
            if ($line =~ /^\s*$/) {
                next;
    
            # discard comment line
            } elsif($line =~ /^\s*#/) {
                next;
    
            # discard fasta header line
            } elsif($line =~ /^>/) {
                next;
    
            # keep line, add to sequence string
            } else {
                $sequence .= $line;
            }
        }
    
        # remove non-sequence data (in this case, whitespace) from $sequence string
        $sequence =~ s/\s//g;
    
        return $sequence;
    }
    

    Benjamin.

    Tuesday, May 31, 2011

    How to print every file inside a folder with just one command line in Linux/UNIX

    This time I want to introduce you a very easy way to print every file inside a folder directly from the Terminal in a Linux/UNIX system:

    The command is called lpr and here is a short definition of it provided from http://linux.about.com/:


    NAME
    lpr - print files  

    SYNOPSIS
    lpr [ -E ] [ -P destination ] [ -# num-copies [ -l ] [ -o option ] [ -p] [ -r ] [ -C/J/T title ] [ file(s) ]

    DESCRIPTION
    lpr submits files for printing. Files named on the command line are sent to the named printer (or the system default destination if no destination is specified). If no files are listed on the command-line lpr reads the print file from the standard input.

    OPTIONS

    The following options are recognized by lpr:

    -E 
    Forces encryption when connecting to the server.

    -P destination
    Prints files to the named printer.

    -# copies
    Sets the number of copies to print from 1 to 100.

    -C name
    Sets the job name.

    -J name
    Sets the job name.

    -T name
    Sets the job name.

    -l
    Specifies that the print file is already formatted for the destination and should be sent without filtering. This option is equivalent to "-oraw".

    -o option
    Sets a job option.

    -p
    Specifies that the print file should be formatted with a shaded header with the date, time, job name, and page number. This option is equivalent to "-oprettyprint" and is only useful when printing text files.

    -r
    Specifies that the named print files should be deleted after printing them.


    #####################

    So, here is a little example using the lpr command:

    Lets suppose, that we have 229 PDFs files inside a folder and we want to print them all obviously without the need to open each of them, then click "File" then "Print", select the number of copies and finally click PRINT.

    Just imagine the amounts of time that that will requires if we do it manually, so we will do something much better with just one line of text inside a Terminal ;)

    NOTE: If you want to export your files (the ones that are supported by OpenOffice like *.doc, *ppt, *.odp, *odt and so on) we can use another Linux/UNIX program called UNOCONV and in a past post I teach you in a simple way how to use it.

    Folder with many PDFs file of the example

    The best command that satisfies my needs is the next one:

    $ lpr -o landscape -s -P Photosmart-C5200-series *.pdf


    Explanation:

    -o landscape =  means that inside the parameters of the printing, I specifies that every file must be printed in a landscape way.

    -P Photosmart-C5200-series  = means that I specifies to the command, that the printer that I want to use is Photosmart-C5200-series ( I recommend you to be sure about this parameter to avoid sending 10,000 files to an undesired printer device).

    *.pdf  = means that every PDF file inside the folder will be printed (You know, this can be changed to *.odt (to print every OpenOffice Writer Text file and so on).

    Here is a screenshot of my Terminal inside that folder with the whole command typed in:



    Hope this entry will help someone ;)

    Benjamin

    Thursday, May 26, 2011

    Document automata .. How to export my *.doc, *odp, *.odt (and so on) files to other formats (*.pdf included) directly from the Terminal.

    Hello there! In this little tutorial I want to introduce this UNIX program called UNOCONV:

    Here is the official description of the file:


    Unoconv converts between any document format that OpenOffice understands. It uses OpenOffice's UNO bindings for non-interactive conversion of documents.

    Supported document formats include Open Document Format (.odt), MS Word (.doc), MS Office Open/MS OOXML (.xml), Portable Document Format (.pdf), HTML, XHTML, RTF, Docbook (.xml), and more.

    And here is the reference of the program from the Linux Man Page: http://linux.die.net/man/1/unoconv

    In this example, we are going to convert a bunch of files of *.ppt, *.doc, *.odt that are inside the same folder to *.pdf (now imagine the human time required to open every file and export them manually to PDF, but with UNOCONV that is going to be done in 1 line code).

    Example files inside the same folder


    1. Download the program:
    In Debian based distros (like Ubuntu) type this in the terminal:


    $  sudo apt-get install unoconv


    2. Open a Terminal (or it can be the same that we used before to install the program) and go to the directory of the files that we want to change the format (Remember that is possible to take a look for what is inside a folder from the Terminal by typing the UNIX command "ls").

    3. Now in the directory, we can use combination of parameters to select the best that fits our problem.


    In the particular case of this example, I want to export every file inside the folder to *.pdf here with this code:


    $ unoconv -f pdf *.*


    Example


    Explanation of the parameters:

    1. unoconv = name of the program
    2. -f = output files, I choose "pdf" because I want that all the files inside the folder (*.doc, *.ppt, etc) be exported to this pdf format.
    3. *.* = means that every file with any name (because the asterisk is before the dot) and every file with any extension (the asterisk after the dot) will be used as input by the unoconv program.
    4. Finally, we execute the program by pressing enter and see the results inside the folder.



    And the output from the Terminal with the UNIX command "ls" :


    As you can see is very easy to use and very helpful when we need to export thousands of files from one editable format to PDF and obviously, we do not want to do it manually.


    This program also helps you if you use OpenOffice and you want to convert every MS Office to a format that best fits us dear Linux users.

    Hope this short Tutorial helps someone!

    Benjamin Tovar

    Wednesday, May 25, 2011

    Hello, is there any GENOBIOTEC out there?..

    Have been so long since my last post, but trust me, I was busy in orders of magnitude because I am a member of the AsEBioGen (Student Association of Genomic Biotechnology in Spanish)  and we have the proudly tradition that every 2 years we organize and build up an International Congress of Biotechnology ;)

    This year our congress began the 19th of May and the last Saturday 21th our congress GENOBIOTEC 2011 finally concluded.
    It was a kind of crazy stuff, you never ever got an idea of the amounts of work that the whole project requires, I can resume thousands of words saying that in three days I only went to sleep 6 or 7 hours.

    But now, for the exhausted ears of my appreciated team co-workers now we are done and we did such a very good work.

    Congratulations to all of you AsEBioGen 2010-2012 members!




    Special thanks to:


    Workshop Manager:
    Paulyna Magaña.

    Webmaster:
    Daniel Rodríguez.  


    Transportation Chief and Logistics:

    Jesús Montes.
    Daniel Rodríguez.  

    Allyson Treviño.
    Gilberto Saca.


    FOR ALL THE STAFF OF GENOBIOTEC 2011 THANK YOU VERY MUCH!