Wednesday, June 29, 2011

Bash script for split an input FASTA file into single files in FASTA format output

Hello everyone, yesterday I downloaded a FASTA file with more than 16,500 DNA sequences but I also need that every single sequence of that original FASTA file be split into 16,500 single files and each file must contain a single DNA sequence.

To do so, I write this short and simple script in bash called "fasta_split.sh".

To use the script:

$ ./fasta_split.sh

Here is an example, just follow the instructions the program ask you and that's all



Example input/output folders and input/output files:


 Code:

#!/bin/bash

#       fasta_split.sh
#       
#       2011 - Benjamin Tovar 
#       
#
# NAME OF THE PROGRAM: fasta_split.sh
# 
# DATE: 29/JUN/2011
# 
# AUTHOR: Benjamin Tovar
# 
# COMMENTS: This script will split a FASTA file that contains many sequences into
# Separated files containing one single sequence per file.
#
# I used this script to split a FASTA file containing ~16,500 sequences into
# ~16,500 independent files very quickly and simple.
#
################################################################################

# BEGINNING OF THE PROGRAM
echo
echo "This script relies on the program called \"csplit\" available in Linux/UNIX systems."

# Ask the user to type the name of the input FASTA file:
echo
echo -n "   Enter the name of the input file in FASTA format: "
    read input_file

# Ask the user to type the name of the output FASTA files (note in line 73, 
# "--suffix="%02d.fa" means that every output file will have the extension "*.fa"
# if you like to use another extension (for example, you like that every output file have
# the extension *.fasta), just replace the ".fa" with ".fasta" this way -> "--suffix="%02d.fasta"

# The part of "02" in "--suffix="%02d.fa" means that every file will be named with two numbers 
# sorted by their occurrence in the original FASTA input file.

# For example: in an input file that contains 2 sequences and with "--suffix="%02d.fa"
# the output will be:
#
# outputfile-00.fa
# outputfile-01.fa
#
# For example: in an input file that contains 2 sequences and with "--suffix="%04d.fa"
# the output will be:
#
# outputfile-0000.fa
# outputfile-0001.fa
echo      
echo -n "   Enter the name of the output files: "
    read output_file

# Ask the user to type the name of the output folder that will contain all the output files   
echo 
echo -n "   Enter the name of the output folder: "
read dir
echo
echo "Creating directory called \"$dir\"."

# Create the output folder
mkdir $dir

# Copy the input FASTA file to the output folder
cp $input_file $dir

# Open to the output folder
cd $dir
echo
echo "   ...splitting FASTA file ...saving them in \"$dir\"."

# Splitting the input FASTA file (For some settings, read line 31 to 49)
csplit -z $input_file '/^>/' '{*}' --suffix="%02d.fa" --prefix=$output_file- -s

# Delete the input FASTA file in the output folder
rm $input_file
echo
echo "Printing output summary in a file called \"OUTPUT-SUMMARY.out\""

# Create the output summary
ls | sort | sed -e 's/OUTPUT-SUMMARY.out//g' > OUTPUT-SUMMARY.out 
echo
echo " ---- PROCESS DONE ---- "
echo


Benjamin