Labels

Tuesday, February 25, 2014

Convert Ensembl, Unigene, Uniprot and RefSeq IDs to Symbol IDs in R using Bioconductor

Hello, I have programmed a function that converts different sources of IDs to Symbol IDs.

The input ID types allowed are (at the moment):  Ensembl, Unigene, Uniprot and RefSeq.

The code is available clicking here

NOTE: The function depends on the Bioconductor package "org.Hs.eg.db" available here

For example, lets show 10 Ensembl IDs:

> id[1:10]
 [1] "ENSG00000121410" "ENSG00000175899" "ENSG00000256069" "ENSG00000171428"
 [5] "ENSG00000156006" "ENSG00000196136" "ENSG00000114771" "ENSG00000127837"
 [9] "ENSG00000129673" "ENSG00000090861"

And their Symbol IDs:

> res[1:10]
 [1] "A1BG"     "A2M"      "A2MP1"    "NAT1"     "NAT2"     "SERPINA3"
 [7] "AADAC"    "AAMP"     "AANAT"    "AARS"    

This is a running example of the function to convert Unigene IDs to Symbol IDs (For all the other IDs types, just replace "unigene" to "ensembl" or "refseq" or "uniprot"):

# USAGE EXAMPlE: UNIGENE
require(org.Hs.eg.db)
unigene <- toTable(org.Hs.egUNIGENE)
# extract 100 random unigene entries
id  <- unigene[sample(1:length(unigene[,2]),100),2]
id.type  <- "unigene"
res <- get.symbolIDs(id,id.type)

Benjamin

Monday, February 10, 2014

Upgrade and update R 2.15 to R 3.0 in Debian Wheezy

Following the instructions from CRAN, you need to add the R backports in your source list.

FIRST PART: ADD R BACKPORTS: 

First, open a Terminal and open the sources.list file:

$ gksudo gedit /etc/apt/sources.list

Then, add these lines at the bottom of the file (Note, I use the Revolution Analytics Dallas, TX server, but this can be easily changed taking a look here for the mirrors):

## R BACKPORTS FOR WHEEZY
deb http://cran.revolutionanalytics.com/bin/linux/debian wheezy-cran3/
#deb-src http://cran.revolutionanalytics.com/bin/linux/debian wheezy-cran3/

SECOND PART: RENAME THE R PACKAGES FOLDER:

There's a folder where R uses to store the packages we download, just rename it to the current version of R. For example, mine was "2.15" and then I just renamed it to "3.0" and was inside this path:

Before:
/home/benjamin/R/x86_64-pc-linux-gnu-library/2.15
After:
/home/benjamin/R/x86_64-pc-linux-gnu-library/3.0

Remember that some packages also needs to install some files in folders that belongs to the root, so, I would recommend to open R in sudo mode (only if you're sure about what you're doing :P) just by executing R this way: "sudo R" and then, in the R console type :

update.packages(checkBuilt=TRUE, ask=FALSE)

THIRD PART: SECURE APT:

The Debian backports archives on CRAN are signed with the key ID 381BA480, to add them, in a Terminal prompt type:

gpg --keyserver pgp.mit.edu --recv-key 381BA480
gpg -a --export 381BA480 > jranke_cran.asc
sudo  apt-key add jranke_cran.asc


FOURTH PART: UPDATE AND UPGRADE R:

Save the file and you can either enter to Synaptic, update the packages list and then just upgrade the packages or in a terminal type:

sudo apt-get update
sudo apt-get upgrade

And that's all.
Benjamin

Saturday, October 19, 2013

Optimizing a multivariable function parameters using a random method, genetic algorithm and simulated annealing in R


Say that you are implementing a non-linear regression analysis, which is shortly described by wikipedia as:

"In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables."

For the training set, we have the following:


And the function to optimize the parameters is:



Which leads us to the following equality:


In other words, we want to optimize the value of theta in order to minimize the sum of the error among y and predicted.y:

Given theta (each parameter a0,..a3 has a range from 0 to 15):


And the error function:


finally, the goal function:


In other words, the goal function searches for the value of theta that minimizes the error.

COMPUTATIONS BEGIN

This is the scatter plot of the training set:


Here is the implementation in R, you can download the file clicking here

Here is a result plot using the genetic algorithm:


Benjamin

Saturday, August 17, 2013

Lieutenant Dan You Got a New Interface..

After some days of thinking, I realize that this blog deserved a little bit more attention, so I decided to change the interface and I'am happy about how it looks now.

Hope to see you again in my personal blog.

keep on programming!

Benjamin

Friday, August 16, 2013

Accuracy versus F score: Machine Learning for the RNA Polymerases

Hello, today I'm going to show you the difference of using two different common performance measures (useful not only for Machine Learning purposes, is useful in every scientific field). Until now, I have found more the accuracy values than F scores in the performance measuring of some methods which ranges from metaheuristics (Genetic Algorithms fitness functions) to promoter recognition programs, diagnose methods and so on.

But, I would really recommend to avoid using the accuracy measure. The reason is shown below with a nice example in R programming language (all the functions used in the simulation are included,  you can download them clicking here).

Case study 1:

Imagine that you are in a Computer Vision project and your task is to "teach" a program to recognize among  electric guitars and acoustic guitars showing the program pictures of different guitars.

Suppose that you've already developed that program and now you want to measure the performance of this Boolean classifier (this is for example, you show the program a picture of a an electric guitar, and the program has to decide whether it will recognize and "classify" it as an electric or as an acoustic guitar).

For the function of this post, lets write down some useful concepts

Consider the following:

TP: a true positive is when the program classifies an electric guitar as an electric guitar, we will use the letter "E" to denote the electric guitar "class"

FP: a false positive is when the program classifies an acoustic guitar as an electric guitar, we will use the letter "A" to denote the acoustic guitar "class"

FN: a false negative is when the program classifies an electric guitar as an acoustic guitar

TN: a true negative is when the program classifies an acoustic guitar as an acoustic guitar

Now that we are ready, we shall begin with the calculations

In R, I have simulated the results of the program. Say, for 1,000 electric guitar pictures and 1,000 acoustic guitar pictures

The program prompt the following results:

        PREDICTED.E PREDICTED.A
TRUE.E         485         515
TRUE.A           9         991

If you notice, from the 1000 electric guitar pictures, only 485 were labeled as electric (TP=485), the rest were labeled as acoustic (FN=515). I feel bad for the hypothetical programmer of this hypothetical example.

On the other hand, from the 1000 acoustic guitars, 991 were labeled as acoustic (TN=991) and only 9 of them were labeled as electric (FP=9). Well not bad!..... or it is?

The accuracy value of this program is = 0.738

And, for computing the F score is necessary to compute the precision and the recall first, where:

precision = 0.9817814 and recall = 0.485

Then, the F score is equal to 0.6492637


Well, the F scores seems to be more "strict", and in fact it is in comparison of the accuracy performance measure. But this example is not very "cool". Lets pass to the case study 2

Case study 2:

Now we have 1,000 electric guitar pictures and 100,000 acoustic guitar pictures, the confusion matrix of the results are:

        PREDICTED.E PREDICTED.A
TRUE.E         493         507
TRUE.A        1017       98983

If you notice, from the 1,000 electric guitar pictures, only 493 were labeled as electric (TP=493), the rest were labeled as acoustic (FN=507)

On the other hand, from the 100,000 acoustic guitars, 98983 were labeled as acoustic (TN=98983) and only 1017 of them were labeled as electric (FP=1017)

Now (cha cha chan!), the performance values are:

Accuracy: 0.9849109
Precision: 0.3264901
Recall: 0.493
F score: 0.3928287

Now you see it?, how come or how is possible that missing almost the 50% of the labels of the electric guitars, the performance of the program in accuracy is almost 0.99?, despite of having a precision and recall not greater than 0.50. Then we have a winner and is the F score measure.

for references visit the following pages:

http://en.wikipedia.org/wiki/Accuracy
http://en.wikipedia.org/wiki/F1_score
http://en.wikipedia.org/wiki/Precision_and_recall


Saturday, May 25, 2013

Make your LaTex presentations using the Beamer class

Hello, I have been working with this tutorial. All you need to do is download the source files and take a look at the source code and the PDF file.


Download link here

Hope you find LaTex as useful as I do

Benjamin

Sunday, December 2, 2012

Póster presentado en la XIV Escuela de Otoño en Biología Matemática, México

Póster presentado en la XIV Escuela de Otoño en Biología Matemática - 8vo Encuentro de Biología Matemática celebrado en San Luis Potosí, S.L.P, México:

"Predicción de promotores RNA POL-II en Drosophila melanogaster utilizando propiedades de señal, contexto y estructura a partir de secuencias nucleotídicas"

Póster disponible aquí