Computational Biology Blog in fasta format: 2013

Saturday, October 19, 2013

Optimizing a multivariable function parameters using a random method, genetic algorithm and simulated annealing in R

Say that you are implementing a non-linear regression analysis, which is shortly described by wikipedia as:

"In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables."

For the training set, we have the following:

And the function to optimize the parameters is:

Which leads us to the following equality:

In other words, we want to optimize the value of theta in order to minimize the sum of the error among y and predicted.y:

Given theta (each parameter a0,..a3 has a range from 0 to 15):

And the error function:

finally, the goal function:

In other words, the goal function searches for the value of theta that minimizes the error.

COMPUTATIONS BEGIN

This is the scatter plot of the training set:

Here is the implementation in R, you can download the file clicking here

Here is a result plot using the genetic algorithm:

Benjamin

Saturday, August 17, 2013

Lieutenant Dan You Got a New Interface..

After some days of thinking, I realize that this blog deserved a little bit more attention, so I decided to change the interface and I'am happy about how it looks now.

Hope to see you again in my personal blog.

keep on programming!

Benjamin

Friday, August 16, 2013

Accuracy versus F score: Machine Learning for the RNA Polymerases

Hello, today I'm going to show you the difference of using two different common performance measures (useful not only for Machine Learning purposes, is useful in every scientific field). Until now, I have found more the accuracy values than F scores in the performance measuring of some methods which ranges from metaheuristics (Genetic Algorithms fitness functions) to promoter recognition programs, diagnose methods and so on.

But, I would really recommend to avoid using the accuracy measure. The reason is shown below with a nice example in R programming language (all the functions used in the simulation are included, you can download them clicking here).

Case study 1:

Imagine that you are in a Computer Vision project and your task is to "teach" a program to recognize among electric guitars and acoustic guitars showing the program pictures of different guitars.

Suppose that you've already developed that program and now you want to measure the performance of this Boolean classifier (this is for example, you show the program a picture of a an electric guitar, and the program has to decide whether it will recognize and "classify" it as an electric or as an acoustic guitar).

For the function of this post, lets write down some useful concepts

Consider the following:

TP: a true positive is when the program classifies an electric guitar as an electric guitar, we will use the letter "E" to denote the electric guitar "class"

FP: a false positive is when the program classifies an acoustic guitar as an electric guitar, we will use the letter "A" to denote the acoustic guitar "class"

FN: a false negative is when the program classifies an electric guitar as an acoustic guitar

TN: a true negative is when the program classifies an acoustic guitar as an acoustic guitar

Now that we are ready, we shall begin with the calculations

In R, I have simulated the results of the program. Say, for 1,000 electric guitar pictures and 1,000 acoustic guitar pictures

The program prompt the following results:

PREDICTED.E PREDICTED.A
TRUE.E 485 515
TRUE.A 9 991

If you notice, from the 1000 electric guitar pictures, only 485 were labeled as electric (TP=485), the rest were labeled as acoustic (FN=515). I feel bad for the hypothetical programmer of this hypothetical example.

On the other hand, from the 1000 acoustic guitars, 991 were labeled as acoustic (TN=991) and only 9 of them were labeled as electric (FP=9). Well not bad!..... or it is?

The accuracy value of this program is = 0.738

And, for computing the F score is necessary to compute the precision and the recall first, where:

precision = 0.9817814 and recall = 0.485

Then, the F score is equal to 0.6492637

Well, the F scores seems to be more "strict", and in fact it is in comparison of the accuracy performance measure. But this example is not very "cool". Lets pass to the case study 2

Case study 2:

Now we have 1,000 electric guitar pictures and 100,000 acoustic guitar pictures, the confusion matrix of the results are:

PREDICTED.E PREDICTED.A
TRUE.E 493 507
TRUE.A 1017 98983

If you notice, from the 1,000 electric guitar pictures, only 493 were labeled as electric (TP=493), the rest were labeled as acoustic (FN=507)

On the other hand, from the 100,000 acoustic guitars, 98983 were labeled as acoustic (TN=98983) and only 1017 of them were labeled as electric (FP=1017)

Now (cha cha chan!), the performance values are:

Accuracy: 0.9849109
Precision: 0.3264901
Recall: 0.493
F score: 0.3928287

Now you see it?, how come or how is possible that missing almost the 50% of the labels of the electric guitars, the performance of the program in accuracy is almost 0.99?, despite of having a precision and recall not greater than 0.50. Then we have a winner and is the F score measure.

for references visit the following pages:

http://en.wikipedia.org/wiki/Accuracy
http://en.wikipedia.org/wiki/F1_score
http://en.wikipedia.org/wiki/Precision_and_recall

Saturday, May 25, 2013

Make your LaTex presentations using the Beamer class

Hello, I have been working with this tutorial. All you need to do is download the source files and take a look at the source code and the PDF file.

Download link here

Hope you find LaTex as useful as I do

Benjamin

Labels