Labels

Monday, July 7, 2014

What are the names of the school principals in Mexico?, If your name is Maria, probably this post will interest you. Trends and cool plots from the national education census of Mexico in 2013

I will start this post with a disclaimer:

The main intention of the post is to show how is the distribution of the school principal names in Mexico, for example, to show basic trends regarding about what is the most common nation-wide first name and so on, also to show trends delimited by state and regions.

These trends in data would answer questions such:

1. Are the most common first names distributed equally among the states?
2. Does the states sharing the same region also share the same "naming" behavior?

Additionally, this post includes cool wordclouds.

Finally and the last part of my disclaimer is that, I am really concerned about the privacy of the persons involved. I am not by any sense promoting the exploitation of this personal data, if you decide to download the dataset, I would really ask you to study it and to generate information that is beneficial, do not join the Dark side.

Benjamin

##################
# GETTING THE DATASET AND CODE
##################

The database is located here
The R code can be downloaded here
Additional data can be downloaded here

All the results were computed exploring 202,118 schools across the 32 states of Mexico from the 2013 census

##################
# EXPLORING THE DATA
# WITH WORDCLOUDS
##################

Here is the wordcloud of names (by name, I am referring to first name only), it can be concluded that MARIA is by far the most common first name of a school principal in all Mexican schools, followed by JOSE and then by JUAN

The following wordcloud includes every word in the responsible_name column (this includes, first name, last names). Now the plot shows that besides the common first name of MARIA, also the last names of HERNANDEZ, MARTINEZ and GARCIA are very common.



##################
# EXPLORING THE FREQUENCY
# OF FIRST NAMES (TOP 30 | NATION-WIDE)
##################

Looking at this barplot, the name MARIA is by far the most common name of the Mexican school's principals, with a frequency ~25,000. The next most popular name is JOSE with a frequency of ~7,500


Looking at the same data, just adjusted to represent the % of each name inside the pool of first names we have that MARIA occupy ~11% of the names pool.


##################
# HEATMAPS OF THE DATA
##################

 With this heatmap, my intention is to show the distribution of the top 20 most common first names across all the Mexican states



It can be concluded that there is a small cluster of states which keep the most number of principals named MARIA(but no so fast!, some states, for example Mexico and Distrito Federal are very populated, so I will reduce this effect in the following plot). In summary the message of this plot is the distribution of frequency of the top 20 most frequent first-names across the country.

##################
# CLUSTERS OF THE DATA
##################

For me, a young data-science-padawan, this is my favorite analysis: "hunting down the trends".


The setup of the experiment is very simple, map the top 1,000 most frequent nation-wide names across each state to create a 32 x 1000 matrix (32 states and 1,000 most nation-wide frequent names).

With this matrix, normalize the values by diving each row by the sum of it  (this will minimize the effect of the populated states vs the non populated while maintaining the proportion of the name frequencies per state). Then I just computed a distance matrix and plotted it as a heatmap.

What I can conclude with this plot is that, there are clusters of states that seems to maintain a geographical preference to be clustered within its region, this would be concluded that it is likely that states sharing the same regions would be more likely to share the "naming" trends due to some cultural factors (like the cluster that includes Chihuahua, Sonora and Sinaloa). But this effect is not present in all the clusters.

All images can be downloaded in PDF format here, just don't do evil with them!

Plot 1 here
Plot 2 here
Plot 3 here
Plot 4 here
Plot 5 here
Plot 6 here

Benjamin





Wednesday, June 18, 2014

[SOLVED] Problems compiling Rcpp dependent R packages in Crunchbang Linux 11 (Debian 7.5 (wheezy) 64-bit)

I had a very weird issue when I tried to compile (install) certain R packages like "wordcloud", "RSNNS" or "GOSemSim" (a Bioconductor package). The installation always ended with a compiling error at the end.

At first, I tried to solve the problem by finding anyone that had faced the same issue, so I Duck-Duck-go-it, Google-it and at least for me, I did not find anyone.

At the end, and which is the core of my post, I solved the problem by removing or commenting all my customizations in the /usr/lib/R/etc/Rprofile.site file (for example, loading libraries automatically).

In summary, using the default Rprofile.site worked for me.



I wrote this post to make a record of the issue, so maybe it would help someone to solve the same in the future.

Benjamin

Tuesday, February 25, 2014

Convert Ensembl, Unigene, Uniprot and RefSeq IDs to Symbol IDs in R using Bioconductor

Hello, I have programmed a function that converts different sources of IDs to Symbol IDs.

The input ID types allowed are (at the moment):  Ensembl, Unigene, Uniprot and RefSeq.

The code is available clicking here

NOTE: The function depends on the Bioconductor package "org.Hs.eg.db" available here

For example, lets show 10 Ensembl IDs:

> id[1:10]
 [1] "ENSG00000121410" "ENSG00000175899" "ENSG00000256069" "ENSG00000171428"
 [5] "ENSG00000156006" "ENSG00000196136" "ENSG00000114771" "ENSG00000127837"
 [9] "ENSG00000129673" "ENSG00000090861"

And their Symbol IDs:

> res[1:10]
 [1] "A1BG"     "A2M"      "A2MP1"    "NAT1"     "NAT2"     "SERPINA3"
 [7] "AADAC"    "AAMP"     "AANAT"    "AARS"    

This is a running example of the function to convert Unigene IDs to Symbol IDs (For all the other IDs types, just replace "unigene" to "ensembl" or "refseq" or "uniprot"):

# USAGE EXAMPlE: UNIGENE
require(org.Hs.eg.db)
unigene <- toTable(org.Hs.egUNIGENE)
# extract 100 random unigene entries
id  <- unigene[sample(1:length(unigene[,2]),100),2]
id.type  <- "unigene"
res <- get.symbolIDs(id,id.type)

Benjamin

Monday, February 10, 2014

Upgrade and update R 2.15 to R 3.0 in Debian Wheezy

Following the instructions from CRAN, you need to add the R backports in your source list.

FIRST PART: ADD R BACKPORTS: 

First, open a Terminal and open the sources.list file:

$ gksudo gedit /etc/apt/sources.list

Then, add these lines at the bottom of the file (Note, I use the Revolution Analytics Dallas, TX server, but this can be easily changed taking a look here for the mirrors):

## R BACKPORTS FOR WHEEZY
deb http://cran.revolutionanalytics.com/bin/linux/debian wheezy-cran3/
#deb-src http://cran.revolutionanalytics.com/bin/linux/debian wheezy-cran3/

SECOND PART: RENAME THE R PACKAGES FOLDER:

There's a folder where R uses to store the packages we download, just rename it to the current version of R. For example, mine was "2.15" and then I just renamed it to "3.0" and was inside this path:

Before:
/home/benjamin/R/x86_64-pc-linux-gnu-library/2.15
After:
/home/benjamin/R/x86_64-pc-linux-gnu-library/3.0

Remember that some packages also needs to install some files in folders that belongs to the root, so, I would recommend to open R in sudo mode (only if you're sure about what you're doing :P) just by executing R this way: "sudo R" and then, in the R console type :

update.packages(checkBuilt=TRUE, ask=FALSE)

THIRD PART: SECURE APT:

The Debian backports archives on CRAN are signed with the key ID 381BA480, to add them, in a Terminal prompt type:

gpg --keyserver pgp.mit.edu --recv-key 381BA480
gpg -a --export 381BA480 > jranke_cran.asc
sudo  apt-key add jranke_cran.asc


FOURTH PART: UPDATE AND UPGRADE R:

Save the file and you can either enter to Synaptic, update the packages list and then just upgrade the packages or in a terminal type:

sudo apt-get update
sudo apt-get upgrade

And that's all.
Benjamin

Saturday, October 19, 2013

Optimizing a multivariable function parameters using a random method, genetic algorithm and simulated annealing in R


Say that you are implementing a non-linear regression analysis, which is shortly described by wikipedia as:

"In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables."

For the training set, we have the following:


And the function to optimize the parameters is:



Which leads us to the following equality:


In other words, we want to optimize the value of theta in order to minimize the sum of the error among y and predicted.y:

Given theta (each parameter a0,..a3 has a range from 0 to 15):


And the error function:


finally, the goal function:


In other words, the goal function searches for the value of theta that minimizes the error.

COMPUTATIONS BEGIN

This is the scatter plot of the training set:


Here is the implementation in R, you can download the file clicking here

Here is a result plot using the genetic algorithm:


Benjamin

Saturday, August 17, 2013

Lieutenant Dan You Got a New Interface..

After some days of thinking, I realize that this blog deserved a little bit more attention, so I decided to change the interface and I'am happy about how it looks now.

Hope to see you again in my personal blog.

keep on programming!

Benjamin

Friday, August 16, 2013

Accuracy versus F score: Machine Learning for the RNA Polymerases

Hello, today I'm going to show you the difference of using two different common performance measures (useful not only for Machine Learning purposes, is useful in every scientific field). Until now, I have found more the accuracy values than F scores in the performance measuring of some methods which ranges from metaheuristics (Genetic Algorithms fitness functions) to promoter recognition programs, diagnose methods and so on.

But, I would really recommend to avoid using the accuracy measure. The reason is shown below with a nice example in R programming language (all the functions used in the simulation are included,  you can download them clicking here).

Case study 1:

Imagine that you are in a Computer Vision project and your task is to "teach" a program to recognize among  electric guitars and acoustic guitars showing the program pictures of different guitars.

Suppose that you've already developed that program and now you want to measure the performance of this Boolean classifier (this is for example, you show the program a picture of a an electric guitar, and the program has to decide whether it will recognize and "classify" it as an electric or as an acoustic guitar).

For the function of this post, lets write down some useful concepts

Consider the following:

TP: a true positive is when the program classifies an electric guitar as an electric guitar, we will use the letter "E" to denote the electric guitar "class"

FP: a false positive is when the program classifies an acoustic guitar as an electric guitar, we will use the letter "A" to denote the acoustic guitar "class"

FN: a false negative is when the program classifies an electric guitar as an acoustic guitar

TN: a true negative is when the program classifies an acoustic guitar as an acoustic guitar

Now that we are ready, we shall begin with the calculations

In R, I have simulated the results of the program. Say, for 1,000 electric guitar pictures and 1,000 acoustic guitar pictures

The program prompt the following results:

        PREDICTED.E PREDICTED.A
TRUE.E         485         515
TRUE.A           9         991

If you notice, from the 1000 electric guitar pictures, only 485 were labeled as electric (TP=485), the rest were labeled as acoustic (FN=515). I feel bad for the hypothetical programmer of this hypothetical example.

On the other hand, from the 1000 acoustic guitars, 991 were labeled as acoustic (TN=991) and only 9 of them were labeled as electric (FP=9). Well not bad!..... or it is?

The accuracy value of this program is = 0.738

And, for computing the F score is necessary to compute the precision and the recall first, where:

precision = 0.9817814 and recall = 0.485

Then, the F score is equal to 0.6492637


Well, the F scores seems to be more "strict", and in fact it is in comparison of the accuracy performance measure. But this example is not very "cool". Lets pass to the case study 2

Case study 2:

Now we have 1,000 electric guitar pictures and 100,000 acoustic guitar pictures, the confusion matrix of the results are:

        PREDICTED.E PREDICTED.A
TRUE.E         493         507
TRUE.A        1017       98983

If you notice, from the 1,000 electric guitar pictures, only 493 were labeled as electric (TP=493), the rest were labeled as acoustic (FN=507)

On the other hand, from the 100,000 acoustic guitars, 98983 were labeled as acoustic (TN=98983) and only 1017 of them were labeled as electric (FP=1017)

Now (cha cha chan!), the performance values are:

Accuracy: 0.9849109
Precision: 0.3264901
Recall: 0.493
F score: 0.3928287

Now you see it?, how come or how is possible that missing almost the 50% of the labels of the electric guitars, the performance of the program in accuracy is almost 0.99?, despite of having a precision and recall not greater than 0.50. Then we have a winner and is the F score measure.

for references visit the following pages:

http://en.wikipedia.org/wiki/Accuracy
http://en.wikipedia.org/wiki/F1_score
http://en.wikipedia.org/wiki/Precision_and_recall