The main intention of the post is to show how is the distribution of the school principal names in Mexico, for example, to show basic trends regarding about what is the most common nation-wide first name and so on, also to show trends delimited by state and regions.
These trends in data would answer questions such:
1. Are the most common first names distributed equally among the states?
2. Does the states sharing the same region also share the same "naming" behavior?
Additionally, this post includes cool wordclouds.
Finally and the last part of my disclaimer is that, I am really concerned about the privacy of the persons involved. I am not by any sense promoting the exploitation of this personal data, if you decide to download the dataset, I would really ask you to study it and to generate information that is beneficial, do not join the Dark side.
Benjamin
##################
# GETTING THE DATASET AND CODE
##################
The database is located here
The R code can be downloaded here
Additional data can be downloaded here
All the results were computed exploring 202,118 schools across the 32 states of Mexico from the 2013 census
##################
# EXPLORING THE DATA
# WITH WORDCLOUDS
##################
Here is the wordcloud of names (by name, I am referring to first name only), it can be concluded that MARIA is by far the most common first name of a school principal in all Mexican schools, followed by JOSE and then by JUAN
The following wordcloud includes every word in the responsible_name column (this includes, first name, last names). Now the plot shows that besides the common first name of MARIA, also the last names of HERNANDEZ, MARTINEZ and GARCIA are very common.
##################
# EXPLORING THE FREQUENCY
# OF FIRST NAMES (TOP 30 | NATION-WIDE)
##################
Looking at this barplot, the name MARIA is by far the most common name of the Mexican school's principals, with a frequency ~25,000. The next most popular name is JOSE with a frequency of ~7,500
Looking at the same data, just adjusted to represent the % of each name inside the pool of first names we have that MARIA occupy ~11% of the names pool.
##################
# HEATMAPS OF THE DATA
##################
# CLUSTERS OF THE DATA
##################
With this matrix, normalize the values by diving each row by the sum of it (this will minimize the effect of the populated states vs the non populated while maintaining the proportion of the name frequencies per state). Then I just computed a distance matrix and plotted it as a heatmap.
What I can conclude with this plot is that, there are clusters of states that seems to maintain a geographical preference to be clustered within its region, this would be concluded that it is likely that states sharing the same regions would be more likely to share the "naming" trends due to some cultural factors (like the cluster that includes Chihuahua, Sonora and Sinaloa). But this effect is not present in all the clusters.
All images can be downloaded in PDF format here, just don't do evil with them!
Plot 1 here
Plot 2 here
Plot 3 here
Plot 4 here
Plot 5 here
Plot 6 here
Benjamin