Using Census Data to Help Pick your Child’s Name

Hello World!

So I had a life changing event this past Sunday at 8:55am 5/24/2015.  My first child was born!  Both child and wife are healthy and happy.  Everything is good in life.  Like many couples though, my wife and I struggled to find the right name for our child.  We didn’t want something too common, or was an old person name, or so rare and funky that nobody could spell it.  We also realized we just had a general lack in knowing what names were out there.  So after much debate and discussion over what to name her, I started doing a bit of an analysis using some census data.  I want to thank Jamie Dixon for providing the data that he found for use in his Dinner Nerds article.  The data itself can be found here.  This article will discuss the code used to go through all of the data and provide insights into child names.

About the Data

The data is Census Data provided by each state in the United States, it goes back as far as the state is able to provide data (early 1900’s I believe is the oldest).  Any names represented less than 5 times during a year were not shown.  This is for an entire state, therefor a single state must have at least 5 children named that specific name within that same year.  The data is provided: Name, Gender, Count, Year.  The file name is the state represented.

About the Code

This is simply a data exploration exercise and therefor we only used the .fsx files and no compiled code was used.  Everything was done within the context of the interactive window.  There are some packages that you will need to have installed and references to:

  1. F# Data
  2. FSharp.Charting
  3. FSharp.Collections.ParrallelSeq (optional)

NugetPackages

Bringing in External References

#r @"C:projectsNamesAnalysisNamesAnalysisSolutionpackagesFSharp.Data.2.2.2libnet40FSharp.Data.dll"
#r @"C:projectsNamesAnalysisNamesAnalysisSolutionpackagesFSharp.Charting.0.90.10libnet40FSharp.Charting.dll"
#r @"C:projectsNamesAnalysisNamesAnalysisSolutionpackagesFSharp.Collections.ParallelSeq.1.0.2libnet40FSharp.Collections.ParallelSeq.dll"
#r @"C:Program Files (x86)Reference AssembliesMicrosoftFramework.NETFrameworkv4.5System.Windows.Forms.DataVisualization.dll"

open FSharp.Data
open FSharp.Collections.ParallelSeq
open FSharp.Charting
open System
open System.Windows.Forms.DataVisualization

You will need to bring in all of the references that we need.  #r defines the location of a reference (the @ symbol in the front means it is a literal string, don’t worry about any escape characters n would be an example. (net40 in a path would cause issues for example).  We then open or load each of the defined libraries we will need.  DataVisualization shows up as not being used, it contains extensions that Charting uses.

Using a Type Provider

F# Type providers are amazing!  We are going to use the CSV type provider.  Notice that I modified the AK.txt file to have an additional row, which provides the typing I want.  If you do not do this, you will get a name, year, count as your intellisense and that name will be stricken from that particular dataset.  The [<Literal>] flag indicates that the string is not going to change and allows for the type provider to have a format that all of the files are going to follow.  This provides for ease of programming and loading additional similar datasets (we have 50 data sets).

[<Literal>]
let namesByStateSample = @"C:UsersdacrookDownloadsdatasetsnamesbystateAK.TXT"
let baseString = @"C:UsersdacrookDownloadsdatasetsnamesbystate"
let states = [| "AK" ; "AL"; "AR" ; "AZ" ; "CA" ; "CO" ; "CT" ; "DC" ; "DE" ; "FL" ; "GA" ; "HI";
                    "IA";"ID";"IL";"IN";"KS";"KY";"LA";"MA";"MD";"ME";"MI";"MN";"MO";"MS";"MT";
                    "NC";"ND";"NE";"NH";"NJ";"NM";"NV";"NY";"OH";"OK";"OR";"PA";"RI";"SC";"SD";
                    "TN";"TX";"UT";"VA";"VT";"WA";"WI";"WV";"WY"|]
type nameData = CsvProvider<namesByStateSample>

Removing Popular and Obscure Names

So the first thing we wanted for our daughter is a name that is not within the realm of names you hear every day, but also not so obscure that you have never heard of it.  So how do we answer this particular question?  Well if you recall from statistics there is a measure called a Percentile.  A percentile is the measure of which a particular observations happens a percent of the time.  There are multiple ways to measure which percentile observations occur, I decided based on the type of data we have represented here that a form of weighted percentile would work the best.  What we will do is take our list of all names within a particular time period and sort them by most popular to most rare, determine the % representation of that name in the dataset and remove the top and bottom quartiles (25%).  This will ensure that the list of names we have available is manageable to go through and neither highly common or very obscure.  So what does the code look like for this?

Load and Aggregate Data

let getData() = 
    states
    |> PSeq.map(fun f -> nameData.Load(baseString + f + ".txt"))
    |> PSeq.collect(fun f -> f.Rows)
    |> PSeq.filter(fun f -> f.Gender.CompareTo("M") <> 0)

We first need to load all of the data from the .txt files, aggregate all of the data into one data set and filter out all occurrences of the Gender M for male, as we are having a baby girl (this reduces the data set dramatically).

Occurrences counted the way WE need it.

let nameTrends(data:pseq<nameData.Row>) = 
    data
    |> PSeq.groupBy(fun f -> f.Name)
    |> PSeq.map(fun (key, dataPoints) -> 
                    let yearSort = 
                        dataPoints 
                        |> Seq.groupBy(fun f -> f.Year)
                        |> Seq.map(fun (k,d) -> 
                                    let count = d |> Seq.fold(fun acc d -> acc + d.Count) 0
                                    k , count)
                        |> Seq.sort
                    key , yearSort)

The incoming data is just every single row of every single file we have.  Which means we have potentially 50 occurrences of “Mary” in the year 1990.  We need to Group all of the Mary’s together, and not only group the Mary’s together, but accumulate all of the 50 data points we have for each Mary within a single year into a single representation of Mary within that year.

Filter to a Particular Time Period

let TotalDataDuringPeriod(yearBegin:int, yearEnd:int, data:pseq<string * seq<int * int>>) =
    data 
    |> PSeq.toArray
    |> PSeq.map(fun (key, data) -> 
                let c = data 
                        |> Seq.filter(fun f -> (fst f) > yearEnd || (fst f) < yearBegin)
                        |> Seq.fold(fun acc d -> acc + (snd d)) 0
                key, c)

We then need total counts of all occurrences of names so that we have a name,value for all occurrences of a name, but we may not want that for all time, so we allow for a filter of time period.  The time period could span the entire dataset (1900-2013), or just a more recent section of it.  The result of this will for example be Mary,4678, which would mean there were 4678 Marys named during the provided time period.

This is the data you were looking for

let DataBetweenPercentile(bottom:float, top:float, data:pseq<string * int>) =    
    let totalCount = data |> PSeq.fold(fun acc d -> acc + (snd d)) 0
    data |> PSeq.map(fun d -> 
                            let perc = float((snd d)) / float(totalCount)
                            (fst d), perc)                                                
        |> Seq.sortBy(fun f -> -(snd f))
        |> Seq.scan (fun (_,acc) (name, perc) -> name, acc + perc) ("", 0.0)
        |> Seq.filter(fun (_,p) -> p > bottom && p < top)

This function takes in the data from the time period, calculate the percent representation of that name in the total data set, sort the data, determine exactly which percentile that name is represented, and then remove all names above and below the acceptable range.

I want to make note of the scan high order function ( special thanks to Mathias Bradewinder for showing me this cool high ordered function.) This is like Seq.Fold, but returns the intermediate results as well, so with a sorted list, we are able to start with 0 percentile, add the next observations percent representation and return it as a data point, and then move to the next one.  Meaning the result of scan is each datapoint with the accumulation of its top representation in the dataset.  SWEET!

Putting it all together

let data = getData() |> nameTrends 
let timePeriodData = TotalDataDuringPeriod(1990, 2013, data)
DataBetweenPercentile(0.25, 0.75, timePeriodData) 
    |> Seq.iter(fun f -> Console.WriteLine(f))

So in this example, we take just the representation of names between 1990 and 2013 and remove the bottom and top quartiles.  This was super cool, but still WAY too many names, so lets drop it a bit more, lets gear more for rare names and remove a few more obscure and do less than 50, but greater than 30.  This gives us a good 20% of the occurrences to work with, so that’s a start.  We then picked a few names from that list based on what we knew about the names and a general we like these… Alice and Evelyn and of course we did a few more iterations of this with various values to add in Margeaux, Madalyn, Ariel, Beatrice and Florence.  Finally we have some good names that are uncommon, we like and have some meaning behind them that we like, plus I feel good because I did some data crunching to get this far.  So I thought, lets take this a step further, there are always things with names we don’t necessarily know about, lets do some plots and trending with the names over time to see if there is anything interesting…

Charting The Data

This is pretty easy, we already have the data set in a format where we can do this, we just need to filter down to individual names, so the code is just a filtration…

let nameData = data |> PSeq.find(fun f -> (fst f).CompareTo("Ariel") = 0)

Next we just need to Chart it…

let plotPoints = snd nameData
let plotName = fst nameData                                               
let chart = Chart.Line(plotPoints, plotName)
chart.ShowChart()

Super easy!  So what can we discover…We start with Ariel…

ArielChart

Woa…What in the world happened here?  Ariel just jumped to popularity right around 1980.  What is 1980 and Ariel?  Bing search….go over to images as I’m not looking at an exact date….

littleMermaid

I should have known!  Well, I mean, its not that bad, everybody will think of the little mermaid.  Hovering over the spike, it is right on 1989, the year the little mermaid was released.  That’s pretty darned cool!

So we did a few more charts…

moreCharts

Florence, Beatrice and Evelyn are all great names.  All of the names really when looking at the data are names primarily given to the older generations, which not too big of a deal, but the popularity for these all appear to have dropped sharply by 1929 – these are all highly popular names during the roaring 20’s but seem to drop off right at the wall street crash and the start of the Great Depression.  Doing some quick reading Florence was actually the name of a woman whom was the subject of many famous photos including Migrant Mother.  Florence from the chart as we can see never recovered as a name after those pictures.  Beatrice appears to have never recovered, though quick searches do not turn much up.

AliceChart

Enter Alice.  Popular in the early 1900’s, slowly dropping off until 2010 with a spike again, which aligns with the release of the 2010 version of the film, Alice in Wonderland.  I actually really liked that film and like the whole premise of the movie and book.  The low popularity, residing at 37.9 percentile for 1990-2013, this is a great name!  Of course we wait until she is born to see her and make sure she really looks like an Alice.

The Final Test

So our daughter was born, 5/24/2015, we got to finally meet her and she has beautiful blue eyes, a little nose and was just looking around the room in wonder the whole time, not crying a bit.  This is in fact an Alice.  So we decided to go with a very  rare name, Margeaux, as the middle and thus using these tools we were able to better find names to our liking ending with Alice Margeaux Crook.

Summary

So if you too are struggling with all of the websites that provide crazy names and you have difficulty knowing what you want to name your child, or you want something somewhat uncommon but not too uncommon.  This is a way you can reveal some very interesting statistics.  Such as Mary was only the most popular name prior to 1960, and then became fairly rare, or the most popular names post 2005 are Isabella, Emma, Sophia, Olivia, Ava, Madison, Abigail and Mia.  None of those names I would have thought would be the most popular names in that range in fact over 157,000 children were named Isabella between 2005 and 2013.  I would have never guessed.  It is best to always get to the source and make your own decisions off of real data.  I hope this has demonstrated some of the usefulness of F#, Charting, Data Exploration and a little statistics and what it can do for you.

One thought on “Using Census Data to Help Pick your Child’s Name

  1. Pingback: F# Weekly #22, 2015 | Sergey Tihon's Blog

Leave a Reply

Your email address will not be published. Required fields are marked *