How is babby named

The Social Security Administration’s baby names database got some notice recently as the source of this visualisation by Reuben Fischer-Baum

There are a number of different visualisations of this dataset but they all deal with the most popular names.

What if you wanted to dig into the torrid minutiae of the least popular names in America? And what if you wanted to do it with Splunk?

Well, it would be mildly annoying because Splunk has some limitations that make it difficult to work with large historical datasets, but it’s not too hard to make it work:

The Epoch

For Splunk time begins at the Unix epoch – January 1, 1970. Since our dataset starts in 1880, _time has no meaning here and we’ll have to use another field to hold the dates for our events.

The Limit

Since we can’t set _time how we’d like to, Splunk defaults _time to the mod time of each file. You’ve just downloaded and unzipped the files from the SSA, so they all have the same mod time. Splunk will attempt to set the same timestamp on every event. Unfortunately, Splunk has trouble when there are too many events with the same timestamp. YMMV – when I first built the babbynames app with version 4.3, the error I received was “The search failed. More than 3125000 events found at time 1284355500”. As of 2012, the dataset has > 7 million events.

A Workaround

One way to get Splunk to consume this dataset in a useable form is to set the mod time of each file to something different, so that there aren’t too many events with the same timestamp.

Please name your baby after a Splunk release!

Now we can glean some truly astonishing insights from this dataset:

Ace is pretty big in Hawaii:
Ace results from babbynames

5 parents in 1927 named their daughters Bubbles:
Bubbles results from babbynames

Roll your own

A sample app is available at
If you don’t want to go through the hassle of indexing the data, drop me a line and I’ll give you mine.

Russell Uman

Posted by