Course preview

GEOG 30323, Data Analysis & Visualization

Fall 2015

Professor: Dr. Kyle Walker

Chances are, you’ve heard some of the hype around “big data” in the media and in industry, which is often represented in images like this:

Image source

In reality, data problems are not always so mysterious. Organizations and businesses across industries and fields collect data of many different types and sizes, from small to “big.” In turn, data literacy – which refers to the ability to engage with, extract meaning from, and communicate with data – is highly sought-after on the job market.

This new course at TCU is designed to help you become data literate. It focuses on the following areas:

  • Obtaining data from a variety of sources, including databases and the Web;
  • Cleaning messy datasets, and converting data between different formats and types;
  • Using exploratory data analysis to summarize and generate insights from your data;
  • Visualizing your data to communicate insights with a larger audience.

To accomplish this, you’ll learn the basics of the Python programming language, which is one of the most popular languages for working with data. And if you don’t know how to code already? Don’t worry - Python is one of the best programming languages for beginners, and this course is welcome to everyone regardless of your technical background or field of study.

So what types of things will you learn how to do in this class? Let’s start with a couple example questions:

  • What are the most popular female baby names in Texas?
  • How has this changed over time?

We’ll be using a raw dataset from the Social Security Administration, available at http://www.ssa.gov/oact/babynames/limits.html, that tracks baby name counts by state back to 1910. Names with a frequency less than 5 in a given year are suppressed.

With a little bit of work, we can determine the top female names in 2013, as well as their rates per 1000 names in the data file:

We see that Sophia ranked #1 in 2013, representing nearly 15 per 1000 entries in the datset. Emma, Isabella, Mia, and Olivia round out the top 5. However, we are not just interested in data from last year; we would also like to know how this has changed over time. To do this, we’ll use data visualization, which can be a more effective data exploration and communication tool than a table of numbers.

The following visualization is a heatmap, which visualizes the data in a grid, in which darker cells represent greater rates. The heatmap shows the top 25 female names for 2013 in Texas, and how their rates have shifted since 1990.

The heatmap shows us that some of the top names of 2013, like Sophia, Emma, and Isabella, were not nearly as popular in 1990; instead, names like Elizabeth and Samantha were much more prevalent. We also see notable growth in the rates of Sofia and Camila, reflecting the growing Hispanic population in the state.

Now let’s go even further back - all the way to 1910 - to get a long-term view of these trends. We’ll use a line chart in this instance that is interactive to compare four names: Mary, Gertrude, Camila, and Sophia. Move your cursor over any line to get its value; click and drag to zoom in on any area; and click any data series in the legend to turn it on and off.

We observe from the chart how Mary dominated among female baby names in the early part of the 20th century; in fact, in the 1910s, there were nearly twice as many Marys as the second-most popular name. However, the popularity of Mary started to wane after World War II. Now, click “Mary” in the legend to turn off that data series, then double-click the chart to resize it. This allows us to see clearly how Gertrude has not been popular since the 1920s, and to observe the recent spike in baby Sophias and Camilas in Texas.

Using the tools of exploratory data analysis, we’ve discovered some trends in our data! If this interests you, check out the new class; you’ll learn how to do all this and much more. I’m happy to answer questions as well at kyle.walker@tcu.edu.