Fabio De Sousa



(published in April 2016)

In the spring of my junior year, with two semesters left until graduation, I decided to change my major from Sustainable Development to Sociology. The biggest factor in that decision was an idea for a thesis I felt would be best executed within the Sociology department at Columbia. Almost exactly a year later, I can safely say that it was the right decision.

The basic idea of my thesis is to use Google Location Services data to study how students at Columbia University make use of New York City. More specifically, I am examining how a variety of factors affect that usage, such as having a job or receiving financial assistance from family.

I've been analyzing survey responses combined with Google Location Services data for this investigation, but before I get into that I need to explain how I recruited my respondents.

Recruitment My first attempts at recruitment involved reaching out to student groups and asking for them to include a blurb in their listserv. I should have known better, because nobody reads student group listservs. I then tried to scrape student emails using Columbia's student directory, but that was an insane amount of server requests to make and it would take too long. After talking it over with some friends, I realized I could use Columbia's internal computer system (which every student has access to) to thoroughly and quickly get all of the public student information available. I brought my proof of concept to a friend that I had consulted with earlier, and after his improvements we had a directory of every person at Columbia University, including all 8,000+ undergraduates.

I then randomly selected 1,000 students at a time and emailed them using Mailchimp. By making use of Mailchimp's A/B testing platform, I improved my recruitment effectiveness. I ended up reaching out to 8,000 students. Of those 8,000, almost 1,000 opened the survey, and 46 students ended up completing the survey and uploading their Google Location data. I was content with the result, because my study really asked each respondent to provide quite a bit of information.

Data At the end of my recruitment, I had over 6 million points of location information, including coordinates, estimated heading, estimated speed, and a timestamp. Each respondent also provided demographic information along with details about the semesters they spent on campus.

The Survey When I was designing the survey, I had to balance getting all of the information I wanted and keeping it short enough for respondents to complete. I chose to ask for: - Basic demographic information - Parental education and income - Which semesters they spent on campus - Whether or not they worked during each semester - Whether or not their family provided financial assistance during each semester

I can look for relationships between these variables and spatial patterns. I am especially focused on identifying anything that limits the extent and frequency of student trips beyond campus.


Cleaning the data was a particularly interesting challenge. The data that respondents downloaded from their Google Takeout was a json file that I needed to convert to GeoJSON. By using the survey responses, I could eliminate all coordinates that were not taken during a semester when that respondent was on campus. I then determined which census tract each coordinate belonged to, which allows me to bring in census data in the future if need be.

This short description of the data wrangling process makes it seem as if it was pretty straightforward. However, every step of the way involved multiple failed attempts, Google searches, and lessons learned. I've heard that often the majority of data analysis is actually just preparing the data to be analyzed, and this experience reinforced that notion.


As I examined the data, I realized that there was high variability from semester to semester for many of the respondents. This, along with the fact that I asked for information for each semester, made me decide to conduct my analysis on the semesters instead of the respondents. The resulting sample of 146 semesters also proved easier to work with than the smaller number of respondents.

I also decided to take some time to learn and set up my work in an iPython Notebook. The initial investment has paid off, as I can now iterate on, visualize, and share my analysis much more easily than I could from the command line.

Preliminary Results

I initially hypothesized that having to work or having no familial financial assistance would be a barrier to accessing the city. As measured by average distance from campus when a student was off campus, percentage of time spent off campus, and range of areas visited by the student, I seem to have been completely wrong. In fact, the opposite of my prediction seems to be true: Columbia University students who do not receive financial assistance from their families spend more time in more parts of the city than students who do.