Simpsons Database

I compiled a corpus of Simpsons episode transcipts from www.simpsonsworld.com. From this database, I analyzed character lines by episode and season.

[github repo]

Data Collection

This project involved web scraping to get text and creating a PostgreSQL database to store data. First, I found links to each episode available on www.simpsonsworld.com and parsed the HTML to extract scripts. The links to each episode, along with additional identifying information, were stored in one table. The scripts for all of the episodes were stored in another table, line by line. The common identifier for the episode and script tables is the URL from where the data were retrieved.

Data Analysis

To check that the database was well formed, I ran several test queries and made a plot showing each of the main characters' (Homer, Marge, Bart, Lisa, and Maggie) lines for every episode in season 5. The output seemed reasonable, as the character line count supported the prevalence of the Simpson family with variations by episode events as inferred from the episode titles. I also looked at the average number of lines per episode for each character for the first 15 seasons of the show. Lines attributed to the nuclear Simpson family accounted for ~45% of the show's dialogue.