Assignment 2: Exploratory Data Analysis
Due: Monday Oct 11, 2021 by 11:30am
In this assignment you will identify a dataset of interest and use an exploratory data analysis tool to better understand the data, investigate initial questions about it and develop preliminary insights and hypotheses. Your final submission will be a report consisting of a series of captioned visualizations that convey the key insights gained over the course of your analysis. Documenting the data analysis process you went through is the main pedagogical goal of the assignment and more important than the design of the visualizations.
Part 1: Select and Prepare the Data
You should start by picking a topic area of interest to you and finding a dataset that can provide insights into that topic. We’ve included some datasets below that you can start from. But, if you would like to investigate a different topic and dataset, you are encouraged to do so. If you self-select a dataset and are concerned about its appropriateness for the assignment, please check with the teaching staff.
Be advised that data collection and preparation (also known as data wrangling) can be a very tedious and time-consuming process. Be sure you have sufficient time to conduct exploratory analysis, after preparing the data.
After selecting a topic and dataset – but prior to analysis – you should write down an initial set of at least three questions you’d like to investigate.
Part 2: Exploratory Analysis
Phase 1: You should start the exploration by first examining the shape and structure of your data. What dimensions/variables does it contain and how are the data values distributed? Are there any notable data quality issues? Are there any surprising relationships between the dimensions/variables? Make sure to perform sanity checks for patterns you expect to see! Note that it may be the case that after doing a bit of exploration in phase 1 you find that your data is not as interesting as you first thought. In such cases you might consider returning to Part 1 and identifying a different dataset to work with. Such iteration on choosing the dataset is common, but also time-consuming, so make sure you leave time in your schedule for this.
Phase 2: Next, you should investigate your initial questions, as well as any new questions that arise during your exploration. For each question, start by creating a visualization that might provide a useful answer. Then refine the visualization (by adding additional variables, changing the sort ordering or axis scales, filtering or subsetting data, etc.) to develop better perspectives explore unexpected observations, or sanity check your assumptions. You should repeat this process for each of your questions, and also feel free to revise your questions or branch off to explore new questions as the data warrants.
Your final submission should take the form of a report – similar to a slide show or comic book – that consists of 8 or more captioned visualizations detailing your most important insights. Your “insights” can include important surprises or issues (such as data quality problems affecting your analysis) as well as responses to your analysis questions. Where appropriate, we encourage you to include annotated visualizations to guide viewers’ attention and provide interpretive context. (If you aren’t sure what we mean by “annotated visualization,” see this page for some examples.)
Provide sufficient detail such that anyone can read your report and understand what you’ve learned without already being familiar with the dataset. To help you gauge the scope of this assignment, see this example report analyzing National Football League data, or this example notebook analyzing motion picture data.
If you are using Tableau, you can assemble your report document by exporting images from Tableau, using the Worksheet > Export > Image… menu item. Place the images in a document, write a text caption for each one and export a PDF. If you are using Vega-Lite you can either export the images you generate and make a PDF, or you can submit a link to a published version of your Observable Notebook.
Do not submit a report or notebook cluttered with everything little thing you tried. Submit a clean report that highlights the most important “milestones” in your exploration, which can include initial overviews, identification of data quality problems, confirmations of key assumptions, and potential “discoveries”. Your report should only present the final dataset you analyzed and should not describe any iterations on earlier datasets you might have initially explored.
There are a variety of data sources available online. Here are some possible sources to consider. If you have any questions about whether your dataset is appropriate, please talk to the course staff ASAP.
- Data is Plural - Variety of datasets and sources covering many topics.
- Stanford Institutional Research & Decision Support - Stanford institutional data (e.g. enrollment, admissions, diversity, etc.).
- Stanford Open Data Portal - Stanford Daily’s open data sets.
- Big Local News Data Archive - Stanford Journalism’s Big Local News Project data sets.
- data.gov - U.S. Government open datasets.
- U.S. Census Bureau - Census data.
- Federal Elections Commission - Campaign finance and expenditures.
- Federal Aviation Administration - FAA data.
- Awesome Public Datasets - Variety of public datasets.
- Stanford Cable TV News Analyzer - We have recently released a tool that can be used to analyze who and what appears in the last decade of Cable TV News (i.e. CNN, Fox News, MSNBC). The site lets you download data as well which you could use to conduct further analysis.
- Tableau – provides a graphical user interface for importing, transforming and especially for visualizing data. You can get a free student licenses that allows you to install the software on your own computer. Be advised that the free student license can take a few days to arrive via email so install the software soon.
Data Wrangling Tools
The data you choose may require reformatting, transformation or cleaning prior to visualization. Here are tools you can use for data preparation. Contact the course staff if you are unsure what might be the best option for your data!
- Tableau - Tableau provides basic facilities for data import, transformation & blending.
- OpenRefine - A free, open source tool for working with messy data.
- Trifacta Wrangler - Interactive tool for data transformation & visual profiling. This tutorial will walk you through the basics of using the tool.
- Pandas - Data frame and manipulation utilities for Python.
- dplyr - A library for data manipulation in R.
Each submission will be graded based on both the analysis process and the included visualizations. Here are our grading criteria:
- Appropriate Data Assessment (5): Overview/understanding of the data is built from transformations and appropriate assessment of data quality. Poses clear questions.
- Exploration Thoroughness (5): Sufficient breadth of analysis, exploring questions in sufficient depth (with appropriate follow-up questions).
- Documentation (5): Clear documentation of exploratory process, including clearly written, understandable captions that communicate primary insights.
This is an individual assignment. You may not work in groups. Your completed assignment is due on Monday Oct 11, 2021 by 11:30am.
To submit your assignment, prepare a PDF containing your report or a
link to a website (e.g. Observable Notebook, or website running
Vega-Lite) with the filename: