Assignment 2: Exploratory Data Analysis

Due: Monday Jan 27, 2020 by 4:30pm

In this assignment you will identify a dataset of interest and use an exploratory data analysis tool to better understand the data, investigate initial questions about it and develop preliminary insights and hypotheses. Your final submission will be a report consisting of a series of captioned visualizations that convey the key insights gained over the course of your analysis. Documenting the data analysis process you went through is the main pedagogical goal of the assignment and more important than the design of the final visualization.

Part 1: Select and Prepare the Data

You should start by picking a topic area of interest to you and finding a dataset that can provide insights into that topic. We’ve included some datasets below that you can start from. But, if you would like to investigate a different topic and dataset, you are encouraged to do so. If you self-select a dataset and are concerned about its appropriateness for the assignment, please check with the teaching staff.

Be advised that data collection and preparation (also known as data wrangling) can be a very tedious and time-consuming process. Be sure you have sufficient time to conduct exploratory analysis, after preparing the data.

After selecting a topic and dataset – but prior to analysis – you should write down an initial set of at least three questions you’d like to investigate.

Part 2: Exploratory Analysis

Next you will use an exploratory analysis tool to investigate your data. For this assignment we would like you to use Tableau, a commercial visualization tool that supports many different ways to visually explore the data.

Phase 1: You should start the exploration by first examining the shape and structure of your data. What dimensions/variables does it contain and how are the data values distributed? Are there any notable data quality issues? Are there any surprising relationships between the dimensions/variables? Make sure to perform sanity checks for patterns you expect to see! Note that it may be the case that after doing a bit of exploration in phase 1 you find that your data is not as interesting as you first thought. In such cases you might consider returning to Part 1 and identifying a different dataset to work with. Such iteration on choosing the dataset is common, but also time-consuming, so make sure you leave time in your schedule for this.

Phase 2: Next, you should investigate your initial questions, as well as any new questions that arise during your exploration. For each question, start by creating a visualization that might provide a useful answer. Then refine the visualization (by adding additional variables, changing the sort ordering or axis scales, filtering or subsetting data, etc.) to develop better perspectives explore unexpected observations, or sanity check your assumptions. You should repeat this process for each of your questions, and also feel free to revise your questions or branch off to explore new questions as the data warrants.

Final Deliverable

Your final submission should take the form of a report – similar to a slide show or comic book – that consists of 10 or more captioned visualizations detailing your most important insights. Your “insights” can include important surprises or issues (such as data quality problems affecting your analysis) as well as responses to your analysis questions. To help you gauge the scope of this assignment, see this example report analyzing National Football League data.

You should assemble your report document by exporting images from Tableau, using the Worksheet > Export > Image… menu item. Place the images in a document, write a text caption for each one and export a PDF. Note that in addition to Tableau you may also use other exploratory analysis tools such as R and MATLAB, or even Python, but the other tools are not required. If you do use another tool, at least one of the visualizations in your final report should be from Tableau.*

Each visualization image should be a screenshot exported from a visualization tool, accompanied with a title and descriptive caption (1-4 sentences long) describing the insight(s) learned from that view. Provide sufficient detail for each caption such that anyone could read through your report and understand what you’ve learned. You are free, but not required, to annotate your images to draw attention to specific features of the data. You may perform highlighting within the visualization tool itself, or draw annotations on an exported image.

Do not submit a report cluttered with everything little thing you tried. Submit a clean report that highlights the most important “milestones” in your exploration, which can include initial overviews, identification of data quality problems, confirmations of key assumptions, and potential “discoveries”. Your report should only present the final dataset you analyzed and should not describe any iterations on earlier datasets you might have initially explored.

Data Sources

There are a variety of data sources available online. Here are some possible sources to consider. If you have any questions about whether your dataset is appropriate, please talk to the course staff ASAP.

Visualization Tool (Tableau)

To create the visualizations, we will be using Tableau, a commercial visualization tool that offers free student licenses so that you can install the software on your own computer. Be advised that the free student license can take a few days to arrive via email so install the software soon.

One goal of this assignment is for you to learn to use and evaluate the effectiveness of Tableau. Please talk to the course staff if you think it will not be possible for you to use the tool. In addition to Tableau, you are free to also use other visualization tools as you see fit.

Data Wrangling Tools

The data you choose may require reformatting, transformation or cleaning prior to visualization. Here are tools you can use for data preparation. Contact the course staff if you are unsure what might be the best option for your data!

Graphical Tools

  • Tableau - Tableau provides basic facilities for data import, transformation & blending.
  • OpenRefine - A free, open source tool for working with messy data.
  • Trifacta Wrangler - Interactive tool for data transformation & visual profiling. This tutorial will walk you through the basics of using the tool.

Programming Tools

Grading

Each submission will be graded based on both the analysis process and the included visualizations. Here are our grading criteria:

  • Appropriate Data Assessment (5): Overview/understanding of the data is built from transformations and appropriate assessment of data quality. Poses clear questions.
  • Exploration Thoroughness (5): Sufficient breadth of analysis, exploring questions in sufficient depth (with appropriate follow-up questions).
  • Documentation (5): Clear documentation of exploratory process, including clearly written, understandable captions that communicate primary insights.

Submission Details

This is an individual assignment. You may not work in groups. Your completed assignment is due on Monday Jan 27, 2020 by 4:30pm.

To submit your assignment, prepare a PDF containing your notebook and your final visualization and description with the filename: A2-FirstnameLastname.PDF

Upload this PDF to Canvas.