Project Final: Final Project

Description

Instead of a final exam, you will complete a final project which will help summarize your statistics and data analysis skills we have been building through our labs with R.

Important dates

  • Project topic: Tuesday, October 24 @ 5pm
  • Rough Draft due: Tuesday, November 21 @ 5pm
  • Project due: Tuesday, December 5 @ 5pm
  • Presentations: Tuesday, December 5, 8:30-11:30am

Dataset

Your first task is to find an interesting dataset that you will explore in your project. This could be something related to your academic interests, or something else you would just like to explore.

Use your chosen dataset to create a new public dataset on Kaggle in your account. Then create a notebook that includes this dataset and reads in the information into a data frame.

You will need to have your topic and dataset approved by the instructor by Tuesday, October 24 @ 5pm.

There are an enormous number of datasets available formatted as CSV files. Here are some resources to help your initial search.

The following table lists of a few of them that I think are in an easily managable state and could provide some interesting analysis questions given our R skills, but feel free to choose one not on this list.

Dataset Documentation Data
Fertility HTML Fertility2.csv
Home Mortgage Disclosure Act HTML HMDA.csv
Parade Salary Survey HTML Parade2005.csv
Labor Force Participation HTML PSID1976.csv
Teacher Ratings HTML TeachingRatings.csv
Smoking Cessation HTML Smoking.csv
British Election Panel Study HTML BEPS.csv
Canadian National Election Study HTML CES11.csv
Titanic Survival HTML TitanicSurvival.csv
Vocabulary and Education HTML Vocab.csv
Post-Coma Recovery of IQ HTML Wong.csv
Mortgage Subsidies and GI Bill HTML mortgages.csv
Cricketers Lifespans HTML cricketer.csv
Airbags and Accidents HTML nassCDS.csv
Unemployment HTML Benefits.csv
Car Choice HTML Car.csv
Doctor Visits HTML DoctorAUS.csv
Extramarital Affairs HTML Fair.csv
Preservation of Kakadu National Park HTML Kakadu.csv
Ketchup Buying HTML Ketchup.csv
Visits to Physician Office HTML OFP.csv
Return to School HTML RetSchool.csv
Tobacco Budget Share HTML Tobacco.csv
Medical Expenses in Vietnam HTML VietNamI.csv
Wife Working Hours HTML Workinghours.csv
Air Pollution HTML ohio.csv
Baseball HTML Hitters.csv
Orange Juice Prices HTML OJ.csv
Eighth-Grade Test Scores HTML nlschools.csv
Blizzard Salary HTML blizzard_salary.csv

Paper

Based on the data you choose, you will be writing a paper as a Kaggle Notebook, with code and richly annotated with Markdown blocks, describing your analysis of particular questions involving statistics.

Your paper should have the following headings and sections:

  • Overview
  • Methodology
  • Exploratory Data Analysis
  • Inference
  • Conclusions

Overview

First, you should lay the groundwork for the topic you will be discussing in your paper. What would someone need to know about the topic if they are unfamiliar with it in order to understand your analysis below? What questions will you be answering with your graphs, statistical analysis, and later discussion? Do you have a hypothesis for what you will find when you do the analysis?

This will probably involve a few references to outside material or other websites. If it is a website, make a link to it; if it is a textbook, use a consistent citation structure. You may use MLA, APA, or other formats, as long as a reader can easily find the reference.

Also in this section explain what about this topic is of interest to you and why you selected it for this paper.

Methodology

Detail in a few sentences how this specific data was gathered, when, and by whom. Is it from a survey or from an experiment? Clearly explain the methodology used for the survey or experiment design, using the language we discussed in Unit 3: Producing Data of our textbook. You will need to research the original paper that discussed the dataset to find this information. Summarize this in your own words, do not copy from the source.

Data Notebook

The reader will want to know some information about your chosen dataset. Load up the data as a variable, show a few rows of the data frame, and provide a description of each of the variables being used in your analysis. Clearly indicate the type of each variable of interest (categorical or quantitative) and the role it is playing in your analysis.

Exploratory Data Analysis

Next, you should examine the data using appropriate exploratory tools we have learned (boxplots, scatter plots, two-way tables, bar charts, etc). Use these graphs, and generate some summary statistics (mean, median, standard deviation, etc), to visualize and understand your data. Clearly discuss why the tools and statistics you show are the right way to analyze the data.

Inference

Finally, use the appropriate statistical tools we have discussed in class to make inferences about your data. Are you estimating population means and understaning their difference among subgroups or to a hypothesis? Are you estimating population proportions and examining differences? Are confidence intervals involved? Are your conclusions statistically significant? Clearly explain the tools you use and why they are the right approach to answer your questions.

Conclusions

Write a paragraph summarizing your findings and put them in context of your original topic introduction. Did your results match your hypotheses? Are there any reasons why your results might not be accurate? What future questions could be asked about your dataset?

Presentation

Everyone will present their projects on Tuesday, December 5. Your presentation, using PowerPoint, Prezi, Google Slides, or some other appropriate presentation medium, will be at most 5 minutes long. You should not present your Kaggle notebook, but extract key portions and data from your writeup into a formal presentation.

Your presentation should include four slides. The four slides should follow this structure:

  • A quick summary of the context of the dataset
  • What are the questions your are investigating?
  • What are some visualizations and summary statistics?
  • What were your conclusions made through inference techniques?

What to turn in

You should turn in two things on Teams

  • Your presentation slides (if your slides are on Prezi or Google Slides or some other cloud-based system, just submit the URL).
  • A link to your public Kaggle notebook.