In this lab, you will refresh your knowledge of Python, upload data to Kaggle, and get some practice using pandas. You will also learn a great deal about how to determine the age of lake trout in Alaska using otolith growth measurements.
The data you’ll need can be downloaded as a zip file here. An abstract of a paper published using this data can be found here. (not important to the lab, but you may be curious to learn more about trout!)
To analyze our data this semester, we will be using Kaggle which hosts Jupyter Notebooks. First you will need to have an account created in Kaggle.
Next, you should make a New Notebook using the large + icon. Kaggle has the ability to run both R and Python notebooks, so make sure you are creating a Python notebook. If the notebook started in R, find the Session options section on the right side, and change this to Python.
Download the archive and unzip it to find the following three files:
The included metadata files (.html or .xml) can be thought of as a data dictionary, which typically provide critical context for the data you are looking to analyze. Every time you begin working with a new, unfamiliar dataset, your first question should always be: Where is the data dictionary?
df.dtypes
Now you should upload the data to Kaggle. On the right-hand side of the notebook, find the Input section, then click Upload, and finally New Dataset. Pick the .csv file for the trout data and give the dataset a name. This should make the data available to you in the notebook under the /kaggle/input/ directory.
Load the data into a pandas.DataFrame and begin exploring its contents.
pandas.DataFrame
.head()
.tail()
Perform some common descriptive statistics on the data. What conclusions can you draw from steps 1-5 about the data?
.describe()
year
age
.value_counts()
lake
For this section of the lab, you will write out a CSV file that has two columns: lake and fish_count. This CSV file should be turned in with your notebook and raw data.
fish_count
.groupby('lake').count()
As you can (hopefully) appreciate by now, doing all of this analysis with pure python (sans pandas) would be a daunting task even for the most savvy pythonistas. In order to drive this point home, please re-do 1.3.4 using only python libraries.
For the lake column, run .value_counts().
In other words, compute how many times each lake appears in the CSV file. You’ll need to load the CSV file using the csv module and store the results in a data structure. Hint: A dictionary is a good data structure to use for this.
As you work on this lab, record all of your progress in a Jupyter notebook. Record your solutions and attempts in Code blocks, and annotate what you did with MarkDown blocks. Cite all the webpages you find and use in your search for your solution. You should turn in this notebook, all of the data you used, and the CSV file produced in Step 2. A good solution should read like a self-contained report.
Code
MarkDown
Complete: Notebook can be read easily w/o needing to reference this page for additional detail. It should read like a self-contained report. It can be executed without producing runtime errors. All steps (1, 2, and 3) are finished and all discussion questions are answered. All data loaded into the notebook and the CSV file produced in step 3 should be provided.
Partially complete: Notebook can be executed without producing runtime errors. All steps (1, 2, and 3) are attempted.