Exam 1: Data Analysis & Visualization

In this exam, you will demonstrate your mastery of CSCI 285 Module #1 concepts in three parts. Part #1 focuses on generating data and visualizing it with pandas. Part #2 focuses on working with an unknown dataset and visualizing it. Part #3 asks you to take the dataset from Part #2 and perform KMeans aggregation and PCA decomposition.

Part 1 - 50 points

(10 points) Using python and/or pandas, create a data frame that has the following schema,

id	age	income
0	25	30,000
1	26	32,000
…	…	…
9998	28	29,000
9999	21	41,000

The data contained in this data frame should satisfy the following constraints,

10k rows of data.
id should be incrementing integers (e.g. 1 - 10k). This does not need to be a column but can serve as the index of the data frame.
age should be randomly generated by sampling the “standard normal” distribution. Ages should approximately fall between 20 and 30 years old and have a mean of 25. (its fine if your numbers look slightly different. the goal here is to generate a normal distribution of ages that looks realistic)
income should be generated using the same method as age but with income approximately falling between 25k and 45k with a mean of 35k. Again, it is fine if your numbers vary a bit as long as income is normally distributed.

Warning: If you cannot complete the data generation step and want to forfeit the points in order to work on 1.1 through 1.5, then please see me.

(5 points) Display the minimum, maximum, and average (mean) values of age and income.
(10 points) Draw two histograms, using seaborn, in order to show that your data is normally distributed. Make sure that your axes are labeled and that you set a title for each chart. Increase the size and aspect ratio to improve visibility.
(10 points) Draw a scatter plot, using seaborn, that plots age vs. income. Make sure that your axes are labeled and that you set a title. Increase the size and aspect ratio to improve visibility.
(10 points) Create a new column on your data frame, called sector, that contains one of four values: “jazz”, “puck”, “camp”, and “flub”. Draw another scatter plot similar to one you created in (3) except that “sector” is displayed using color.

Assign “jazz” if age > 25 and income > 35k Assign “puck” if age < 25 and income > 35k Assign “camp” if age > 25 and income < 35k Assign “flub” to all remaining points
(5 points) Display the number of points in each sector where income is greater than 40k. If you were unable to do 1.4, then simply display the number of points where income is greater than 40k (sans sector).

Part 2 - 50 points

The iris flower dataset is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper “The use of multiple measurements in taxonomic problems”. It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

(5 points) Load iris.csv into a pandas data frame. Record the shape of the data frame.
(5 points) Display the first 10 rows of the dataset. Display the last 5 rows of the dataset.
(5 points) Display the count of each value in the species column.
(5 points) Display the data types of the columns.
(10 points) Draw a box plot, using seaborn, of iris petal width (cm) vs. species. Make sure that your axes are labeled and that you set a title. Increase the size and aspect ratio to improve visibility. What differences do you noticed in this plot?
(10 points) Draw a pairwise plot, using seaborn, of the four numerical columns and the one categorical column. Make sure to color by the categorical column. What do you notice about the dataset through analyzing this array of charts?
(10 points) Draw a scatter plot that includes linear regressions, using seaborn, of two features that appear linearly related. Color by species. Increase the size and aspect ratio to improve visibility. Make sure that your axes are labeled and that you set a title. What conclusions can you draw from this plot?

Part 3 - 50 points

Clustering & PCA - 30 points

Continuing with the iris flower data set from Part 2, scale your features and use KMeans to cluster your data into 3 clusters. Use PCA to decompose your features into two dimensions. Draw a scatter plot for your two PCA dimensions and color the plot using the clustering results. Discuss the results from creating this chart.

Discuss why k=3 clusters is the right choice. How can you show that k=3 is the right choice using the the principle of inertia.

Discussion Questions - 20 points

What is feature scaling and why is it important to do as a preprocessing step?
What is meant by an n-dimensional Euclidean space ? Is the iris dataset an example of a Euclidean space? What’s an example of a non-Euclidean space that we discussed in class or examined in lab? What does it matter?
Qualitatively, what is the difference between L1 and L2 distance measures?
What is meant by the “curse of dimensionality”? Does the iris dataset suffer from this “curse”?

What To Turn In

A Jupyter notebook that begins with the following statement,

All of the below work is my own. I adhered to the test-taking procedure by not receiving any help from my peers. I have cited all resources I found online or from notebooks shared from class that helped me complete this exam.

Please print (sign) your name and date.
Please label each Part of the exam using markdown headers.
Turn in a zip file that contains your notebook and any data needed to run the notebook.

Grading

Complete - Earn (at least) 90% of the points.
Partially Complete - Earn (at least) 70% of the points.