Imagine you are tasked by the Cleveland Clinic to help them train new doctors to diagnose patients with heart disease. They have been reviewing their past patient histories and started to notice some patterns after narrowing their focus to a few key features of each patient. They also give you some more detailed information about these features in the form of a data dictionary,
You will be analyzing past patient history in order to understand patterns of heart disease. Create a Jupyter notebook and read the cleveland-testing.csv into a pandas data frame.
.dtypes
.describe()
.value_counts()
chest_pain
rest_ecg
heart_disease
Discuss any patterns or other interesting traits you notice from your numerical and categorical analysis.
Run K-Means using the five numerical features as input. Make sure to drop any NaN values in these features (garbage in, garbage out) and also remember to scale your features in order to avoid overfitting. Store two new columns to your original data frame,
cluster
prediction
Produce a confusion matrix (2x2) and associated heatmap that compares ground truth values (i.e. heart_disease) against predicted values. Discuss the outcome of running K-Means on this dataset and explain this in terms of four categories:
Use PCA to find the first two principal components. The variance of each component explains its “importance” in the decomposition. By running pca = PCA(n_components=2).fit(x), you can get back the variance of each component from pca.explained_variance_. You can then run pca.transform(x) to rotate your input features into PCA-1/PCA-2 space.
pca = PCA(n_components=2).fit(x)
pca.explained_variance_
pca.transform(x)
Describe how much variance is explained through using these components.
How would you decribe the results of this experiment? What would you recommend to the doctors that tasked you with this problem? What could be done to improve the outcome of your model? Assuming you cannot build a perfect model, how should you optimize your model with respect to the confusion matrix analysis?
As you work on this lab, record all of your progress in a Jupyter notebook. Record your solutions and attempts in Code blocks, and annotate what you did with MarkDown blocks. Cite all the webpages you find and use in your search for your solution. You should turn in this notebook and all of the data you used. A good solution should read like a self-contained report.
Code
MarkDown
Complete: Notebook can be read easily w/o needing to reference this page for additional detail. It should read like a self-contained report. It can be executed without producing runtime errors. All steps (1, 2, 3, 4, and 5) are finished and all discussion questions are answered. All data loaded into the notebook should be provided.
Partially complete: Notebook can be executed without producing runtime errors. All steps (1, 2, 3, 4, and 5) are attempted.