Cardiovascular Disease Data Analysis and Prediction

The analysis is centered around the prediction of presence of cardiovascular disease, given data about patients. The data includes different categorical and numerical variables commonly associated to an individual’s health status. After a brief data exploration, we perform first cluster analysis on the dataset and then fit different types of models: Trees and Random Forest, Logistic Regression and Natural Splines. Finally, we discuss some issues with the data, considering its unknown origin and unspecified gathering methods, and hint at how this work could possibly be improved.

More specifically, the following steps are performed:

Data exploration
- Preprocessing
- Distribution analysis
- Correlation analysis
K-Means Clustering
Trees & Random Forest
Logistic Regression
- Base Model
- Models with interection between variables
- Polynomial Regressions
- LASSO Regression
Natural Splines
Results overview

Cardiovascular Disease Data Analysis and Prediction

Collaborators