Cardiovascular Disease Data Analysis and Prediction

Cardiovascular Disease Data Analysis and Prediction

Analysis centered around the prediction of presence of cardiovascular diseases. Data is preprocessed and various analysis are performed over it. Multiple models are applied: K-Means Clustering, Random Forest, Logistic Regression and Natural Splines. Finally, a result overview and comparison is provided.

Data AnalysisMachine LearningHealthcare
Rcaretglmnetranger

The analysis is centered around the prediction of presence of cardiovascular disease, given data about patients. The data includes different categorical and numerical variables commonly associated to an individual’s health status. After a brief data exploration, we perform first cluster analysis on the dataset and then fit different types of models: Trees and Random Forest, Logistic Regression and Natural Splines. Finally, we discuss some issues with the data, considering its unknown origin and unspecified gathering methods, and hint at how this work could possibly be improved.


More specifically, the following steps are performed:
  • Data exploration
    • Preprocessing
    • Distribution analysis
    • Correlation analysis
  • K-Means Clustering
  • Trees & Random Forest
  • Logistic Regression
    • Base Model
    • Models with interection between variables
    • Polynomial Regressions
    • LASSO Regression
  • Natural Splines
  • Results overview
View on GitHub

Collaborators

Giovanni Billo, Muhammad Mubashar Shahzad, Fabio Vicig