Predicting Heart Attacks with Machine Learning

Alec Garza
5 min readApr 25, 2021

--

Heart attacks are one of the most pervasive causes of death in the United States. According to the CDC, someone has a heart attack every 40 seconds. In addition to their frequency, 1 out of every 5 attacks are silent. Meaning that the damage is done without the awareness of the individual.

Thanks to advances in machine learning, we are able to deploy tools that detect an individuals likelihood of suffering a heart attack in their lifetime. By understanding risk levels of an individual and the factors that cause them, we can better treat, prevent, and rehabilitate people who are at risk.

We will build a simple classifier model using the Heart Attack Analysis & Prediction Dataset from Kaggle.

Features of the dataset include:

  • Age — age of patient
  • Sex — sex of patient
  • exang — exercise induced angina (1=yes; 0=no)
  • ca — the number of major vessels (0–3)
  • cp — chest pain and type of chest pain
  • trtbps — resting blood pressure
  • chol — cholestoral
  • fbs — fasting blood sugar
  • rest_ecg — resting electrocardiographic results
  • thalach — maximum heart rate achieved
  • target — 0=less chance of a heart attack; 1=more chance of a heart attack

We will start off by importing the necessary libraries

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn import preprocessingfrom sklearn.preprocessing import RobustScalerfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.svm import SVCfrom sklearn.linear_model import LogisticRegression

Next, we will write a couple of helper functions to use later on

def make_distr_plot(dataset, X, title):    sns.displot(data=dataset, x=X)
plt.xticks(rotation=90)
plt.title(title)
plt.show()
def get_score(model, X_train, X_test, y_train, y_test):
model.fit(X_train,y_train)
return model.score(X_test, y_test)
def unique_val_dict(df):
dict = {}
for i in list(df.columns):
dict[i] = df[i].value_counts().shape[0]
return pd.DataFrame(dict,index=["unique count"]).transpose()

We will now import and begin inspecting our data

heart_df = pd.read_csv('/content/heart.csv')print("Dataset overview")
print(heart_df.head())
print("Dataset Statistical Overview")
print(heart_df.describe())

The next step is to check for any missing values

print('Total Missing Values in Each Dataset')
print(heart_df.isnull().sum())

Let us now inspect the data types we will be working with. This will let us know if we need to encode any categorical variables or do other feature engineering tasks later on

print('Checking Datatypes')
print(heart_df.dtypes)

Next, we’ll check and remove duplicate rows

heart_df[heart_df.duplicated()]
heart_df.drop_duplicates(inplace=True)

Finally, we’ll use our unique_val_dict() helper function to get the unique values and transpose them to each category

unique_vals = unique_val_dict(heart_df)

Next, we’ll separate our categorical and continuous variables

categorical_cols = ['sex','exng','caa','cp','fbs','restecg','slp','thall']continuous_cols = ["age","trtbps","chol","thalachh","oldpeak"]

We’ll use another one of our helper functions to quickly visualize the distribution of our continuous features

for col in range(len(continuous_cols)):
make_distr_plot(heart_df, continuous_cols[col], continuous_cols[col]+" distribution")

These graphs model the distribution of our continuous data. We could perform further exploratory analysis and graph the relationships between these variables against the target output. This would visualize which features have higher influence on the likelihood of a heart attack.

An example would be to plot the number of heart attack likelihoods for each gender, like so

sns.countplot(data=heart_df, x='sex')
plt.title("Total Gender Count")
plt.show()
sns.countplot(data=heart_df.loc[heart_df['output']==1], x='sex')
plt.title("Heart Attacks by Gender")
plt.show()

For the sake of simplicity on this post, and to provide you with a challenge, I will leave further/deeper data visualizations up to you.

Next, we will begin doing data transformations to get it ready for our machine learning model.

We will first make a copy of our dataset; this is the data set we will use to perform our data transformations and train our model.

After making a copy of the dataset, we will then encode our categorical variables using pd.get_dummies() from pandas. Even though our categorical data is numbered already, encoding them with get_dummies() turns them into indicator or dummy variables. Indicator/dummy variables take the value of 0 or 1 to indicate the presence/absence of data that can shift the outcome of the task we are predicting.

After that, we will use the RobustScaler() to transform our continuous variables. The robust scaler removes the median and scales data according to the interquartile range. This is a method of standardization. Standardization typically happens by removing the mean and scaling to the unit variance. However, outliers are known to significantly influence the mean and unit variance; the RobustScaler() solves this problem with its median/IQR scaling approach.

Finally, we will split our dataset into test and training sets, where they will be input into our models.

Here is the code for these actions:

dfcopy = heart_dfdfcopy = pd.get_dummies(dfcopy, columns = categorical_cols, drop_first=True)y, X = dfcopy['output'], dfcopy.drop('output', axis=1)X[continuous_cols] = RobustScaler().fit_transform(X[continuous_cols])X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Finally, we will build our models and score them with our helper function, get_score(). Since our problem is a classification prediction problem, we will utilize Support Vector Machines, Logistic Regression, and Random Forest Classifier models.

print("Support Vector Classifier Score:", get_score(SVC(), X_train, X_test, y_train, y_test)*100)print("Logistic Regression Score:",get_score(LogisticRegression(), X_train, X_test, y_train, y_test)*100)print("Random Forest Classifer Score:",get_score(RandomForestClassifier(), X_train, X_test, y_train, y_test)*100)

The Logistic Regression model appears to have the best performance with almost 89% accuracy!

Check out the source code on my Github.

--

--

Alec Garza
Alec Garza

No responses yet