Predicting Diabetes from glycoheamoglobin data using Machine Learning

5 min readApr 30, 2019

AIM: This article is to show how to use Machine Learning on Health related data sets. I will attempt to give rationale for most of my decisions. I hope it will help spark the interest of more people in health sciences to think of Machine Learning Algorithms when thinking of solutions.

REQUIREMENT: Interactive Notebook(I will recommenced Google’s colab notebook https://colab.research.google.com ), Basic understanding of Python Programming,Basic understanding of Machine Learning.

The Model built will be a classification predictive model that learns from selected features and predict if a client is Normal,Pre-diabetic or Diabetic

Glycohemoglobin (HbA1c, A1c) is a measure how much glucose is bound to heamoglobin in red blood cells(over 2–3 months). People with diabetes or other conditions that raises blood glucose have higher A1c values.
Normal is <5.7%,
Pre - diabetes 5.7–6.4%
Diabetes >6.5%

THE DATASET: The original dataset has about 6800 rows and 20 columns,obtained from http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets

Only a subset of the columns (9 out of 20)will be used for this analysis.

The Dataset features were explained in http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/nhgh.html.

FEATURE SELECTION: The features selected were Seqn(Unique Identifier), Sex,Age,BMI,Waist circumference,Glycoheamoglobin,Albumin,Creatinine(SCr),Blood urea nitrogen(BUN) values. I made these decision based on domain knowledge.E.g I know BMI already shows a relationship between height and weight ,high waist circumference is a risk factor in developing diabetes, Albumin ,Scr and BUN could be used to assess kidney function .

FEATURE ENGINEERING: Machine learning algorithms require data to be in practical numerical formats to work well. So I will create a column(gh_int) with the glycoheamoglobin values assigned to three integer classes. (Normal: 0, Pre-diabetes:1,Diabetes:2)

I will also create a column(age_range) the codes below shows how to bin the age into age ranges. This makes working with age easier.

EXPLORATORY DATA ANALYSIS : I will do some visualization and see what we can learn about the data set.

There is a good balance between Male and Female Data

This particular dataset seem to have more single age_range data from people in their 20s-30s. Most cases of type 2 diabetes is diagnosed in people of 40 and above. If these age ranges are summed up. The data set is a good representation of what is obtainable.

Many people on this data set have normal glycoheamoglobin values. So we will test how well our model can learn and generalize to this information.

A correlation map for the final selected features

This above gives an idea of what to do in EDA, there is so much more that can be explored.

When I tried to compare algorithm performance using cross validation. I ran into the error below

This is an error showing that there are missing values in my data. One must always check for missing values. On investigation, Python was right and I was wrong!!!!.

I dropped rows with missing values that fixed it. There are other methods of dealing with missing values.

I split the data set into 33% test set(which will be hidden from the algorithm) and 66% train set that the algorithms will train on. I also scaled the data using MinMaxScaler because most of the values are in ranges. These are all readily available on Sklearn Library.

I used cross validation method to compare the performance of about 7 Machine Learning algorithms on the data.

The algorithms with the highest accuracy were Random Forest,Decision Tree and XGBoost. Any of these can be used in the final model. It’s proper to compare algorithm performance, the decision of what final algorithm to use is can be based on many factors such as familiarity,sensitivity to scaling,compute power,easy of interpretation etc.