Predicting Diabetes Types: A Deep Learning Approach
Machine learning analysis of diabetes in young Indians: Deep learning vs. XGBoost (64.75% vs. 74% accuracy) using health and lifestyle data.
Join the DZone community and get the full member experience.
Join For FreeDiabetes has become a significant health concern in India, particularly among young adults. In this article, we'll explore a comprehensive analysis of diabetes prediction using machine learning techniques, working with a dataset that contains various health and lifestyle factors of young adults in India.
Understanding the Dataset
The dataset comprises 100,000 records with 22 features, including demographic information, health metrics, and lifestyle factors. The key features include age, gender, BMI, family history of diabetes, genetic risk scores, and various lifestyle indicators such as physical activity level, dietary habits, and sleep patterns. What makes this dataset particularly interesting is its focus on young adults and the inclusion of both Type 1 and Type 2 diabetes cases.
Here is the link to the Dataset.
Let's start by loading and examining our data:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.model_selection import train_test_split
# Load the data
df = pd.read_csv("diabetes_young_adults_india.csv")
# Check basic information about the dataset
print(df.info())
Data Preprocessing Journey
Our preprocessing pipeline begins with handling categorical variables through ordinal encoding. Here's how we implement this:
# Define category orders for ordinal encoding
categories_order = [
["No", "Yes"], # For binary categories
["Unknown", "Type 1", "Type 2"], # For diabetes types
["Sedentary", "Moderate", "Active"], # For activity levels
["Unhealthy", "Moderate", "Healthy"], # For dietary habits
["No", "Yes"], # For smoking
["No", "Yes"], # For alcohol
["No", "Yes"], # For prediabetes
["Unknown", "Type 1", "Type 2"] # For diabetes type
]
columns_to_encode = [
"Family_History_Diabetes", "Parent_Diabetes_Type",
"Physical_Activity_Level", "Dietary_Habits", "Smoking",
"Alcohol_Consumption", "Prediabetes", "Diabetes_Type"
]
# Handle missing values first
for col in columns_to_encode:
df[col] = df[col].fillna("Unknown")
# Apply ordinal encoding
encoder = OrdinalEncoder(categories=categories_order)
df[columns_to_encode] = encoder.fit_transform(df[columns_to_encode])
For gender and region variables, we implement one-hot encoding:
# One-hot encoding for nominal categories
df = pd.get_dummies(df, columns=["Gender", "Region"])
We then standardize our numerical features:
# Standardize numerical features
scaler = StandardScaler()
num_columns = ["Age", "Family_Income", "Genetic_Risk_Score", "BMI",
"Fast_Food_Intake", "Fasting_Blood_Sugar", "HbA1c",
"Cholesterol_Level", "Sleep_Hours", "Stress_Level", "Screen_Time"]
df[num_columns] = scaler.fit_transform(df[num_columns])
Deep Learning Model Architecture
The neural network architecture is designed with careful consideration of the problem's complexity. Here's our implementation:
def create_diabetes_model(input_shape, num_classes):
model = tf.keras.Sequential([
# Input layer
tf.keras.layers.Dense(64, activation='relu', input_shape=(input_shape,)),
# Expansion layers
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(1024, activation='relu'),
# Compression layers with dropout for regularization
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(32, activation='relu'),
# Output layer
tf.keras.layers.Dense(num_classes, activation='softmax')
])
return model
# Create and compile the model
X = df.drop(["ID", "Diabetes_Type"], axis=1)
y = df["Diabetes_Type"]
# Convert target to one-hot encoding
y_onehot = tf.keras.utils.to_categorical(y)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y_onehot, test_size=0.2, random_state=123
)
# Initialize and compile the model
model = create_diabetes_model(X_train.shape[1], y_train.shape[1])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
Let's break down the architecture:
- Input layer (64 neurons): This layer serves as the entry point for our features. We use ReLU activation to introduce non-linearity.
- Expansion phase (256 → 512 → 1024 neurons): These layers progressively increase in size, allowing the model to learn increasingly complex patterns. This expansion helps the model capture intricate relationships in the data.
- Compression phase (128 → 64 → 32 neurons): After expansion, we gradually reduce the dimensionality, forcing the model to learn the most important features. This acts as a form of feature selection.
- Dropout layers (0.3 rate): We add dropout after large dense layers to prevent overfitting. During training, these layers randomly deactivate 30% of neurons, making the model more robust.
- Output layer (3 neurons with softmax): The final layer uses softmax activation to produce probability distributions across our three classes (Unknown/Type 1/Type 2).
Training the model:
history = model.fit(
X_train, y_train,
epochs=100,
batch_size=64,
validation_split=0.2,
verbose=1
)
Model Performance Analysis
The neural network achieved an accuracy of 64.75% on the test set. Here's how we evaluate it:
from sklearn.metrics import classification_report
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_test_classes = np.argmax(y_test, axis=1)
print("\nClassification Report:")
print(classification_report(y_test_classes, y_pred_classes))
To improve upon these results, we implemented an XGBoost classifier:
from xgboost import XGBClassifier
xgb_model = XGBClassifier(
n_estimators=20,
eval_metric="mlogloss",
use_label_encoder=False,
random_state=123
)
xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)
The XGBoost model achieved a higher accuracy of 74%, suggesting that tree-based methods might be more suitable for this particular problem.
Model Enhancement Strategies
Several strategies could potentially improve the model's performance. First, addressing class imbalance through techniques like SMOTE or class weights could help improve the detection of minority diabetes types. Second, feature engineering could create more informative inputs, such as BMI categories or combined risk scores. Finally, hyperparameter tuning using techniques like grid search or Bayesian optimization could optimize both the neural network and XGBoost models.
Real-World Implications
The insights gained from this analysis have significant implications for healthcare in India. Understanding the relative importance of different factors in diabetes prediction can help healthcare providers focus on the most relevant risk factors when screening young adults. The model's ability to distinguish between diabetes types, albeit with room for improvement, could assist in early intervention and appropriate treatment planning.
Looking Forward
This project demonstrates the potential and challenges of using machine learning for medical diagnosis. While our models show promising results, they also highlight the complexity of medical prediction tasks and the importance of continuous refinement. Future work could focus on collecting more balanced datasets, incorporating temporal data, and exploring the integration of other health markers.
The complete code is available for practitioners looking to build upon this work and can be adapted for similar healthcare prediction tasks. Remember that while machine learning models can provide valuable insights, they should be used as supporting tools alongside professional medical judgment, not as replacements for clinical diagnosis.
Link to notebook: https://www.kaggle.com/code/adityak74/predict-diabetes-in-youth-vs-adult-in-india.
This analysis is a starting point for more sophisticated diabetes prediction models, particularly those focused on young adults in specific geographic regions. Combining deep learning and gradient-boosting approaches provides a comprehensive framework for tackling similar healthcare prediction challenges.
Disclaimer: This model is for research purposes only and should not be used for actual medical diagnosis without proper clinical validation.
Opinions expressed by DZone contributors are their own.
Comments