This study introduces a novel model for multiclassification of diabetes types. The model encompasses a series of steps, beginning with data collection and proceeding through preprocessing, and data analysis-based feature extraction followed by a training/testing phase. The next phase of the ML multiclassification process is followed by evaluating the classifier’s performance, optimization, and prediction phases. The system architecture is graphically presented in Fig. 2.
The diabetes data types system architecture.
Data collection
Data forms the foundation of ML models, shaping their effectiveness. Gathering the appropriate data, both in terms of quality and quantity, is pivotal for constructing an optimal model. The reliability of the collected data plays a crucial role across all phases of model classification, ultimately influencing the quality of predictions. This phase involved selecting significant data features and determining the necessary sample size, guided by assumptions regarding the most relevant attributes to diabetes. Therefore, the quality of data directly impacts the performance of the model, contributing to its overall efficacy. We declare the detailed explanation of how to unify different datasets into the new DTD dataset as shown in online supplementary material.
Dataset
The proposed system employs the newly developed DTD dataset, which integrates data from four distinct sources: the Pediatrics, PID, Pone, and Gestational diabetes datasets, resulting in a comprehensive dataset of 5312 patients with 13 attributes including Age, Sex, BPressure, NPregnancies, BMI, HbA1c, Insulin, POGTT, FOGTT, PGlucose, FGlucose, Diagnosis and DiagnosisType. The Pediatrics dataset, sourced from Mansoura University Children’s Hospital in Egypt, includes 619 patients aged 1 to 19. The PID dataset, obtained from the UCI Repository, contributes data from 768 patients. The Pone dataset, collected from hospitals in Thanjavur district, Tamil Nadu, India, includes 400 patients with 23 features. The Gestational dataset comprises data from 3012 patients with 17 attributes, gathered by diabetes researchers. The goal of the DTD dataset is to identify key factors influencing diabetes occurrence within a multiclass classification framework. We use two external datasets: the first external dataset is diabetes_prediction_dataset, containing 100,000 patient records and 9 features, namely gender, age, hypertension, heart_disease, smoking_history, bmi, HbA1c_level, blood_glucose_level, and diabetes which was also utilized, comprising 91,501 non-diabetic and 8499 diabetic cases. The second external dataset is diabetes_Dataset, containing 34 features namely: Target, Genetic Markers, Autoantibodies, Family History, Environmental Factors, Insulin Levels, Age, BMI, Physical Activity, Dietary Habits, Blood Pressure, Cholesterol Levels, Waist Circumference, Blood Glucose Levels, Ethnicity, Socioeco2mic Factors, Smoking Status, Alcohol Consumption, Glucose Tolerance Test, History of PCOS, Previous Gestational Diabetes, Pregnancy History, Weight Gain During Pregnancy, Pancreatic Health, Pulmonary Function, Cystic Fibrosis Diagnosis, Steroid Use History, Genetic Testing, Neurological Assessments, Liver Function Tests, Digestive Enzyme Levels, Urine Test, Birth Weight, and Early Onset Symptoms used to predict 12 distinct diabetes types and prediabetic. The diabetes types are namely: Type 1, Type 2, Type 3c, Gestational, MODY, LADA, Secondary, Neonatal Mellitus, Wolcott-Rallison Syndrome, Steroid-Induced, Cystic Fibrosis-Related, and Wolfram Syndrome.
Figure 3 shows the binary distribution of diabetes (2317 diabetic and 2995 non-diabetic patients), while Fig. 4 outlines the composition of the four source datasets. Additionally, a rich Kaggle diabetes dataset covering various forms of the disease—such as Steroid-Induced, Neonatal, Prediabetes, Type 1, and Wolfram Syndrome—was referenced to support broader analysis of genetic, lifestyle, and medical factors contributing to diabetes.
The number of diabetes and non-diabetes patients.
The PIMA, pediatrics, CS, and gestational diabetes datasets.
Table 1 presents the DTD dataset, which comprises thirteen attributes and their comprehensive descriptions. The diagnosis attribute serves as the dependent output variable, while the remaining thirteen attributes are considered independent input features. The DiagnosisType label is utilized for multiclassification purposes, as depicted in Fig. 5.
Data preprocessing
Some errors can result from human mistakes during the previous data collection phase. This led us to perform a preprocessing step that included adjusting the data format, handling missing values, feature selection, data sampling, and feature scaling.
Data format
The current phase is concerned with manipulating the collected input attributes to be in a clear and correct format. The preprocessing phase is used to organize and clean the data for further analysis and processing. This approach assists in the accuracy and precision of interpreting data features via the ML classifier algorithms. We prepared the DTD dataset in CSV file format.
Missing values
An empty value within the attributes of the DTD dataset indicates a missing value, which is typically denoted by null indicators. Missing data samples may arise from errors during the data collection phase or from unperformed analysis requests. Such missing values can detrimentally impact the overall performance of the system25. In patient records, missing values may occur for one or multiple attributes for a defined percentage of patients. Addressing the issue of missing values can be approached in two ways. First, one may opt to eliminate features with missing values, although this risks discarding pertinent information and reducing the dataset size. Second, one can replace missing values by Multiple Imputation by Chained Equations (MICE) for the DTD dataset26. MICE is widely used to handle missing data in datasets that contain both numerical (continuous) and categorical (discrete) variables. The imputation used in MICE is a regression-based model where each missing value Yj is predicted using the observed values of other variables27. The general form for imputation is indicated using Eq. (1):
$$Y_{j}^{m} = f\left( {X^{m} ,\theta_{j} } \right)$$
(1)
where \({\text{Yj}}^{m}\) is the imputed value of variable j in the m-th imputed dataset, Xm represents the observed values of all other variables in the dataset for the m-th imputed dataset. f is a regression model (e.g., linear regression or LR) that predicts \({\text{Yj}}^{m}\) based on Xm. The structure of the DTD dataset multiclass classification includes numerical features like Age, BMI, BPressure, NPregnancies, HbA1c, FGlucose, PGlucose, FOGTT, POGTT, Insulin, and categorical features (e.g., Sex, Diagnosis). The target classes (e.g., Normal, Type 1, Type 2, or gestational diabetes). A sample of the DTD dataset is depicted in Fig. 6. where missing values are indicated by white cells. The number of missing values in the DTD dataset is detailed in Table 2. Then we standardize the data using Standard Scaler.
The dataset mini-batches.
Handle class imbalance
We use the Synthetic Minority Over-sampling Technique) SMOTE) A technique to balance imbalanced datasets by creating synthetic samples of the minority class35. Instead of duplicating rows, it generates new data points using interpolation. It’s used before training classifiers to prevent the model from being biased toward the majority classes. SMOTE helps by equalizing the number of samples per class and improving model generalization across all classes. The DTD samples are distributed as follows:—Normal patients are: 3003, Type 1 patients are: 277, Type 2 patients are: 659, and Gestational patients are: 1373. Using the SMOTE technique, we balance the dataset to be: Normal patients are 3004, Type 1 patients are 3004, Type 2 patients are 3004, and Gestational patients are 3004, to prevent overfitting.
Feature selection
Feature selection plays a crucial role in the feature extraction process, aiming to eliminate redundant features and retain those essential for constructing an efficient predictive model, thereby enhancing classification accuracy. This process aids in comprehending the significance of all extracted features, allowing for the utilization of valuable values while discarding outlier features. By prioritizing features with the highest correlation and importance scores, the feature selection method minimizes the execution time and mitigates the risk of data overfitting. Identifying interactions among input variables that influence system output performance is of paramount importance.
To analyze correlations between different attribute types (numerical and categorical), we apply two statistical techniques One-Way ANOVA test and the Chi-square tests. The ANOVA test is used to compare the means of a numerical variable across multiple categories of DiagnosisType (target classes). ANOVA checks if the mean Age, BMI, BPressure, NPregnancies, HbA1c, FGlucose, PGlucose, FOGTT, POGTT or Insulin differs significantly across DiagnosisType. It helps to assess whether the variation in the data is due to differences between groups or if it’s just random variation. To perform ANOVA test, state the null hypothesis (H₀): The means of the groups are equal, using Eq. (2):
$${\text{H}}_{0} : \, \mu_{{1}} = \, \mu_{{2}} = \, \mu_{{3}} = \, \mu_{{4}} = \ldots \ldots . \, \mu_{{\text{k}}}$$
(2)
where k is the total number of groups.
Alternative hypothesis (H₁) at least one of the group means is different from the others using Eq. (3):
$${\text{H}}_{1} {:}\mu_{{\text{i}}}$$
(3)
where I ∈ {1, 2, 3, …, k}.
The F-statistic (F) is calculated by comparing the variance between groups and within groups using Eq. (4):
$${\mathbf{F}} = \frac{{{\mathbf{Variance}} {\mathbf{between}} {\mathbf{Groups}} }}{{{\mathbf{Variance}} {\mathbf{within}} {\mathbf{Groups}}}}$$
(4)
We use the p-value statistical measure that helps to determine the significance of your results in a hypothesis test. It tells the probability of obtaining results at least as extreme as the observed results, if the null hypothesis (H0) is true. All the p-values are very small (< 0.0001), with many being close to zero, indicating that there is a statistically significant difference in Age, BPressure, NPregnancies, BMI, HbA1c, POGTT, FOGTT, PGlucose, FGlucose, and Insulin across DiagnosisType as indicated in Table 3.
We also apply the Chi-Square Test (χ2) to evaluate whether two categorical variables are statistically associated. This test is especially suitable when you want to examine if DiagnosisType is related to categorical features like Sex or Diagnosis. In practice, the test compares each observed cell frequency with its expected count under the null hypothesis of independence. The Formula for the Chi-Square Test using Eq. (5):
$$X^{2} = \sum \frac{{\left( {O – E} \right)^{2} }}{E}$$
(5)
where O is the actual count in each category, E is the expected count if there is no association.
The degrees of freedom (D) using Eq. (6):
$${\text{D }} = \, \left( {{\text{R}} – {1}} \right){\text{ x }}\left( {{\text{C}} – {1}} \right)$$
(6)
where R is the number of rows and C is the number of columns.
The Chi-square statistic is notably large, and the p-value is less than 0.0001 well below the conventional threshold of 0.05, indicating a significant association between the Sex and DiagnosisType variables. Similarly, the Chi-square between Diagnosis and DiagnosisType is also very large, with a p-value less than 0.0001, confirming a significant association, as shown in Table 4.
Data sampling
We partition the training dataset into smaller chunks. This chunk of the DTD dataset is divided into smaller batches aids in the training phase, as illustrated in Fig. 6.
Train phase
This phase focuses on training multiple classifiers using a set of input attributes prepared during the data preprocessing stage. Classifiers such as ANN, LR, NB, DT, AB, RF, GB, KNN, and ANN are each individually trained on the dataset. The objective is to build models capable of accurately classifying unseen data into one of four categories: Type 1 Diabetes, Type 2 Diabetes, Gestational Diabetes, or Normal based on the predefined diagnosisType label. The dataset, containing 5312 records, is split into two primary subsets: 70% (3718 samples) for training and 30% (1,594 samples) for testing, as illustrated in Fig. 7. The training set is further subjected to 5 fold cross-validation to enhance model reliability34. In each fold, approximately 56% of the total dataset is used for training, while about 14% is used for validation. This process is repeated five times, with each fold serving as the validation set once. Stratified sampling is applied during splitting to maintain balanced class distributions. The test set remains untouched throughout the cross-validation process and is used only for final performance evaluation.
The train, validation and test five k-folds.
Machine learning multi-class classification
The pivotal aspect of the multiclass classification process is training the classifier. This step involves creating a model by training it with nine classifiers, which will subsequently be utilized to classify unlabeled “DiagnosisType” into the four output categories Normal, Type 1, Type 2, Gestational patients. We proceed by training nine ML classifiers, including ANN, LR, NB, DT, AB, RF, GB, ET and KNN34. These supervised ML algorithms are specifically chosen to perform multiclassification on the DTD dataset. Following this, the nine ML classifiers are compared, and the most suitable ones for the dataset are selected. The output of this phase is a trained classifier referred to as a model, which is prepared for testing. The parameters are fine-tuned on the developed predictive model, and performance measures are calculated, resulting in the construction of a superior ML model with a highly accurate performance level.
ANN classification algorithm
Previously, we conducted binary classification to distinguish between diabetic patients and non-diabetic patients. Subsequently, in our proposed model, we utilized nine supervised ML techniques to perform multiclassification of diabetes type prediction using an ANN. An ANN typically comprises multiple layers of units known as neurons. In each mini batch, the input features from the DTD dataset are forwarded to the initial input layer. The neurons in the first layer receive a vector composed of twelve input features, while the subsequent hidden layer’s neurons are linked to the input layer’s neurons through a combination of weights and the Rectified Linear Unit (ReLU) activation function37. The neurons in the last output layer receive a combination of outputs with corresponding weights and apply the SoftMax activation function. The ANN is configured for four output nodes: normal patient, Type 1, Type 2 and gestational diabetes patients as shown in Fig. 8. The cross-entropy of the multinomial distribution serves as the cost function, measuring the disparity between predicted and actual outputs to adjust weights and biases accordingly. We use the swarm optimization for fine-tuning hyperparameters36. The ADAM algorithm is employed to update the assigned weight and bias values, with a regularization factor of 0.01 and 50 epochs, respectively. The number of iterations needed to complete one epoch corresponds to the number of batches. This iterative process continues until the desired output aligns closely with the actual output. The SoftMax function aids in the final classification with multiple output probabilities (0, 1, 2, or 3) for different patients, as depicted in Fig. 10. With this, our training concludes, yielding a multiclassification prediction with high accuracy. The ANN Algorithm is described in online Appendix B.
Age and diagnosis type attributes.
Test phase
The multiclassification technique was subsequently applied to accurately assign a class label to unlabelled DiagnosisType, effectively distinguishing between Type 1, Type 2, Gestational diabetes, and normal cases. A mapping function is utilized to classify the unlabelled DiagnosisType and ascertain its appropriate label. It is imperative to compute the probability of assigning the patient to the respective class label using eight multiclassification techniques.
Optimization
We utilize the Keras and TensorFlow libraries to construct a model for the ANN. The training–testing split and cross-validation functions from the Sci-Kit-Learn library are employed for data splitting. Various ML algorithms, including LR, NB, DT, AB, RF, GBC, ET, KNN, and ANN, are implemented in this phase. The primary objective is to enhance multiclassification efficiency, accuracy, and reduce computational complexity compared to manual methods. The ANN model is trained with tuned hyperparameters using the Particle Swarm Optimization (PSO) technique to determine the best-fit probable value for model optimization. PSO plays a crucial role in improving the model’s performance by adjusting the particles’ positions in the swarm to find the optimal set of parameters. PSO optimizes several key parameters, such as the number of hidden units, batch size, learning rate, momentum, and the number of hidden layers. In this case, PSO will search for the optimal number of neurons. The batch size, which determines the number of training examples processed in each iteration before updating the model’s weights, will be optimized between 5 and 64. PSO also searches for the best learning rate (lr) within the range of 0.0001 to 0.1, using Adam optimizer to minimize output error during backpropagation, which controls how quickly the model updates its weights during training.
Evaluation
This phase focuses on evaluating the performance of each classifier by comparing predicted class labels with the actual labels using the trained models. To interpret and understand model behaviour, we employ SHapley Additive exPlanations (SHAP) Summary Plots, which illustrate the contribution of each feature to individual predictions as well as the overall model output as shown in Fig. 9. These plots enhance transparency by highlighting which features influence predictions positively or negatively, thereby improving interpretability and trust in the model. Additionally, we apply Particle Swarm Optimization (PSO) for hyperparameter tuning to optimize each classifier’s performance. The effectiveness of the machine learning algorithms is assessed using various evaluation metrics, including precision, Mean Squared Error (MSE), R2 score, training accuracy and AUC3.
Prediction
The performance of all classifiers is compared to determine the most effective model for recommendation. The selected classifier demonstrating the highest predictive accuracy serves as the optimal model, significantly enhancing the system’s overall capability to predict class labels for new data. To support this evaluation, three key visualizations are generated: a Confusion Matrix applied to the full dataset, ROC Curves for each class to assess classification performance, and a Learning Curve that illustrates the relationship between training size and both training and validation accuracy. Finally, the best model is used to predict the DiagnosisType on an external dataset namely “diabetes_prediction_dataset”. The proposed system is designed to use a trained and optimized classifier to make predictions on a new, external dataset that was not part of the original training data. It first loads the external dataset from a CSV file, removes the target column if it exists (since the goal is to predict it), and ensures that the data’s structure matches the training features. It then applies the same preprocessing steps used during training such as imputing missing values and standardizing the data to prepare it for prediction. Using the final model, which was optimized through PSO, it generates predictions for each sample, mapping the predicted numeric labels to meaningful class names like “Normal,” “Type1,” “Type2,” or “Gestational.” Finally, the results, including the original features and predicted labels, are saved to a new CSV file for review or further analysis following the procedure outlined in Algorithm 1.
Predict and Interpret Using Trained ML Model