Bayesian network analysis of factors influencing type 2 diabetes, coronary heart disease, and their comorbidities | BMC Public Health

A total of 3824 complete data were collected in this study, including 1175 control cases, 726 males (61.79%) and 449 females (38.21%), with an average age of 40.79 ± 14.02 years. There were 1163 T2DM patients, including 635 males (54.60%) and 528 females (45.40%), with an average age of 59.98 ± 12.06 years. There were 982 cases in the CAD group, including 641 males (65.27%) and 341 females (34.73%), with an average age of 67.16 ± 11.75 years. There were 504 patients in the comorbidities group, including 312 males (61.90%) and 192 females (38.10%), with an average age of 67.65 ± 10.38 years. The general description of the research objects is shown in Table 1.

Table 1 General description of research objects

Univariate analysis showed that for T2DM, except BMI and TC, there were differences in the distribution of other 24 variables between groups, and the differences were statistically significant (P < 0.05).For CAD, except gender and FHx of coronary heart disease, there were statistical differences in the distribution of other 25 variables between groups (P < 0.05).Univariate analysis of comorbidities showed that except gender, FHx of DM, FHx of CAD, BMI and LDL-C, there were statistically significant differences in the distribution of other 22 variables between groups (P < 0.05).The results of single factor analysis for preliminary screening of influencing factors of disease are shown in Table 2.

Table 2 Results of single factor analysis of disease influencing factors screening

The more nodes (variables), the larger the sample size required to construct a reasonable BN, and too many network nodes are not conducive to reflecting the relationship between the main factors and the outcome. Therefore, variables with P < 0.05 in the univariate analysis were selected by logistic stepwise regression to simplify the structure of the later BN. The variables and their assignments are described in Table 3.

Table 3 Variables and their assignments

Logistic regression results of T2DM showed that there were 16 factors entering the model, which were age, region, education level, occupation, FHx of DM smoking, drinking, intake of staple food, meat, fruit, sweets, exercise, heart rate, TG, HDL-C and LDL-C. Among them, the older the age, the lower the education level, the FHx of DM, the frequent smoking, the higher the risk of T2DM; Compared with suburbs and cities, the risk of disease in rural population was 1.798(1/0.556) times and 14.386 (1/0.070) times, respectively. Business service workers had a higher risk than other occupational groups. Occasional drinkers had a 58.9 percent lower risk than non-drinkers. The risk of disease was 2.299(1/0.435) times of those who did not eat staple foods or consumed less than 3 liang per day compared with 3–6 liang per day. Higher intake of meat, fruit and sweets was associated with a lower risk of T2DM; For each unit increase in heart rate, TG, LDL-C, and HDL-C, the likelihood of T2DM increased by 2.7%, 8.9%, 34.3%, and 62.3%, respectively. See Table 4 for details.

Table 4 Results of multivariate logistic regression analysis of T2DM

The logistic regression results of CAD showed that there were 17 factors that finally entered the regression model, which were age, region, education level, marital status, occupation, smoking, drinking, staple food, meat, fruit intake, exercise, sleep time, SBP, heart rate, TG, TC and LDL-C. The risk of CAD increased by 0.091 times, 0.020 times, 0.140 times and 2.197 times for each unit increase of age, SBP, TG and LDL-C, respectively. People who live in rural areas, have low education level, often smoke, eat more staple food and do not exercise have an increased risk of CAD. Compared with married people, unmarried, divorced and widowed people have a lower risk of CAD. The risk of disease among military/national party and government personnel is lower than that of other occupational groups; Occasional drinkers had a 48.1% lower risk than non-drinkers. Compared with those who slept less than 5 h/day or needed medication to help them sleep, the risk of 5–7 h/day and 7–9 h/day were reduced by 51.8% and 48.6%, respectively. Higher meat and fruit intake was also associated with a lower risk of CAD. See Table 5 for details.

Table 5 Results of multivariate logistic regression analysis of CAD

alogistic regression results of comorbidities showed that age, region, education level, fruit, sweet food intake, exercise, SBP, DBP, FBG, TC and HDL-C were related to 11 variables. The risk of comorbidities increased by 10.6%, 2.6%, 2.9% and 49.6% for each grade of age, systolic blood pressure, diastolic blood pressure and FBG.With each increase of TC and HDL-C grade, the risk of disease was reduced by 32.3% and 85% respectively. The risk of comorbidity in rural areas was 9.091(1/0.110) times that in urban areas. People with a college degree or above were 82% less likely to get the disease than those who were illiterate. Compared with no fruit, the probability of comorbidities decreased when the intake of fruit was 7–14 times or less per week. Those who exercised had a 49.3 percent lower risk than those who did not. See Table 6 for details.

Table 6 Results of multivariate logistic regression analysis of comorbidity

Variables with statistically significant differences in multivariate Logistic regression analysis were selected, and 70% of T2DM, CAD and their comorbiditis data were randomly selected as the training set. Tabu search algorithm was used to learn the structure of BN, and the BN model was constructed by combining the prior knowledge of experts and data information. Structure learning needs to discretize continuous variables, which can not only improve the accuracy of network learning, but also reduce the risk of model overfitting, making the data mining results more practical value. Table 7 shows the variable-discretization rules.

Table 7 Variable discretization rules

The BN of T2DM influencing factors is shown in Fig. 1, which contains 17 nodes and 21 directed edges. The directed edge represents the dependent relationship between related factors and T2DM. The network structure shows that age, education level and FHx are the parent nodes of T2DM, that is, they are the direct influencing factors of T2DM. Smoking, alcohol consumption, heart rate, occupation, HDL-C, staple food and meat intake indirectly affect T2DM by influencing other factors. Region, fruit intake, exercise, LDL-C, TG and sweet food intake are the subnodes of T2DM, that is, they are also directly related to T2DM. Among them, age, region, education level, FHx and sweet food intake, in addition to the direct effect on T2DM, can also be indirectly related to other factors.

Fig. 1

BN diagram of influencing factors of T2DM

The BN of CAD influencing factors is shown in Fig. 2, which contains 18 nodes and 25 directed edges. The network structure showed that age, SBP, smoking, sleep time, heart rate, exercise, meat and fruit intake were directly related to CAD, Alcohol consumption, educational level, occupation and marital status can be indirectly correlated with CAD through other nodes, and other variables were correlated, but the network relationship with CAD was far away.

Fig. 2

BN diagram of influencing factors of CAD

The BN of comorbidity influencing factors is shown in Fig. 3, which contains 12 nodes and 16 directed edges. The network structure showed that age, FBG, SBP, exercise, sweets, fruit intake and HDL-C were directly related to comorbidity. Among them, age and sweet food intake had both direct and indirect effects on comorbidities. Region and education level were indirectly associated with comorbidities through FBG, TC and DBP were indirectly associated with comorbidities through SBP.

Fig. 3

BN diagram of influencing factors of comorbidity

For the constructed BN structure, the maximum likelihood estimation method is used for parameter learning. Table 8 is the conditional probability table for T2DM as child node. It can be seen that the incidence probability of T2DM increases significantly with the increase of age. The incidence of T2DM decreased with higher education level. The incidence of DM in people with a FHx of DM is significantly higher than that in people without a FHx.

Table 8 Conditional probability of influencing factors of T2DM

Table 9 is a conditional probability table with CAD as a child node. It can be seen that the incidence probability of CAD increases with age.The incidence rate of regular smokers is much higher than that of non-smokers, and this gap is especially obvious in middle-aged and elderly people. The incidence of CAD in patients with abnormal SBP is significantly higher than that in normal population.

Table 9 Conditional probability of CAD influencing factors

Table 10 is a conditional probability table for children with comorbidity. Similarly, the probability of comorbidity increases with age. The incidence rate of inactive people was higher than that of exercisers. The older the age, the more obvious the difference was. Abnormal SBP and FBG can significantly increase the incidence of comorbidity, and the effect of FBG is stronger than that of SBP.

Table 10 Conditional probability of factors influencing comorbidity

After completing the BN structure learning and parameter learning, the remaining 30% data was taken as the test set, and the confusion matrix obtained was shown in Table 11. The results showed that the accuracy rate of T2DM prediction was 84.33%, the accuracy rate was 83.91%, the sensitivity was 86.23%, and the specificity was 82.30%. The area under ROC is 0.844 (95CI%:0.817 ~ 0.871) (Fig. 4A). The accuracy, accuracy, sensitivity and specificity of CAD prediction were 85.34%, 83.62%, 83.62%, 86.72%, and the area under ROC curve was 0.852 (95CI%:0.824 ~ 0.880) (Fig. 4B).The accuracy of comorbidity prediction was 87.62%, the accuracy was 80.79%, the sensitivity was 78.71%, the specificity was 91.62%, and the area under ROC curve was 0.857(95CI%:0.822 ~ 0.892) (Fig. 4C).The results of each evaluation index show that the three BN models have good prediction performance.

Table 11 Confusion matrix by disease
Fig. 4

ROC curves of each disease model. Note: (A) ROC curve of T2DM prediction, (B) ROC curve of CAD prediction, (C) ROC curve of comorbidity prediction

BNs can use the conditional probability distribution determined by network structure and parameter learning to realize predictive reasoning and diagnostic reasoning of uncertain events, and can use third-party software to more intuitively display the complex relationship and probability distribution between variables and outcomes. Its biggest advantage is that it can automatically update the network probability by using Bayes theorem according to the different degree of information. In this study, the learning results are substituted into Netica, where the directed arc between nodes is expressed as the probability dependence between the connected nodes, and the nodes contain different states of each variable.

Predictive reasoning is based on the prior probability of a specific basic event (node), and uses the conditional probability relationship between nodes to find out the probability of a node arising from the cause, that is, through some characteristics of known research objects, to predict the incidence probability of diseases. Suppose a study subject is known to be 45 to 59 years of age, illiterate, urban, and has a FHx of DM with a 57.6% risk of T2DM (Fig. 5). If the person is found to have abnormal HDL-C and LDL-C on clinical examination, the risk of T2DM is 57.6%.The BN showed that the risk of T2DM increased to 65.8% (Fig. 6), suggesting that patients with a FHx of DM and hyperlipidemia should be paid enough attention to reduce the risk of T2DM. If the subjects were aged 45–59 years old, lived in rural areas, and regularly smoked and drank alcohol, the risk of CAD was 63.2% (Fig. 7); if they quit smoking and drinking, and maintained exercise and adequate sleep, the risk of CAD was reduced to 28.6% (Fig. 8), indicating that a healthy lifestyle can effectively reduce the risk of coronary heart disease. If an individual is 60 years of age or older, illiterate, and living in an urban area, the risk of comorbidity is 76.6% (Fig. 9). If the person is found to have abnormal SBP and HDL-C by further biochemical examination, the risk of comorbidity increases to 90.6% (Fig. 10), indicating that hypertension and hyperlipidemia are closely related to comorbidity. Therefore, people with hypertension and hyperlipidemia, especially the elderly, should be given adequate attention to prevent T2DM and CAD.

Fig. 5

Prediction inference of T2DM influencing factors by BN model I

Fig. 6

Prediction inference of T2DM influencing factors by BN model II

Fig. 7

Prediction inference of CAD influencing factors by BN model I

Fig. 8

Prediction inference of CAD influencing factors by BN model II

Fig. 9

Prediction inference of comorbidity influencing factors by BN model I

Fig. 10

Prediction inference of comorbidity influencing factors by BN model II

BNs can not only predict the risk of outcome events, but also explore the conditional probability of causes through diagnostic reasoning, that is, on the premise of knowing the disease state, judge the basic situation of the inference research object and find out the pathogenic conditions. As shown in Fig. 11, after the incidence probability of T2DM is set to 100%, the change of node probability is observed. For example, in terms of age, the probability of people aged 18–44 years decreased from 38.1% to 10.3%, the probability of people aged 45–59 years increased from 29.1% to 35.5%, and the probability of people aged 60 years and above increased from 32.8% to 54.2%. The increase value was the largest, indicating that the elderly are the population with the highest risk of T2DM. For regions, the rural probability increased from 19.6% to 36.3%, and the suburban probability increased from 10.6% to 16.4%, indicating that people living in rural and suburban areas were more likely to get sick, and the rural population was more likely to get sick than the suburban population. In terms of education level, the probability of college and above decreased, and the probability of illiteracy and primary school, middle school, high school or secondary school increased, respectively, 7.86% and 80.4%, indicating that these two groups of people are more likely to suffer from T2DM. In terms of occupation, the probability value of agriculture and forestry personnel increased from 18.3% to 28.4%, and the probability value of business service personnel increased from 9.69% to 10.8%, indicating that these two occupational groups are more likely to get the disease, and the incidence rate of agriculture and forestry personnel may be higher than that of business service personnel. The probability of having a FHx increased from 16.3% to 19.0%, indicating that having a FHx increases the risk of T2DM. In terms of heart rate, the value of bradygia increased from 1.83% to 2.30%, the value of normal was unchanged, and the value of tachycardia decreased from 6.31% to 5.77%. The overall change was small, suggesting that bradygia may increase the incidence probability of T2DM, but the influence is small. In terms of dietary habits, the probability value of no eating or fruit intake less than 7 times/week, no eating or sweet food intake more than 7 times/week, meat intake 7–14 times/week or less, no eating or staple food intake less than 32 days increased, indicating a higher likelihood of T2DM. The probability of occasional drinking decreased, suggesting that moderate drinking may reduce the probability of the disease, while the probability of smoking did not change much. In addition, the probability values of no exercise and abnormal TG, HDL-C and LDL-C indexes also increased, indicating a greater probability of T2DM.

Fig. 11

Diagnostic inference of BN model for T2DM influencing factors

The diagnostic inference of the BN model for the influencing factors of CAD is shown in Fig. 12. The incidence probability of CAD is set to 100%, and it is found that in terms of age, the probability of people aged 18–44 years old decreases from 37.2% to 4.41%, and the probability of people aged 45–59 years old decreases from 23% to 22.6%. The probability of patients aged 60 and above increased from 39.8% to 73.0%, indicating that advanced age can significantly increase the risk of CAD. In terms of regions, the value of rural residents increased from 18.9% to 19.8%, that of suburban residents increased from 7.55% to 7.68%, and that of urban residents decreased, indicating that the probability of CAD in rural and suburban residents was higher than that in urban residents. The probability value of agriculture and forestry occupations increased to 21.3%, indicating that the occupational population has a higher risk of CAD. In terms of education level, the probability of college and above people decreased, and the probability of illiteracy and primary school, middle school, high school or secondary school increased, indicating that people with low education level were more likely to get sick. In terms of marital status, the value of married people increased from 83.1% to 94.9%, indicating that the CAD incidence probability of married people was higher than that of unmarried, divorced and widowed people. The probability of regular smokers increased from 13.8% to 17.3%, indicating that regular smokers had a higher risk of CAD. The probability of occasional drinking decreased, suggesting that moderate drinking may reduce the probability of disease. In terms of dietary habits, the probability value of people who do not eat or fruit intake is less than 7 times/week, meat intake is 7–14 times/week and less, and staple food intake is 3–6 two or more/day is increased, indicating that they are more likely to suffer from CAD. In terms of sleep time, the value of < 5 h/day or need drugs to help sleep increased from 11.1% to 20.3%, the value of 5–7 h/day decreased from 47.6% to 44.2%, the value of 7–9 h/day decreased from 36.7% to 27.5%, and the value of ≥ 9 h/day increased from 4.53% to 8.01%. The results showed that insufficient or too much sleep would increase the probability of CAD, and the effect of insufficient sleep might be greater than that of too long sleep. The value of abnormal systolic blood pressure increased from 27.2% to 46.1%, indicating that the increase of blood pressure will increase the incidence of CAD. At the same time, the probability values of no exercise, slow heart rate and abnormal TC, TG and LDL-C indexes also increased, indicating that CAD was more likely.

Fig. 12

Diagnostic inference of BN model for CAD influencing factors

The diagnostic inference of the factors affecting comorbidity by BN model is shown in Fig. 13. The incidence probability value is set to 100%. It is found that in terms of age, the probability decreases from 46.4% to 2.64% for people aged 18 to 44, increases from 22.6% to 23.2% for people aged 45 to 59, and increases from 31% to 74.2% for people aged 60 and above. It showed that old age significantly increased the risk of comorbidities. The probability values in rural and suburban areas increased to 10.9% and 8.58% respectively, suggesting that the population living in rural and suburban areas had a higher probability of disease. Systolic blood pressure abnormalities increased from 23.5% to 46.9%, diastolic blood pressure abnormalities increased from 10.1% to 15%, indicating that hypertension can significantly increase the incidence of comorbidities. The value of abnormal fasting blood glucose increased from 17.2% to 46.1%, suggesting that DM may accelerate the course of coronary heart disease and increase the probability of comorbidity. In terms of exercise, the value of no exercise increased from 36.3% to 44.7%, indicating that lack of exercise can increase the risk of comorbidity. For total cholesterol and high-density lipoprotein cholesterol, the abnormal value increased by 0.5% and 13.4%, respectively, indicating that hyperlipidemia was one of the risk factors for comorbidities. In addition, the probability value of not eating or fruit intake less than 7 times/week, and not eating or sweet food intake more than 7 times/week also increased, indicating that the likelihood of comorbidities was also greater.

Fig. 13

Diagnostic inference of BN model for comorbidity influencing factors

Sensitivity analysis is a method to quantify the degree of factor dependence in the BN model, which can reflect the quantization of target nodes caused by changes in local parameters of the network model, and then identify the sensitivity factors in the model. Through the forward reasoning of BN, sensitivity analysis of target variables “T2DM”, “CAD” and “Comorbidity” can be conducted to obtain the influence of each factor on the outcome of the disease. The results of sensitivity analysis are expressed by the percentage of variance reduction, which can reflect the influence of specific variables on target variables. The larger the percentage of variance reduction value is, the greater the influence of input factors will be. The analysis results are shown in Table 12. As can be seen from the table, for T2DM, education level, age, region, and occupation ranking in the front in the influencing variables. For CAD, age, education level, occupation, fruit intake and SBP were more sensitive to CAD. For comorbidities, age, FBG, education level, fruit intake and SBP had significant effects on comorbidities. By focusing on the prevention and control of the above sensitive factors, the risk of disease can be effectively reduced.

Table 12 Sensitivity to target variable

Related posts

UVA conducts diabetes research focusing on body movement

Detecting type 2 diabetes using audio: How does it work?

Type 2 diabetes mellitus in adults: pathogenesis, prevention and therapy