Modifiable risk factors and plasma proteomics in relation to complications of type 2 diabetes

Ethics statement

The UK Biobank received ethical approval from the North West Multi-Centre Research Ethical Committee. All participants provided written informed consent. This research was done under UK Biobank application number 109546.

Study population

The UK Biobank is a large population-based cohort study, which recruited around half a million participants aged 37–73 years in 2006–2010 across England, Scotland, and Wales. Each participant completed touchscreen questionnaires, underwent physical examinations, and provided biological samples44.

Prevalent cases of type 2 diabetes were identified through a validated algorithm developed by UK Biobank that used self-reported medical history and medication information at baseline and has been shown to be a reliable measurement with 96% accuracy45. A total of 24,225 participants with type 2 diabetes were identified. After exclusion of participants with incomplete data on modifiable risk factors (n = 10,123), 14,102 were included in the association analyses. After excluding those without proteome data, 1287 participants with type 2 diabetes were included in the proteomic analyses and mediation analyses (Fig. 1). Condition-specific exclusions were performed; those with previous diagnosis of a disease at baseline were excluded from the corresponding analysis.

Definition of cardiovascular health score and degree of risk factors control

The cardiovascular health score was calculated based on the LE8 metrics according to the American Heart Association in 202246. In brief, the components of LE8 include four life behaviors (diet, physical activity, nicotine exposure, and sleep duration) and four cardiometabolic factors (BMI, blood lipids, blood glucose, and blood pressure). Each component metric score ranges from 0 to 100 points, with a higher score indicating a better lifestyle or cardiometabolic status. Data on diet, physical activity, nicotine exposure, and sleep duration, was self-reported. The dietary quality was evaluated using a recommendation for cardiovascular health which considered consumption of fruits, vegetables, whole grains, refined grains, fish, dairy products, vegetable oils, processed meats, unprocessed meats, and sugar-sweetened beverages47. Physical activity was evaluated according to the total duration of moderate or vigorous physical activity per week. Nicotine exposure was evaluated based on self-reported use of cigarettes, smoking cessation, and secondhand smoke exposure. Sleep health was assessed according to average hours of sleep per night. Weight and height were measured during physical examination. BMI was determined as weight in kilograms divided by the square of height in meters. Random blood samples at baseline were drawn for blood biomarkers, including total cholesterol, high-density lipoprotein (HDL) cholesterol, glucose levels, and glycosylated hemoglobin (HbA1c); the blood biomarkers have been externally validated in the UK Biobank48. Serum total and HDL cholesterol levels and self-reported use of antihyperlipidemic medication were used to evaluate blood lipid metric. Serum glucose levels (among those with fasting time >8 h) and HbA1c were used to evaluate blood glucose metric. The blood pressure metric was based on systolic and diastolic blood pressures (SBP, DBP) measured during physical examination and self-reported use of antihypertensive medication. Detailed information is shown in Supplementary Table 18. All self-reported information, physical examination data, and blood sample for each participant were obtained at the initial assessment visit. The cardiovascular health score was calculated as the average of eight component metric scores and was treated as both a continuous score as well as quartiles in the analyses.

In addition, we defined degree of risk factor control according to the target ranges of these 8 risk factors based on guidelines for diabetes7,8,49,50 and combined with the LE8 definition: healthy diet (adherence to at least five items of the recommendations); at least 150 min physical activity each week; non-current smoker; sleep duration ≥7 h and <9 h; BMI <25 kg/m2; HbA1c < 7%; non-HDL cholesterol <130 mg/dL; and SBP <140 mmHg and DBP <90 mmHg. A detailed definition of risk factor control can be found in Supplemental Table 19. Each risk-factor variable on target receives 1 point, resulting in a total risk factor control score ranging from 0 –8, with a higher degree indicating more risk factors within the target range.

Proteomics data

The UK Biobank Pharma Proteomics Project includes 53,017 participants, among whom 46,788 individuals were randomly selected from baseline. The randomized samples are highly representative of the overall UK Biobank population and were included in our analyses51. Details of data processing and quality control have been described on the UK Biobank online resource52,53. In brief, the plasma samples were stored in a −80 °C freezer before being shipped on dry ice to Olink Analysis Service in Sweden. Proximity Extension Assay, in combination with Next-Generation Sequencing, was utilized to parallelly measure relative concentrations of 2923 unique proteins using the Olink proteomics platform. Measurements are expressed as normalized protein expression values (log2-transformed). After excluding eight proteins with missing values of >20%, a total of 2915 proteins were included in the proteomic analyses. All protein levels were standardized in the analyses.

Ascertainment of outcomes

Based on UK Biobank data and previous research on complications of type 2 diabetes, ten types of complications were taken into consideration23,24. In the primary analyses, six types of common complications were included, including macro- and microvascular diseases, cancer, neurological and mental diseases, respiratory diseases, and mortality. In the secondary analyses, disorders of digestive, endocrine, genitourinary, and musculoskeletal system were included. International Classification of Disease version 10 was used to define outcomes (Supplementary Table 20). Information on cancer and death was obtained from cancer registry data and death registry data. Other incident outcomes were identified through linkage with hospital admissions data. Death data were available up to 12 November 2021 for all participants. The electronic health records were available up to 1 November 2022, 25 September 2021, and 29 May 2021 for centers in England, Scotland, and Wales, respectively. Patients were censored at time of events onset, death, loss to follow-up, or end of follow-up, whichever occurred first.

Assessment of covariates

Demographic (age, sex, and ethnicity) and socioeconomic factors (household income, educational attainment, employment status, and Townsend deprivation index), alcohol consumption, and diabetes-related factors (diabetes duration and use of diabetes medication) were collected at baseline through a touchscreen questionnaire and nurse-led interviews. Household income before tax included five groups, i.e., <₤18 000, ₤18 000-£30 999, ₤31 000-£51 999, ₤52 000-£100 000, and >₤100 000. Educational attainment was classified into three groups, i.e., high (college or university degree), intermediate (A levels, AS levels, or equivalent; O levels, GCSEs, or equivalent; CSEs or equivalent; NVQ, HND, HNC, or equivalent; other professional qualifications), and low qualification (none of the above). Employment status was divided into employed (those in paid employment or self-employed, retired, doing unpaid or voluntary work, or being full or part-time students) and unemployed (those unemployed, looking after home and/or family, or unable to work because of sickness or disability). Townsend deprivation index scores represented the levels of socioeconomic deprivation. Moderate drinking was defined as 1–14 g alcohol consumption per day for women or 1–28 g alcohol consumption per day for men54.

Statistical analysis

Baseline characteristics were examined using Chi-squared test or t-test (Wilcoxon rank-sum test for non-normal distributed continuous variables) according to the quartiles of cardiovascular health score.

Multiple Cox proportional hazards regression model was used to calculate the HRs and 95% CIs for the associations of cardiovascular health score (as a continuous score and quartiles) and degree of risk factor control (as a continuous score) with primary and secondary outcomes in the main cohort (n = 14,102). Schoenfeld residuals were used to test the proportional hazards assumption, and no violation was observed. The multiple regression models adjusted for age (continuous, years), sex (male, female), ethnicity (white, non-white), Townsend deprivation index (continuous), education attainment (high, intermediate, and low qualifications), employment status (employed, unemployed), household income before tax (<₤ 18 000, ₤ 18 000-₤ 30 999, ₤ 31 000-₤ 51 999, ₤ 52 000-₤ 100 000, >₤ 100 000), moderate alcohol consumption (yes, no), diabetes duration (continuous, years), and diabetes medicine use (none, oral hypoglycemic drugs only, insulin therapy and others). The missing values of covariates were <20% and were imputed using multiple imputations with 5 imputations (SAS PROC MI). The results from the Cox regression analyses were pooled using Rubins’s rule. In addition, a restricted cubic spline model with three knots (10th, 50th, and 90th percentiles) was performed to explore the dose-response relationship between cardiovascular health score and risk of primary and secondary outcomes.

Stratified analyses were performed to examine the association of cardiovascular health score and degree of risk factor control with risks of outcomes by age (<60 years, ≥60 years) and sex (male, female). Potential modifying effects of stratified factors were examined by testing the corresponding multiplicative interaction terms. Regarding sensitivity analyses, we used competing risk model to correct the competitive risk of death. To reduce the risk of reverse causation, which might arise from the influence of pre-existing or latent conditions on the modifiable risk factors at baseline, we conducted sensitivity analyses after excluding individuals who died or developed endpoints within the first year of follow-up. Additionally, 42% of participants with type 2 diabetes were excluded from our analyses due to missing values of any modifiable risk factors, potentially leading to selection bias. To address this concern, we conducted multiple imputation to impute missing information on risk factors, subsequently repeating the above analyses to test the robustness of our results.

All analyses below that related to plasma proteomics were carried out in the proteomic subset (n = 1287). To build proteomics profiles for cardiovascular health score, multiple linear regression model was used to assess the association between every one of 2915 proteins and cardiovascular health score adjusting for the same confounders as the Cox model. Bonferroni’s correction was applied for multiple testing. Further, we conducted the KEGG and GO enrichment analyses to elucidate the pathways and biological processes related to proteins that were derived from the previous step55,56, using the online software Hiplot (https://hiplot.com.cn/). Hiplot is a comprehensive data computing and visualization cloud platform based on the R language. The clusterProfiler package produced GO and KEGG enrichment analyses in Hiplot. The KEGG database with the species selected as Homo sapiens was used to analyze the relevant genes. GO analysis involves biological processes (BP), cell composition (CC), and molecular function (MF). The rankings of relevant pathways, BP, CC, or MF terms were based on the p-value, which represents the statistical significance of the enrichment observed for a particular KEGG pathway or GO term.

In order to further select representative proteins for mediation analyses, the LASSO regression (R package glmnet) was utilized to select candidate proteins from the proteomic profile of cardiovascular health score57. A 10-fold cross-validation was performed to screen the optimal tuning parameter lambda.min. Firstly, the original dataset was randomly divided into ten subsets of equal size. Secondly, nine of the subsets are selected as the training set to train the model, while the remaining subset serves as the testing set to evaluate the model’s performance. The training and testing process was conducted ten times with a different testing set each time. The final estimate of the model’s performance was calculated as the average of these ten performance metrics. Since the small sample size of the proteomic subset might influence the stability of results and the cross-validation estimator can have high variance, we set 1000 seeds (ranging from 1 – 1000) to repeat the LASSO regression 1000 times to prevent overfitting and promote stability. Only the proteins with 100% repeatability (non-zero coefficients during 1000 LASSO regressions) would be selected as representative proteins for further analyses58,59. Additionally, we performed a sensitivity analysis using elastic net regression, which applies a weaker penalty to coefficients to select proteins60.

The mediation analysis was performed to evaluate the mediation effect of representative proteins on associations of combined modifiable risk factors with risk of common complications of type 2 diabetes. Nevertheless, given the limited sample size of proteomic subset, we estimated statistical power using PASS (version 15.0.5) before association analyses to ensure that these analyses had sufficient sensitivity to detect true associations at an adjusted significant level61. Specifically, the power calculation was based on the following parameters: sample size (N), effect size (HR), significant level (α), event rate (P) and two-sided test. The sample size was set to 1200, and hazard ratio was set to 0.75 as estimated from the Cox regression in the main cohort. In this case, associations between cardiovascular health score and 15 outcomes for which the event rate was not less than 6.42% (corresponding to 77 cases) reached 80% power at a significant level of 0.05/15. Therefore, these 15 outcomes (i.e., MACE, ischemic heart diseases, atrial fibrillation, heart failure, peripheral artery disease, diabetic kidney diseases, diabetic retinopathy, stroke, ischemic stroke, all site cancer, depression, COPD, all-cause mortality, CVD mortality, and cancer mortality) were taken forward for mediation analysis. The difference method that compares estimates from model with and without hypothesized mediator was used to calculate mediation proportion (the SAS Macro %mediate)62. The same covariates as aforementioned in Cox regression were adjusted in the mediation analyses. Besides, as the SAS Macro %mediate does not work with multiple data sets, the imputed dataset (first iteration) was used in the mediation analysis. In addition, to test the robustness of mediation effect, sensitivity analysis was conducted using the medsens function from R package mediation63.

All analyses were performed using SAS (version 9.4; SAS Institute, Cary, NC) and R software (version 4.3.2; R Foundation for Statistical Computing) unless otherwise specified. All p-values were based on two-sided tests, and FDR-corrected or Bonferroni-corrected p-value < 0.05 was considered statistically significant.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Related posts

American Diabetes Alert Day: Health experts share tips for managing blood sugar | Health

Diabetes Workshop: Preventing Complications & Building Sustainable Habits for Life

The Diabetic Hand as a Diagnostic Blind Spot: A Case of Severe Pseudohyperglycemia Masking Critical Hypoglycemia