Table of Contents
Study population
The UK Biobank is a prospective study recruiting more than 500,000 participants aged 37 to 73 years in 22 sites across Wales, Scotland, and England from 2006 to 2010. At baseline, participants filled out a touchscreen questionnaire, participated in a face-to-face interview and physical measurements, and provided biological samples45.
Participants with prevalent T2D at baseline were identified using electronic health records (ICD-10 code: E11) or an algorithm developed by the UK Biobank study. The algorithm identified T2D with 96% accuracy through medical and medication history46. In the present study, after excluding participants without valid baseline dietary data, with extreme dietary intake (males with <800 kcal/day or >4200 kcal/day or females with <600 kcal/day or >3500 kcal/day), and without baseline plasma metabolome, a total of 3597 participants were included to derive the metabolomic signature correlated to UPF intake (Fig. 3). For the analysis of the metabolomic signature with diabetic microvascular complications, after a further exclusion of participants with prevalent microvascular complications and CVD, 2477 participants were included.
Participants with available baseline dietary and metabolomic data were randomly assigned to a training set (70%, n = 2517) and a testing set (30%, n = 1080). Elastic net regression, a regularized regression model that combines the Ridge and Lasso penalties, was used to construct a metabolomic signature related to UPF consumption. To make full use of the data in the training set, leave-one-out cross-validation was used to select optimal hyperparameters for elastic net regression. Based on the selected elastic net regression, a coefficient was assigned for each metabolite in the training set. The metabolomic signature was calculated as the weighted sum of metabolites with nonzero β coefficients in the training set. In the testing set, coefficients derived from the training set were applied, and the metabolomic signature was calculated in the same way.
The UK Biobank was approved by the Community Health Index Advisory Group in Scotland, the North West Multi-Centre Research Ethics Committee, and the National Information Governance Board for Health and Social Care in England and Wales in accordance with the Declaration of Helsinki. All participants provided written informed consent. This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guideline (Supplementary Checklist).
Dietary assessment
Self-reported dietary data were collected through the Oxford WebQ, a web-based 24-h dietary recall, which records 32 beverages and 206 foods in the previous 24 h. Participants were invited to fill out the Oxford WebQ on five separate occasions, including a baseline assessment between 2009 and 2010 and four online cycles between 2011 and 2012 via e-mail. The Oxford WebQ has been validated by an interviewer-administered 24-h dietary recall, demonstrating similar recordings of daily nutrient intake and estimated energy47. In addition, the Oxford WebQ showed a high level of validity compared with the urinary biomarkers48.
Considering the inconsistency in the time of the completion of 24-h recalls and the baseline assessment, dietary recalls within 36 months of baseline assessment were utilized to assess the baseline dietary intake level10,49. Dietary intakes were average for those participants who had more than one 24-h dietary recall within 36 months. The quantity of each beverage or food consumed was computed by multiplying the serving consumed by the proportion size of each beverage or food50. Additionally, nutrient and total energy intake were computed based on the calculated quantity.
Nova food classification was applied to classify each beverage and food into one of the four food groups based on the extent and purpose of food processing9: 1) unprocessed or minimally processed foods, e.g., milk and fresh vegetables; 2) processed culinary ingredients, e.g., salt and butter; 3) processed foods, e.g., canned fish and cheese; 4) UPFs, e.g., soft drinks and savory snacks. Detailed information about the Nova classification is available elsewhere9. Our study primarily focuses on the UPF group, and the specific details as well as several samples are shown in Supplementary Table 10. For each participant, a proportion of UPF was computed by dividing the sum weight of each food item among the UPF group (g/day) by the weight of total food consumption (g/day). Instead of energy ratio, we used weight proportion (%) because it can capture UPFs providing no or low energy (e.g., artificially sweetened beverages)51.
Metabolomic measurement
Metabolomics of ethylenediaminetetraacetic acid (EDTA) plasma samples were measured using high-throughput nuclear magnetic resonance (NMR) spectroscopy in Nightingale Health’s laboratories among random subsets of the full cohort. Phases 1 and 2 have been completed, and the current number of detected samples reached ~292,000 participants, of which 275,000 are from baseline recruitment. The samples were stored at −80 °C before preparation, were slowly thawed at +4 °C overnight, and then were centrifuged (3400 × g) at +4 °C for 3 min. Aliquots of each sample were transferred into NMR tubes and mixed with a phosphate buffer. The NMR spectra of each sample was recorded using a 500 MHz NMR spectrometer (Bruker AVANCE IIIHD), and the metabolomic biomarkers were quantified with Nightingale Health’s proprietary software. In each well plate, two blind duplicate samples and two control samples were included to monitor the consistency across multiple spectrometers. The coefficients of variation (CVs) were below 5% for most metabolomic biomarkers. Further details of the experimentation and the NMR platform could be found elsewhere52.
A total of 251 metabolomic measures including 170 absolute levels and 81 ratio measures, were quantified for each plasma sample, covering lipid concentrations and compositions of 14 lipoprotein subclasses, fatty acids, and various low-molecular-weight metabolites (e.g., ketones, amino acids, and glycolysis metabolites). Technical variation in the metabolomic data (i.e. sample preparation time, the position of samples in shipping plates, temporal drift within the spectrometer, and outlier shipping plates) was removed using the R package named “ukbnmr”53. Specifically, the technical variation was removed by regressing on the time elapsed between sample preparation and sample measurement, plate row, plate column, and plates grouped by date within each of six spectrometers. Furthermore, outlier plates were systematically identified and removed. This strategy to remove technical variation was also applied in previous studies conducted in the UK Biobank54,55. In the present study, we excluded ratio measures (e.g., Phospholipids to Total Lipids) and the original alanine which could be replaced by spectrometer-corrected alanine, leaving 169 metabolites in absolute levels for the final analysis. Detailed information on these metabolites is provided in Supplementary Table 11. Metabolite values flagged with “below the limit of quantification” for each metabolite were replaced by half of the minimum non-missing values, and then the other missing values were imputed using the random forest imputation since it was recommended and used in metabolomic analysis56,57.
Outcome ascertainment
The diagnosis of incident diabetic microvascular complications was obtained through two sources of health records including the hospital inpatient records and death registries. The ICD10 code was employed to define diabetic neuropathy (E114, E144, G590, G629, G632, G990), diabetic retinopathy (E113, E143, H280, H360), and diabetic kidney disease (E112, E142, N180, N181, N182, N183, N184, N185, N188, N189)58,59. Health records were available until 25 September 2021 for Scotland, 1 November 2022 for England, and 29 May 2021 for Wales.
Covariate assessment
Information about age, race, sex, education level, smoking status, leisure-time physical activity, drinking status, and family history of CVD was collected via a touch-screen questionnaire at baseline. Townsend Deprivation Index (TDI), which reflected socioeconomic status, was calculated based on national census data according to postcodes of residence60. Leisure-time physical activity was assessed through the long-form version of the International Physical Activity Questionnaire, and weekly metabolic equivalent minutes (MET-min/week) were computed. Body weight and height were measured by trained nurses, and BMI was computed as body weight (kg) divided by the square of height (m2). A healthy diet score was calculated to evaluate the overall diet quality according to a previous UK Biobank study61, which considered adequate consumption of healthy food (whole grains, fruit, vegetables, seafood, vegetable oils, and dairy) and reduced consumption of unhealthy food (refined grains, sugar-sweetened beverages, processed meats, and unprocessed meats. Detailed scoring method for the healthy diet score can be found in Supplementary Table 12. The medical history of hyperlipidemia, hypertension, cancer, CVD, and duration of T2D was ascertained through verbal interviews, electronic health records, and questionnaires. The mediation for T2D was collected by verbal interviews and questionnaires.
Statistical analysis
The association between UPF intake and each metabolite (log1p-transformed and z score standardized) was assessed using the multivariable linear regression62,63, with P values < 0.05 after Benjamini–Hochberg FDR correction considered as statistically significant. To identify the metabolomic signature of UPF intake, all the participants were randomly assigned to a training set or testing set in a ratio of 7:3 (Fig. 3). In the training set, we applied an elastic net model, which is a regularized regression model combining the Lasso and Ridge penalties, to derive a metabolomic signature reflecting UPF intake based on metabolites passing the threshold after FDR correction and to avoid collinearity among metabolites64. Two tuning parameters for the elastic net model were determined, including α (balancing penalties of Lasso and Ridge) and λ (the penalty intensity parameter). To select core metabolites and achieve sparsity, α was chosen from 0.50, 0.75, and 1.0062. We then used a leave-one-out cross-validation framework to select optimal λ to achieve the minimum mean squared error15. In the final model, α = 0.50 and λ = 0.12298 were selected. The metabolomic signature was computed as the weighted sum of the chosen metabolites, with weights being the coefficients from the elastic net regression. In the testing set, the weights derived from the training set were applied to compute a metabolomic signature. Spearman correlation coefficients were calculated to assess the correlation between UPF intake and the metabolomic signature in the training and testing sets.
UPF intake and the metabolomic signature were converted to z scores (mean = 0, SD = 1) to ensure comparability. Cox proportional hazard regression models were applied to estimate the HRs and 95% CIs for associations of UPF intake and the metabolomic signature with risks of total and individual microvascular complications utilizing the combined data from the testing and training set (Fig. 3). Follow-up time was computed from the recruitment date until death, the diagnosis of microvascular complications, or the end of follow-up, whichever came first. We assessed the proportional hazards assumption by the product term of UPF intake or the metabolomic signature with follow-up time and found no significant violation.
Two models were built in this study. In Model 1, we adjusted for age (continuous, y), total energy intake (continuous, kcal/d), and sex (female, male). In Model 2, we additionally adjusted for race (White participants, Asian or Asian British participants, Mixed participants, and Black or Black British participants), smoking status (never, past, current), drinking status (never, past, current light to moderate [1–28 g ethanol per day for males or 1–14 g ethanol per day for females], current heavy [>28 ethanol per day for males or >14 g ethanol per day for females]), education level (college/university degree, other degrees), TDI (continuous), leisure-time physical activity (continuous, MET-min/wk), history of hyperlipidemia, hypertension, and cancer (yes, no), duration of T2D (continuous, y), family history of CVD (yes, no), use of glucose-lowering medications (none, only oral medicine, insulin, and others), and the number of 24-h dietary recalls (continuous). To maximize data availability, we imputed missing covariates by generating 10 imputed datasets using multiple imputations. To test whether associations between the metabolomic signature and risks of diabetic microvascular complications were attributable to its correlation with UPF intake, we further included both UPF intake and the metabolomic signature in Model 2. To assess the predictive performance of the metabolomic signature for composite microvascular complications, we calculated the C statistic, continuous NRI, and absolute IDI65. The basic model was based on all covariates in Model 2. Other models were successively added with UPF intake, UPF-related traditional biomarkers, and the metabolomic signature. The traditional biomarkers included renal function (urate and urea), lipid profile (total cholesterol [TC], LDL [low-density lipoprotein] cholesterol, HDL cholesterol, triglycerides, apolipoprotein B, and apolipoprotein A), inflammation (white blood cell count and C-reactive protein), liver function (alkaline phosphatase [ALP], alanine aminotransferase [ALT], aspartate aminotransferase [AST], total bilirubin, total protein, and gamma glutamyltransferase [GGT]), blood pressure (systolic and diastolic blood pressure), and glycated hemoglobin A1c (HbA1c)10,59. When UPF intake was significantly associated with a biomarker in the multivariable-adjusted linear regression model, it was selected as a UPF intake-related traditional biomarker (Supplementary Table 13).
Among the chosen metabolites in the metabolomic signature, potential mediating metabolites of associations between UPF intake and risks of diabetic microvascular complications were identified based on the mediation principle that mediators are associated with both the exposure and the outcome66. These criteria were evaluated in the multivariable linear regression models (Fig. 1 and Supplementary Table 1) and multivariable Cox regression models (P values < 0.05 after FDR adjustment setting as significant; Supplementary Table 14). The %MEDIATE macro in SAS was used to compute the proportion of the diet-disease association that could be mediated by the metabolomic signature and individual metabolites.
Stratified analyses were conducted according to sex (female, male), age (≤60, >60 y), BMI (<30, ≥30 kg/m2), and the number of dietary recalls (1, ≥2). The interactions of stratified factors with UPF intake or the metabolomic signature on the risk of outcome were tested by the Wald test by adding product terms in Model 2. Several sensitivity analyses were conducted to test the robustness of these findings. First, we excluded participants who had diabetic microvascular complications within 3 years of follow-up. Second, considering BMI is an important mediator linking UPF intake and diabetic microvascular complications10, we did not adjust for it in the main analysis but further adjusted for it as a sensitivity analysis to verify the identified associations. Third, we further adjusted for the healthy diet score. Fourth, we investigated the association of UPF intake and the metabolomic signature with the risk of diabetic kidney disease after additionally adjusting for baseline eGFR. Fifth, we removed from the multivariable model glucose-lowering medications, which might be a confounder or mediator. Sixth, to test whether the pre-screening process impacts the results, we applied an elastic net model to all 169 metabolites instead of only those metabolites significantly associated with UPF intake. Seventh, to further validate the constructed metabolomic signature, we applied it to individuals with baseline metabolomic data but without valid baseline dietary data and assessed its associations with risks of diabetic microvascular complications. Eighth, we used the 80:20 split (80% of the data for training and 20% for testing) to derive a metabolomic signature67. Finally, we repeated the analysis with UPF intake measured by the proportion of energy.
All analyses were performed using SAS version 9.4 (SAS Institute) and R software (version 4.4.3). Two-sided P < 0.05 unless specified using FDR correction was considered to be statistically significant.