Study design
In the discovery phase, participants are enroled from the Men’s LVS (MLVS) and Women’s LVS (WLVS), the goal of which was to validate self-reported diet and lifestyle through the use of 7DDRs and objective biomarkers33. MLVS was conducted in 2011–2013 within the HPFS cohort and the Harvard Pilgrim Health Care cohort. The WLVS was conducted in 2010–2012 among selected participants from the NHS and NHSII. All LVS participants (including MLVS and WLVS) were free of a history of chronic diseases as per study protocol. In all, 1,196 LVS participants who completed 7DDR assessments and had existing metabolomics data were included in the current analyses.
In the external replication phase, participants were from the NPAAS-FS involving 153 participants of the WHI cohort. The NPAAS-FS was conducted in 2011–2013. The study targeted postmenopausal women who were free from major medical conditions. This 2-week long feeding study provided participants meals that were prepared according to each participant’ habitual diet assessed using a 4-day diet records as a starting point for individualizing diet specifications. A total of 153 women completed the feeding study and attended two clinic visits34. Blood samples were collected after a 2-week controlled feeding period designed to mimic participants’ usual diets, ensuring stable biomarker concentrations and retainment of the intake variations34.
For the cohort analysis of metabolomic profiles with incident T2D, participants were from NHS, NHSII and HPFS cohorts. In brief, blood samples were collected from 32,826 NHS participants during 1989–1990, 29,611 NHSII participants during 1996–1999 and 18,225 HPFS participants during 1993–1995. Metabolomic data were generated from multiple individual studies within these cohorts, which collectively provided data for the third component of the current analyses. Participants with existing metabolomics data were excluded if they had a daily energy intake below 500 kcal for women or 800 kcal for men or above 3,500 kcal for women and 4,000 kcal for men, if they were lost to follow-up after blood collection, or reported a history of cancer, cardiovascular disease or T2D at the time of blood draw. Ultimately, 11,454 participants were included from the pooled cohort (Extended Data Fig. 5). Of note, these participants did not include the LVS participants.
The study protocol has been approved by the Human Subjects Committees of the Harvard T.H. Chan School of Public Health and Brigham and Women’s Hospital. In the WHI study, participants provided written informed consent for the overall WHI programme and the NPAAS-FS substudy. Study protocols were approved by the Institutional Review Board at the Fred Hutchinson Cancer Research Center and all participating clinical centres.
Dietary assessment
We used two sets of 7DDRs data collected during LVS examinations to represent their habitual diet. Participants were provided with detailed instructions for completing their 7DDRs. Participants weighed their food before and after eating and submitted recipes for homemade dishes and labels from commercial products. Nutrition records were analysed using the Nutrition Data System for Research software at the Nutrition Coordinating Center, University of Minnesota, yielding data on over 150 nutrients and dietary constituents35,36. Total carbohydrate intake was expressed as percentage of calories. The intakes of added sugar and carbohydrates from whole grains, refined grains, vegetables, fruits, potatoes and legumes were adjusted for total energy intake using the residual method and then expressed as grams per day. The food contributors for these carbohydrate variables were summarized in Extended Data Table 4. We further categorized potatoes into baked/boiled/mashed potatoes versus fried potatoes.
In addition, participants from the LVS also completed a validated FFQ37. Participants in the NHS, HPFS and NHSII cohorts completed similar FFQs quadrennially since 1984, 1986 and 1991, respectively. Averaged nutrient intake was calculated based on the most recent FFQ cycle before blood collection (1990 for NHS, 1994 for HPFS and 1999 for NHSII). Total and types of carbohydrate intake were calculated by multiplying the frequency of food consumption by the nutrient content based on the Harvard University Food Composition Database and then summing these values. All carbohydrate variables were adjusted for total energy intake.
In the NPAAS-FS, participants’ intake of total and individual types of carbohydrate was derived from menus for preparing the controlled meals. To calculate intake for each food or food group, menu items were converted into standard servings per day using the Nutrition Data System for Research serving sizes. The food intake variables were then calculated by averaging the intake over the 14-day feeding period (mean servings per day)38. In this current study, types of carbohydrate intake included added sugars, whole grains, refined grains, vegetables, whole fruits and potatoes (in grams per day). Similar to the LVS, the total and types of carbohydrate intake were adjusted for total energy intake using the residual method.
Metabolomics measurement
In the LVS, plasma metabolomics profiling was conducted using high-throughput liquid chromatography–mass spectrometry techniques at the Broad Institute of MIT and Harvard (Cambridge, MA)32. Hydrophilic interaction liquid chromatography (HILIC) with positive ionization mode detection (HILIC-pos) was used to separate polar metabolites, HILIC with negative ionization mode detection (HILIC-neg) was utilized for negative ones and C8 chromatography with positive ionization mode detection (C8-pos) was used for polar and non-polar lipids in positive ion mode. Only named metabolites were analysed, excluding features with missing rates >75% or mean coefficient of variation <30%. We also excluded metabolites that did not pass our pilot study investigating the effects of delayed sample processing during blood collection (intraclass correlation coefficient <0.4)32. Two drug metabolites were further excluded (acetaminophen and α-hydroxymetoprolol). Metabolites were then natural log-transformed and scaled to z-scores. Missing data for each metabolite were imputed by replacing missing values with the half of minimum valid value. Finally, we included a total of 293 known metabolites in the metabolomics analysis during the discovery phase (Extended Data Fig. 6).
Serum metabolomics profiles for NPAAS-FS participants were derived using the Metabolon platform (Metabolon), which employs Q-exactive ultrahigh-performance liquid chromatography tandem mass spectrometry38. For the Metabolon method, the sample analysis and data processing, including peak alignment and compound identification, have been detailed in prior publications39. In addition, lipidomics profiling was conducted in Dr Daniel Raftery’s lab at the Northwest Metabolomics Research Center at the University of Washington using the Sciex QTRAP 5500 Lipidyzer platform, which incorporates the SelexION differential mobility spectrometry method that targeted 1,070 lipids in 13 major lipid classes40. The measurements of metabolomics and lipidomics are complementary in the spectrum of metabolites that each method emphasizes. We mapped the metabolites from the three labs between LVS and NPAAS-FS by the HMDB numbers, metabolite names, or synonyms of these metabolites using the Human Metabolome Database or Lipid Maps. Lipid metabolites in the LVS could be mapped by summing species with the same number of carbons and double bonds in NPAAS-FS (Supplementary Table 5).
Ascertainment of T2D
For the cohort analysis, self-reported T2D cases were confirmed via a supplemental questionnaire if at least one of the following criteria from the American Diabetes Association was met: (1) presence of one or more classic symptoms (for example, excessive thirst, frequent urination, weight loss, hunger, itching or coma) along with fasting plasma glucose (PG) ≥126 mg dl−1 (7.0 mmol l−1) or random PG ≥200 mg dl−1 (11.1 mmol l−1); (2) at least two elevated PG levels on separate occasions (fasting PG ≥140 mg dl−1, random PG ≥200 mg dl−1 or PG ≥200 mg dl−1 at 2 h during an oral glucose tolerance test) without accompanying symptoms; or (3) use of hypoglycaemic medication (either insulin or oral hypoglycaemic agents). Before 1998, a fasting PG level of ≥7.8 mmol l−1 (140 mg dl−1) was used for diagnosing diabetes based on National Diabetes Data Group criteria41. Beginning in 2010, HbA1c ≥6.5% was included in the diagnostic criteria42. The validity of the supplementary questionnaire was examined in two prior studies conducted within the NHS and HPFS cohorts. These studies utilized blinded medical record reviews, which confirmed T2D diagnoses in 98% and 97% of participants, respectively43.
Covariates
In the LVS, demographics, lifestyles and medical conditions were assessed from self-reported questionnaires completed at the blood draw, including sex, age, ancestry, body weight, smoking status, physical activity and alcohol drinking. BMI is calculated by dividing a person’s weight in kilograms by the square of their height in metres. The AHEI was derived and cumulatively averaged based on FFQ assessments since baseline (1986 in NHS, 1991 in NHSII and 1986 in HPFS) through 2010.
In the cohort analysis, information was obtained from self-reported biennial questionnaires until blood draw, including age, ancestry, family history of diabetes, BMI at early adulthood (age 18 in the NHS, NHSII or 21 in the HPFS), history of hypertension, history of high cholesterol, fasting status, smoking status, alcohol drinking, physical activity, total calories intake, per cent of calories from protein and AHEI.
Statistical analyses
In the LVS, we first explored metabolites that were associated with carbohydrate intake using multivariate linear regression models, with Bonferroni correction. To build metabolomic indices of carbohydrate intake, we applied elastic net regression to select relevant metabolites from all measured metabolites and constructed metabolomics scores for total carbohydrate intake as well as carbohydrate intake from different dietary sources. Individuals were randomized to either a training set or a testing set in a 7-to-3 fashion. The elastic net regression with a tenfold cross-validation, and leave-one-out approach was performed by using the R cv.glmnet function, with an α of 0.5 to indicate an equal mix of LASSO and Ridge regularization and the optimal lambda value based on the mean square error (‘lambda.min’) to minimize the cross-validation prediction error rate (R package ‘glmnet’)44,45. We then constructed metabolomic indices using the β coefficients estimated from the trained model in both training and testing sets. The performance of the metabolomic indices was evaluated using Pearson correlation coefficients between the indices and carbohydrate intake. Correlations between ‘true’ intake and intakes measured using 7DDRs, FFQ and the metabolomic indices were assessed using the triad method in LVS46,47. In light of the lack of well-accepted quantitative criteria for the performance of dietary biomarkers, we considered r ≥ 0.30 as the evidence of a successful development of the indices. Of note, correlations between diet and many established nutrient biomarkers, such as long-chain n − 3 fatty acids and trans fatty acids, and dietary intake were in the range of 0.30 and above48. We built the same carbohydrate metabolomic indices in the NPAAS-FS using available metabolites and calculated correlation coefficients to quantify the replication performance. Considering the heterogeneity between the discovery cohort and replication cohort in terms of dietary assessments and metabolomic profiling, we used statistical significance at 0.05 as the criterion for determining whether the replication was acceptable. The metabolomic indices were calculated using the following formula:
$$\mathrm{Metabolomic}\,\mathrm{indices}={\beta }_{1}{M}_{1}+{\beta }_{2}{M}_{2}+{\beta }_{3}{M}_{3}+\ldots +{\beta }_{i}{M}_{i},$$
where Mi represents the level or concentration of the ith metabolite, and βi represents the coefficient associated with the ith metabolite.
In the cohort analysis, we constructed the same indices based on available metabolites. Correlations were calculated to examine the relationships between the indices and carbohydrate intake assessed using FFQs. We used Cox regression models to evaluate prospective associations of the indices with incident T2D during follow-up. Person-time was calculated from the blood collection date until the diagnosis of T2D, death, loss to follow-up or end of the study period (June 2020 in the three cohorts), whichever came first. We built two models with model 1 adjusted for study cohorts and age at blood draw. Model 2 was further adjusted for ancestry (white or others), fasting status (fasting or non-fasting), family history of diabetes (yes or no), smoking status (never smoking or smoking), alcohol drinking (quintiles of continuous), BMI at early adulthood (<25, 25–29.9 or ≥30 kg m−2), physical activity (quintiles of continuous), hypertension (yes or no), high cholesterol (yes or no), total calorie intake (quintiles) and AHEI (quintiles). In addition, several sensitivity analyses were conducted. First, as current BMI can be a potential mediator, mediation analyses using bootstrapping with 500 resamples were employed to explore indirect effects. Second, the selection of metabolites and their coefficients were determined using the LASSO regression with lambda yielding the minimum mean square error value49,50. Third, to assess the robustness of metabolite selection, we performed stability selection by repeatedly fitting elastic net regression models to random subsamples of the data with cross-validated penalty parameters. Metabolites selected in at least 80% of 100 subsampling iterations were considered robust and reproducible. Fourth, we additionally adjusted for the respective dietary carbohydrate variables (for example, whole grain or added sugar intake) to examine whether the metabolomic indices are associated with T2D independently of self-reported diet. Fifth, to gain a better understanding of the underlying biological processes, we organized metabolites into groups and performed metabolite set enrichment analysis to identify those specifically associated with T2D. Last, we examined associations between the indices and the diabetes risk in a nested case–control study of T2D within NHS, which included 1,456 participants (778 diabetes cases and 778 healthy controls) who were free of diabetes at blood draw in 1989–1990 with T2D risk ascertained through 2008. Conditional logistic regression models were used to investigate associations of interest. Two-sided statistical tests (P < 0.05) were employed, with Bonferroni correction applied for multiple comparisons when analysing individual metabolites. All statistical analyses were performed using R version 4.0.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.