Issue in Statistics, Analysis and Presentation of Nutritional Data Wen-Harn Pan

Issue in Statistics, Analysis and Presentation of Nutritional Data Wen-Harn Pan

The underlying objectives of data analysis and presentation • learn as much as possible • present w/ maximum clarity. The approaches for presentation • depending on the intended readership • simpler analytic approaches and greater use of figures may be appropriate for a general medical journal; • for an epidemiologic publication, more complex methods and primarily tabular results may be best.

DATA CLEANING: BLANKS and OUTLIERS Blank in questionnaires • Should subjects w/ more than a specific number of blanks be excluded? • How should blanks be treated in calculating nutrient intakes? • To understand why participants may not have completed a response for a specific food. • Inattention or carelessness • The participant did not eat the food (even though they should have answered “never”). • Several patterns can be seen • Interspersed w/ plausible and seemingly carefully completed responses => not consumed. • Whole sections are left blank, => missed.

A firm rule for allowable number of blanks cannot be made. • Nurses’ Health Study: allowed up to 70 (out of about 130 items) blanks as long as no whole sections or pages were blank. • validation study • correlation between number of blanks on a questionnaire and measurement error (calculated for each person as the absolute value of the difference between the FFQ and diet record values). • For all nutrients examined there was no appreciable correlation.

Once nutrients are calculated, some responses will be implausibly high or low, • Additional decisions regarding allowable ranges • The use of total energy intake as a primary criterion • It is generally considered that total energy intakes below approximately 1.2 times the resting or basal metabolic rate estimated from age, gender and weight are unlikely to be correct. • Intakes of >4000 kcal/day are unlikely to be true for even relatively active men. • 500 to 3500 kcal/day for women. • 800 to 4000 kcal/day for men are used. • Chinese range? • Adjustment of nutrient intakes for total energy intake will, to a large extent, compensate for overall under or overreporting.

Dietary data are highly sensitive to coding and data entry errors • Multiple choice formatsand machine-readablequestionnaires are less prone to such errors • Extreme values • due to the skewed distributions of most nutrients • improper completion of questionnaires • coding, data entry, or food composition database errors may be discovered. • unusual food intake patterns without obvious error.

Dealing with outliers • Gender specific scattergrams • Age vs variables of interest • Check original data for outliers • Make decision on how to deal with them • Correction • deletion

§CATEGORIZED vs. CONTINUOUS INDEPENDENT VARIABLES • Dietary nutrients are primarily continuous. • Traditional presentation of epidemiologic data • Rate ratios / rate differences for levels of exposure • Arbitrarily defined quantiles (e.g., quartiles or quintiles) • Standard round-numbered cutpoints • Cut points with biologic relevance • RDA • intake at which an enzyme is saturated. • Dose-response relationship.

Arguments for using continuous variables: • The greatest statistical power is provided by a continuous variable if the function reasonably fits the data, although this advantage may be slight w/ the use of five of more categories combined w/ an overall test for trend. • When used as a covariate, a crudely categorized variable may not fully account for the effect of that variable, resulting in possible residual confounding. • The use of continuous variables may facilitate comparisons among studies because a single relative risk is reported for an arbitrarily specified increment of intake (e.g., RR for 100 mg of cholesterol per day) that does not depend on the distribution of the dietary factor in the particular population or on the choice of cut-points of individual studies. • Tests for nonlinearity, such as the addition of a quadratic term, can be used to evaluate the presence of nonlinearity.

Comparison between fitting BMI as a categorical variable and as a continuousvariable

§ GRAPHICAL PRESENTATION of DATA • The central data of epidemiologic studies • should be presented in numerical form to provide the actual numbers of exposed subjects and the numbers of endpoints. • Judicious ancillary use of figures summarizing the primary findings can be helpful to many readers. Also, a clear and attractive summary figure is likely to enhance the probability that others will include your data in their presentations. • For presenting the effects of one or a few dichotomous exposure variables, a graphic display provides little additional perspective and tables should be used. • With multiple ordinal categories, a figure can assist in visualizing an overall relationship (Fig 13-1). • RRs is more correctly represented as a point, and C.Is.

Absolute risk vs relative risk • Relative risk is usually used. • Absolute risks are usually arbitrary depending on the age to which the data have been standardized. • The actual width of categories of the exposure variable should be represented in the graphic display. [Example] in Fig 13-1, trans-fatty acid intake was divided into quintiles

Actual vs. Predicted Relationships • In graphic displays of the relationship between two variables • Whether to provide the actual data or the prediction from a model derived from the data. • Many epidmiologists believe that the actual data should be provided so that the reader can view the findings w/o being forced to assume the appropriateness of any model assumptions, such as whether a linear relationship adequately describes the relationship. • The best solution may be to do both; for example, provide the data points for categories and superimpose the fitted regression line (Fig 13-2).

One clearly inappropriate approach is • To analyze the data as continuous, but then present the findings as though they were categorical. • Provides the potentially misleading impression of a clearly monotonic relationship and C.I. that are too narrow for a specific level because they are based on the overall data. • For example, by displaying the odds ratios (and C.Is) for multiple discrete levels of intake that are all based on a single regression coefficients.

Display of Joint Effects • A common approach to represent the joint effects of two exposure variables is the 3-D histogram (Fig 13-3A). - This has been criticized by some for the same reasons that the use of histograms to present univariate findings has been discouraged.  alternative display using points (Fig 13-3B). - A display of C.I. is usually incompatible w/ the 3-D histogram, but this may also be problematic w/ the use of points as their C.I. are frequently overlapping.  critical C.I. will usually need to be presented in tabular form or in text.

Locally Smoothed Regression Curves and Regression Splines • Locally smoothed regression curves and regression splines have been used increasingly to display epidemiologic findings w/ a continuous exposure variable. - With smoothed regression curves, the values of the dependent variable, such as a relative risk, are estimated for a continuously moving window of values of the independent variable. - In using regression splines, separate linear or nonlinear functions are fit between specified points (“knots”) on the exposure distribution, and functions are connected at the knots to produce a smooth curve. • The principal advantages of these approaches are that a priori assumptions are not imposed regarding the shape of the dose-response relationship and that maximal use is made of the continuous nature of the dependent variable.

[Example] relation of alcohol intake to risk of breast cancer (Fig 13-4) [Example] - those data fit using regression splines provide a strong sense that the relationship is approximately linear and that a significant increase in risk is seen even at about 10g (~1 drink) per day.

[Example2] Use of smoothing techniques to evaluate the relationship between vitamin A intake and risk of neural crest congenital malformations (Fig 13-5) - This analysis suggested the existence of a threshold at ~10,000 IU, above which risk increased substantially. - One possible concern is that inflection points may be accepted too literally, especially when the data are sparse (e.g., the method does not provide a C.I. For the apparent threshold, which would probably be quite wide).  this might be regarded as an over-fitting of the data (analogous to an extreme form of selecting optimal cut-points to demonstrate a relationship).

The degree to which the conclusions are affected by somewhat arbitrary choices such as the width of the window used for smoothing and the number and spacing of knots deserves further consideration. • Although smoothing methods and regression splines for analyses involving dietary intake may prove valuable, particularly in exploratory data analysis, their use deserves further evaluation.

§ EXAMINATION of FOODS and NUTRIENTS • A full evaluation of the relationship between diet and a disease should involve the analysis of data on both food and nutrient intakes. • If an association w/ disease is found for a specific nutrient, it is important to examine and report whether the major foods contributing to this nutrient are also related similarly to risk of disease. • The large number of items • The p value should be adjusted • Consistency w/ other information internal and external to the study • Reporting bias

Reporting bias • No simple solution exists for publication bias • A partial solution may be to deposit data in the National Auxiliary Publications Service or on the internet for all foods when an analysis on a particular disease is published. • The best approach for avoiding publication bias on specific foods is probably to analyze collaboratively the primary data from all available studies on a topic.

Nutritional supplements • Adds complexity to dietary analyses • Provide important insight by greatly extending the range • Details on dose and duration of supplement use are usually obtainable • Excluding supplement users to study dietary intake • Stratification by nutrient intake from foods to study supplement effects • The greatest contrast in risk would usually be expected, comparing • long-term supplement users w/ high intakes from diet • nonsupplement users w/ low intakes from diet.

The Effect of Time • Dietary factors may operate at various stages of the sequence of events (e.g., • an antioxidant might reduce the effect of ionizing ration, which is an early event • alcohol may influence endogenous hormone metabolism, which is most likely to be most important later in carcinogenesis). • The effect of diet may also be cumulative so that risk is related to a function of both dose and duration of exposure. • Dietary factors may also have effects at specific periods in life far removed from the time of diagnosis (e.g., high growth rates before puberty appear to increase breast cancer risk by advancing the age at menarche, and effects of maternal diet during pregnancy on the offspring’s risk of breast cancer have been hypothesized). • Our knowledge is often not sufficient to be confident that an effect of diet would be limited to only a particular period, it will usually be difficult to exclude an effect of a dietary factor until a fairly wide range of temporal relationships have been examined.

Ideally, a comprehensive dietary assessment would include a measurement of current diet and also diet at various times in the past. • truly independent retrospective assessments of various periods appear impossible • In practice, • in cohort studies an assessment of current diet (usually over the past year) is used • in case-control studies a period in the past thought to be most plausibly relevant to the disease (typically 5 to 10 years ago for cancer) is the focus of recall. • For intake of vitamin and mineral supplement, information on duration of use can be readily collected and can be critical • (e.g., vitamin E supplements in relation to CHD risk, or for vitamin C supplements in relation to risk of cataracts the associations were limited to longer term users).

Prospective studies provide important additional means of assessing temporal relationships w/ diet. - Baseline data on current diet can be examined in relation to disease incidence at various follow-up periods; under the reasonable assumption that diet varies over time, the maximum relative risk should provide information on the true induction period. - The limitation of this approach is largely practical issues as few cohorts will be sufficiently large to provide statistically stable estimates of risk during multiple time periods. • Prospective studies w/ replicate dietary assessment provide the opportunity to examine various intervals between dietary intake and disease diagnosis w/ much greater power.

§ THE USE of MULTIPLE DIETARY ASSESSMENTS in PROSPECTIVE STUDIES • A powerful feature of cohort studies is the opportunity to collect repeated dietary data over time. Such repeated measurement of dietary intake provide many possible analytic opportunities to reduce the effects of measurement error and to evaluate various hypothesized temporal relationships between the dietary factor and the disease outcome (Table 13-2).

Measured changes in diets of individuals over time are a mix of true variation and measurement error. - the comparison of persons whose intakes are consistently high w/ those whose intakes are consistently low provide a strong test of cumulative exposure, as well as both long and short latency, because it is highly likely that these persons were truly high or truly low over long durations. - The major limitation of this strategy is the loss of power due to the exclusion of the many persons who changed categories and the need to exclude cases that occur before the repeated measurement if the analyses are to be truly prospective. • The use of cumulative average measurements (i.e., the average of all measurements for an individual up to the start of each follow-up interval) takes advantage of all prior data and thus should provide a statistically more powerful test of an association of cumulative exposure. - This approach deserves further methodologic development, though, to take into account the different degrees of measurement error and information provided at each interval.

Because our understanding of disease etiology is often inadequate to specify a temporal relationship w/ confidence, the use of several rather than just one analytic strategy to examine various temporal relationships will generally be appropriate. • Clear evidence that an association is strongest w/ a particular temporal relationship can provide important information on the pathogenetic process and possibilities for intervention. - If no association is observed, the demonstration that this lack of relationship is seen when a full range of temporal relationship is examined provides the most compelling evidence that an important association has not been missed. [Example] Analysis of the relationship between coffee consumption and risk of CHD in women (Table 13-3)

§ MULTIVARIATE ANALYSES • Particularly important in nutritional epidemiology because dietary factors tend to be intercorrelated. • A common reason • whether an observed association between a specific dietary factor and disease risk is only secondary to its correlation w/ another truly causal dietary factor. • A standard approach is simply to include both variables together in the same model. • As the focus will typically be on dietary composition rather than on absolute amounts, the specific nutrients should usually be expressed as energy-adjusted residuals or nutrient densities • Total energy should be included as a term unless it is unrelated to disease risk. • Many possible alternative dietary factors could be considered as potential confounding variables, and the temptation may be to simply include these all simultaneously in a model.

A problem w/ including a large number of dietary factors simultaneously is that the remaining independent variation in the primary dietary factor may become quite small because, collectively, the other variables can account for almost all of its variation. • An alternative strategy is to conduct a series of analyses including standard non-dietary factors one at a time. In this process, it may be possible to eliminate several or all alternative variables by showing that they have no independent association w/ disease and that the association w/ the primary variable remains. • There is no clear limits as to the number of nutrients that may be included simultaneously as this will depend on their intercorrelation as well as size of the dataset. However, because many dietary variables are strongly correlated w/ many others, the maximum number is likely to be modest before C.I. became uninformatively wide.

Another common situation arises when one or more dietary factors are subcomponents of another (e.g., saturated, monounsaturated, and polyunsaturated fats are the components of total fat): entering all four variables simultaneously is impossible as they are redundant. - Options (using types of fat as an example) are listed in Table 13-4. - Model 1a: a standard multivariate model and does address the independent effect of saturated fat. The term total fat no longer has the biologic meaning of total fat because a major component, saturated fat, is included separately; its meaning then becomes mono- and polyunsaturated fat.

- Model 1b: Total fatres E + sat fatres total fat + energy the residual from the regression of energy-adjusted saturated fat on energy-adjusted saturated fat is included. This will provide the same coefficient for saturated fat as in model 1a, but the full biologic meaning of total fat is retained  this model describes disease risk in relation both to the total fat composition of the diet and to the type of fat. - Model 1c: Total fat/E + sat fat/E + energy the term for saturated fat (as a nutrient density) can be interpreted as substituting certain percentage of energy from saturated fat for the same amount of other types of fat. The term for total fat as a nutrient density reflects primarily the energy density of mono- and polyunsaturated fats. - Models 2a: Total fat + sat fat + poly fat + energy 2b: Total fatres E + sat fatres total fat + poly fatres total fat + energy 2c: Total fat/E + sat fat/E + poly fat/E + energy analogous to models 1a-1c, but more specifically address the substitution of saturated fat for monounsaturated fat because polyunsaturated fat is included as a separate term. Can be used for testing the general question: Does the type of fat add independently to the prediction of disease above and beyond total fat?

- Models 3a: Sat fat + mono fat + energy 3b: Sat fatres E + mono fatres E + poly fatres E + energy 3c: Sat fat/E + mono fat/E + poly fat/E + energy are not fat substitution models because the total fat composition of the diet is not constrained. These models address a somewhat different issue: Are each of the types of fat, substituted for other sources of energy, independently associated w/ disease risk? • Another common example of a dietary factor w/ nested subcomponents is alcohol intake, where the question frequently arises whether an observed association is due to a certain type of alcoholic beverage. • The interpretation of multivariate analyses including two or more dietary factors should always be tempered by knowledge that non of the dietary variables are measure perfectly. Moreover, the degree of measurement error may vary among different dietary factors. -One dietary factor may appear to be the true predictor and the other a confounder only because the former is better measured.

§ EMPERICAL DIETARY SCORES • Two traditional methods of combining data on intakes of various foods have been (1) to compute nutrient intakes using food composition tables, and (2) to create food groups based on similarities in nutrient content. - Use of global scores to describe dietary patterns or quality has also been suggested. • Factor Analysis • Can be used to identify two or more uncorrelated dietary patterns based on foods that tend to be used (or avoided) by the same persons. • A score is created for each person for each factor by assigning weights to their frequency of use of each food. Once the scores are computed for each person, their relation to risk of disease can be examined.

The role of factor analysis or other multivariate methods (e.g., principal components or cluster analysis) to create scores for dietary patterns in nutritional epidemiology remains unclear. - In contrast to calculations of nutrient intakes, there is no biologic basis for these scores. - This approach may be useful for describing the intercorrelation of foods and thus the identification of potential confounders. • Empirically Selected Variable Score • A tempting strategy for developing a prediction score is to examine the relation of each food (or nutrient) w/ risk of disease, pick the significant associations, and create a summary score comprised of these variables. - The problem w/ this approach can be appreciated by considering that, if 100 foods are examined as disease predictors, by chance alone about 5 will be statistically significant. A score based on these 5 variables will be extremely significantly predictive of disease, all on the basis of chance.

One common strategy for cross-validation is to divide the dataset into halves, create an empirical prediction score in one half (training set) and evaluate the score in the other half (test or validation set). • More statistically efficient alternatives for cross-validation exist, such as the jack knife, which involve successively leaving out one observation and fitting the model w/ the remaining data to predict the omitted observation. • Nutrient Prediction Models Using an Independent Gold Standard • In calculating nutrient intakes from a FFQ, foods are weighted by their frequency of use and their nutrient content using a food composition database. - Ideally, the weight would also take into account the validity w/ which intake of each food was assessed and the bioavailability of the nutrient from each food. - This additional weighting can be accomplished by using an independent quantitative assessment of nutrient intake or a biochemical indicator of nutrient intake in a sample of the population.

Diet records, which are often used to assess the validity of a food-frequency instrument, could be used as an independent estimate of true nutrient intake to develop a prediction score from foods on a FFQ. - To do so, the nutrient intake from the diet record would be used as the independent variable in a multiple regression analysis w/ all foods from the FFQ being allowed to enter in a stepwise multiple regression analysis. Foods that explain the most between-person variance in the nutrient intake enter first. - If the validity of a food item on the questionnaire is low (e.g., if it was worded poorly), it should not contribute appreciably to the prediction of the nutrient. - The nutrient score based on the coefficients from the stepwise regression could then be computed for each person and used in analyses predicting disease. - Limitations of this approach: (1) a large number of subjects are needed to provide stable estimates of regression coefficients, probably considerably larger than most validation studies; (2) the regression coefficients will reflect in part the respondent characteristics – which is a desirable feature because they may influence validity- but this means that they may not be generalizable to other population.

A biochemical indicator could also serve as the standard for developing a prediction score from foods on a FFQ. - Such an approach would take into account factors such as the bioavailability of the nutrient in each food and the validity of each food item. [Example] Giovannucci et al. (1995) use this approach to develop a prediction score for lycopene intake using plasma lycopene levels as a standard. They found that cooked tomato products predicted plasma lycopene levels better than did raw tomato products; so this was incorporated into the empirical prediction equation. When this empirical score was used to examine the relation between lycopene intake and risk of prostate cancer in the total cohort, a stronger association was observed than using the standard calculation of intake. - The size of substudy is a critical issue because it determines the precision of coefficients, but the desirable size is not clear.

§ SUBGROUP ANALYSES AND INTERACTIONS • The effects of most dietary factors are likely to vary among subgroups, depending on the intake of other dietary factors and characteristics of the subjects. This issue is generally known as “effect modification” or “interaction”. - A fundamental issue is whether the interaction should be assessed on an absolute scale (whether the rate difference is constant across categories of the third variable) or relative scale (whether the rate ratio or RR is constant across these categories). • One general concern has been that an extensive search for interactions and associations within subgroups of other variables creates a high likelihood of statistically significant associations arising by chance. • Although there is no simple solution for evaluating subgroups w/ confidence, this should not deter investigators from examining them. - Some subgroup analyses are so important a priori that they must be examined, and failure to find evidence of effect modification may even cast doubt on an association.

When two dietary factors act by a similar mechanism, it may be difficult to observe an association by examining only one of these variables at a time; an examination of joint exposures may be most powerful. • Purely exploratory analysis of associations among subgroups of known risk factors is also a good practice even when little a priori reason exists because new knowledge may be gained - Such explorations w/o strong prior expectations should be clearly described as such, and the reader should be skeptical of any findings. - Some would suggest not reporting p values in such circumstances. • Ultimately, the only fail-safe protection against spurious conclusions based on subgroups is demonstration of reproducibility, perhaps over time within the same study and, most importantly, in other independent datasets.

§ ERROR CORRECTION • Methods to correct observed associations for errors in measurement of exposure variables require data on either reproducibility or validity; this information is now being collected as part of many large studies. • The most common use of error correction procedures is the de-attenuation of correlation coefficients, probably because this only requires replicates of one or both measurements being compared. - This procedure has become quite routine in validation studies, where a small number of days of diet records or 24-hour recall data are collected as an independent representation of true long-term intake. • In studies of disease incidence, the calculation of RRs and C.I. Adjusted for measurement error has usually been done as a secondary analysis. - These analyses address 3 interrelated objectives: (1) to obtain the best estimate of the RR after accounting for attenuation due to imperfect measurement of the primary exposure; (2) to obtain the best estimate of the true C.I., which is particularly important when little association is seen, because the central question becomes whether the adjusted C.I. Are sufficiently narrow to be informative; (3) to account for residual confounding by imperfectly measured covariates.

§ ROLE of META-ANALYSIS and POOLED ANALYSIS in NUTRITIONAL EPIDEMIOLOGY • The place of meta-analysis in epidemiology has been controversial. - Some have argued that the combining of data from randomized trials is appropriate because statistical power is increased w/o concern for validity since the comparison groups have been randomized, but that in observational epidemiology the issue of validity is determined large by confounding and bias rather than limitations of statistical power  the great statistical precision obtained by the combining of data may be misleading because the findings may still be invalid. - The combining of all available epidemiologic data can be of treat value, particularly when a body of evidence becomes substantial and difficult to assimilate at once and if the potential for bias is not ignored.

An alternative to the combining of published epidemiologic data is to pool and analyze the primary data from all available studies on a topic that meet specified criteria. - Because of the complexity of dietary data, this approach has great advantages in nutritional epidemiology and can address many limitations of individual studies. • Any attempt to combine published data on diet and disease is immediately confronted w/ the problem that various investigators have usually used different approaches for presenting their findings that make them difficult to combine: - Sometimes RRs are given for arbitrary quantiles and other times for specified increments using continuous variables. - Adjustments for total energy intake are often done using a variety of methods or not at all, and the inclusion of other covariates typically differs among studies. • A major advantage of pooling primary data is that all data can be analyzed simultaneously using common approaches and definitions of exposure.

Issue in Statistics, Analysis and Presentation of Nutritional Data Wen-Harn Pan

Issue in Statistics, Analysis and Presentation of Nutritional Data Wen-Harn Pan

Presentation Transcript

Statistics and Data Analysis

Statistics: Data Analysis and Presentation

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Issue in Analysis and Presentation of Dietary Data Nutritional Epidemiology Walter Willet

Statistics and Data Analysis

Statistics and Data Analysis

Data Analysis, Presentation, and Statistics

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Principles of nutritional data analysis