280 likes | 422 Views
Covariate information in complex event history data - some thoughts arising from a case study. Elja Arjas Department of Mathematics and Statistics, University of Helsinki and National Public Health Institute (KTL) Based on ongoing joint work with Olli Saarela and Sangita Kulathinal.
E N D
Covariate information in complex event history data -some thoughts arising from a case study Elja Arjas Department of Mathematics and Statistics, University of Helsinki and National Public Health Institute (KTL) Based on ongoing joint work with Olli Saarela and Sangita Kulathinal
Background and motivation: • Assessment of risk factors of cardiovascular diseases (e.g. coronary heart disease, stroke); • Traditional approach for cohort analysis: hazard regression model, with covariates (e.g. blood pressure, cholesterol level, or body mass index) measured only at the baseline; • Adding “a genetic component”: usually candidate loci, potentially causative on the basis of the available information about their function.
Emphasis on causal ideas: • Stressing probabilistic predictions: “How would the probability of the outcome change if a covariate would have a different value?” • Association vs. causation: the issue of confounding (change by intervention, “do”-conditioning, Pearl 2000).
Cosidering causal effects … • Compare, e.g., predictive probabilities of future response y* p(y*|data, attrib*, hist*, do(exposure*’)) vs. p(y*|data, attrib*, hist*, do(exposure*)) for a generic individual ”*” (or, for an equivalence class of exchangeable individuals) characterized by attributes and past history used in conditioning (cf. Arjas and Parner 2004).
Causal ideas…: • Causal mechanisms can involve pathways that are • direct in the sense that they influence, in the postulated model structure, directly the outcome variable, or • indirect in that their effect on the outcome is mediated via the levels of the measured risk factors.
MORGAM study • Evans et al. (2005) • Individuals of different ages in a cohort are monitored for • (fatal and non-fatal) occurrences of coronary heart disease (CHD) or stroke, • death from other causes. • Information on risk factors such as • smoking status, • blood pressure (BP), • body mass index (BMI), • total cholesterol and HDL cholesterol and • possible earlier occurrences (yes/no) of CHD or stroke is collected at cohort baseline.
Genetic information… • SNP (single nucleotide polymorphism) level genotype data from candidate loci, e.g. • functionally connected e.g. to blood clotting, • associated with cardiovascular diseases, • associated with increased lipid levels. • Due to the cost involved genotyping is only done on • all known cases of CHD or stroke, and • individuals belonging to a random subset of the original cohort.
Information missing… • There is • no genetic information of any kind available on most members of the original cohort, and even for those belonging to the case-cohort set, only on the chosen candidate loci; • no knowledge of early fatal occurrences of CHD or stroke from outside the cohort.
Graphical representation event endpoint parameters of interest time (age) underlying covariate process candidate gene measure- ment error variance covariate measurement
Aspects to be considered... • Time: • BMI, BP and cholesterol level do not remain constant over time: “individually varying stochastic processes”. • Even an accurate measurement at a particular time cannot be directly related to the endpoints as a "cause“. • The interpretation, and value for a causal analysis, of covariate measurements made in the past will generally depend on how long ago they were measured.
Further aspects… • Feed back to covariate values from earlier events: Covariate values of individuals who had experienced a CHD event or stroke already before being recruited to the cohort may have been influenced by this event (e.g., the person quits smoking, changes diet, or gets medication to lower blood pressure). • Influence of an earlier treatment: After a first occurrence of non-fatal CHD or stroke, the risk for later similar events or death is likely to be more strongly influenced by the availability and success of the acute medical treatment than by the values of the measured risk factors/covariates.
Further aspects… • Potential confounding issue: The considered candidate loci can influence both the values of the measured covariates and those of the outcome variables. If this is not properly accounted for in the modelling and analysis of data, they become a potential source of confounding in an observational study. Here also: How about the rest of the genome, outside the selected candidate loci?
Further aspects… • Large dimension of parameter space: The degree of SNP-based polymorphisms present in the data generally exceeds by far numbers for which it would be possible, given the amounts of data, to reliably estimate risks associated with individual genotypes. Particularly problematic in this sense is the MHC/HLA region.
Some shortcuts… • Problem 2: Ignore the current status covariate information that may have been influenced by the earlier occurrence, only keeping information on covariates that do not change in time (age, sex, genotype). • Problem 3: Consider follow-up data only up to the first occurrence of CHD or stroke. • Problems 1, 4 and 5: Try something more systematic: For problem 5, apply a monotonicity postulate and consequent partial ordering of risks. For problems 1 and 4, treat the missing covariate information in a distributional form (using data augmentation and MCMC).
Problem 5: dimension Partial ordering: • The two variants (alleles) of a biallelic SNP are labeled as 0 and 1, with 0 for the "common” and 1 for the "rare” form; • Within each gene (more generally, linkage group), arrange the sequence of SNP genotypes (pairs of the form 00, 01, 10 and 11), each determined from the same SNP locus, into haplotypes. (Alleles belonging to the same - maternal or paternal - chromosome form a haplotype.)
Problem 5: dimension (2) • Denote (−,ø,+) to indicate “less risky”, “neutral” and “more risky” allele, respectively. • For each pair of alleles, there are three possibilities: • allele 0 is less risky than allele 1 (−+), • no effect (øø) and • allele 1 is less risky than 0 (+−). • Postulate: this ordering of alleles is extendible to a partial ordering of haplotype risks. For example, haplotype h1 is “more risky” than haplotype h2 if all its alleles are either “more risky” or “neutral” compared to the corresponding alleles in h2, and at least one is “more risky”. • Haplotypes can then be classified into groups, each being represented by a vector with elements chosen from {−,ø,+}. Modelling of risks is then done via such classes. • Extend this partial ordering into a partial ordering between to haplotype pairs (diplotypes).
Problem 5: dimension (5) event endpoint genotype diplotype restrictions for parameters from the allele ordering population haplotype frequencies ordering of alleles of causal loci number of causal loci location of causal loci
Problem 1: time • Regression dilution Measuring time dependent and individually varying covariates (such as BP, cholesterol level and BMI) at a single time point generally leads to an under-estimation of the effect size. • But what should one do if for each individual there is only a single covariate measurement in the data?
Problem 1: time (2) • Modelling the underlying covariate process • For dealing with time dependent covariates in an explicit form, one needs a generator (stochastic intensities) for the covariate process considered as a function of pre-t histories, as well as corresponding stochastic intensities for the end point (T;X) itself. • One possibility is to apply the Marked Point Process (MPP) framework. The considered end point, with a corresponding description of the outcome, can then be imbedded into this process in a natural way as a marked point (T;X).
Problem 1: time (3) • Measurement error If also the covariate measurements involve a random error, we need a measurement model. The model parameters can be estimated if there are additional data available on the progression of the covariates. • Numerical implementation Using MCMC and data augmentation methods – but practical implementation can be difficult. • Dependence of the covariates on genotype information? Fortunately, only long time averages of covariates are likely to be of importance for the considered endpoints. But potential confounding problem remains.
Problem 4: missing data, confounding… • Genetic factors are potential confounders in causal questions. If the relevant genotype information is known and its role has been properly accounted for in the statistical model, this problem can be dealt with by proper conditioning on such information. • But what to do when a majority of the cohort members, as in MORGAM, have not been genotyped? • Usual solution: restrict the analysis only to those individuals who have been genotyped. But then the relevant follow-up and covariate information that exists on the other cohort members will not be used in the analysis at all.
Problem 4: missing data, confounding… • Treat also problem 4 as a missing data problem, considering a probability model for the missing genotypes and applying "full likelihood” and Bayesian inference (Kulathinal and Arjas 2006, cf. Scheike and Martinussen 2004). This solution involves considering the unknown genotypes in a distributional form. • Note, however, that a person's genotype, the measured risk factors and phenotype (time to event and event type) may all be statistically dependent of each other. Therefore the likelihood contribution from an individual who has not been genotyped involves an integration with respect to a (conditional) genotype distribution (which is generally different for different individuals).
Problem 4: missing data, confounding… • In general, and depending on the information available, one can consider different levels of conditioning in the predictive probabilities p(y*|data, attrib*, hist*, do(exposure*’)). • Depending on such a level, the interpretation of the results from causal analysis will differ, with more detailed conditioning taking us closer to “individual causal effect” - which, however, can never be achieved by a statistical analysis of data.
Problem 4: missing data, confounding… • More detailed conditioning is also attractive as a recipe against potential confounders (“no unmeasured confounders” postulate). • Playing with finer level conditioning by using latent variable modelling can be attractive, but also risky if there is very little data, noisy data, or no data at all to support such modelling efforts. • In essence, such finer level predictive probabilities are calibrated against data that are actually observed.
”Take home”-messages: • Careful consideration of sources of information is important. • Interpretation of results is often facilitated by establishing intuitive links to causal ”what if” ideas (”do”-conditioning). • Less emphasis on inference (particularly statistical significance testing) concerning individual regression coefficients.
”Take home”-messages: (2) • General modelling approach based on MPP’s is useful, offering possibilities to consider conditioning of probabilities on different levels of information. • Bayesian approach, and applying MCMC for numerical computations, provides a flexible framework for statistical inference, keeping it within the domain of probability.