280 likes | 838 Views
Causal Inference因果推论 Of Intermediate 中级 Phenotypes 表型 and Biomarkers 生物标记 in Rheumatoid Arthritis 风湿性关节炎 [An Application of Machine Learning 机器学习 Techniques to Genetic Epidemiology 遗传流行病学] Wentian Li 李问天 , Ph.D Feinstein Institute for Medical Research Genetic Association
E N D
Causal Inference因果推论 Of Intermediate 中级 Phenotypes表型and Biomarkers 生物标记 in Rheumatoid Arthritis 风湿性关节炎 [An Application of Machine Learning 机器学习 Techniques to Genetic Epidemiology 遗传流行病学] Wentian Li 李问天, Ph.D Feinstein Institute for Medical Research Wentian Li, North Shore LIJ Health System
Genetic Association • Association 相关 is not equivalent to causal 因果的 relationship • Wrinkle-Cancer risk association does not mean one causes 导致 another • Age is a confounding factor 混杂因素 Wentian Li, North Shore LIJ Health System
When do we need to know cause and effect? • Rarely discussed in genetic analysis because genotype is always the cause 原因, and phenotype is always the effect 效果 • In epidemiology 流行病学 factor 因素-disease 疾病 association can belong to three situations (1) factor is a cause; (2) reverse causality; (3) a third confounding factor • For two intermediate phenotypes (biomarkers), causal arrow can point either way Wentian Li, North Shore LIJ Health System
Causal Inference in Machine Learning • Large text database (e.g. google) • Observational data (no controlled experiment, and no other approaches to determine causality) • Two-point association indeed cannot be used to claim causality • The key is a third variable, as well as conditional 条件的 association based on the third variable Wentian Li, North Shore LIJ Health System
Data Mining and Knowledge Discovery (2000) v4, pp.163-192 Wentian Li, North Shore LIJ Health System
An Example Wentian Li, North Shore LIJ Health System
Cooper’s Local Causality Discovery (LCD) Rule • Six assumptions: 1.database completeness. 2. discrete variables. 3. Bayesian network model (directed acyclic 非环式的 graph: no loops). 4…. 5. no selection bias. 6. valid statistical testing. • Three variables: x,y,z • Hidden 潜在的 variable is allowed (but not in the dataset) • Determine three correlations: unconditional C(x,y), C(y,z) and conditional C(x,z|y) Wentian Li, North Shore LIJ Health System
Between two variables, there are only 6(4) causal relationships (allowing confounding variable) confounding no relationship confounding+causing causing NO NO confounding plus rev causing Reverse causing Wentian Li, North Shore LIJ Health System
Number of causal relationships among three variables • 6x6x6=216 possibilities • 4x4x6=96 if x is not caused by either y or z (but can receive an arrow from a hidden variable) [Cooper’97 paper] • 2x2x6=24 if x doesn’t even receive an arrow from hidden confounding variables [Li and Wang, unpublished] Wentian Li, North Shore LIJ Health System
Given a causal model… • Unconditional 无条件 association between any two variables can be determined by whether they are connected by a path • Conditional 条件的 association can be determined by the so-called “d-separation” rule Wentian Li, North Shore LIJ Health System
“CCC” causal inference rule (Cooper version) if C(x,y)+, C(y,z)+, but C(x,z|y)-, then there are only three possible causal models: x => y => z x <= h => y => z h =>x => y =>z (Silverstein et al. version) if C(x,y)+, C(y,z)+, C(x,z)+, but C(x,z|y)-, C(x,y|y)+, C(y,z|x)+, then... Wentian Li, North Shore LIJ Health System
In a three-way correlated set If one of the variable (x) is not an effect (only a cause) AND If correlation is lost between x and z conditionally, THEN y causes z x: gene y,z: two intermediate phenotypes Wentian Li, North Shore LIJ Health System
The use of a not-a-effect variable has an amazing parallel in epidemiology • Called “instrumental variable” • Martjin Katan’s idea on cholesterol 胆固醇 cancer 癌症 association: he proposed to use a genotype (apoliprotein 载脂蛋白 E) as the third variable (Lancer 1986, i:507-508) • Katan did not use conditional correlation • This idea is now called “Mendelian randomization” Wentian Li, North Shore LIJ Health System
Rheumatoid Arthritis (RA) • An autoimmune 自我免疫的 disease • Chronic inflammation 炎症 of joints 关节 • Three times more likely to occur in women than men • Age of onset 40-60 • Twin 双胞胎 concordance rates: 12-15% for MZ单合子,单卵双生, 5% for DZ 异卵双生 • Genetic and environmental (e.g. smoking) risk factors Wentian Li, North Shore LIJ Health System
MHC/HLA: the main genetic contribution of RA • MHC (Major Histocompatibility Complex主要组织相容性复合体) or HLA (Human leukocyte antigens 人类白血球抗原): HLA-DRB1 gene on chromosome 6 (6p21.3) • The RA associated alleles are HLA-DRB1*0401, *0404, *0408 (Caucasian), not *0402, *0403, *0407 • In Asian population, different DRB1 alleles are associated with RA (e.g. *0405, *0901) • A group of DRB1 risk alleles are called “shared epitope” (SE) 共同表位, or rheumatoid epitope, code position 70-74 amino acids in the third hypervariable region Wentian Li, North Shore LIJ Health System
Two Auto-antibodies are strongly associated with RA: RF and anti-CCP • RF (rheumatoid factor 类风湿因子): 80% of RA patients are RF positive • anti-CCP (anti-cyclic citrullinated peptide antibody 抗环瓜氨酸肽抗体,抗CCP抗体): even better predictor of RA in early stage • HLA-DRB1, RF, anti-CCP are all associated with the RA disease, and they are associated with each other. CCC rule can be applied! 张利方,阎有功,黄前川,等, “抗环瓜氨酸肽抗体在类风湿性关节炎诊断中的应用”, 免疫学杂志,2004,20:52-57 Wentian Li, North Shore LIJ Health System
Q: Between RF and anti-CCP, which one is the cause and which is the effect? Wentian Li, North Shore LIJ Health System
1723 Caucasian RA patients anti-CCP positive anti-CCP negative Wentian Li, North Shore LIJ Health System
Association between RF and DRB1 genotype is lost conditional on anti-CCP Wentian Li, North Shore LIJ Health System
By the CCC rule, anti-CCP is the cause, RF is the effect Or, anti-CCP is upstream and RF is downstream in a pathway Wentian Li, North Shore LIJ Health System
Discussions/Issues • There are evidences that RA patients become anti-CCP positive before becoming RF positive • The three-way correlation might be lost in normal controls (here we have a “case-only” analysis) • In-between anti-CCP and RF, other factors are possible (so the cause-effect may not be direct) • It is not clear where the smoking factor comes in (could be an intriguing analysis with smoking data!) Wentian Li, North Shore LIJ Health System
MR needs a not-an-effect variable (gene) Conditional association is not used Only need a counter example (e.g. Apo E2 samples have low cholesterol, but NOT high cancer risk) LCD needs a variable that is not a cause Conditional association is used Complete information of (G, IP, D) trio for all samples (e.g. Apo genotype, cholesterol level, cancer status) Revisit Katan’s “Mendelian Randomization” (MR) by LCD[Wang, Li, unpublished] Wentian Li, North Shore LIJ Health System
Co-Authors • Mingyi WANG (Zhejiang Univ, Computer Science Department, causal inference) • Patricia Irigoyen, Peter Gregersen (North Shore LIJ, RA data) Wentian Li, North Shore LIJ Health System