330 likes | 908 Views
Learning the Genetics of Common Human Diseases . -- Applications of Machine Learning Methods in genomic epidemiological studies. Wayne State, Detroit, MI. Epidemiology.
E N D
Learning the Genetics of Common Human Diseases -- Applications of Machine Learning Methods in genomic epidemiological studies Yan Sun Wayne State, Detroit, MI
Epidemiology Epidemiology is the study of factors affecting the health and illness of populations, and serves as the foundation and logic of interventions made in the interest of public health and preventive medicine. It is considered a cornerstone methodology of public health research, and is highly regarded in evidence-based medicine for identifying risk factors for disease and determining optimal treatment approaches to clinical practice. Yan Sun
Genetic/Genomic Epidemiology Genetic epidemiology is the epidemiological evaluation of the role of inherited causes of disease in families and in populations; it aims to detect the inheritance pattern of a particular disease, localize the gene and find a marker associated with disease susceptibility. Gene-gene and gene-environment interactions are also studied in genetic epidemiology of a disease. In its broad context, genetic epidemiology includes family studies, molecular epidemiologic studies with genetic components, and more traditional cohort and case-control studies with family history components. Yan Sun
Genetics of Common Human Diseases Heritability of Common Human Diseases • Disease Heritability • Asthma ~ 60% • T2D ~ 70% • Obesity ~ 50% CVD Risk factors with a significant genetic component (heritability) Genetics of atherosclerosis Lusis,A.J., Mar,R., Pajukanta,P., 2004 Yan Sun
Dimensionality of Data • Phenotypic Data • Genotypic Data • Microsatellites • Single Nucleotide Polymorphism (SNP) • Copy Number Variation (CNV) Yan Sun
SNP AAGGCGTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC AAGGCCTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTGGGCGCC AAGGCCTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC AAGGCCTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTTGGCGCC AAGGCGTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTTGGCGCC AAGGCGTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTTGGCGCC AAGGCCTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC AAGGCGTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC AAGGCGTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC AAGGCGTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTTGGCGCC SNP1 C/G SNP2 A/T SNP3 T/G Yan Sun
Genome-Wide Association Data Yan Sun
Data Mining The data are not mine, they are public. • NIH dbGaP (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap) • The Wellcome Trust Case Control Consortium (http://www.wtccc.org.uk/) • Genetic Analysis Workshop Yan Sun
Goals • Variable Selection • Noise removal • Dimension reduction • Feature selection • Prediction Model • Predictive ability of genetic factor along • Improvement of predictive ability • Underlying Mechanism* • Interactions, pathways and biological networks Yan Sun
Machine Learning Methods • Neural Networks • Support Vector Machine • Ensemble Leaning Methods • Bagging, Boosting, Random Forests& RuleFit Yan Sun
Ensemble Learning Methods Ensemble learning refers to a collection of methods that learn a target function by training a number of individual learners and combining their predictions Yan Sun
Ensemble Learning Methods • Accuracy: a more reliable mapping can be obtained by combining the output of multiple "experts " • Efficiency: a complex problem can be decomposed into multiple sub-problems that are easier to understand and solve (divide-and-conquer approach). Mixture of experts, ensemble feature selection. • There is no single model that works for all pattern recognition problems! (no free lunch theorem) "To solve really hard problems, we’ll have to use several different representations……. It is time to stop arguing over which type of pattern-classification technique is best……. Instead we should work at a higher level of organization and discover how to build managerial systems to exploit the different virtues and evade the different limitations of each of these ways of comparing things. " --Minsky, 1991. Yan Sun
Why do ensembles work ? • Because uncorrelated errors of individual classifiers can be eliminated by averaging. • Assume: 40 base classifiers, majority voting, each error rate 0.3 • Probability that an instance will be misclassified by r out of 40 classifiers (Dietterich, 1997): r=21 -> 0.002 • Theoretical results by Hansen & Solomon (1990) Yan Sun
Ensemble Learning Methods • How to generate base classifiers?Generation strategy • Decision tree learning:ID3, C4.5 & CART • Instance-based learning: k-nearest neighbor • Bayesian classification: Naïve Bayes • Neural networks • Regression analysis • Clustering et.al. • How to integrate them?Integration strategy: • BAGGing = Bootstrap AGGregation (Breiman, 1996) • Boosting (Schapire and Singer, 1998) • Random Forests (Breiman, 2001) Yan Sun
Tree 1 Tree 2 Tree 3 Tree i Final Classification is based on votes from all N trees Tree i+1 Tree i+2 Tree i+3 Tree N No Yes No Yes No Yes No No Random Forests Yan Sun
Random Forests • Its accuracy is as good as Adaboost and sometimes better • It is relatively robust to outliers and noise • It is faster than bagging or boosting • It gives useful internal estimates of error, strength, correlation and variable importance • It is simple and can be easily parallelized Yan Sun
Random Forests Heidema 2006 BMC Genetics Yan Sun
Application of Random Forests • Candidate Genes • Genome-wide Markers Yan Sun
Application of Random Forests ds1 (n=360) 16 Cov. 471 SNPs ds2 (n=360) 16 Cov. 471 SNPs Missing SNP Genotype Imputation All 471 SNPs tagSNPs LD Rsq<0.5 Random Forests RuleFit Prediction Models With ROC curve 5-fold CV Identify Replicable Covs and SNPs KGraph Summary and Biological relevance Yan Sun
Predicting CAC Random Forests Rulefit Yan Sun
Using All SNPs vs. Tag SNPs A tagSNP is a representative SNP in a region of the genome with high linkage disequilibrium with the rest SNPs. It is possible to identify genetic variation without genotyping every SNP in a chromosomal region. It is the maximally informative SNP. Yan Sun
Predicting CAC Yan Sun
GPR35 Protein Yan Sun
Kgraph Presentation Yan Sun
Predicting RA Status • Sample: One subject was randomly selected from each family to create dataset 1 and the second subject was then randomly selected from the rest of samples for dataset 2. The singletons were randomly divided into the two samples. Each of the two replicate samples has 740 unrelated subjects • Genetic Markers: 5,742 genome-wide informative SNP markers. Yan Sun
Predicting RA Status Sun YV et.al., 2007, In Press Yan Sun
Predicting RA Status Sun YV et.al., 2007, In Press Yan Sun
Challenges • Validation, validation and validation! “So far, comprehensive reviews of the published literature, most of which reports work based on the candidate-gene approach, have demonstrated a plethora of questionable genotype– phenotype associations, replication of which has often failed in independent studies.” Yan Sun
Computational Challenge • 100K SNP data • 500K SNP data • 1M SNP data • 3.1M HapMap SNPs (Nature Oct. 2007) • And more – different type of genetic variations • And more rare genetic variations • And larger sample Yan Sun
If all a man has is a hammer, then every problem looks like a nail. Yan Sun
Acknowledgements University of Michigan Sharon Kardia Lawrence Bielak Patricia Peyser Ji Zhu Mayo Clinic Stephen Turner Patrick Sheedy, II University of Texas Eric Boerwinkle Yan Sun