Learning the Genetics of Common Human Diseases

Learning the Genetics of Common Human Diseases -- Applications of Machine Learning Methods in genomic epidemiological studies Yan Sun Wayne State, Detroit, MI

Epidemiology Epidemiology is the study of factors affecting the health and illness of populations, and serves as the foundation and logic of interventions made in the interest of public health and preventive medicine. It is considered a cornerstone methodology of public health research, and is highly regarded in evidence-based medicine for identifying risk factors for disease and determining optimal treatment approaches to clinical practice. Yan Sun

Genetic/Genomic Epidemiology Genetic epidemiology is the epidemiological evaluation of the role of inherited causes of disease in families and in populations; it aims to detect the inheritance pattern of a particular disease, localize the gene and find a marker associated with disease susceptibility. Gene-gene and gene-environment interactions are also studied in genetic epidemiology of a disease. In its broad context, genetic epidemiology includes family studies, molecular epidemiologic studies with genetic components, and more traditional cohort and case-control studies with family history components. Yan Sun

Genetics of Common Human Diseases Heritability of Common Human Diseases • Disease Heritability • Asthma ~ 60% • T2D ~ 70% • Obesity ~ 50% CVD Risk factors with a significant genetic component (heritability) Genetics of atherosclerosis Lusis,A.J., Mar,R., Pajukanta,P., 2004 Yan Sun

Dimensionality of Data • Phenotypic Data • Genotypic Data • Microsatellites • Single Nucleotide Polymorphism (SNP) • Copy Number Variation (CNV) Yan Sun

SNP AAGGCGTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC AAGGCCTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTGGGCGCC AAGGCCTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC AAGGCCTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTTGGCGCC AAGGCGTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTTGGCGCC AAGGCGTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTTGGCGCC AAGGCCTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC AAGGCGTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC AAGGCGTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC AAGGCGTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTTGGCGCC SNP1 C/G SNP2 A/T SNP3 T/G Yan Sun

Genome-Wide Association Data Yan Sun

Xavier and Armengol, 2007 PLoS Genetics Yan Sun

Data Mining The data are not mine, they are public. • NIH dbGaP (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap) • The Wellcome Trust Case Control Consortium (http://www.wtccc.org.uk/) • Genetic Analysis Workshop Yan Sun

Goals • Variable Selection • Noise removal • Dimension reduction • Feature selection • Prediction Model • Predictive ability of genetic factor along • Improvement of predictive ability • Underlying Mechanism* • Interactions, pathways and biological networks Yan Sun

Machine Learning Methods • Neural Networks • Support Vector Machine • Ensemble Leaning Methods • Bagging, Boosting, Random Forests& RuleFit Yan Sun

Ensemble Learning Methods Ensemble learning refers to a collection of methods that learn a target function by training a number of individual learners and combining their predictions Yan Sun

Ensemble Learning Methods • Accuracy: a more reliable mapping can be obtained by combining the output of multiple "experts " • Efficiency: a complex problem can be decomposed into multiple sub-problems that are easier to understand and solve (divide-and-conquer approach). Mixture of experts, ensemble feature selection. • There is no single model that works for all pattern recognition problems! (no free lunch theorem) "To solve really hard problems, we’ll have to use several different representations……. It is time to stop arguing over which type of pattern-classification technique is best……. Instead we should work at a higher level of organization and discover how to build managerial systems to exploit the different virtues and evade the different limitations of each of these ways of comparing things. " --Minsky, 1991. Yan Sun

Why do ensembles work ? • Because uncorrelated errors of individual classifiers can be eliminated by averaging. • Assume: 40 base classifiers, majority voting, each error rate 0.3 • Probability that an instance will be misclassified by r out of 40 classifiers (Dietterich, 1997): r=21 -> 0.002 • Theoretical results by Hansen & Solomon (1990) Yan Sun

Ensemble Learning Methods • How to generate base classifiers?Generation strategy • Decision tree learning:ID3, C4.5 & CART • Instance-based learning: k-nearest neighbor • Bayesian classification: Naïve Bayes • Neural networks • Regression analysis • Clustering et.al. • How to integrate them?Integration strategy: • BAGGing = Bootstrap AGGregation (Breiman, 1996) • Boosting (Schapire and Singer, 1998) • Random Forests (Breiman, 2001) Yan Sun

Tree 1 Tree 2 Tree 3 Tree i Final Classification is based on votes from all N trees Tree i+1 Tree i+2 Tree i+3 Tree N No Yes No Yes No Yes No No Random Forests Yan Sun

Random Forests • Its accuracy is as good as Adaboost and sometimes better • It is relatively robust to outliers and noise • It is faster than bagging or boosting • It gives useful internal estimates of error, strength, correlation and variable importance • It is simple and can be easily parallelized Yan Sun

Random Forests Heidema 2006 BMC Genetics Yan Sun

Application of Random Forests • Candidate Genes • Genome-wide Markers Yan Sun

Application of Random Forests ds1 (n=360) 16 Cov. 471 SNPs ds2 (n=360) 16 Cov. 471 SNPs Missing SNP Genotype Imputation All 471 SNPs tagSNPs LD Rsq<0.5 Random Forests RuleFit Prediction Models With ROC curve 5-fold CV Identify Replicable Covs and SNPs KGraph Summary and Biological relevance Yan Sun

Predicting CAC Random Forests Rulefit Yan Sun

Using All SNPs vs. Tag SNPs A tagSNP is a representative SNP in a region of the genome with high linkage disequilibrium with the rest SNPs. It is possible to identify genetic variation without genotyping every SNP in a chromosomal region. It is the maximally informative SNP. Yan Sun

Predicting CAC Yan Sun

GPR35 Protein Yan Sun

Kgraph Presentation Yan Sun

Predicting RA Status • Sample: One subject was randomly selected from each family to create dataset 1 and the second subject was then randomly selected from the rest of samples for dataset 2. The singletons were randomly divided into the two samples. Each of the two replicate samples has 740 unrelated subjects • Genetic Markers: 5,742 genome-wide informative SNP markers. Yan Sun

Predicting RA Status Sun YV et.al., 2007, In Press Yan Sun

Challenges • Validation, validation and validation! “So far, comprehensive reviews of the published literature, most of which reports work based on the candidate-gene approach, have demonstrated a plethora of questionable genotype– phenotype associations, replication of which has often failed in independent studies.” Yan Sun

Computational Challenge • 100K SNP data • 500K SNP data • 1M SNP data • 3.1M HapMap SNPs (Nature Oct. 2007) • And more – different type of genetic variations • And more rare genetic variations • And larger sample Yan Sun

If all a man has is a hammer, then every problem looks like a nail. Yan Sun

Acknowledgements University of Michigan Sharon Kardia Lawrence Bielak Patricia Peyser Ji Zhu Mayo Clinic Stephen Turner Patrick Sheedy, II University of Texas Eric Boerwinkle Yan Sun

Yan Sun

Learning the Genetics of Common Human Diseases

Learning the Genetics of Common Human Diseases

Presentation Transcript

Common Diseases of EENT

Human Genetics

HUMAN GENETICS

Human Genetics

Human Genetics

Human Genetics

Common diseases of the respiratory system

Human Genetics

HUMAN GENETICS

HUMAN GENETICS

Genetics of Kidney Diseases

Genetics of Kidney Diseases

Fundamentals of human genetics. Human hereditary diseases. Methods of research of human heredity

Genetics of Kidney Diseases

Human Genetics

Genetics of Pulmonary Diseases

Human Genetics

Genetics of Kidney Diseases

Human Genetics

Genetics of Pulmonary Diseases