320 likes | 460 Views
Recursive Partitioning And Its Applications in Genetic Studies. Chin-Pei Tsai Assistant Professor Department of Applied Mathematics Providence University. OUTLINE. Genetic data Example Basic ideas of recursive partitioning Applications in genetic studies linkage analysis
E N D
Recursive Partitioning And Its Applications in Genetic Studies Chin-Pei Tsai Assistant Professor Department of Applied Mathematics Providence University
OUTLINE • Genetic data • Example • Basic ideas of recursive partitioning • Applications in genetic studies • linkage analysis • association analysis • Recursive-partitioning based tools for data analyses
1 2 3 4 5 6 1 2 3 4 2 1 2 1 1 2 2 1 1 2 Father Mother 1 2 Affected 3 4 5 6 Tree-based Analyses in Genetic Studies Genetic Data 1 1 1 1 1 1 2 2 2 2 0 0 0 0 1 2 1 2 1 2 1 2 0 0 0 0 1 2 1 2 1 2 1 2 1 2 1 2 1 2 Nuclear Family
1 1 1 1 1 1 2 2 2 2 1 2 3 4 5 6 1 2 3 4 0 0 0 0 1 2 1 2 1 2 1 2 0 0 0 0 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2 1 2 1 1 2 2 1 1 2 2 2 1 7 1 7 2 2 Genetic Data Genotype 2 6 3 3 7 2 2 3 1 6 2 3 1 2 2 3 7 2 2 3 3 4 2 5 3 2 4 4 3 3 2 4 3 2 5 4
Application of Recursive Partitioning in Microarray Data (Zhang et al.,PNAS, 2001) • Gene expression profiles of 2,000 genes in 22 normal and 40 colon cancer tissues • Purpose: to predict new tissue
Node 1 CT:40 NT:22 >60 M26383 Node 2 CT: 0 NT:14 Node 3 CT: 40 NT: 8 >290 R15447 Node 4 CT: 10 NT: 8 Node 5 CT: 30 NT: 0 >770 M28214 Node 6 CT: 10 NT: 1 Node 7 CT: 0 NT: 7 Automatically Selected Tree (by RTREE)
Node 2 Node 3
Node 5 Node 7 Node 6
Concluding Remarks • The three genes, IL-8 (M26383), CANX (R15447) and RAB3B (M28214), were chosen from 2,000 genes. • Using three genes can achieve high classification accuracy. • These three genes are related to tumors.
Tree Growing For binary outcome, y=0, 1, let p = proportion of (y=1). Entropy: -p log(p) - (1-p) log(1-p) where 0log(0) = 0 1/2 1/2 p 1 0 Basic Ideas in Classification Trees • Splitting criterion Goodness of Split = weighted sum of node impurities • Impurity functions: entropy
11 10 Cancer subjects 11 Normal subjects 10 Male Gender 10 9 1 1 Entropy left .6918 right .6931 By left right Gender 10911 Race 9723 Smoked 91 29 Age 7743 Node Impurity .6853 .6365 .3251 .4741 .6931 .6829
Entropy (i(t)) Weight (p(t)) By left right Gender .6918 .6931 Race .6853 .6365 Smoked .3251 .4741 Age .6931 .6829 left right 19/21 2/21 16/21 5/21 10/21 11/21 14/21 7/21 Goodness of Split Goodness of split s = p(L)i(L) + p(R)i(R) s .6919 .6737 .4031 .6897 No split: .6920
Tree Pruning • Fisher Exact Test • Misclassification cost and rate • Cost-complexity and complexity parameter • Optimal sub-trees
1 1 1 1 1 1 2 2 2 2 1 2 3 4 5 6 1 2 3 4 0 0 0 0 1 2 1 2 1 2 1 2 0 0 0 0 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2 1 2 1 1 2 2 1 1 2 2 2 1 7 1 7 2 2 Genetic Data Genotype 2 6 3 3 7 2 2 3 1 6 2 3 1 2 2 3 7 2 2 3 3 4 2 5 3 2 4 4 3 3 2 4 3 2 5 4
Sib pair 1 2 3 4 Key Idea in Tree-based Analysis If a marker locus is close to a disease locus, then individuals from a given family who are phenotypically similar are expected to be genotypically more similar than expected by chance.
Tree-based Linkage Analysis • Unit of observation: sib pair • Covariate: the expected IBD (identity by descent) sharing at each marker locus • The response variable y takes three possible values depending on whether none, one, or both sibs are affected, which we arbitrarily coded as 0, 1, and 2.
3 4 1 2 IBD=0 2 4 1 3 Sib 1 Sib 2 Father’s genotype Mother’s genotype 3 3 1 2 IBD=1 Sib 1 Sib 2 3 3 1 1 IBD=2 Sib 2 Sib 1 Identity by Descent (IBD) Genes (or alleles) inherited by relatives from the same ancestor.For two sibs, they can share at most one IBD gene from the father, and at most one from themother. Thus,0, 1, or 2genes can be shared by two siblings.
The Gilles de la Tourette Syndrome (GTS) Phenotype data (Joint work with Zhang et al., 2002) • Genome scan of the hoarding phenotype collected by the Tourette Syndrome Association International Consortium for Genetics (TSAICG) • Hoarding is a component of obsessive-compulsive disorder. • We used data from 223 individuals in 51 families with 77 sib pairs. • Genotypes are allele sizes from 370 markers on 22 chromosomes.
23 28 26 > 1.9 P=0.0011 IBD Sharing atD5SMfd154 16 28 18 7 0 8 > 0 P=0.0034 D5S408 Split p-values 16 20 18 0 8 0 > 1.16 P=0.0078 D4S1652 6 17 14 10 3 4 Overall p-value = 2.63e-6 The Gilles de la Tourette Syndrome Phenotype data Linkage Tree
Tree-based Association Study • The response variable is affection status. • The covariates include gender, the parental phenotypes, race and the variables constructed using the marker information. • If a marker has n distinct alleles, then n covariates, each taking a value of 0, 1 or 2, are then constructed for this marker. For example, if n=7, then the 7 covariates take values (0,0,0,1,0,1,0) for a genotype of 4/6 and (0,0,0,0,0,0,2) for a genotype of 7/7.
The Gilles de la Tourette Syndrome Phenotype data Association Tree 85 135 > 0 P=2e-4 Copies of AlleleD4S403-5 46 106 39 29 > 0,NA P= 0.0017 D5S816-7 0 18 46 88 > 1,NA P= 0.016 D4S2431-10 0 11 46 77 Split p-values > 0 P=0.0023 D4S2632-5 19 54 27 23 Overall p-value = 1.03e-7
Why Recursive Partitioning? • Attempt to discover possibly very complex structure in huge databases - genotypes for hundreds of markers - expression profiles for thousands of gene - all possibly predictors (continuous, categorical) • No need to do transformation • Impervious to outliers • Easy to use • Easy to interpret
Recursive partitioning based tools for data analysis • Classification and regression • RTREE (http://peace.med.yale.edu) • CART • Multivariate Adaptive Regression Splines • MASAL (http://peace.med.yale.edu) • MARS • Longitudinal data analysis • MASAL (http://peace.med.yale.edu) • Survival Analysis • STREE (http://peace.med.yale.edu)
References • Books • L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, 1984, Classification and Regression Trees, Wadsworth, California. • H. Zhang and B. Singer, 1999, Recursive Partitioning in the Health Sciences, Springer, New York. • T. Hastie, R. Tibshirani and J. Friedman, 2001, The Elements of Statistical Learning, Springer, New York.
References • Papers • Zhang, Tsai, Yu, and Bonney, 2001, Genetic Epidemiology, 21, Supplement 1, S317-S322. • Zhang, Leckman, Pauls, Tsai, Kidd, Campos and The TSAICG, 2002, American Journal of Human Genetic, 70, 896-904. • Zhang, Yu, Singer and Xiong, 2001, Proc Natl Acad Sci U S A, 98, 6730-6735. • Tsai, Acharyya, Yu and Zhang, 2002, In Recent Research Developments in Human Genetic.
Recent Development • Instability of Trees (high variance) • Bagging – averages many trees to reduce variance (Breiman, 1996) • Boosting (Breiman, 1998, Mason et al. 2000, Friedman el al. 1998) • Random forest (Breiman, 1999) • Lack of Smoothness • MARS procedure (Zhang & Singer, 1999, Hastie et al. 2001) • Difficulty in Capturing Additive Structure • MARS procedure
Competitive Tree for Colon Data
Competitive Tree Node 1 CT: 40 NT: 22 >1052 Node 4 CT: 0 NT:6 R87126 (372, 1052] Node 3: CT: 6 NT: 13 Node 2 CT: 34 NT: 3 T62947 X15183 >457 >28 Node 5 CT: 0 NT: 3 Node 6 CT: 34 NT: 0 Node 7 CT: 0 NT: 13 Node 8 CT: 6 NT: 0