A combining approach to statistical methods for p >> n problems

A combining approach to statistical methods for p >> n problems Workshop on Statistical Genetics, Nov 9, 2004 at ISM Shinto Eguchi

Microarray data cDNA microarry

Prediction from gene expressions Feature vectordimension = number of genes p components = quantities of gene expression Class label disease, adverse effect Classification machine based on training dataset

Leukemic diseases, Golub et al http://www.broad.mit.edu/cgi-bin/cancer/publications/

Web microarray data http://microarray.princeton.edu/oncology/ http://mgm.duke.edu/genome/dna micro/work/ p >> n

Genomic data Genome Protein mRNA

Problem: p >> n Fundamental issue on Bioinformatics pis the dimension of biomarker (SNPs, proteome, microarray, …) nis the number of individuals (informed consent, institutional protocol, … bioethics)

Current paradigm Biomarker space SNPs Haplotype block (Fujisawa) Haplotype & adverse effects (Matsuura) Proteome Peak data reduction (Miyata) Model-based clustering Microarray Network gene model GroupBoost (Takenouchi)

An approach by combining Let B be a biomarker space Let be K experimental facilities Rapid expansion of genomic data

Bridge Study? CAMDA (Critical Assessment of Microarray Data Analysis) DDBJ (DNA Data Bank Japan, NIG) result …. …. ….

CAMDA 2003 4 datasets for Lung Cancer http://www.camda.duke.edu/camda03/datasets/

Some problems 1. Heterogeneity in feature space cDNA, Affymetrix 2. Heterogeneous class-labeling Differences in covariates，medical diagnosis 3. Heterogeneous generalization powers Uncertainty for microarray experiments 4. Publication bias A vast of unpublished studies

Machine learning Leanability: boosting weak learners? AdaBoost : Freund& Schapire (1997) weak classifiers A strong classifier stagewise

AdaBoost

One-gene classifier one-gene classifier Let be expressions of the j-th gene 4 5 6 5 6 5 6 Error number 5 5 6 5

The second training Update the weight: Weight up to 2 Weight down to 0.5 9 7 8.5 9 7 5.5 7.5 4.5 6 Errror number 8 4

Learning algorithm Final machine

Exponential loss Exponential loss Update :

Different datasets expression vector of the same genes label of the same clinical item Normalization: ∋

Weighted Errors The k-th weighted error The combined weighted error

BridgeBoost

Learning Stage t : Stage t+1:

Mean exponential loss Exponential loss Mean exponential loss Note: convexity of Expo-Loss

Meta-leaning Meta-learning Separate learning

Simulation 3 datasets data 1, data2 Test error 0 （ideal） data3 Traning error Test error Test error 0.5（ideal） Collapsed dataset

Comparison Test error Training error Training error Test error Separate AdaBoost BridgeBoost

Test errors Collapsed AdaBoost BridgeBoost Separate AdaBoost Min =43% Min = 3% Min = 4% Min =15% Min =4%

Conclusion …. …. …. result Separate Leaning Meta-leaning

Unsolved problems 1. Which dataset should be joined or deleted in BridgeBoost ? 2. Prediction for class-label for a given new x ? 3. On the information on the unmatched genes in combining datasets 4. Heterogeneity is OK, but publication bias?

Publication bias? Passive smokers vs lung cancer (Copas & Shi, 2001) Mean and s.d. of 37 studies heterogeneity publication bias Funnel plot

References [1] A class of logistic-type discriminant functions. S. Eguchi and J. Copas, Biometrika89, 1-22 (2002). [2] Information geometry of U-Boost and Bregman divergence. N. Murata, T. Takenouchi, T. Kanamori and S. Eguchi Neural Computation 16, 1437-1481 (2004). [3] Robustifying AdaBoost by adding the naive error rate. T. Takenouchi and S. Eguchi. Neural Computation 16, 767-787 (2004). [4] GroupAdaBoost for selecting important genes. In preparation. T. Takenouchi, M. Ushijima and S. Eguchi [5] Local model uncertainty and incomplete data bias. J. Copas and S. Eguchi. ISM Research Memo. 884 July. (2003). [6] Local sensitivity approximation for selectivity bias. J. Copas and S. Eguchi. J. Royal Statistical Society B63 (2001) 871-895. [7] Reanalysis of epidemiological evidence on lung cancer and passive smoking. J. Copas and J.Q. Shi, British Medical Journal7232 (2000) 417-418.

A combining approach to statistical methods for p >> n problems