310 likes | 445 Views
A combining approach to statistical methods for p >> n problems. Workshop on Statistical Genetics, Nov 9, 2004 at ISM. Shinto Eguchi. Microarray data. cDNA microarry. Prediction from gene expressions. Feature vector dimension = number of genes p
E N D
A combining approach to statistical methods for p >> n problems Workshop on Statistical Genetics, Nov 9, 2004 at ISM Shinto Eguchi
Microarray data cDNA microarry
Prediction from gene expressions Feature vectordimension = number of genes p components = quantities of gene expression Class label disease, adverse effect Classification machine based on training dataset
Leukemic diseases, Golub et al http://www.broad.mit.edu/cgi-bin/cancer/publications/
Web microarray data http://microarray.princeton.edu/oncology/ http://mgm.duke.edu/genome/dna micro/work/ p >> n
Genomic data Genome Protein mRNA
Problem: p >> n Fundamental issue on Bioinformatics pis the dimension of biomarker (SNPs, proteome, microarray, …) nis the number of individuals (informed consent, institutional protocol, … bioethics)
Current paradigm Biomarker space SNPs Haplotype block (Fujisawa) Haplotype & adverse effects (Matsuura) Proteome Peak data reduction (Miyata) Model-based clustering Microarray Network gene model GroupBoost (Takenouchi)
An approach by combining Let B be a biomarker space Let be K experimental facilities Rapid expansion of genomic data
Bridge Study? CAMDA (Critical Assessment of Microarray Data Analysis) DDBJ (DNA Data Bank Japan, NIG) result …. …. ….
CAMDA 2003 4 datasets for Lung Cancer http://www.camda.duke.edu/camda03/datasets/
Some problems 1. Heterogeneity in feature space cDNA, Affymetrix 2. Heterogeneous class-labeling Differences in covariates,medical diagnosis 3. Heterogeneous generalization powers Uncertainty for microarray experiments 4. Publication bias A vast of unpublished studies
Machine learning Leanability: boosting weak learners? AdaBoost : Freund& Schapire (1997) weak classifiers A strong classifier stagewise
One-gene classifier one-gene classifier Let be expressions of the j-th gene 4 5 6 5 6 5 6 Error number 5 5 6 5
The second training Update the weight: Weight up to 2 Weight down to 0.5 9 7 8.5 9 7 5.5 7.5 4.5 6 Errror number 8 4
Learning algorithm Final machine
Exponential loss Exponential loss Update :
Different datasets expression vector of the same genes label of the same clinical item Normalization: ∋
Weighted Errors The k-th weighted error The combined weighted error
Learning Stage t : Stage t+1:
Mean exponential loss Exponential loss Mean exponential loss Note: convexity of Expo-Loss
Meta-leaning Meta-learning Separate learning
Simulation 3 datasets data 1, data2 Test error 0 (ideal) data3 Traning error Test error Test error 0.5(ideal) Collapsed dataset
Comparison Test error Training error Training error Test error Separate AdaBoost BridgeBoost
Test errors Collapsed AdaBoost BridgeBoost Separate AdaBoost Min =43% Min = 3% Min = 4% Min =15% Min =4%
Conclusion …. …. …. result Separate Leaning Meta-leaning
Unsolved problems 1. Which dataset should be joined or deleted in BridgeBoost ? 2. Prediction for class-label for a given new x ? 3. On the information on the unmatched genes in combining datasets 4. Heterogeneity is OK, but publication bias?
Publication bias? Passive smokers vs lung cancer (Copas & Shi, 2001) Mean and s.d. of 37 studies heterogeneity publication bias Funnel plot
References [1] A class of logistic-type discriminant functions. S. Eguchi and J. Copas, Biometrika89, 1-22 (2002). [2] Information geometry of U-Boost and Bregman divergence. N. Murata, T. Takenouchi, T. Kanamori and S. Eguchi Neural Computation 16, 1437-1481 (2004). [3] Robustifying AdaBoost by adding the naive error rate. T. Takenouchi and S. Eguchi. Neural Computation 16, 767-787 (2004). [4] GroupAdaBoost for selecting important genes. In preparation. T. Takenouchi, M. Ushijima and S. Eguchi [5] Local model uncertainty and incomplete data bias. J. Copas and S. Eguchi. ISM Research Memo. 884 July. (2003). [6] Local sensitivity approximation for selectivity bias. J. Copas and S. Eguchi. J. Royal Statistical Society B63 (2001) 871-895. [7] Reanalysis of epidemiological evidence on lung cancer and passive smoking. J. Copas and J.Q. Shi, British Medical Journal7232 (2000) 417-418.