1 / 31

A combining approach to statistical methods for p >> n problems

A combining approach to statistical methods for p >> n problems. Workshop on Statistical Genetics, Nov 9, 2004 at ISM. Shinto Eguchi. Microarray data. cDNA microarry. Prediction from gene expressions. Feature vector dimension = number of genes p

Download Presentation

A combining approach to statistical methods for p >> n problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A combining approach to statistical methods for p >> n problems Workshop on Statistical Genetics, Nov 9, 2004 at ISM Shinto Eguchi

  2. Microarray data cDNA microarry

  3. Prediction from gene expressions Feature vectordimension = number of genes p components = quantities of gene expression Class label disease, adverse effect Classification machine based on training dataset

  4. Leukemic diseases, Golub et al http://www.broad.mit.edu/cgi-bin/cancer/publications/

  5. Web microarray data http://microarray.princeton.edu/oncology/ http://mgm.duke.edu/genome/dna micro/work/ p >> n

  6. Genomic data Genome Protein mRNA

  7. Problem: p >> n Fundamental issue on Bioinformatics pis the dimension of biomarker (SNPs, proteome, microarray, …) nis the number of individuals (informed consent, institutional protocol, … bioethics)

  8. Current paradigm Biomarker space SNPs Haplotype block (Fujisawa) Haplotype & adverse effects (Matsuura) Proteome Peak data reduction (Miyata) Model-based clustering Microarray Network gene model GroupBoost (Takenouchi)

  9. An approach by combining Let B be a biomarker space Let be K experimental facilities Rapid expansion of genomic data

  10. Bridge Study? CAMDA (Critical Assessment of Microarray Data Analysis) DDBJ (DNA Data Bank Japan, NIG) result …. …. ….

  11. CAMDA 2003 4 datasets for Lung Cancer http://www.camda.duke.edu/camda03/datasets/

  12. Some problems 1. Heterogeneity in feature space cDNA, Affymetrix 2. Heterogeneous class-labeling Differences in covariates,medical diagnosis 3. Heterogeneous generalization powers Uncertainty for microarray experiments 4. Publication bias A vast of unpublished studies

  13. Machine learning Leanability: boosting weak learners? AdaBoost : Freund& Schapire (1997) weak classifiers A strong classifier stagewise

  14. AdaBoost

  15. One-gene classifier one-gene classifier Let be expressions of the j-th gene 4 5 6 5 6 5 6 Error number 5 5 6 5

  16. The second training Update the weight: Weight up to 2 Weight down to 0.5 9 7 8.5 9 7 5.5 7.5 4.5 6 Errror number 8 4

  17. Learning algorithm Final machine

  18. Exponential loss Exponential loss Update :

  19. Different datasets expression vector of the same genes label of the same clinical item Normalization: ∋

  20. Weighted Errors The k-th weighted error The combined weighted error

  21. BridgeBoost

  22. Learning Stage t : Stage t+1:

  23. Mean exponential loss Exponential loss Mean exponential loss Note: convexity of Expo-Loss

  24. Meta-leaning Meta-learning Separate learning

  25. Simulation 3 datasets data 1, data2 Test error 0 (ideal) data3 Traning error Test error Test error 0.5(ideal) Collapsed dataset

  26. Comparison Test error Training error Training error Test error Separate AdaBoost BridgeBoost

  27. Test errors Collapsed AdaBoost BridgeBoost Separate AdaBoost Min =43% Min = 3% Min = 4% Min =15% Min =4%

  28. Conclusion …. …. …. result Separate Leaning Meta-leaning

  29. Unsolved problems 1. Which dataset should be joined or deleted in BridgeBoost ? 2. Prediction for class-label for a given new x ? 3. On the information on the unmatched genes in combining datasets 4. Heterogeneity is OK, but publication bias?

  30. Publication bias? Passive smokers vs lung cancer (Copas & Shi, 2001) Mean and s.d. of 37 studies heterogeneity publication bias Funnel plot

  31. References [1] A class of logistic-type discriminant functions. S. Eguchi and J. Copas, Biometrika89, 1-22 (2002). [2] Information geometry of U-Boost and Bregman divergence. N. Murata, T. Takenouchi, T. Kanamori and S. Eguchi Neural Computation 16, 1437-1481 (2004). [3] Robustifying AdaBoost by adding the naive error rate. T. Takenouchi and S. Eguchi. Neural Computation 16, 767-787 (2004). [4] GroupAdaBoost for selecting important genes. In preparation. T. Takenouchi, M. Ushijima and S. Eguchi [5] Local model uncertainty and incomplete data bias. J. Copas and S. Eguchi. ISM Research Memo. 884 July. (2003). [6] Local sensitivity approximation for selectivity bias. J. Copas and S. Eguchi. J. Royal Statistical Society B63 (2001) 871-895. [7] Reanalysis of epidemiological evidence on lung cancer and passive smoking. J. Copas and J.Q. Shi, British Medical Journal7232 (2000) 417-418.

More Related