1 / 43

Reducing false positives in molecular pattern recognition

Reducing false positives in molecular pattern recognition. Xijin Ge, Shuichi Tsutsumi, Hiroyuki Aburatani & Shuichi Iwata Genome Science Division Research Center for Advanced Science and Technology The University of Tokyo.

jasper
Download Presentation

Reducing false positives in molecular pattern recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reducing false positives in molecular pattern recognition Xijin Ge,Shuichi Tsutsumi, Hiroyuki Aburatani & Shuichi Iwata Genome Science Division Research Center for Advanced Science and Technology The University of Tokyo

  2. Towards bedside application of DNA microarrays to cancer treatment • Hardware • Technologies • Cost • Knowledge accumulation • Software • Availability • Evaluation Bedside application Testing Knowledge accumulation Software Hardware Needs/Market

  3. Algorithms for cancer classification • Support vector machine (SVM) • k-nearest neightbor (kNN) • Prototype matching (PM) • Artificial neural networks (ANN) • Weighted voting (WV) • Naive bayes (NB) • ……

  4. Algorithms for cancer classification • Support vector machine (SVM) • k-nearest neightbor (kNN) • Prototype matching (PM) • Artificial neural networks (ANN) • Weighted voting (WV) • Naive Bayes (NB) • ……

  5. Our objective • Evaluate the reliability of existing algorithms • Testing for practical applications • Suggestions for improvement • How to use them?

  6. Major result:SVM & kNN with false positive error rates >50% ! KNN SVM SVM PM PM KNN False Positive Error False Negative Error

  7. Implicit assumption: Samples must be accurately diagnosed prior to classification (AML or ALL). 83.3% accuracy “Independent” Question: Is this a true measure of reliability ? How are classifiers tested? Acute Lymphoblastic Leukemia ALL (N=27 ) Acute Myeloid Leukemia AML (N=11 ) Golub et al, Science, 1999 KNN, SVM, PM etc.

  8. What happens if we present the classifier samples that are neither AML nor ALL?

  9. Bedside reality • Diagnosis accuracy • Metastasis cancers • Novel subtypes absent from training dataset • “complete” training dataset ? • Important for particular patients • Important for progress in cancer diagnosis • …

  10. “Independent” How should classifiers be tested? ALL (N=27 ) AML (N=11 ) KNN, SVM, PM, etc. False positives!

  11. AML False positives ! ALL No comment! Strangers! “Null” test Acute Lymphoblastic Leukemia ALL (N=27 ) Acute Myeloid Leukemia AML (N=11 ) KNN, SVM, PM, etc.

  12. A benchmark dataset • Training 11 AML, 19 B-ALL, 8 T-ALL • Independent test (false negative) 14 AML, 19 B-ALL, 1 T-ALL • Null test (false positive) 239 samples (stomach, ovarian, liver, ……) 239 38 34

  13. Samples Leave-one-class-out cross validation (LOCO-CV) for the testing of false positives Testing OV BR PR KI LI LU_S Training BL PA LU_A Dataset of Su et al. Cancer Res., 2001 GA CO

  14. Algorithm evaluation Leave-one-sample-out cross validation False negative Leave-one-class-out cross validation False positive Null test Positive test

  15. “Unsupervised” genes selection • One-vs-all (1,0,0) (0,1,0) (0,0,1) • Cluster-and-select ─ Classification of genes for the classification of samples ─ Data-structure dependent

  16. Cluster-and-select Variation filter Kruskal-Wallis H test Dividing genes into M clusters (e.g. K-means clustering) Selecting S genes from each clusters with highest H value

  17. Cluster-and-select One-vs-All B-ALL AML T-ALL T-ALL B-ALL AML (1,0,0) (1,0,1) genes (1,0,0) (1,1,0) (0,1,0) (0,1,0) (0,1,1) (0,0,1) (0,0,1)

  18. Brief description of algorithms • Support vector machine (SVM) • k-nearest neightbor (kNN) • Prototype matching (PM) • Artificial neural networks (ANN) • Weighted voting (WV) • Naive Bayes (NB) • ……

  19. Support Vector Machine (SVM) • Multiple binary classifier • Code: SvmFu Ryan Rifkin: www.ai.mit.edu/projects/cbcl/ No Prediction Zone

  20. K-nearest neighbor (KNN) • Case-based reasoning • Simplicity

  21. Prototype Matching Nearest-centroid classification “PAM”, Tibshirani et al, PNAS, 2002

  22. Prototype Matching: modifications Confidence: Pearson Co. >0.2

  23. Results: Comparison of clustering algorithms KNN SVM SVM PM KNN PM False Positive Error False Negative Error

  24. Results: Comparison of gene selection methods PM SVM Different feature set for different algorithms.

  25. Variation filter Feature selection: Global filter Kruskal-Wallis H test Dividing genes into M clusters (e.g. K-means clustering) Outlier detection Selecting S genes from each clusters with highest H value Feature selection: Redundancy reduction PM Pattern recognition Leave-one-sample-out cross validation Leave-one-class-out cross validation Verification Null test Positive test

  26. Results on other datasets(Cluster-and-select + PM) Leave-one-class-out cross validation

  27. Discussion (1)Why do we see such a big difference?

  28. Two strategies of classification Uniqueness Differences Multi-class problems Binary problems Metastasis vs. non-metastasis Tumor origin

  29. Discussion (2)How many genes should we use? KNN SVM SVM PM KNN PM False Positive Error False Negative Error

  30. Sensitivity Specificity Discussion (3)Which algorithm should we use? SVM False positive KNN SVM PM False negative Don’t fall in love with SVM ! Focus on the problem and always try other methods!

  31. Algorithm Development FGENES, GeneMark, Genie, MZEF, Morgan, Genescan, HMMgene Algorithm Evaluation Burset & Guigo, Genomics, 1996 … “Meta-algorithm” Murakami & Takagi, Bioinformatics, 1998 Rogic et al, Bioinformatics, 2002 Shah et al, Bioinformatics, 2003 (GeneComber) Algorithm Development SVM, PM, kNN, NB, WV Gene-prediction vs. tumor classification • Algorithm Evaluation • Dudoit et al, JASA, 2002. • Liu et al, GIW2002 • “Meta-algorithm”

  32. Conclusions • A benchmark dataset to evaluate algorithms. “Null test” & “leave-one-class-out” cross validation • High false positives for KNN & SVM with small feature set. (>50%) • PM can be modified to achieve high specificity (~90%). • “Cluster-and-select” gene selection procedure.

  33. Hiroyuki Aburatani Shuichi Tsutsumi Shogo Yamamoto Shingo Tsuji Kunihiro Nishimura Daisuke Komura Makoto Kano Shigeo Ihara Naoko Nishikawa Shuichi Iwata Naohiro Shichijo Jerome Piat Todd R. Golub Qing Guo Jiang Fu GIW reviewers Thanks to: Yoshitaka Hippo Shumpei Ishikawa Akitake Mukasa Yongxin Chen Yingqiu Guo Other lab members

  34. Supplementary information (Benchmark datasets, source code, …) www2.genome.rcast.u-tokyo.ac.jp/pm/

  35. Additional data Xijin Ge

  36. Support Vector Machine (SVM) • Multiple binary classifier • Code: SvmFu Ryan Rifkin: www.ai.mit.edu/projects/cbcl/ No Prediction Zone

  37. K-nearest neighbor (KNN) • Threshold >80% consistency

  38. Prototype Matching: modifications Confidence: Pearson Co. >0.2

  39. Samples Lymphoma dataset Alizadeh et al, Nature, 2000 DLBCL FL CLL

  40. Burkitt Burkitt lymphomas lymphomas SRBCT dataset: Khan et al, Nature Medicine 2001 (BL) Neuroblastoma (NB) Ewing family of tumors (EWS) Rhabdomyosarcoma (RMS)

  41. Samples OV BR Su et al Can. Res. 2001 PR KI LI LU_S BL PA LU_A GA CO

  42. False positive False negative

More Related