430 likes | 575 Views
Reducing false positives in molecular pattern recognition. Xijin Ge, Shuichi Tsutsumi, Hiroyuki Aburatani & Shuichi Iwata Genome Science Division Research Center for Advanced Science and Technology The University of Tokyo.
E N D
Reducing false positives in molecular pattern recognition Xijin Ge,Shuichi Tsutsumi, Hiroyuki Aburatani & Shuichi Iwata Genome Science Division Research Center for Advanced Science and Technology The University of Tokyo
Towards bedside application of DNA microarrays to cancer treatment • Hardware • Technologies • Cost • Knowledge accumulation • Software • Availability • Evaluation Bedside application Testing Knowledge accumulation Software Hardware Needs/Market
Algorithms for cancer classification • Support vector machine (SVM) • k-nearest neightbor (kNN) • Prototype matching (PM) • Artificial neural networks (ANN) • Weighted voting (WV) • Naive bayes (NB) • ……
Algorithms for cancer classification • Support vector machine (SVM) • k-nearest neightbor (kNN) • Prototype matching (PM) • Artificial neural networks (ANN) • Weighted voting (WV) • Naive Bayes (NB) • ……
Our objective • Evaluate the reliability of existing algorithms • Testing for practical applications • Suggestions for improvement • How to use them?
Major result:SVM & kNN with false positive error rates >50% ! KNN SVM SVM PM PM KNN False Positive Error False Negative Error
Implicit assumption: Samples must be accurately diagnosed prior to classification (AML or ALL). 83.3% accuracy “Independent” Question: Is this a true measure of reliability ? How are classifiers tested? Acute Lymphoblastic Leukemia ALL (N=27 ) Acute Myeloid Leukemia AML (N=11 ) Golub et al, Science, 1999 KNN, SVM, PM etc.
What happens if we present the classifier samples that are neither AML nor ALL?
Bedside reality • Diagnosis accuracy • Metastasis cancers • Novel subtypes absent from training dataset • “complete” training dataset ? • Important for particular patients • Important for progress in cancer diagnosis • …
“Independent” How should classifiers be tested? ALL (N=27 ) AML (N=11 ) KNN, SVM, PM, etc. False positives!
AML False positives ! ALL No comment! Strangers! “Null” test Acute Lymphoblastic Leukemia ALL (N=27 ) Acute Myeloid Leukemia AML (N=11 ) KNN, SVM, PM, etc.
A benchmark dataset • Training 11 AML, 19 B-ALL, 8 T-ALL • Independent test (false negative) 14 AML, 19 B-ALL, 1 T-ALL • Null test (false positive) 239 samples (stomach, ovarian, liver, ……) 239 38 34
Samples Leave-one-class-out cross validation (LOCO-CV) for the testing of false positives Testing OV BR PR KI LI LU_S Training BL PA LU_A Dataset of Su et al. Cancer Res., 2001 GA CO
Algorithm evaluation Leave-one-sample-out cross validation False negative Leave-one-class-out cross validation False positive Null test Positive test
“Unsupervised” genes selection • One-vs-all (1,0,0) (0,1,0) (0,0,1) • Cluster-and-select ─ Classification of genes for the classification of samples ─ Data-structure dependent
Cluster-and-select Variation filter Kruskal-Wallis H test Dividing genes into M clusters (e.g. K-means clustering) Selecting S genes from each clusters with highest H value
Cluster-and-select One-vs-All B-ALL AML T-ALL T-ALL B-ALL AML (1,0,0) (1,0,1) genes (1,0,0) (1,1,0) (0,1,0) (0,1,0) (0,1,1) (0,0,1) (0,0,1)
Brief description of algorithms • Support vector machine (SVM) • k-nearest neightbor (kNN) • Prototype matching (PM) • Artificial neural networks (ANN) • Weighted voting (WV) • Naive Bayes (NB) • ……
Support Vector Machine (SVM) • Multiple binary classifier • Code: SvmFu Ryan Rifkin: www.ai.mit.edu/projects/cbcl/ No Prediction Zone
K-nearest neighbor (KNN) • Case-based reasoning • Simplicity
Prototype Matching Nearest-centroid classification “PAM”, Tibshirani et al, PNAS, 2002
Prototype Matching: modifications Confidence: Pearson Co. >0.2
Results: Comparison of clustering algorithms KNN SVM SVM PM KNN PM False Positive Error False Negative Error
Results: Comparison of gene selection methods PM SVM Different feature set for different algorithms.
Variation filter Feature selection: Global filter Kruskal-Wallis H test Dividing genes into M clusters (e.g. K-means clustering) Outlier detection Selecting S genes from each clusters with highest H value Feature selection: Redundancy reduction PM Pattern recognition Leave-one-sample-out cross validation Leave-one-class-out cross validation Verification Null test Positive test
Results on other datasets(Cluster-and-select + PM) Leave-one-class-out cross validation
Two strategies of classification Uniqueness Differences Multi-class problems Binary problems Metastasis vs. non-metastasis Tumor origin
Discussion (2)How many genes should we use? KNN SVM SVM PM KNN PM False Positive Error False Negative Error
Sensitivity Specificity Discussion (3)Which algorithm should we use? SVM False positive KNN SVM PM False negative Don’t fall in love with SVM ! Focus on the problem and always try other methods!
Algorithm Development FGENES, GeneMark, Genie, MZEF, Morgan, Genescan, HMMgene Algorithm Evaluation Burset & Guigo, Genomics, 1996 … “Meta-algorithm” Murakami & Takagi, Bioinformatics, 1998 Rogic et al, Bioinformatics, 2002 Shah et al, Bioinformatics, 2003 (GeneComber) Algorithm Development SVM, PM, kNN, NB, WV Gene-prediction vs. tumor classification • Algorithm Evaluation • Dudoit et al, JASA, 2002. • Liu et al, GIW2002 • “Meta-algorithm”
Conclusions • A benchmark dataset to evaluate algorithms. “Null test” & “leave-one-class-out” cross validation • High false positives for KNN & SVM with small feature set. (>50%) • PM can be modified to achieve high specificity (~90%). • “Cluster-and-select” gene selection procedure.
Hiroyuki Aburatani Shuichi Tsutsumi Shogo Yamamoto Shingo Tsuji Kunihiro Nishimura Daisuke Komura Makoto Kano Shigeo Ihara Naoko Nishikawa Shuichi Iwata Naohiro Shichijo Jerome Piat Todd R. Golub Qing Guo Jiang Fu GIW reviewers Thanks to: Yoshitaka Hippo Shumpei Ishikawa Akitake Mukasa Yongxin Chen Yingqiu Guo Other lab members
Supplementary information (Benchmark datasets, source code, …) www2.genome.rcast.u-tokyo.ac.jp/pm/
Additional data Xijin Ge
Support Vector Machine (SVM) • Multiple binary classifier • Code: SvmFu Ryan Rifkin: www.ai.mit.edu/projects/cbcl/ No Prediction Zone
K-nearest neighbor (KNN) • Threshold >80% consistency
Prototype Matching: modifications Confidence: Pearson Co. >0.2
Samples Lymphoma dataset Alizadeh et al, Nature, 2000 DLBCL FL CLL
Burkitt Burkitt lymphomas lymphomas SRBCT dataset: Khan et al, Nature Medicine 2001 (BL) Neuroblastoma (NB) Ewing family of tumors (EWS) Rhabdomyosarcoma (RMS)
Samples OV BR Su et al Can. Res. 2001 PR KI LI LU_S BL PA LU_A GA CO
False positive False negative