Reducing false positives in molecular pattern recognition

Reducing false positives in molecular pattern recognition Xijin Ge,Shuichi Tsutsumi, Hiroyuki Aburatani & Shuichi Iwata Genome Science Division Research Center for Advanced Science and Technology The University of Tokyo

Towards bedside application of DNA microarrays to cancer treatment • Hardware • Technologies • Cost • Knowledge accumulation • Software • Availability • Evaluation Bedside application Testing Knowledge accumulation Software Hardware Needs/Market

Algorithms for cancer classification • Support vector machine (SVM) • k-nearest neightbor (kNN) • Prototype matching (PM) • Artificial neural networks (ANN) • Weighted voting (WV) • Naive bayes (NB) • ……

Algorithms for cancer classification • Support vector machine (SVM) • k-nearest neightbor (kNN) • Prototype matching (PM) • Artificial neural networks (ANN) • Weighted voting (WV) • Naive Bayes (NB) • ……

Our objective • Evaluate the reliability of existing algorithms • Testing for practical applications • Suggestions for improvement • How to use them?

Major result:SVM & kNN with false positive error rates >50% ! KNN SVM SVM PM PM KNN False Positive Error False Negative Error

Implicit assumption: Samples must be accurately diagnosed prior to classification (AML or ALL). 83.3% accuracy “Independent” Question: Is this a true measure of reliability ? How are classifiers tested? Acute Lymphoblastic Leukemia ALL (N=27 ) Acute Myeloid Leukemia AML (N=11 ) Golub et al, Science, 1999 KNN, SVM, PM etc.

What happens if we present the classifier samples that are neither AML nor ALL?

Bedside reality • Diagnosis accuracy • Metastasis cancers • Novel subtypes absent from training dataset • “complete” training dataset ? • Important for particular patients • Important for progress in cancer diagnosis • …

“Independent” How should classifiers be tested? ALL (N=27 ) AML (N=11 ) KNN, SVM, PM, etc. False positives!

AML False positives ! ALL No comment! Strangers! “Null” test Acute Lymphoblastic Leukemia ALL (N=27 ) Acute Myeloid Leukemia AML (N=11 ) KNN, SVM, PM, etc.

A benchmark dataset • Training 11 AML, 19 B-ALL, 8 T-ALL • Independent test (false negative) 14 AML, 19 B-ALL, 1 T-ALL • Null test (false positive) 239 samples (stomach, ovarian, liver, ……) 239 38 34

Samples Leave-one-class-out cross validation (LOCO-CV) for the testing of false positives Testing OV BR PR KI LI LU_S Training BL PA LU_A Dataset of Su et al. Cancer Res., 2001 GA CO

Algorithm evaluation Leave-one-sample-out cross validation False negative Leave-one-class-out cross validation False positive Null test Positive test

“Unsupervised” genes selection • One-vs-all (1,0,0) (0,1,0) (0,0,1) • Cluster-and-select ─ Classification of genes for the classification of samples ─ Data-structure dependent

Cluster-and-select Variation filter Kruskal-Wallis H test Dividing genes into M clusters (e.g. K-means clustering) Selecting S genes from each clusters with highest H value

Cluster-and-select One-vs-All B-ALL AML T-ALL T-ALL B-ALL AML (1,0,0) (1,0,1) genes (1,0,0) (1,1,0) (0,1,0) (0,1,0) (0,1,1) (0,0,1) (0,0,1)

Brief description of algorithms • Support vector machine (SVM) • k-nearest neightbor (kNN) • Prototype matching (PM) • Artificial neural networks (ANN) • Weighted voting (WV) • Naive Bayes (NB) • ……

Support Vector Machine (SVM) • Multiple binary classifier • Code: SvmFu Ryan Rifkin: www.ai.mit.edu/projects/cbcl/ No Prediction Zone

K-nearest neighbor (KNN) • Case-based reasoning • Simplicity

Prototype Matching Nearest-centroid classification “PAM”, Tibshirani et al, PNAS, 2002

Prototype Matching: modifications Confidence: Pearson Co. >0.2

Results: Comparison of clustering algorithms KNN SVM SVM PM KNN PM False Positive Error False Negative Error

Results: Comparison of gene selection methods PM SVM Different feature set for different algorithms.

Variation filter Feature selection: Global filter Kruskal-Wallis H test Dividing genes into M clusters (e.g. K-means clustering) Outlier detection Selecting S genes from each clusters with highest H value Feature selection: Redundancy reduction PM Pattern recognition Leave-one-sample-out cross validation Leave-one-class-out cross validation Verification Null test Positive test

Results on other datasets(Cluster-and-select + PM) Leave-one-class-out cross validation

Discussion (1)Why do we see such a big difference?

Two strategies of classification Uniqueness Differences Multi-class problems Binary problems Metastasis vs. non-metastasis Tumor origin

Discussion (2)How many genes should we use? KNN SVM SVM PM KNN PM False Positive Error False Negative Error

Sensitivity Specificity Discussion (3)Which algorithm should we use? SVM False positive KNN SVM PM False negative Don’t fall in love with SVM ! Focus on the problem and always try other methods!

Algorithm Development FGENES, GeneMark, Genie, MZEF, Morgan, Genescan, HMMgene Algorithm Evaluation Burset & Guigo, Genomics, 1996 … “Meta-algorithm” Murakami & Takagi, Bioinformatics, 1998 Rogic et al, Bioinformatics, 2002 Shah et al, Bioinformatics, 2003 (GeneComber) Algorithm Development SVM, PM, kNN, NB, WV Gene-prediction vs. tumor classification • Algorithm Evaluation • Dudoit et al, JASA, 2002. • Liu et al, GIW2002 • “Meta-algorithm”

Conclusions • A benchmark dataset to evaluate algorithms. “Null test” & “leave-one-class-out” cross validation • High false positives for KNN & SVM with small feature set. (>50%) • PM can be modified to achieve high specificity (~90%). • “Cluster-and-select” gene selection procedure.

Hiroyuki Aburatani Shuichi Tsutsumi Shogo Yamamoto Shingo Tsuji Kunihiro Nishimura Daisuke Komura Makoto Kano Shigeo Ihara Naoko Nishikawa Shuichi Iwata Naohiro Shichijo Jerome Piat Todd R. Golub Qing Guo Jiang Fu GIW reviewers Thanks to: Yoshitaka Hippo Shumpei Ishikawa Akitake Mukasa Yongxin Chen Yingqiu Guo Other lab members

Supplementary information (Benchmark datasets, source code, …) www2.genome.rcast.u-tokyo.ac.jp/pm/

Additional data Xijin Ge

Support Vector Machine (SVM) • Multiple binary classifier • Code: SvmFu Ryan Rifkin: www.ai.mit.edu/projects/cbcl/ No Prediction Zone

K-nearest neighbor (KNN) • Threshold >80% consistency

Prototype Matching: modifications Confidence: Pearson Co. >0.2

Samples Lymphoma dataset Alizadeh et al, Nature, 2000 DLBCL FL CLL

Burkitt Burkitt lymphomas lymphomas SRBCT dataset: Khan et al, Nature Medicine 2001 (BL) Neuroblastoma (NB) Ewing family of tumors (EWS) Rhabdomyosarcoma (RMS)

Samples OV BR Su et al Can. Res. 2001 PR KI LI LU_S BL PA LU_A GA CO

False positive False negative

Reducing false positives in molecular pattern recognition

Reducing false positives in molecular pattern recognition

Presentation Transcript

Pattern Recognition

Pattern recognition

Pattern Recognition

Reducing false positives in intrusion detection systems by means of frequent episodes

Reducing False-Positives and False-Negatives in Security Event Data Using Context

Reducing IDS False Positives by Clustering Related Alerts

Reducing False Failures in Metrology

Reducing False Failures in Metrology

Pattern Recognition

Pattern Recognition

Bayesian Notions and False Positives

Pattern Recognition

Pattern Recognition

Reducing False Failures in Metrology

Reducing False Positives In Automated Testing

Pattern Recognition

Pattern Recognition

Reducing False-Positives and False-Negatives in Security Event Data Using Context

Pattern Recognition

Pattern Recognition