280 likes | 408 Views
A gene expression analysis system for medical diagnosis. D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas University of Athens Dept. of Informatics and Telecommuncations. Objectives. A system to support medical diagnosis using molecular level information
E N D
A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas University of Athens Dept. of Informatics and Telecommuncations
Objectives • A system to support medical diagnosis using molecular level information • Efficient classification of pathological conditions into multiple classes • A user friendly interface for physicians and biologists
DNA Microarrays Microscope glasses Thousands of spots Spot cDNA part
DNA Microarrays Gene expression level (feature)
DNA Microarrays Gene expression vector (feature vector)
DNA Microarrays Gene expression matrix (data set)
Gene expression analysis tools • Image processing & analysis for microarray spot detection • Visualization & clustering for discovery of unknown classes of pathological conditions • Gene ranking for identification of differentially expressed marker genes • Supervised classification of gene expression vectors into known classes
Gene expression analysis tools • GeneClust Do et al, 2000 • dChip Li & Wong, 2001 • Clusfavor Peterson, 2002 • Genesis Sturn et al, 2002 • Snomad Collantuoni et al, 2002 • Base Saal et al, 2002 • TM4 Suite Saeed et al, 2003 • RankGene Yang et al, 2003 • Excavator Xu et al, 2003 • KnowledgeEditor Toyoda & Konagaya, 2003 • ArrayNorm Pieler et al, 2004
Today’s challenge • None of the existent tools takes into account the usability profile of a physician or a biologist • Such tools could hardly be used in everyday medical practice
Supervised approaches • Most known supervised approaches have been applied to classification of gene expression vectors • Linear discriminant analysis • k-nearest neighbors • Parzen windows • Decision trees • Neural networks, etc. • Support Vector Machines (Brown et al, 2000; Furey et al, 2000; Ryu & Cho, 2000; Dudoit et al, 2002; Lu & Han, 2003; Aliferis et al, 2003)
Support Vector Machines • Robust binary classifiers • Not easily affected by the dimensionality of the feature vectors • SVM methods for classification into multiple classes • One vs one • One vs all • Directed Acyclic Graph (DAG) • Weston & Watkins • Cramer & Singer (Weston & Watkins, 1999; Platt, 2000; Yeang et al, 2001; Cramer & Singer, 2001; Hsu & Lin, 2002)
About multiclass SVM classifiers • They all lead to comparable results • They utilize a common, constant set of genes as input in each SVM node • They assume that the various pathological conditions correspond to separable clusters in the same gene space (Hsu et al, 2002; Lee et al, 2003; Statnikov et al, 2004)
The proposed approach • We consider the fact that • Only a small subset of genes is differentially expressed for each type or subtype of a pathological condition • We propose • The combination of SVMs in a cascading architecture that embodies gene selection in its structure
Cascading architecture Diagnostic Unit Pre-processing Unit Classifies input vector x into ω1, ω2,… ωΝ
Cascading architecture Diagnostic Unit Pre-processing Unit Poor quality cDNA targets generate missing values (Trovanskaya et al, 2001)
Cascading architecture Diagnostic Unit Pre-processing Unit Normalization facilitates comparability of samples (Zhang & Shmulevich, 2002)
Cascading architecture Diagnostic Unit Pre-processing Unit • A subset of genes is selected by ranking for each block • Three ranking criteria are available
Cascading architecture The classification module Cj is autonomously trained using a subset Xj of the available training samples
Cascading architecture A standard binary SVM classifier implements each classification module
Model selection • The best architecture is determined by leave one out cross validation • Selection bias is minimized • Gene selection and parameter tuning take place on the training samples during each iteration of the leave one out (Ambroise & McLahian, 2002; Varma & Simon, 2006)
Results • Prostate cancer data • 112 samples (patients) • Classes • 62 primary prostate tumors • 41 normal prostate specimens • 9 pelvic lymph node metastases • 44016 gene expressions per sample • (Lapointe et al, 2004)
Results Minimum error 6.3% using 1 input gene
Results • Colon cancer dataset (Alon et al, 1999) • Minimum classification error 9.7% • Lung cancer dataset (Bhattacharjee et al, 2001) • Minimum classification error 1.5%
Conclusions • We presented a user friendly system that implements a cascading SVM architecture • It aims to the classification of gene expression data into known classes • The cascading architecture automatically tunes its parameters and determines its optimal configuration • In most cases leads to a diagnostic accuracy that exceeds 90%
Conclusions • Its performance is usually better than one-vs-one SVM combination method • It utilizes N-1 binary SVM classifiers, whereas one-vs-one utilizes N(N-1)/2 • It could be used in everyday clinical practice • Within our future perspectives is the adoption of incremental learning approaches