A gene expression analysis system for medical diagnosis

A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas University of Athens Dept. of Informatics and Telecommuncations

Objectives • A system to support medical diagnosis using molecular level information • Efficient classification of pathological conditions into multiple classes • A user friendly interface for physicians and biologists

DNA Microarrays Microscope glasses Thousands of spots Spot cDNA part

DNA Microarrays Gene expression level (feature)

DNA Microarrays Gene expression vector (feature vector)

DNA Microarrays Gene expression matrix (data set)

Gene expression analysis tools • Image processing & analysis for microarray spot detection • Visualization & clustering for discovery of unknown classes of pathological conditions • Gene ranking for identification of differentially expressed marker genes • Supervised classification of gene expression vectors into known classes

Gene expression analysis tools • GeneClust Do et al, 2000 • dChip Li & Wong, 2001 • Clusfavor Peterson, 2002 • Genesis Sturn et al, 2002 • Snomad Collantuoni et al, 2002 • Base Saal et al, 2002 • TM4 Suite Saeed et al, 2003 • RankGene Yang et al, 2003 • Excavator Xu et al, 2003 • KnowledgeEditor Toyoda & Konagaya, 2003 • ArrayNorm Pieler et al, 2004

Today’s challenge • None of the existent tools takes into account the usability profile of a physician or a biologist • Such tools could hardly be used in everyday medical practice

Supervised approaches • Most known supervised approaches have been applied to classification of gene expression vectors • Linear discriminant analysis • k-nearest neighbors • Parzen windows • Decision trees • Neural networks, etc. • Support Vector Machines (Brown et al, 2000; Furey et al, 2000; Ryu & Cho, 2000; Dudoit et al, 2002; Lu & Han, 2003; Aliferis et al, 2003)

Support Vector Machines • Robust binary classifiers • Not easily affected by the dimensionality of the feature vectors • SVM methods for classification into multiple classes • One vs one • One vs all • Directed Acyclic Graph (DAG) • Weston & Watkins • Cramer & Singer (Weston & Watkins, 1999; Platt, 2000; Yeang et al, 2001; Cramer & Singer, 2001; Hsu & Lin, 2002)

About multiclass SVM classifiers • They all lead to comparable results • They utilize a common, constant set of genes as input in each SVM node • They assume that the various pathological conditions correspond to separable clusters in the same gene space (Hsu et al, 2002; Lee et al, 2003; Statnikov et al, 2004)

The proposed approach • We consider the fact that • Only a small subset of genes is differentially expressed for each type or subtype of a pathological condition • We propose • The combination of SVMs in a cascading architecture that embodies gene selection in its structure

Cascading architecture Diagnostic Unit Pre-processing Unit Classifies input vector x into ω1, ω2,… ωΝ

Cascading architecture Diagnostic Unit Pre-processing Unit Poor quality cDNA targets generate missing values (Trovanskaya et al, 2001)

Cascading architecture Diagnostic Unit Pre-processing Unit Normalization facilitates comparability of samples (Zhang & Shmulevich, 2002)

Cascading architecture Diagnostic Unit Pre-processing Unit • A subset of genes is selected by ranking for each block • Three ranking criteria are available

Gene ranking criteria

Cascading architecture The classification module Cj is autonomously trained using a subset Xj of the available training samples

Cascading architecture A standard binary SVM classifier implements each classification module

Model selection • The best architecture is determined by leave one out cross validation • Selection bias is minimized • Gene selection and parameter tuning take place on the training samples during each iteration of the leave one out (Ambroise & McLahian, 2002; Varma & Simon, 2006)

Graphical User Interface

Results • Prostate cancer data • 112 samples (patients) • Classes • 62 primary prostate tumors • 41 normal prostate specimens • 9 pelvic lymph node metastases • 44016 gene expressions per sample • (Lapointe et al, 2004)

Results Minimum error 6.3% using 1 input gene

Results • Colon cancer dataset (Alon et al, 1999) • Minimum classification error 9.7% • Lung cancer dataset (Bhattacharjee et al, 2001) • Minimum classification error 1.5%

Conclusions • We presented a user friendly system that implements a cascading SVM architecture • It aims to the classification of gene expression data into known classes • The cascading architecture automatically tunes its parameters and determines its optimal configuration • In most cases leads to a diagnostic accuracy that exceeds 90%

Conclusions • Its performance is usually better than one-vs-one SVM combination method • It utilizes N-1 binary SVM classifiers, whereas one-vs-one utilizes N(N-1)/2 • It could be used in everyday clinical practice • Within our future perspectives is the adoption of incremental learning approaches

Thank you

A gene expression analysis system for medical diagnosis