A Novel Framework to Elucidate Core Classes in a Dataset

A Novel Frameworkto Elucidate Core Classes in a Dataset Daniele Soria daniele.soria@nottingham.ac.uk G54DMT - Data Mining Techniques and Applications 9th May 2013

Outline • Aims and motivations • Framework • Clustering and consensus • Supervised learning • Validation on two clinical data sets • Conclusions and questions

Aims and motivations • Develop an original framework with multiple steps to extract most representative classes from any dataset • Refine the phenotypic characterisation of breast cancer • Move the medical decision making process from a single technique approach to a multi-technique one • Guide clinicians in the choice of the favourite andmost powerful treatment to reach a personalised healthcare

Framework (1) no yes

Framework (2) yes no

Data pre-processing • Dealing with missing values • Homogeneous variables • Computedescriptive statistics

Clustering • Four different algorithms: • Hierarchical (HCA) • Fuzzy c-means (FCM) • K-means (KM) • Partitioning Around Medoids (PAM) • Methods run with the number of clusters varying from 2to20

Hierarchical method • Hierarchy of clusters • Represented with a tree (dendrogram) • Clusters obtained cutting the dendrogram at specific height

Hierarchical method (cont.) 6 clusters

FCM method • Minimisation of the o. f. • X = {x1,x2,...,xn}:ndata points • V = {v1,v2,…,vc}:ccluster centres • U=(μij)n*c:fuzzy partition matrix • μij:membership of xito vj • m:fuzziness index

FCM method (cont.)

KM method • Minimisation of the o. f. • ||xi - vj||:Euclidean distance betweenxiand vj • cj:data points in clusterj • vj

KM method (cont.)

PAM method • Search for krepresentative objects (medoids) among the observations • Minimum sum of dissimilarities • kclusters are constructed by assigning each observation to the nearest medoid

PAM method (cont.)

If n is unknown… • Validity indicescomputation • Defined considering the data dispersion within and between clusters • According to decision rules, the best number of clusters may be selected

Validity Indices

Characterisation & agreement • Visualisation techniques • Biplots • Boxplots • Indices for assessing agreement • Cohen’s kappa (κ) • Rand

Definition of classes • Consensus clustering: • Align labels to have similar clusters named in thesame way by different algorithms • Take into account points assigned to groups with the same label

Supervised learning (1) • Model-based classification for prediction of future cases • Aims • High quality prediction • Reduce number of variables (biomarkers) • Prefer ‘white-box’ prediction models

Supervised learning (2) • Different techniques: • C4.5 Decision Tree • Multi-Layer Perceptron Artificial Neural Network (MLP-ANN) • Naïve Bayes (NB) or Non-Parametric Bayesian Classifier (NPBC)

C4.5 classifier • Each attribute can be used to make a decision that splits the data into smaller subsets • Information gain that results from choosing an attribute for splitting the data • Attribute with highest information gain is the one used to make the decision

Multi-Layer Perceptron • Feed-forward ANN • Nonlinear activation function used by each neuron • Layers of hidden nodes connected with every other node in the following layer • Learning carried out through back-propagation

Naïve Bayes classifier • Probabilistic classifier based on Bayes’ theorem • Good for multi-dimensional data • Common assumptions: • Independence of variables • Normality

NPBC: ratiobetweenareas (1) • Similar to Naïve Bayes • Useful for non-normal data • Based on ratio between areas under the histogram • The closer a data point is to the median, the higher the probability to belong to that specific class 25

NPBC: ratiobetweenareas (2) x< m x > m datapoint x median m median m datapoint x ER ER Soria et al. (2011): A `Non-Parametric’ Version of the Naïve Bayes Classifier. Knowledge-Based Systems, 24, 775-784 26

Characterisation of classes • Biplots • Boxplots • Relation with clinical information • Survival analysis

Validation of the framework (1) • Set of markers involved in breast cancer cell cycle regulation • 347 patients and 4 markers • Survival and grade available • K-means, PAM and Fuzzy C-means used • Validity indices for best number of clusters

KM validity indices

PAM validity indices

Optimum number of clusters

FCM validity indices

Optimum number of clusters

Agreement and consensus • Kappa index high for 3 groups classification • 3 common classes found (8.9% not classified) • Intermediateexpression (class 1) • High expression (class 2) • Lowexpression (class 3)

Biplots of classes (1) °: Class 1 °: Class 2 °: Class 3

Biplots of classes (2) °: Class 1 °: Class 2 °: Class 3 °: N.C.

Boxplots of classes

Kaplan-Meier curves

Clinical information • High grade patients (poor prognosis) in classes 1 and 3 (worst survival) • Common classes group patients with similar outcome

C4.5 decision tree

Validation of the framework (2) • Patients entered into the Nottingham Tenovus Primary Breast Carcinoma Series between 1986 and 1998 • 1076 cases informative for all 25 biological markers • Clinical information(grade, size, age, survival, follow-up, etc.) available

Previous work: 4 groups

Breast Cancer ER+ Luminal CKs+ ER- Basal CKs- ER- HER2+ Mixed Class (38.4%) HER2 Class 6 (7.2%) Luminal Basal PgR+ PgR- p53+ p53- p53 altered Class 4 (7.6%) p53 normal Class 5 (6.4%) HER3+ HER4+ HER3- HER4- HER3+ HER4+ Luminal A Class 1 (18.8%) Luminal N Class 2 (14.2%) Luminal B Class 3 (7.4%) Consensus Clustering Soria et al., Computers in Biology and Medicine, 2010

Conclusions • Original framework for identification of core classes • Formed by different logical steps • Validation over novel data sets • 3 classes (Low, IntermediateandHigh) • High marker levels for better survival • Discover of novel cancer subtypes

Main references • L. Kaufman, P.J. Rousseeuw. Finding groups in data, Wiley series in probability and mathematical statistics, 1990. • A. Weingessel, et al. An Examination Of Indexes For Determining The Number Of Clusters In Binary Data Sets, Working Paper No.29, 1999. • I. H. Witten and E. Frank. DataMining: Practical machine learning tools and techniques. Morgan Kaufmann Publishers, 2005. • A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice-Hall advanced reference series, Prentice-Hall. Englewood Cliffs, NJ, USA, 1988. • A.K. Jain, M.N. Murty, and P.J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, 1999. • P.F. Velleman and D.C. Hoaglin. Applications, Basics and Computing of Exploratory Data Analysis. Boston, Mass.: Duxbury Press, 1981. • J. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, Los Altos, California, 1993. • S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2 edition, 1998. • G. John and P. Langley. Estimating continuous distributions in bayesian classifiers. Proceeding of the Eleventh Conference on Uncertainty in Artificial Intelligence, 1995.

Acknowledgements Prof JM Garibaldi, Dr J Bacardit Nottingham Breast Cancer Pathology RG Prof IO Ellis, Dr AR Green, Dr D Powe, Prof G Ball 46

Thank You! Contact: daniele.soria@nottingham.ac.uk

A Novel Framework to Elucidate Core Classes in a Dataset

A Novel Framework to Elucidate Core Classes in a Dataset

Presentation Transcript

A Novel Adventure

A REVISED FRAMEWORK FOR APPRAISING NOVEL MOLECULAR CLASSIFIERS

Reading a Novel

Elucidate :

SILO: A novel framework for flexible protocol composition

A novel approach to modeling

Core Dataset J - Changes

To Kill a Mockingbird A Novel

A Novel Framework to Elucidate Core Classes in a Dataset

c hoose a dataset…

Extending the National Hydrography Dataset: A Spatial Framework for

Dataset Classes

Core Database Classes

NDTMS Core Dataset ‘F’ Training

A framework to move from common core to classroom practice

PRESS: A Novel Framework of Trajectory Compression in Road Networks

Chapter 7 Developing a Core Knowledge Framework

NDTMS – Core Dataset ‘F’

NDTMS – Core Dataset ‘F’

How To Read A novel

Waters Framework Dataset Workgroup Update

FIRSTS-A-NOVEL