470 likes | 605 Views
A Novel Framework to Elucidate Core Classes in a Dataset. Daniele Soria daniele.soria@nottingham.ac.uk G54DMT - Data Mining Techniques and Applications 9 th May 2013. Outline. Aims and motivations Framework Clustering and consensus Supervised learning
E N D
A Novel Frameworkto Elucidate Core Classes in a Dataset Daniele Soria daniele.soria@nottingham.ac.uk G54DMT - Data Mining Techniques and Applications 9th May 2013
Outline • Aims and motivations • Framework • Clustering and consensus • Supervised learning • Validation on two clinical data sets • Conclusions and questions
Aims and motivations • Develop an original framework with multiple steps to extract most representative classes from any dataset • Refine the phenotypic characterisation of breast cancer • Move the medical decision making process from a single technique approach to a multi-technique one • Guide clinicians in the choice of the favourite andmost powerful treatment to reach a personalised healthcare
Framework (1) no yes
Framework (2) yes no
Data pre-processing • Dealing with missing values • Homogeneous variables • Computedescriptive statistics
Clustering • Four different algorithms: • Hierarchical (HCA) • Fuzzy c-means (FCM) • K-means (KM) • Partitioning Around Medoids (PAM) • Methods run with the number of clusters varying from 2to20
Hierarchical method • Hierarchy of clusters • Represented with a tree (dendrogram) • Clusters obtained cutting the dendrogram at specific height
Hierarchical method (cont.) 6 clusters
FCM method • Minimisation of the o. f. • X = {x1,x2,...,xn}:ndata points • V = {v1,v2,…,vc}:ccluster centres • U=(μij)n*c:fuzzy partition matrix • μij:membership of xito vj • m:fuzziness index
KM method • Minimisation of the o. f. • ||xi - vj||:Euclidean distance betweenxiand vj • cj:data points in clusterj • vj
PAM method • Search for krepresentative objects (medoids) among the observations • Minimum sum of dissimilarities • kclusters are constructed by assigning each observation to the nearest medoid
If n is unknown… • Validity indicescomputation • Defined considering the data dispersion within and between clusters • According to decision rules, the best number of clusters may be selected
Characterisation & agreement • Visualisation techniques • Biplots • Boxplots • Indices for assessing agreement • Cohen’s kappa (κ) • Rand
Definition of classes • Consensus clustering: • Align labels to have similar clusters named in thesame way by different algorithms • Take into account points assigned to groups with the same label
Supervised learning (1) • Model-based classification for prediction of future cases • Aims • High quality prediction • Reduce number of variables (biomarkers) • Prefer ‘white-box’ prediction models
Supervised learning (2) • Different techniques: • C4.5 Decision Tree • Multi-Layer Perceptron Artificial Neural Network (MLP-ANN) • Naïve Bayes (NB) or Non-Parametric Bayesian Classifier (NPBC)
C4.5 classifier • Each attribute can be used to make a decision that splits the data into smaller subsets • Information gain that results from choosing an attribute for splitting the data • Attribute with highest information gain is the one used to make the decision
Multi-Layer Perceptron • Feed-forward ANN • Nonlinear activation function used by each neuron • Layers of hidden nodes connected with every other node in the following layer • Learning carried out through back-propagation
Naïve Bayes classifier • Probabilistic classifier based on Bayes’ theorem • Good for multi-dimensional data • Common assumptions: • Independence of variables • Normality
NPBC: ratiobetweenareas (1) • Similar to Naïve Bayes • Useful for non-normal data • Based on ratio between areas under the histogram • The closer a data point is to the median, the higher the probability to belong to that specific class 25
NPBC: ratiobetweenareas (2) x< m x > m datapoint x median m median m datapoint x ER ER Soria et al. (2011): A `Non-Parametric’ Version of the Naïve Bayes Classifier. Knowledge-Based Systems, 24, 775-784 26
Characterisation of classes • Biplots • Boxplots • Relation with clinical information • Survival analysis
Validation of the framework (1) • Set of markers involved in breast cancer cell cycle regulation • 347 patients and 4 markers • Survival and grade available • K-means, PAM and Fuzzy C-means used • Validity indices for best number of clusters
Agreement and consensus • Kappa index high for 3 groups classification • 3 common classes found (8.9% not classified) • Intermediateexpression (class 1) • High expression (class 2) • Lowexpression (class 3)
Biplots of classes (1) °: Class 1 °: Class 2 °: Class 3
Biplots of classes (2) °: Class 1 °: Class 2 °: Class 3 °: N.C.
Clinical information • High grade patients (poor prognosis) in classes 1 and 3 (worst survival) • Common classes group patients with similar outcome
Validation of the framework (2) • Patients entered into the Nottingham Tenovus Primary Breast Carcinoma Series between 1986 and 1998 • 1076 cases informative for all 25 biological markers • Clinical information(grade, size, age, survival, follow-up, etc.) available
Breast Cancer ER+ Luminal CKs+ ER- Basal CKs- ER- HER2+ Mixed Class (38.4%) HER2 Class 6 (7.2%) Luminal Basal PgR+ PgR- p53+ p53- p53 altered Class 4 (7.6%) p53 normal Class 5 (6.4%) HER3+ HER4+ HER3- HER4- HER3+ HER4+ Luminal A Class 1 (18.8%) Luminal N Class 2 (14.2%) Luminal B Class 3 (7.4%) Consensus Clustering Soria et al., Computers in Biology and Medicine, 2010
Conclusions • Original framework for identification of core classes • Formed by different logical steps • Validation over novel data sets • 3 classes (Low, IntermediateandHigh) • High marker levels for better survival • Discover of novel cancer subtypes
Main references • L. Kaufman, P.J. Rousseeuw. Finding groups in data, Wiley series in probability and mathematical statistics, 1990. • A. Weingessel, et al. An Examination Of Indexes For Determining The Number Of Clusters In Binary Data Sets, Working Paper No.29, 1999. • I. H. Witten and E. Frank. DataMining: Practical machine learning tools and techniques. Morgan Kaufmann Publishers, 2005. • A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice-Hall advanced reference series, Prentice-Hall. Englewood Cliffs, NJ, USA, 1988. • A.K. Jain, M.N. Murty, and P.J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, 1999. • P.F. Velleman and D.C. Hoaglin. Applications, Basics and Computing of Exploratory Data Analysis. Boston, Mass.: Duxbury Press, 1981. • J. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, Los Altos, California, 1993. • S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2 edition, 1998. • G. John and P. Langley. Estimating continuous distributions in bayesian classifiers. Proceeding of the Eleventh Conference on Uncertainty in Artificial Intelligence, 1995.
Acknowledgements Prof JM Garibaldi, Dr J Bacardit Nottingham Breast Cancer Pathology RG Prof IO Ellis, Dr AR Green, Dr D Powe, Prof G Ball 46
Thank You! Contact: daniele.soria@nottingham.ac.uk