460 likes | 568 Views
Robust methodologies for partition clustering. Paulo Lisboa Terence Etchells, Ian Jarman and Simon Chambers. Overview. Partition clustering - critique Decomposition of the covariance matrix Landscape mapping of cluster solutions
E N D
Robust methodologies for partition clustering Paulo LisboaTerence Etchells, Ian Jarman and Simon Chambers
Overview • Partition clustering - critique • Decomposition of the covariance matrix • Landscape mapping of cluster solutions • Validation for two synthetic data sets and metabolic sub-typing
BioinformaticsNottingham Tenovous Primary Breast Carcinoma Series Consecutive series of 1,944 cases of primary operable invasive breast cancer(n=1,076 with all markers present) Patients presenting during 1986-98 Protein expression comprising 25 immunohistochemical markers related to tumour malignancyderived through high-throughput protein expression using TMA Abd El-Rehim et al, Int J Cancer, 116, 340-350, 2005.
Partition clustering – relevance to bioinformatics p53 CK 5/6 C-erbB-2 BRCA1 ER PgR
Partition clustering –open issues K-means i. Assume #K ii. Initialise #N ? iii. Sort by optimality ? iv. Select best for #K ? v. Select #K(s) ? vi. Single cluster or ensemble ? • Identify a suitable algorithm: • Model-based or model-free ? • Hierarchical, K-means, PAM ? • Return {Sa,...,Sz} solutions • Validate & interpret each solution
Separation index:Decomposition of the scatter matrix SW1 SW2 SB • Scatter matrices
Separation index:Decomposition of the scatter matrix SW1 SW2 SB • Invariant separation matrix and index
N.B. If |ST|=0 → Project onto subspace of cohort means a1 a3 a2
Theorem: is invariant to dimensionality reduction under Mahalanobis rotations ~ a1 ~ a3 ~ a2
Optimality principle i. N initialisations ii. Sort by J iii. Select top p% iv. Calculate pairwise CV v. Retain med(CV) vi. Plot (J, med_CV) • Reproducibility with • Best Separation - max(J) • Best Concordance – max(CV) • under repeated initialisations
Synthetic data (10 cohorts) 10 2 9 85 58 100 97 66 45 6 38 1 5 113 5 52 55 18 133 48 59 44 6 42 177 89 8 118 7 24 84 3 3 42 118 78 92 4 124 63 4 88 112 3 208 93 6 79 1 55 189 150 127 24 23 69 101 1 1 189 3 59 54 219 117 7 137 177 7 238 5 21 49 2 172 238 212 60 2 2 143 335 5 183 161 978 294 238 2 47 192 738 2 142 2 185 8 388 738 173 29 153 94 1 455 8 190 4 28 177 1 170 98 181 455 28 192 177 9 98 2 361 4 1 164 181 177 383 100 5 169 6 97 190 144 2 173 1 161 3 176 171 190 97 176 19 96 4 5 160 96 4 3 132 1 96 129 3 129 126 132 127 97 97 3 6 7 4 97 97 95 95 97 95 96
Synthetic data (10 cohorts) Max J SeCo Max Cv
BioinformaticsNottingham Tenovous Primary Breast Carcinoma Series Consecutive series of 1,944 cases of primary operable invasive breast cancer(n=1,076 with all markers present) Patients presenting during 1986-98 Protein expression comprising 25 immunohistochemical markers related to tumour malignancyderived through high-throughput protein expression using TMA Abd El-Rehim et al, Int J Cancer, 116, 340-350, 2005.
Cluster hierarchy (1) C5, 179 159 C7, 186 160 C2, 106 C4, 230 105 206 67 C1, 266 C5, 120 105 240 44 C3, 108 C2, 109 C4, 430 107 407 107 112 C4, 116 C3, 459 C3, 130 458 114 C6, 209 C4, 94 C1, 781 C3, 285 202 22 246 322 62 94 C1, 96 C2, 373 C5, 205 103 201 93 24 51 65 24 C2, 209 C1, 121 C2, 295 C8, 106 102 105 112 244 C1, 244 C2, 198 C6, 119 208 26 116 219 79 C6, 174 C1, 152 C3, 215 172 186 C2, 234 169 C4, 277 44 51 91 C1, 142 C5, 192 101 127 C3, 205 94 C7, 167
Cluster hierarchy (2) C1, 177 164 C3, 185 172 C2, 131 C5, 184 120 167 C5, 237 C4, 189 15 183 201 46 65 C8, 183 C4, 209 C1, 338 300 134 161 116 228 C2, 249 C3, 459 C1, 241 458 155 125 78 105 C3, 246 C3, 163 C1, 781 C2, 365 209 322 151 C6, 121 C2, 373 C4, 252 240 114 91 102 51 124 C3, 238 C1, 119 C2, 295 C7, 106 19 243 C1, 244 C2, 229 C5, 104 228 229 116 93 99 101 C5, 97 C4, 135 C6, 120 113 117 C7, 138 17 C3, 117 116 136 198 C6, 126 C2, 198 20 62 C1, 90 66 C4, 93
Sub-type profiling Clusters A Clusters B Luminal New 2 Luminal N
Sub-type profiling Clusters A Clusters B Luminal A HER2
Sub-type profiling Clusters A Clusters B Basal p53 - Basal muc1 + Basal p53 + Basal muc1 -
Summary • Partition clustering - critique • Decomposition of the covariance matrix • Landscape mapping of cluster solutions • Validation for two synthetic data sets and metabolic sub-typing
Ferrara data (n=633) JMU Cluster 1/5 JMU Cluster 2/5 JMU Cluster 4/5 JMU Cluster 3/5 JMU Cluster 5/5