490 likes | 593 Views
Searching For a Few Good Features. Pathology Informatics 2010. B ȕ lent Yener Rensselaer Polytechnic Institute Department of Computer Science. The Hard Problem: Bad or just Ugly ??. ?. One of the main challenges is to
E N D
Searching For a Few Good Features Pathology Informatics 2010 BȕlentYener Rensselaer Polytechnic Institute Department of Computer Science
The Hard Problem: Bad or just Ugly?? ? One of the main challenges is to Unlike healthy tissue, discriminating damaged (diseased but not cancerous)tissue from cancerous one. We need a few good features!!.
Brain Tissue - Diffused Good: healthy Ugly: inflammation Bad: glioma
Gland based tissue: Prostate Good (Healthy) Ugly (PIN) Bad (cancerous)
Gland based tissue: Breast Ugly (in Situ) Good Bad (invesive)
Bone Tissue Images Osteosarcoma (bad) Healthy (good) Fracture Fracture (ugly)
Two related problems • Feature Extraction • Identify and compute attributes that will characterize the information encoded in the histology images • Need to quantify! • Feature Selection • Identify an optimal subset.
Feature Selection • Select a subset of the original features • reduces the number of features (dimensionality reduction) • removes irrelevant or redundant data (noise reduction) • speeding up a data mining algorithm • improving prediction accuracy • It is an hard optimization problem! • Optimal feature selection is an exhaustive search of all possible subsets of features of the chosen cardinality. • Too expensive • In practice Adhoc heuristics
Greedy Algorithms • A local optimum is searched • evaluate a candidate subset of features • modify the subset and evaluate it • if the new subset is an improvement over the old • then take it as current • else • If algorithm is deterministic reject the modifications (e.g. hill climbing) • Else accept with a probability (e.g. simulated annealing).
Methods (partial list) • Exhaustive search: evaluate possible subsets. • Branch and Bound Search: enumerate a fraction of the subsets--- can find optimum but worst-case is exponential. • Best features (isolated): evaluate all m features in isolation–-- no guarantee for optimum • Sequential Forward Selection: start with the best feature and add one at a time – no back tracking • SBS: start with all d features and eliminate one at a time—more expensive than SFS and no backtracking either. • Variants of SFS and SBS: start with k best features and then delete r of them.. etc
Types of Algorithms • Supervised, unsupervised , and semi-supervised (embedded) feature selection algorithms • e.g. (PCA) is a unsupervised feature extraction method- finds a set of mutually orthogonal basis functions that capture the directions of maximum variance in the data. • But these features may not be useful for discriminating between data in different classes. • Wrappers (wrap the selection process around the learning algorithm), Filters (examine intrinsic properties of the data) • Feature selection algorithms with filter and embedded models may return either a subset of selected features or the weights (measuring feature relevance) of all features.
Relevance and redundancy • A feature is statistically relevant if its removal from a feature set will reduce the prediction power. • A feature may be redundant due to the existence of other relevant features, which provide similar prediction power as this feature.
Filter Model • Filtering is independent from the algorithm • It is a preprocessing step • Example: Relief method Algorithms inducing concept descriptions from examples (i.e. learning algorithms) All d features Subset selection m<d features Induction Algorithm
Relief Method • It assigns relevance to features based on their ability to disambiguate similar samples • Similarity is defined by proximity in feature space. • Relevant features accumulate high positive weights, while irrelevant features retain near-zero weights. • For each target sample, • find the nearest sample in feature space of the same category, the “hit” sample. • find the nearest sample of the other category, the “miss” sample. • The relevance of feature f near the target sample is measured as: Source: K. Kira and L.A. Rendell
Other Filter Algorithms • Laplacian Score: focuses local structure of the data space, computes a score to reflect its locality preserving power. • SPEC: similar but uses normalized Laplacian matrix. • Fisher Score: assigns the highest score to the feature on which the data points of different classes are far from each other. • Chi-square Score: tests independence whether the class label is independent of a particular feature. • Minimum-Redundancy-Maximum-Relevance (mRmR): selects features that are mutually far away from each other, while they still have "high" correlation to the classication variable. (approximation to maximizing the dependency between the joint distribution of the selected features and the classication variable.) • Kruskal Wallis: non-parametric method. Based on ranks for comparing the population medians among groups. • Information Gain: measures of dependence between the feature and the class label. Source: Zhao et al http://featureselection.asu.edu
Wrapper Model Source: Zhao et al http://featureselection.asu.edu
BLogReg : Gavin C. Cawley and Nicola L. C. Talbot. Gene selection in cancer classication using sparse logistic regression with bayesian regularization. Bioinformatics, 22(19):2348{2355, 2006. CFS : Mark A. Hall and Lloyd A. Smith. Feature selection for machine learning: Comparing a correlationbased fllter approach to the wrapper, 1999. Chi-Square : H. Liu and R. Setiono. Chi2: Feature selection and discretization of numeric attributes. In J.F. Vassilopoulos, editor, Proceedings of the Seventh IEEE International Conference on Tools with Articial Intelligence, November 5-8, 1995, pages 388{391, Herndon, Virginia, 1995. IEEE Computer Society. FCBF: H. Liu and L. Yu. Feature selection for high-dimensional data: A fast correlation-based lter solution. In Correlation-Based Filter Solution". In Proceedings of The Twentieth International Conference on Machine Leaning (ICML-03), pages 856{863, Washington, D.C., 2003. ICM. Fisher Score : R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classication. John Wiley & Sons, New York, 2 edition, 2001. Information Gain: T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991. Kruskal-Wallis : L. J. Wei. Asymptotic conservativeness and eciency of kruskal-wallis test for k dependent samples. Journal of the American Statistical Association, 76(376):1006{1009, December 1981. mRMR : F. Ding C. Peng, H. Long. Feature selection based on mutual information: Criteria of maxdependency, max-relevance, and min-redundancy. IEEE TRANSACTIONS ON PATTERN ANAL- YSIS AND MACHINE INTELLIGENCE, 27(8):1226{1238, 2005. Relief : K. Kira and L.A. Rendell. A practical approach to feature selection. In Sleeman and P. Edwards, editors, Proceedings of the Ninth International Conference on Machine Learning (ICML-92), pages 249{256. Morgan Kaufmann, 1992. SBMLR: Gavin C. Cawley, Nicola L. C. Talbot, and Mark Girolami. Sparse multinomial logistic regression via bayesian l1 regularisation. In NIPS, pages 209{216, 2006. Spectrum: Huan Liu and Zheng Zhao. Spectral feature selection for supervised and unsupervised learning. Proceedings of the 24th International Conference on Machine Learning, 2007. Source: Zhao et al http://featureselection.asu.edu
Feature Space over Histology Images is Large • Texture based • Intensity based • Graph theoretical • Voronoi graphs • Cell-graphs
Voronoi Graphs and its Features • Minimum Spanning tree and its properties
Cell-Graphs Represent the tissue as a graph: • A node of the graph represents a cell or cell cluster • An edgeof the graph represents a relation between a pair of nodes (e.g., spatial, ECM)– generalization of Voronoi graphs (a) Healthy (c) Cancerous (b) Damaged
What do we gain from Cell-graphs ? Adjacency matrix: • Mathematical representation • We can apply operands on them using • (multi) Linear Algebra • Algorithms • We can quantify the structural properties with mathematically well defined graph metrics. • Subgraph mining • Descriptor subgraphs • Subgraph search in a large graph • Subgraph Kernels Normalized Laplacian:
Cell-graph Features • Local: cell-level • Graph theoretical: e.g. Degree, clustering coeff. • Morphological: e.g., shape • Global: tissue-level • Graph theoretical • Spectral
Cell-graph Feature Selection • Pairwise correlation of featuresGoal: to find a set of features which are pairwise independent. • Discriminative powerGoal: to find a smaller subset of features which are as expressive as all feature set.
Pairwise Correlation Graph • The correlation between the graph features, themselves, can be represented as correlation graph. • The correlation graph can be obtained in the procedure below. • Calculate the nxn correlation matrix for n features and obtain the correlation coefficients (n = 20 in this case). • Create nodes for each feature which are located in a circular manner. • Set a threshold for correlation and establish an edge between two feature nodes if |correlation coefficient| ≥ threshold (threshold = 0.9 in this case) .
Correlation Graphs for Healthy Tissue Brain Breast Bone
Correlation Graphs for Cancerous Tissue Breast Brain Bone
Observations on Correlation Graphs • The correlation graphs differ greatly depending on tissue type and (dis) functional status. • The complexity of the correlation graph (number of edges) depends on the tissue type and tissue status. • Some features in some cases can show cluster structures (E.g. node number, edge number and average degree in breast - healthy), • but a cluster structure may not be in all cases (E.g. brain - cancer). • The features are highly correlated.
Interpretation • The strong correlation means a high dependency between the features, which causes a complex joint probability density function. Any probabilistic/statistic model attempt should be aware of this complexity. • An uncorrelated feature does not necessarily mean a distinguishing feature. It might not be a discriminative feature for classification. • The high correlation may indicate that a smaller subset of features might be enough to discriminate the classes – but not always
Feature Selection: good, bad, and ugly Breast – Average Degree Brain – Average Degree
Feature Selection - cont Breast – End Point Percentage Brain – End Point Percentage
Feature Selection Need a few god features! Two phase approach: • Find the best classifier (MLP) • Determine the features
Feature Selection • The data is not linearly separable. Also the features, as expected, show different distributions in each tissue type. • 10-fold cross-validation results (accuracy percentages) for breast tissue using • Adaboost (30 C4.5 trees), • k-nn (k = 5), • MLP (1 hidden layer, 12 hidden units, back propagation). with all existing 20 features are obtained to see which classifier is more successful in classifying the data for cell-graph features . • These classifiers are used since they are good at separating non-linearly distributed data and they are from different classification algorithm families.
Feature Selection – next step • The classification problem is reduced into 2-class problems (healthy vs. cancerous, healthy vs. damaged, damaged vs. cancerous). • Number of edges and number of nodes are excluded. This exclusion also decrease the runtime for selection.
Details • An exhaustive search over 18 features is done using MLP. Since MLP has given the highest accuracy rate with all feature, it is intuitively expected to show higher accuracy than the other classifiers during subset selection. • The procedure is described below. • Start with an empty selected feature subset with 0 accuracy percentage. (seq. forward selection alg). • Repeat the procedure below for all possible feature subset (218). • Train the classifier and validate its accuracy with 10-fold cross-validation. • If the average 10-fold CV accuracy percentage of the current subset is higher than the selected feature subset, assign the current subset as the selected feature subset.
MLP + Exhaustive Search Results on Breast Cancer • The results for breast data is given below. (no normalization)
Cell-graph Feature Selection with Relief Method 1. Average degree 2. Average Clustering coefficient 3. Average eccentricity 4. Maximum eccentricity 5. Minimum eccentricity 6. Average effective eccentricity 7. Maximum effective eccentricity 8. Minimum effective eccentricity 9. Average path length (closeness) 10. Giant connected ratio 11. Percentage of isolated points 12. Percentage of end points 13. Number of central points 14. Percentage of central points 15. Number of nodes 16. Number of edges 17. Spectral radius 18. Second largest eigenvalue 19. Trace 20. Energy 21. Number of eigenvalues
Problem Definition • Treated with ROCK (Rhoassociated coil-coil kinase) that regulates branching morphogenesis • Untreated • Can we quantify the organizing principles and distinguish between different states of branching process?
Even a Richer Set of Features 27 largest_eigen_adjacency_ 28 second_largest_adjacency 29 trace_adjacency_ 30 energy_adjacency 31 #of_zeros_normalized_laplacian 32 slope_0-1_normalized_laplacian 33 #of_ones_normalized_laplacian 34 slope_1-2_normalized_laplacian 35 #of_twos_normalized_laplacian 36 trace_laplacian 37 energy_laplacian 1 Average_degree 2 C 3 C2 4 D 5 Average_eccentricity 6 Maximum_eccentricity_(diameter) 7 Minimum_eccentricity_(radius) 8 Average_eccentricity_90 9 Maximum_eccentricity_90 10 Minimum_eccentricity_90 11 Average_path_length_(closeness) 12 Giant_connected_ratio 13 Number_of_Connected_Components 14 Percentage_of_isolated_points 15 Percentage_of_end_points 16 Number_of_central_points 17 Percentage_of_central_points 18 Number_of_nodes 19 Number_of_edges 20 elongation_ 21 area 22 orientation 23 eccentricity 24 perimeter 25 circularity_ 26 solidity 38 degree_cluster_1 39 degree_cluster_2 40 degree_cluster_3 41 clustering_coefficient_C_cluster_1 42 clustering_coefficient_C_cluster_2 43 clustering_coefficient_C_cluster_3 44 clustering_coefficient_D_cluster_1 45 clustering_coefficient_D_cluster_2 46 clustering_coefficient_D_cluster_3 47 eccentricity_cluster_1 48 eccentricity_cluster_2 49 eccentricity_cluster_3 50 effective_eccentricity_cluster_1_ 51 effective_eccentricity_cluster_2 52 effective_eccentricity_cluster_3 53 closeness_cluster_1 54 closeness_cluster_2 55 closeness_cluster_3
Classifier Comparison Since MLP has a higher overall accuracy, it is used in later studies in feature selection.
Epithelial vs Mesenchymal comparison in treated tissue samples
Epithelial vs Mesenchymal comparison in untreated tissue samples
Concluding Remarks • Feature extraction and selection are strongly coupled for accuracy– always room for new features • Feature selection performance depends on the induction algorithm (i.e., learning algorithm) • Quantifiable features are not always interpretable- mapping the features to biology or pathology is crucial link!