180 likes | 325 Views
A Taxonomy of Similarity Mechanisms for Case-Based Reasoning. Pa´ draig Cunningham TKDE, Vol.21, 2009, pp. 1532–1543. Presenter : Wei- Shen Tai 200 9 / 11/17. Outline. Introduction Representation Similarity measures Direct similarity mechanisms Transformation-based measures
E N D
A Taxonomy of Similarity Mechanisms for Case-Based Reasoning Pa´ draigCunningham TKDE, Vol.21, 2009, pp. 1532–1543. Presenter : Wei-Shen Tai 2009/11/17
Outline • Introduction • Representation • Similarity measures • Direct similarity mechanisms • Transformation-based measures • Information-theoretic measures • Emergent measures • Implications for CBR research • Conclusion • Comments
Motivation • Similarity is central to CBR • More recently, a number of novel mechanisms have emerged that introduce interesting alternative perspectives on similarity.
Objective • Novel SM mechanisms review • Present a taxonomy of similarity mechanisms that places these new techniques in the context of established CBR techniques.
Feature value representation • In terms of case attributes or instance. • Enhancement • Discover word associations in a text corpus and then use these associations to add terms to the representation. • Bill Gates - > software, CEO, mircrosoft • Allow texts to be represented by more features.
Structural representations • Hierarchical structure • Features value themselves reference nonatomic objects. • Network structure • Typically a semantic network • The Semantic Web describes the relationships between things (like tire is a part of car and John Lennon was a member of the Beatles) and the properties of things (like size, weight, age, and price) • Flow structure • Share many of the characteristics of hierarchical and network representations. For example, work or job.
String and sequence representations • The most straightforward representation for free text. (non-structure data) • It supports similarity assessment is the bag-of-words strategy from information retrieval.
Direct similarity mechanisms • Similarity and distance metrics • k-NN • Set-theoretic measures • Jaccard index, Dice similarity • Kullback-Leibler Divergence and the χ2 Statistic • Compare two images described as histograms. • Symbolic attributes in taxonomies • Case representation is organized by feature values into a taxonomy of is-a relationships.
Transformation-based measures I • Edit Distance • the number of editing to transform one string. • From cat to rat is 1, from cats to cat is 1. • Alignment Measures for Biological Sequences • A variety of sequence alignment in biology (DNA).
Transformation-based measures II • Earth mover distance • A transformation-based distance for image data.
Transformation-based measures III • Similarity for networks and graphs • Structure mapping engine (SME) • Identify the appropriate mapping between the two domains.
Information-theoretic measures • It works directly on the raw case representation • Compression-based similarity for text • Two very similar documents, the compressed size of both them will not be much greater than one. • Information-based similarity for biological sequences • Specialized algorithms are required to compress them • Similarity in a taxonomy • Distinguish the weight of is-a relationship between features. • A taxonomy can be quantified as the negative log likelihood. • Similarity is the common parent node with the highest value.
Emergent measures I • Random forests • An ensemble of decision trees. • For each ensemble member (n > N), build a decision tree for them with less selected features (m >> M). • Track the frequency with which cases are located at the same leaf node. • Two features get more shared leaf frequency means they are more similar as well.
Emergent measures II • Cluster kernels • A semi-supervized learning, where only some of the available data are labeled. • Class labels do not change in regions of high density. • Cluster kernels allow the unlabelled data to influence similarity. • where K(xi, xj)orig is a basic neighborhood kernel and K(xi, xj)bag is a kernel derived from repeated clustering of all the data.
Emergent measures III • Web-based kernel • Text snippet similarity by documents returned in Web search.
Implications for CBR research • Vocabulary knowledge container • In some circumstances (e.g., information-theoretic measures) the role of the similarity knowledge container is increased. • Speeding up technique • New methodologies are typically computationally intensive, the importance of strategies for speeding up case-retrieval is increased.
Conclusions • Similarity measurement taxonomy • Organize the broad range of strategies for similarity assessment in CBR into a coherent taxonomy. • Improve effectiveness of CBR • Alternative metrics simply offer better accuracy because it embodies specific knowledge about the data.
Comments • Advantage • This paper introduces and discusses those alternative metrics of similarity assessment for CBR. • Drawback • . • Application • Similarity measurement.