Text Similarity & Clustering

TextSimilarity & Clustering Qinpei Zhao 15.Feb.2011

Outline • String matching metrics • Implementation and applications • Online Resources • Location-based clustering

String Matching Metrics

Exact String Matching • Given a text stringT of length n and a pattern stringP of length m, the exact string matching problem is to find all occurrences of P in T. • Example: T=“AGCTTGA” P=“GCT” • Applications: • Searchingkeywords in a file • Searching engines (like Google) • Database searching

Approximate String Matching Determine if a text stringT of length n and a pattern stringP of length m“partially” matches. • Consider the string “approximate”. Which of these are partial matches? aproximateapproximatelyappropriateproximateapproxapproximataproposapproxximate • A partial match can be thought of as one that has k differences from the string where k is some small integer (for instance 1 or 2) • A difference occurs if the string1.charAt(j) != string2.charAt(j) or if string1.charAt(j) does not appear in string2 (or vice versa) • The former case is known as a revise difference, the latter is a delete or insert difference. • What about two characters that appear out of position? For instance, approximate vs. apporximate?

Approximate String Matching Schwarrzenger Query errors: • Limited knowledge about data • Typos • Limited input device (cell phone) input Data errors • Typos • Web data • OCR Similarity functions: • Edit distance • Q-gram • Cosine • … Applications • Spellchecking • Query relaxation • …

Edit distance (Levenshtein distance) • Given two strings T and P, the edit distanceis the minimum number of substitutions, insertion and deletions, which will transform some characters of Tinto P. • Time complexity by dynamic programming: O(mn)

Edit distance (1974) Dynamic programming: m[i][j] = min{m[i-1][j]+1, m[i][j-1]+1, m[i-1][j-1]+d(i,j)} d(i,j) =0 if i=j, d(i,j)=1 else

b i n go n Q-grams 2-grams Fixed length (q) ed(T, P) <= k, then # of common grams >= # of T grams –k *q

Q-grams T = “bingo”, P = “going” gram1 = {#b, bi, in, ng, go, o#} gram2 = {#g, go, oi, in, ng, g#} Unique(gram1, gram2) = {#b, bi, in, ng, go, o#, #g, oi, g#} gram1.length = (T.length + (q - 1) * 2 + 1) – q gram2.length = (P.length + (q - 1) * 2 + 1) - q L = gram1.length + gram2.length Similarity = (L- |common terms difference| )/ L

Cosine similarity • Two vectors A and B,θ is represented using a dot product and magnitude as • Implementation: Cosine similarity = (Common Terms) / (sqrt(Number of terms in String1) + sqrt(Number of terms in String2))

Cosine similarity T = “bingo right”, P = “going right” T1 = {bingo right}, P1 = {going right} L1 = unique(T1).length; L2 = unique(T2).length; Unique(T1&P1) = {bingo right going} L3 = Unique(T1&P1) .length; Common terms = (L1+L2)-L3; Similarity = common terms / (sqrt(L1)*sqrt(L2))

Dice coefficient • Similar with cosine similarity • Dices coefficient = (2*Common Terms) / (Number of terms in String1 + Number of terms in String2)

Implementation & Applications

Similarity metrics • Edit distance • Q-gram • Cosine distance • Dice coefficient …… similarity between two strings: Demo

Applications in MOPSI • Duplicated records clean • Spelling check • Communication & comunication • query relevance/expansion • Text-level Annotation recommendation * • Keyword clustering * • MOPSI search engine**

Annotation recommendation 500ms

String clustering • The similarity between every string pair is calculated as a basis for determining the clusters • Using the vector model for clustering • A similarity measure is required to calculate the similarity between two strings.

String clustering (Cont.) • The final step in creating clusters is to determine when two objects (words) are in the same cluster • Hierarchical agglomerative clustering (HAC) – start with un-clustered items and perform pair-wise similarity measures to determine the clusters • Hierarchical divisive clustering – start with a cluster and breaking it down into smaller clusters

Objectives of Hierarchy of Clusters • Reduce the overhead of search • Perform top-down searches of the centroids of the clusters in the hierarchy and trim those branches that are not relevant • Provide for visual representation of the information space • Visual cues on the size of clusters (size of ellipse) and strengths of the linkage between clusters (dashed line, sold line…) • Expand the retrieval of relevant items • A user, once having identified an item of interest, can request to see other items in the cluster • The user can increase the specificity of items by going to children clusters or by increasing the generality of items being reviewed by going to a parent clusters

Keyword clustering (semantic) • Thesaurus-based:WordNet • An advanced web-interface to browse the WordNet database • Thesaurus are not available for every language, e.g. Finnish. • example

Resources

Useful resources • Similarity metrics (http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetrics.html ) • Similarity metrics (javascript) (http://cs.joensuu.fi/~zhao/Link/ ) • Flamingo package (http://flamingo.ics.uci.edu/releases/4.0/ ) • WordNet (http://wordnet.princeton.edu/wordnet/related-projects/ )

Location-based clustering

DBSCAN- density based clustering (KDD’96) Parameters: MinPts eps Time complexity O(logn) – getNeighburs O(nlogn) – total Advantages Data shape unlimited Noise considered

DBSCAN result Joensuu: 29,76, 62.60 Helsinki: 24, 60

Gaussian Mixture Model Maximization likelihood estimation (Expectation Maximization algorithm) Parameters required Number of components Iteration number Advantages: Probabilistic (fuzzy) theory

GMMs Joensuu: 29,76, 62.60 Helsinki: 24, 60

My activity area

thanks!

Text Similarity & Clustering