150 likes | 159 Views
This comprehensive study delves into similarity metrics for strings and graphs, offering theoretical insights with practical examples. Covering distance definitions, overlap of substructures, and various approaches to measuring similarity, this exploration provides valuable insights for analyzing structure similarities. The Exhaustive Substructure Vector Space and Graph/String Distance examples offer practical applications in this fascinating field.
E N D
Similarity Metric for Strings and Graphs Dr. David Dailey david.dailey@sru.edu Dr. Beverly Gocal beverly.gocal@sru.edu Dr. Deborah Whitfield deborah.whitfield@sru.edu
Outline • Introduction • Graph distance • String Distance • Definitions • Examples • Implementation • Theoretical Results • String Space Examples
Problem Framework • Distance • may be defined for any structure • Overlap of the substructures of two structures • Strings • Graphs • Algebraic structures • Semi-groups • Trees • Web site and web page similarity
Background • Past 15 years • Over 20 papers on graph similarity • Several more on string similarity • Semi-Group • Let T=(S, A) together with the concatenation operation, where A consists of the set of axioms • x, y S, xy S • x, y, z S, x(yz) = (xy)z
Graph and String • Graph: Let T=(S, A) together with a relation ~ where A consists of the set of axioms • x, y S, x ~ y y ~ x • x , (x ~ x) • String Let T=(S,A) together with an associative operation (expressed by concatenation). • Then let Sn be defined recursively by • S1 = S and • Sn = S x Sn-1 and • S* be defined as the infinite union of ordered tuples: S1 S2 …Sn
Approaches • Levenshtein distance calculates minimum number of transformations • Largest shared substructure • Smallest super structure • All of these approaches are relative
Exhaustive Substructure Vector Space (ESVS) • Enumerate all substructures within T and U • Union those two sets (T* U*) =Z • |Z|-dimensional vector space • z(T) be the number of occurrences of structure z as a substructure of T • Calculate Minkowski distance d(T,U)
String Distance Example • Alphabet S = {a,b,c}, a = abaac and b = cbaac • a*= {a,b,c,ab, ba,aa,ac,aba,baa,aac, abaa, baac, abaac} • b* = {a,b,c,cb,ba,aa,ac,cba, baa, aac,cbaa, baac,cbaac} • Z= { a, b, c, ab, cb, ba, aa, ac, cba, aba, baa, aac, cbaa, abaa, baac, cbaac, abaac } (underlined elements are unique to b and boldfaced are unique to a*) • Equal frequency: I = {b, c, ba, aa, ac, baa, aac, baac} • Different frequency: D={a}, • Unique: O= {ab, cb, cba ,aba, cbaa, abaa, cbaac, abaac} • |I| = 8 , |D| = 1, and |O| = 8
String Distance Example • |I| = 8 , |D| = 1, and |O| = 8 • |I| +|D| +|O| = |Z| = 18 . • Contribution of O is |O| • Contribution of I is 0 - substrings appear equally often • Contribution of D, in this case will be 1. • d(a,b) = contribution(I)+ contribution(D)+ contribution(O) = 9
Examples • A= aabc B= abcd • S= {a, a, aa, aab, aabc, ab, abc, b, bc, c} • T= {a, ab, abc, abcd, b, bc, bcd, c, cd, d} • Counts for S and T • a:2 aa:1 aab:1 aabc:1 ab:1 abc:1 b:1 bc:1 c:1 • a:1 ab:1 abc:1 abcd:1 b:1 bc:1 bcd:1 c:1 cd:1 d:1 • Differences: a:1 aad:1 aab:1 aabc:1 ab:0 abc:0 abcd:1 b:0 bc:0 bcd:1 c:0 cd1:0 d:1 • Distance (aabc, abcd) = 8
Examples • Too tedious by hand • http://srufaculty.sru.edu/david.dailey/javascript/StringDistances.html • Distance (aabc, abcd) = 8
Theoretical Results • Conjecture: if |a|=|b|=n and a and b share no substrings in common (i.e., |I D|=0), then d(a,b) = n(n+1) • Conjecture: if |a|=|b|=n and a and b share no substrings in common (i.e., |I D|=0), then d(a,b) = n(n+1) • Lemma: if a=an then d(a,aa)= n2 + n(n+1)/2 • Conjecture: if |a|=|b|=n , then d(a,aa)=d(a,ab)=d(b,ab)=d(b,bb)= n2 + n(n+1)/2
Explorations of String Space • Pretty pics
Conclusion • Exhaustive substructure vector space • Calculate distance • Interesting observations used to study structure similarity based on size