1 / 15

Similarity Metric for Strings and Graphs

This comprehensive study delves into similarity metrics for strings and graphs, offering theoretical insights with practical examples. Covering distance definitions, overlap of substructures, and various approaches to measuring similarity, this exploration provides valuable insights for analyzing structure similarities. The Exhaustive Substructure Vector Space and Graph/String Distance examples offer practical applications in this fascinating field.

priest
Download Presentation

Similarity Metric for Strings and Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Similarity Metric for Strings and Graphs Dr. David Dailey david.dailey@sru.edu Dr. Beverly Gocal beverly.gocal@sru.edu Dr. Deborah Whitfield deborah.whitfield@sru.edu

  2. Outline • Introduction • Graph distance • String Distance • Definitions • Examples • Implementation • Theoretical Results • String Space Examples

  3. Problem Framework • Distance • may be defined for any structure • Overlap of the substructures of two structures • Strings • Graphs • Algebraic structures • Semi-groups • Trees • Web site and web page similarity

  4. Background • Past 15 years • Over 20 papers on graph similarity • Several more on string similarity • Semi-Group • Let T=(S, A) together with the concatenation operation, where A consists of the set of axioms • x, y  S, xy  S • x, y, z S, x(yz) = (xy)z

  5. Graph and String • Graph: Let T=(S, A) together with a relation ~ where A consists of the set of axioms • x, y S, x ~ y  y ~ x • x , (x ~ x) • String Let T=(S,A) together with an associative operation (expressed by concatenation). • Then let Sn be defined recursively by • S1 = S and • Sn = S x Sn-1 and • S* be defined as the infinite union of ordered tuples: S1 S2 …Sn

  6. Approaches • Levenshtein distance calculates minimum number of transformations • Largest shared substructure • Smallest super structure • All of these approaches are relative

  7. Exhaustive Substructure Vector Space (ESVS) • Enumerate all substructures within T and U • Union those two sets (T*  U*) =Z • |Z|-dimensional vector space • z(T) be the number of occurrences of structure z as a substructure of T • Calculate Minkowski distance d(T,U)

  8. Graph Distance

  9. String Distance Example • Alphabet S = {a,b,c}, a = abaac and b = cbaac • a*= {a,b,c,ab, ba,aa,ac,aba,baa,aac, abaa, baac, abaac} • b* = {a,b,c,cb,ba,aa,ac,cba, baa, aac,cbaa, baac,cbaac} • Z= { a, b, c, ab, cb, ba, aa, ac, cba, aba, baa, aac, cbaa, abaa, baac, cbaac, abaac } (underlined elements are unique to b and boldfaced are unique to a*) • Equal frequency: I = {b, c, ba, aa, ac, baa, aac, baac} • Different frequency: D={a}, • Unique: O= {ab, cb, cba ,aba, cbaa, abaa, cbaac, abaac} • |I| = 8 , |D| = 1, and |O| = 8

  10. String Distance Example • |I| = 8 , |D| = 1, and |O| = 8 • |I| +|D| +|O| = |Z| = 18 . • Contribution of O is |O| • Contribution of I is 0 - substrings appear equally often • Contribution of D, in this case will be 1. • d(a,b) = contribution(I)+ contribution(D)+ contribution(O) = 9

  11. Examples • A= aabc B= abcd • S= {a, a, aa, aab, aabc, ab, abc, b, bc, c} • T= {a, ab, abc, abcd, b, bc, bcd, c, cd, d} • Counts for S and T • a:2 aa:1 aab:1 aabc:1 ab:1 abc:1 b:1 bc:1 c:1 • a:1 ab:1 abc:1 abcd:1 b:1 bc:1 bcd:1 c:1 cd:1 d:1 • Differences: a:1 aad:1 aab:1 aabc:1 ab:0 abc:0 abcd:1 b:0 bc:0 bcd:1 c:0 cd1:0 d:1 • Distance (aabc, abcd) = 8

  12. Examples • Too tedious by hand • http://srufaculty.sru.edu/david.dailey/javascript/StringDistances.html • Distance (aabc, abcd) = 8

  13. Theoretical Results • Conjecture: if |a|=|b|=n and a and b share no substrings in common (i.e., |I  D|=0), then d(a,b) = n(n+1) • Conjecture: if |a|=|b|=n and a and b share no substrings in common (i.e., |I  D|=0), then d(a,b) = n(n+1) • Lemma: if a=an then d(a,aa)= n2 + n(n+1)/2 • Conjecture: if |a|=|b|=n , then d(a,aa)=d(a,ab)=d(b,ab)=d(b,bb)= n2 + n(n+1)/2

  14. Explorations of String Space • Pretty pics

  15. Conclusion • Exhaustive substructure vector space • Calculate distance • Interesting observations used to study structure similarity based on size

More Related