1 / 16

Bitap Algorithm

Bitap Algorithm. Approximate string matching. Evlogi Hristov. Telerik Corporation. Student at Telerik Academy. Table of Contents. Levenshtein distance. Bitap overview. Bitap Exact search. Bitap Fuzzy search . Additional information. Levenshtein distance. Edit distance.

aggie
Download Presentation

Bitap Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bitap Algorithm Approximate string matching EvlogiHristov Telerik Corporation Student at Telerik Academy

  2. Table of Contents • Levenshtein distance. • Bitap overview. • Bitap Exact search. • Bitap Fuzzy search. • Additional information.

  3. Levenshtein distance Edit distance

  4. Levenshtein distance • Edit distance: Primitive operations necessary to convert the string into an exact match. • insertion: cot → coat • deletion: coat → cot • substitution: coat → cost • Example: • Set n to be the length of s = "GUMBO"Set m to be the length of t="GAMBOL"If n = 0, return m and exitIf m = 0, return n and exit

  5. Levenshtein distance (2) • Initialize matrix M [m + 1, n + 1] • Examine each character of s ( i from 1 to n ) • Examine each character of t ( j from 1 to m ) • If s[i] equals t[j], the cost is 0If s[i] is not equal to t[j], the cost is 1 • Set cell M[j, i] equal to the minimum of: • The cell immediately above plus 1: M [j-1, i] + 1 • The cell immediately to the left plus 1: M [j, i-1] + 1 • The cell diagonally above and to the left plus the cost: M [j-1, i-1] + cost • After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell M [m - 1, n - 1]

  6. Levenstein distance (3) private int Levenshtein(string source, string target) { if (string.IsNullOrEmpty(source)) { if (!string.IsNullOrEmpty(target)) { return target.Length; } return 0; } if (string.IsNullOrEmpty(target)) { if (!string.IsNullOrEmpty(source)) { return source.Length; } return 0; } int[,] dist = new int[source.Length + 1, target.Length + 1]; int min1, min2, min3, cost; // ..continues on text page

  7. Levenstein distance (4) for (int i = 0; i < dist.GetLength(0); i += 1) { dist[i, 0] = i; } for (int i = 0; i < dist.GetLength(1); i += 1) { dist[0, i] = i; } for (int i = 1; i < dist.GetLength(0); i++) { for (int j = 1; j < dist.GetLength(1); j++) { cost = Convert.ToInt32(!(source[i-1] == target[j - 1])); min1 = dist[i - 1, j] + 1; min2 = dist[i, j - 1] + 1; min3 = dist[i - 1, j - 1] + cost; dist[i, j] = Math.Min(Math.Min(min1, min2), min3); } } return dist[dist.GetLength(0)-1,dist.GetLength(1)-1]; }

  8. Bitap algorithm shift-or/shift-and

  9. Bitap algorithm • Also known as the shift-or, shift-and orBaeza–Yates–Gonnet algorithm. • Aproximate string matching algorithm. • Approximate equality is defined in terms of Levenshtein distance. • Often used for fuzzy search without indexing. • Does most of the work with bitwise operations. • Runs in O(mn) operations, no matter the structure of the text or the pattern.

  10. Bitap Exact search(2) public static List<int> ExactMatch(string text, string pattern) { long[] alphabet = new long[128]; //ASCII range (0 – 127) for (int i = 0; i < pattern.Length; ++i) { int letter = (int)pattern[i]; alphabet[letter] = alphabet[letter] | (1 << i); } long result = 1; //0000 0001 List<int> indexes = new List<int>(); for (int index = 0; index < text.Length; index++) { result &= alphabet[text[index]]; //if result != pattern => result = 0 result = (result << 1) + 1; if ((result & (1 << pattern.Length)) > 0) { indexes.Add(index - pattern.Length + 1); } } return indexes; }

  11. Bitap Exact search • Example: text = cbdabababc pattern = ababc start res: = 1 text[i] text[i] bits: res: res: text[i] text[i] alphabet[a] = = 5 res: res: text[i] text[i] alphabet[b] = = 10 res: res: text[i] text[i] alphabet[c] = = 16 res: res: text[i] text[i] alphabet[d] = = 0 res: res:

  12. Fuzzy searching • Instead of having a single array result that changes over the length of the text, we now have k distinct arrays  result 1..k • ... • long[] result = new long[k + 1]; • for (int i = 0; i <= k; i++) • { • result[i] = 1; • } • ... • for (int j = 1; j <= k; ++j) • { • // Three operations of the Levenshtein distance • long insertion = current | ((result[j] & patternMask[text[i]]) << 1); • long deletion = (previous | (result[j] & patternMask[text[i]])) << 1; • long substitution = (previous | (result[j] & patternMask[text[i]])) << 1; • current = result[j]; • result[j] = substitution | insertion | deletion | 1; • previous = result[j]; • } • ...

  13. Shift-and vs. Shift-or • Shift-and : • Uses bitwise & and 1’s for matches • More intuitive and easyer to understand • Needs to add result |= 1 • Shift-or : • Uses bitwise | and zeroes’s for matches • A bit faster

  14. Bitap algorithm http://algoacademy.telerik.com

  15. Links for more information • Original paper of Baeza-Yates and Gonnet: • http://www.akira.ruc.dk/~keld/teaching/algoritmedesign_f08/Artikler/09/Baeza92.pdf • Google implementation using bitap: • https://code.google.com/p/google-diff-match-patch • Levenshtein algorithm: • http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm • http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance

  16. Free Trainings @ Telerik Academy • “C# Programming @ Telerik Academy • csharpfundamentals.telerik.com • Telerik Software Academy • academy.telerik.com • Telerik Academy @ Facebook • facebook.com/TelerikAcademy • Telerik Software Academy Forums • forums.academy.telerik.com

More Related