240 likes | 1.41k Views
Bitap Algorithm. Approximate string matching. Evlogi Hristov. Telerik Corporation. Student at Telerik Academy. Table of Contents. Levenshtein distance. Bitap overview. Bitap Exact search. Bitap Fuzzy search . Additional information. Levenshtein distance. Edit distance.
E N D
Bitap Algorithm Approximate string matching EvlogiHristov Telerik Corporation Student at Telerik Academy
Table of Contents • Levenshtein distance. • Bitap overview. • Bitap Exact search. • Bitap Fuzzy search. • Additional information.
Levenshtein distance Edit distance
Levenshtein distance • Edit distance: Primitive operations necessary to convert the string into an exact match. • insertion: cot → coat • deletion: coat → cot • substitution: coat → cost • Example: • Set n to be the length of s = "GUMBO"Set m to be the length of t="GAMBOL"If n = 0, return m and exitIf m = 0, return n and exit
Levenshtein distance (2) • Initialize matrix M [m + 1, n + 1] • Examine each character of s ( i from 1 to n ) • Examine each character of t ( j from 1 to m ) • If s[i] equals t[j], the cost is 0If s[i] is not equal to t[j], the cost is 1 • Set cell M[j, i] equal to the minimum of: • The cell immediately above plus 1: M [j-1, i] + 1 • The cell immediately to the left plus 1: M [j, i-1] + 1 • The cell diagonally above and to the left plus the cost: M [j-1, i-1] + cost • After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell M [m - 1, n - 1]
Levenstein distance (3) private int Levenshtein(string source, string target) { if (string.IsNullOrEmpty(source)) { if (!string.IsNullOrEmpty(target)) { return target.Length; } return 0; } if (string.IsNullOrEmpty(target)) { if (!string.IsNullOrEmpty(source)) { return source.Length; } return 0; } int[,] dist = new int[source.Length + 1, target.Length + 1]; int min1, min2, min3, cost; // ..continues on text page
Levenstein distance (4) for (int i = 0; i < dist.GetLength(0); i += 1) { dist[i, 0] = i; } for (int i = 0; i < dist.GetLength(1); i += 1) { dist[0, i] = i; } for (int i = 1; i < dist.GetLength(0); i++) { for (int j = 1; j < dist.GetLength(1); j++) { cost = Convert.ToInt32(!(source[i-1] == target[j - 1])); min1 = dist[i - 1, j] + 1; min2 = dist[i, j - 1] + 1; min3 = dist[i - 1, j - 1] + cost; dist[i, j] = Math.Min(Math.Min(min1, min2), min3); } } return dist[dist.GetLength(0)-1,dist.GetLength(1)-1]; }
Bitap algorithm shift-or/shift-and
Bitap algorithm • Also known as the shift-or, shift-and orBaeza–Yates–Gonnet algorithm. • Aproximate string matching algorithm. • Approximate equality is defined in terms of Levenshtein distance. • Often used for fuzzy search without indexing. • Does most of the work with bitwise operations. • Runs in O(mn) operations, no matter the structure of the text or the pattern.
Bitap Exact search(2) public static List<int> ExactMatch(string text, string pattern) { long[] alphabet = new long[128]; //ASCII range (0 – 127) for (int i = 0; i < pattern.Length; ++i) { int letter = (int)pattern[i]; alphabet[letter] = alphabet[letter] | (1 << i); } long result = 1; //0000 0001 List<int> indexes = new List<int>(); for (int index = 0; index < text.Length; index++) { result &= alphabet[text[index]]; //if result != pattern => result = 0 result = (result << 1) + 1; if ((result & (1 << pattern.Length)) > 0) { indexes.Add(index - pattern.Length + 1); } } return indexes; }
Bitap Exact search • Example: text = cbdabababc pattern = ababc start res: = 1 text[i] text[i] bits: res: res: text[i] text[i] alphabet[a] = = 5 res: res: text[i] text[i] alphabet[b] = = 10 res: res: text[i] text[i] alphabet[c] = = 16 res: res: text[i] text[i] alphabet[d] = = 0 res: res:
Fuzzy searching • Instead of having a single array result that changes over the length of the text, we now have k distinct arrays result 1..k • ... • long[] result = new long[k + 1]; • for (int i = 0; i <= k; i++) • { • result[i] = 1; • } • ... • for (int j = 1; j <= k; ++j) • { • // Three operations of the Levenshtein distance • long insertion = current | ((result[j] & patternMask[text[i]]) << 1); • long deletion = (previous | (result[j] & patternMask[text[i]])) << 1; • long substitution = (previous | (result[j] & patternMask[text[i]])) << 1; • current = result[j]; • result[j] = substitution | insertion | deletion | 1; • previous = result[j]; • } • ...
Shift-and vs. Shift-or • Shift-and : • Uses bitwise & and 1’s for matches • More intuitive and easyer to understand • Needs to add result |= 1 • Shift-or : • Uses bitwise | and zeroes’s for matches • A bit faster
Bitap algorithm http://algoacademy.telerik.com
Links for more information • Original paper of Baeza-Yates and Gonnet: • http://www.akira.ruc.dk/~keld/teaching/algoritmedesign_f08/Artikler/09/Baeza92.pdf • Google implementation using bitap: • https://code.google.com/p/google-diff-match-patch • Levenshtein algorithm: • http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm • http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance
Free Trainings @ Telerik Academy • “C# Programming @ Telerik Academy • csharpfundamentals.telerik.com • Telerik Software Academy • academy.telerik.com • Telerik Academy @ Facebook • facebook.com/TelerikAcademy • Telerik Software Academy Forums • forums.academy.telerik.com