180 likes | 287 Views
CSC 213. Lecture 16: Strings and Pattern Matching. Announcements. Last quiz results were not good Good news: neighbors did not read book either Scores were universally poor Bad news: neighbors are also unlucky Average score was 4.2 Flipping a coin would produce an average of 5
CSC 213 Lecture 16:Strings and Pattern Matching
Announcements • Last quiz results were not good • Good news: neighbors did not read book either • Scores were universally poor • Bad news: neighbors are also unlucky • Average score was 4.2 • Flipping a coin would produce an average of 5 • Best news: Another daily quiz!
Strings (§ 11.1) • Algorithmically, any sequence of concatenated data is a string: • “CSC213 STUDENTS RAWK” • “I can’t believe this is a String.” • Java programs • HTML documents • Digitized image • DNA sequences
String Terminology • String is made up of elements in an alphabet – the characters usable within a family of strings • ASCII • Unicode • Bits • Pixels • DNA bases • SubstringP[i ... j] of a string P has characters of P at ranks i to j • Any substring starting at rank 0 is called a prefix • Substrings that end at a string’s last rank is a suffix
Pattern Matching Problem (§ 11.2) • Given two strings T and P, find the first substring of T that matches P • T is the “text” and Pis the “pattern” • This has many, many applications • Search engines • Database queries • Biological research
Brute-Force Approach • Common method of solving problems • Easy to develop, require little coding, and needs little brain power • Instead, use computer’s raw speed to consider and analyze all possible options • This can be painfully slow and use lots of memory • Generally good for only small problems
Brute-Force Pattern Matching • Compare P with all substrings in T, until • find a substring of T equal to P, or • reject all possible substrings of T • If P has size m and T has size n, this takes time O(nm) • Worst-case: • T = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa • P = aaag • Common in images, DNA, & biological data
Brute-Force Pattern Matching AlgorithmBruteForceMatch(String T,String P) // For each rank of T, see if starts a matching substring for i 0 to T.length()– P.length() // Compare characters in substring with compatriot in P j 0 while j < P.length()&& T.charAt(i + j)== P.charAt(j) j j +1 if j == P.length() return i // Return 1st place in T we find S return -1// No matching substring exists
Your Turn • What are all of the prefixes and suffixes of the string:I am the Lizard King! • How many character comparisons does brute-force do to find a substring of:ccagcctccgccthat matches this pattern:ccgcc
My Turn I am the Lizard King!
Boyer-Moore Heuristics (§ 11.2.2) • Looking-glass heuristic:When comparing P and substring of T, start from the end of P and continue backward to P’s start • Character-jump heuristic:When finding a mismatch at T[i] = c • If P contains c, restart comparison so T[i] is aligned with last occurrence of c in P • Else, continue with new comparison starting at T[i+1]
Last-Occurrence Function • Boyer-Moore’s precomputes the last-occurrence function • Stores last-occurrence function, L, for P in array • Example: • Consider alphabetS = {a, b, c, d, e, f} • P=badfeed Largest i where P.charAt(i) = c-1, if cis not in P L(c) =
The Boyer-Moore Algorithm AlgorithmBoyerMooreMatch(String T, String P, Alphabet S)L lastOccurence (P, S)i P.length()–1j P.length()–1repeat if T.charAt(i)= P.charAt(j) if j =0return i // We have a match starting at i elsei i–1j j–1 else // We do not have a match at character i so we can skip lettersl L[T.charAt(i)]i i+P.length()– min(j, 1 + l)j P.length()–1until i > T.length()–1return -1
Boyer-Moore’s Algorithm • Runs in time O(nm +S) • Worst-case: • T = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa • P = baaa • May occur in images, DNA sequences • Unlikely on larger alphabets, like English • Significantly faster than brute-force algorithm on English text
Your Turn • How many character comparisons does Boyer-Moore do to find a substring of:ccagcctccgccmatching this pattern:ccgcc S = {a,c,g,t} • Compute the Boyer-Moore algorithm’s last function for the string:the quick brown fox jumped over the lazy dog
Your Turn • Write brute-force pattern matching method:public int bfMatch(String text,String pattern){ • Suppose we are using a non-ASCII alphabet that is stored in an arraypublic static int[] lastFn(Sequence pattern, Object[] alphabet) { • Hint: You cannot use value at each rank in pattern as index into last. May want to write a method that, given an Object and alphabet, examines each location in alphabet to return the index where the Object is found.