120 likes | 226 Views
Textual Patterns. Finding that which is hidden. http://www.flickr.com/photos/seadigs/4810947743/sizes/z/in/photostream/. Patterns. To: Sajil From: Yaalini Hi Sajil ,
E N D
Textual Patterns Finding that which is hidden http://www.flickr.com/photos/seadigs/4810947743/sizes/z/in/photostream/
Patterns To: Sajil From: Yaalini Hi Sajil, I'm throwing a birthday party for mom this week. Give me a call at 608-555-1234 or 618-508-0001 and we can arrange the details. I will be gone between 10/9/2011 until 10/12/2011 but I'll have my cell phone with me. See you Yallini yallini.vrinda@rckaa.edu • Consider the text below. Identify the: • Proper names • Phone numbers • Email addresses • Dates
Patterns • Have you ever been prompted to enter a date and the prompt is something like "MM/DD/YYYY"? • How can we define a 'pattern'. For example • We might say that a social security number has nine digits. A dash occurs after the first three digits and then again after the next two. • Do the following strings match this definition? • 123-456-789 • 123-45-6789 • Abc-de-fghi • 123456789 • A regular expression is a pattern that some strings will match and others will not match.
The Language of Regular Expressions • Lets define the "pattern language". • Any single character is a pattern. • The character represents itself. • The '.' (period) is a pattern. • The period represents "any character" • If A and B are both patterns, then so are • AB : This represents the pattern A followed by pattern B • "P." is a pattern that matches "PA", "PR", and "PQ" but not "pa". • A|B : This represents either the pattern A or the pattern B • "(P|Q)" is a pattern that matches "P" and "Q" but not "R". • (A) : this represents a group; the parens are special. Anything in parenthesis is a group. A group is one "thing".
Examples (0|1|2|3|4|5|6|7|8|9) (0|1|2|3|4|5|6|7|8|9) (0|1|2|3|4|5|6|7|8|9)-(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)-(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9) We might want to say that a social security number has nine characters each of which is a digit. A dash occurs after the first three digits and then again after the next two. Can you write down a pattern for this? It will be ugly!
Repeating Patterns within Patterns • Quantifiers are used to allow and constrain repetitions. If A is a pattern, then so are: • A* : this represents zero or more repetitions of A; • A+ : this represents one or more repetitions of A • A? : this represents zero or one occurrences of A; • A{n} : this represents exactly n repetitions of A. • A{m,n} : this represents at least m and no more than n repetitions of A. • Give a regex for social security numbers: • (0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){2}-(0|1|2|3|4|5|6|7|8|9){4}
Character Classes • Square brackets denote a character class (this is a set of characters). The class will match any one character in the set. • Inside the brackets, characters are just listed one by one. • The '-' is used to denote a range. • Examples (characters with special meaning don't take on special meaning in the character classes). • [abc] will match "a" or "b" or "c" and nothing else • [a-c] will match "a" or "b" or "c" and nothing else • [a-z] will match any lower case alphabetic symbol • [a-zA-Z0-9] will match any alphanumeric symbol • [a+] will match either an "a" or a "+". • A caret, however, is special in a set. It is "not" • [^a] will match anything but "a" • [^0-9] will match anything but a digit character.
Examples • Which of the following match [a-z][0-9]* • abc • 1z93 • a-9 • Which of the following match [0-9]*[^02468] • 03 • 999 • 354 • Give a pattern for social security numbers • [0-9]{3}-[0-9]{2}-[0-9]{4}
Predefined classes • Some character classes are common and have shorthand definitions • \d : matches any decimal digit; equivalent to [0-9] • \D : matches any non-digit character; equivalent to [^0-9] • \w : matches any 'word' character; equivalent to [^ \t\n\r\f\v] • \W : matches any non-word character; equivalent to [^a-zA-Z0-9] • Also, there are two interesting 'location' matches • $ : matches the end of a string or matches before a newline • ^ : matches the start of a string or right after a newline • Give a regex for social security numbers • \d{3}-\d{2}-\d{4}
Example: Phone Numbers • Consider creating a regular expression to match phone numbers. The phone numbers can take on the following forms: • 800-555-1212 • 800 555 1212 • 800.555.1212 • 1-800-555-1212 • 800-555-1212-1234 • 800-555-1212x1234 • 800-555-1212 ext. 1234 • 1-800 555.1212 #1234
Example: Phone Numbers • Divide and conquer! • Note that each phone number has at most four parts. • prefix (the number 1) • area code • trunk (first three digits) • rest (next 4 digits) • extension (last digits. May be between 1 and 4 in length) • Consider defining each of these parts • what is the prefix? • what is the area code? • what is the trunk? • what is the rest? • what is the extension?
Example: Phone Numbers • We need to 'conquer' by combining the solutions for the parts. • Rules: • The prefix is optional • One of the following must occur between the prefix and the area code: space, comma, dash, period • One of the following must occur between the area code and the trunk: space, comma, dash, period • One of the following must occur between the trunk and the rest: space, comma, dash, period • One of the following must occur between the rest and the extension: space, 'x', ' ext. ', dash, octothorpe.