200 likes | 415 Views
Textual Patterns . Finding the needle(s) in the textual haystack. http://www.flickr.com/photos/seadigs/4810947743/sizes/z/in/photostream/. Patterns. From: Gow , Joe < gow.joe@uwlax.edu > Subject: Reminder About Open Forums Today Date: March 25, 2011 8:44:08 AM CDT
E N D
Textual Patterns Finding the needle(s) in the textual haystack http://www.flickr.com/photos/seadigs/4810947743/sizes/z/in/photostream/
Patterns From: Gow, Joe <gow.joe@uwlax.edu> Subject: Reminder About Open Forums Today Date: March 25, 2011 8:44:08 AM CDT Bcc: personnel@uwlax.edu Hello, everyone. I just wanted to send a quick reminder about the two campus wide Open Forums we're holding today from 2 to 3 and 3 to 4 p.m. in the Cleary Center. I'll host the first session from 2 to 3, and we'll cover any topics you'd like to discuss. Then from 3 to 4 Vice Chancellor Bob Hetzel will lead a conversation about the plans for a new Cowley Science Building. Please join us! Thanks, Joe Joe Gow, Chancellor University of Wisconsin-La Crosse Consider the text above. How would you identify … Proper names? … Email addresses? … Dates?
Patterns What do you think of when you see the following? MM/DD/YYYY This is a (string) pattern. Are there different patterns for this same thing? How would you describe the pattern of a credit card number?
Regular Expressions Regular expressions are “formulas” for string patterns. Regular expressions follow a standard notation. Regular expressions can be used in various computer applications and programming languages. Applying a regular expression to a string (piece of text) is called pattern matching. - The regular expression might match the string (or part of it) or it might not.
Regular Expression Notation Regular expressions use a standard pattern language. • Any (non-meta) character is a pattern. • The character pattern represents itself. • The '.' (period) is a pattern. • The period (a meta character) pattern represents "any character" • If A and B are both patterns, then so are • AB : This represents the pattern A followed by pattern B • F. matches FaFR and F3 but not fa or aF • A|B : This represents either the pattern A or the pattern B • P|Q matches P and Qbut not R • Parentheses are special; they form a pattern group. • Anything in parenthesis is a group. A group is one "thing". matches what strings? (red|blue) fish
Example How would you write an expression for the time on a digital 12-hour clock? [HINT: Let’s divide & conquer] A regular expression matching any possible hour: 1|2|3|4|5|6|7|8|9|10|11|12 A regular expression matching any possible minute: (0|1|2|3|4|5)(0|1|2|3|4|5|6|7|8|9) A regular expression matching any possible time: (1|2|3|4|5|6|7|8|9|10|11|12):(0|1|2|3|4|5)(0|1|2|3|4|5|6|7|8|9)
Repeating Patterns within Patterns Quantifiers are used to allow and constrain repetitions. If re is a regular expression (pattern), then so are: • re*represents zero or more repetitions of re • re+represents one or more repetitions of re • re?represents zero or one occurrences of re • re{n}represents exactly nrepetitions of re • (nis some positive integer) • re{m,n}represents at least mand no more than • n repetitions of re • (n, mare positive integers, m ≤ n) Write a regular expression for Social Security Numbers 123-45-6789
Escaped Characters Some characters have special meaning in regular expressions, and others have no printable form. Such characters can still be represented using a 2-character notation, known as an escape code. • \+represents + • \.represents . The same technique works for * ? ( ) { } [ ] \ ^ $ | • \n represents the new line character • \t represents the tab character • \r represents the carriage return character • \v represents the vertical tab character • \f represents the form feed character
Location Symbols There are also two “location” symbols. • ^ matches the start of a new line, • including right after \n • $ matches the end of a new line, • including right before \n
Sample Regular Expressions (snow|rain)(flake|drop) g(rr|ee)* W.*W B\.C\. ^Right now.$ ^Right now.\$
Character Classes Square brackets enclose a character class (a set of characters). The class will match any one character from the set. Within brackets… specific characters can be listed ranges are denoted using - Examples matches a or D or b and nothing else [aDb] matches c or d or e and nothing else [c-e] matches any lowercase letter and nothing else [a-z] matches any alphabetic or numeric symbol [a-zA-Z0-9] matches a or + or * and nothing else [a+*]
Examples Which of the following match [a-z][0-9]* abc 1z93 a-9 Which of the following match [0-9]*[02468] 03 9929 354 Give a pattern for social security numbers using character classes.
Example 1: Phone Numbers Create a regular expression to match phone numbers. The phone numbers can take on the following forms: 800-555-1212 800 555 1212 800.555.1212 1-800-555-1212 800-555-1212-1234 800-555-1212x1234
Example 1: Phone Numbers • Divide and conquer Note that each phone number has at most four parts. • prefix (the number 1) • area code • trunk (first three digits) • rest (next 4 digits) • extension (last digits. May be between 1 and 4 in length) • Consider defining each of these parts • what is the prefix? • what is the area code? • what is the trunk? • what is the rest? • what is the extension?
Example 1: Phone Numbers • We need to 'conquer' by combining the solutions for the parts. • Rules: • The prefix is optional • One of the following must occur between the prefix and the area code: space, comma, dash, period • One of the following must occur between the area code and the trunk: space, comma, dash, period • One of the following must occur between the trunk and the rest: space, comma, dash, period • An ‘x’ must occur between the rest and the extension.
Example 2: User Name Suppose the rules for some system are that a user name must begin with a capital letter, followed by lowercase letters and/or dashes and/or periods. The length of user names are restricted to 3 to 16 characters. Examples Dave D.-riley Rdave Invalid davedoesn’t begin with a capital letter DDR3 capital letters and digits not permitted after first symbol R too short
Example 3: MAC Address Every computer network connection has a unique MAC address that is expressed as six numbers separated by colons. Each number consists of two hexadecimal digits. Examples 10:22:93:04:91:00 AF:0C:AA:ED:B7:21 Invalid 10:22:93:04:91 too short 10:22:013:04:91 numbers must be two digits long, not three AG:0C:AA:ED:B7:21 the letter “G” is not a hexadecimal digit
Example 4: IPV4 Internet addresses are referred to as IP numbers. A common address consists of four positive integers separated by periods. These integers must each be within the range of 0-255. Examples 1.01.001.0 255.255.255.255 193.24.17.2 Invalid 256.255.255.255 no number can be greater than 255 193.24.175. too few numbers 193.24:17.2 separators must be periods
Example 5: Email Addresses • An email address consists of two strings separated by a @ localString@domainString • localString • Must be one or more of the following characters: alphabetic, digits (0 through 9), or any of these !#$&’+-_/=?^`{|}~ • Periods are permitted but with the following restrictions: the first and last characters cannot be periods and there cannot be any consecutive periods. • Note: There is another unusual notation for selected characters only allowed inside double quotes, which we will ignore. • domainString • Must be one or more of the following characters: alphabetic, digits, dashes or periods. • Alternately, the domain could be written as a pair of square brackets enclosing four numbers separated by periods, where each of the four numbers is a non-negative number of one to three digits. e.g., [138.93.200.0]