400 likes | 537 Views
Information Management. DIG 3563: Lecture 2a: Regular Expressions Michael Moshell University of Central Florida. DO NOT BE A RABBIT!. If you don ’ t know how to Do something, Don ’ t hide under a bush. Tell me Or Come see me. Naturphoto.cz. Regular Expressions.
E N D
Information Management DIG 3563: Lecture 2a: Regular Expressions Michael Moshell University of Central Florida
DO NOT BEA RABBIT! If you don’t know how to Do something, Don’t hide under a bush. Tell me Or Come see me. Naturphoto.cz
Regular Expressions • A "grammar" for validating input • useful for many kinds of pattern recognition • The basic built-in Boolean function in PHP is called 'preg_match'. • It takes two or three arguments: • the pattern, like "cat" • the test string, like "catastrophe" • and an (optional) array variable, • which we can ignore for now • It returns TRUE if the pattern matches the test string.
POSIX Regular Expressions Always begin with "/ and end with /" (for today's lesson) $instring = "catastrophe"; if (preg_match("/cat/",$instring)) { print "I found a cat!"; } else { print "No cat here."; }
Regular Expressions $instring = "catastrophe"; if (preg_match("/cat/",$instring)) { print "I found a cat!"; } else { print "No cat here."; } I found a cat!
PRACTICE 1: "/cat/" that is the regular expression Make up a Regular Expression to recognize Not the word cat, but rather the word dog. Write it on your paper, now.
PRACTICE 1: "/cat/" that is the regular expression Make up a Regular Expression to recognize Not the word cat, but rather the word dog. Write it on your paper, now. Yes, I mean YOU. Where is your paper and pencil? (You can use your laptop if that’s what you have…)
PRACTICE 1: "/cat/" that is the regular expression Make up a Regular Expression to recognize Not the word cat, but rather the word dog. Write it on your paper, now. Answer: "/dog/" Yep, it’s that simple. But I gotta get you STARTED.
Regular Expressions Wild cards: period . matches any single character $instring = "cotastrophe"; if (preg_match("/c.t/",$instring)) { print "I found a c.t!"; } else { print "No c.t here."; }
Regular Expressions Wild cards: period . matches any single character $instring = "cotastrophe"; if (preg_match("/c.t/",$instring)) { print "I found matching string!"; } else { print "No c.t here."; } I found a matching string!
Regular Expressions Wild cards: a* matches any number of a characters (or the "null character"!) $instring = "caaaatastrophe"; if (preg_match("/ca*t/",$instring)) { print "I found a match!"; } else { print "No ca*t here."; } I found a match!
Regular Expressions Wild cards: .* matches any string of characters (or the "null character"!) $instring = "cotastrophe"; if (preg_match("/c.*t/",$instring)) { print "I found a c.*t!"; } else { print "No c.*t here."; } I found a c.*t!
Regular Expressions Wild cards: .* matches any string of characters (or the "null character"!) $instring = "cflippingmonstroustastrophe"; if (preg_match("/c.*t/",$instring)) { print "I found a c.*t!"; } else { print "No c.*t here."; }
Regular Expressions Wild cards: .* matches any string of characters (or the "null character"!) $instring = "cflippingmonstroustastrophe"; if (preg_match("/c.*t/",$instring)) { print "I found a c.*t!"; } else { print "No c.*t here."; } I found a c.*t!
PRACTICE 2: "/c.t/" that is a model RE for you "/c.*t/" that is a model RE for you "/ca*t/" that is a model RE for you Make up a Regular Expression to recognize Rob or Rb or Roob or Rooob, etc. But to REJECT Reb and Rab and Rats and Mike … .
PRACTICE 2: "/c.t/" that is a model RE for you "/c.*t/" that is a model RE for you "/ca*t/" that is a model RE for you Answer: ”/Ro*b/”
Quantification Multiple copies of something: a+ means ONE OR MORE a’s Example: "/fa+ther/" matches father, faather, faaather, etc. a* means ZERO OR MORE a’s Example: "/fa*ther/" matches fther, father, faather, etc. a? means ZERO OR ONE a Example: "/flavou?r/" will match flavor AND flavour. a{33} means 33 instances of a
Quantification Example a+ means ONE OR MORE a’s Example: "/fa+ther/" matches father, faather, faaather, etc. a* means ZERO OR MORE a’s Example: "/fa*ther/" matches fther, father, faather, etc. a? means ZERO OR ONE a Example: "/flavou?r/" will match flavor AND flavour. a{33} means 33 instances of a How to recognize “Rob” or “Robb”?
Quantification Example a+ means ONE OR MORE a’s Example: "/fa+ther/" matches father, faather, faaather, etc. a* means ZERO OR MORE a’s Example: "/fa*ther/" matches fther, father, faather, etc. a? means ZERO OR ONE a Example: "/flavou?r/" will match flavor AND flavour. a{33} means 33 instances of a How to recognize “Rob” or “Robb”? ”/Robb?/"
Quantification Example a+ means ONE OR MORE a’s Example: "/fa+ther/" matches father, faather, faaather, etc. a* means ZERO OR MORE a’s Example: "/fa*ther/" matches fther, father, faather, etc. a? means ZERO OR ONE a Example: "/flavou?r/" will match flavor AND flavour. a{33} means 33 instances of a How to recognize “Rob” or “Robb”? Another way: ”/Rob{1,2}/"
Escaping Backslash means "don't interpret this:" \. is just a dot \* is just an asterisk.
The concept: Would $t="/a{3}\.b{1,4}/"; $s= "aaa.bbb"; this would or would not be accepted? preg_match($t,$s) – true or false?
The concept: Would $t="/a{3}\.b{1,4}/"; $s= "aaa.bbb"; this would or would not be accepted? preg_match($t,$s) – true or false? TRUE, because $s matches the pattern string $t. three a, one dot, and between one and four b characters.
The concept: Would $t="/a{3}\.b{1,4}/"; $s= "aaa.bbbbb"; this would or would not be accepted? preg_match($t,$s) – true or false?
The concept: Would $t="/a{3}\.b{1,4}/"; $s= "aaa.bbbbb"; this would or would not be accepted? preg_match($t,$s) – true or false? Perhaps surprisingly, TRUE: because $s contains three a and 4 b.
The concept: Would $t="/a{3}\.b{1,4}/"; $s= "aaa.bbbbb"; this would or would not be accepted? preg_match($t,$s) – true or false? Perhaps surprisingly, TRUE: because $s contains three a and 4 b. If you have $1.00 and I asked you “do you have 75 cents?” the answer would be YES.
The concept: Would $t="/a{3}\.b{1,4}/"; $s= "aaa.bbbbb"; this would or would not be accepted? preg_match($t,$s) – true or false? Perhaps surprisingly, TRUE: because $s contains three a and 4 b. If you wanted an EXACT match, I'll show you how In a bit.
Grouping Multiple copies of something: (abc)+ means ONE OR MORE string abc’s (abc)* means ZERO OR MORE string abc’s like abcabcabc SETS: [0-9] matches any single integer character [A-Z] matches any uppercase letter [AZ] matches A or Z [AZ]? (i.e. 0 or 1 of the previous) matches null, A or Z
Starting and Ending preg_match("/cat/","abunchofcats") is TRUE but preg_match("/^cat/","abunchofcats") is FALSE because ^ means the RE must match the first letter. preg_match("/cats$/","abunchofcats") is TRUE but preg_match("/cats$/","mycatsarelazy") is FALSE So, ^ marks the head and $ marks the tail.
Exact Matching with ^ and $ $t="/^a{3}\.b{1,4}$/"; $s= "aaa.bbbbb"; this would or would not be accepted? preg_match($t,$s) – true or false? FALSE, because the ending $ in the pattern says "no more input is acceptable" but more stuff comes. This would also reject $s="aaa.bbbbAndMoreText"; 30
Alternatives - the 'or' mark | $t="/flav(o|ou)r/"; This will match 'flavor' and 'flavour'. And (yes!) there are often more than one way to do things; for instance our good old ? Mark. "/flavou?r/" 31 31
Sets - Examples [A-E]{3} matches AAA, ABA, ADD, ... EEE [PQX]{2,4} matches PP, PQ, PX ... up to XXXX [A-Za-z]+ matches any alphabetic string with 1 or more characters [A-Z][a-z]* matches any alpha string with first letter capitalized. [a-z0-9]+ matches any string of lowercase letters and numerals
Practice in class Write a RE that recognizes any string that begins with "sale". Here's an example for you to look at, help you remember ^cat From now on, the RE is just ^cat. You don't need to write the other stuff (preg_match, "/, etc.)
Practice 1) Write a RE that recognizes any string that begins with "sale". Answer: ^sale
Practice 1) Write a RE that recognizes any string that begins with "sale". Answer: ^sale 2) Write a RE that recognizes a string that begins with "smith" and a two digit integer, like smith23 or smith99. Here's an example from your recent past: a{3}\.b{1,4}
Practice 1) Write a RE that recognizes any string that begins with "sale". Answer: ^sale 2) Write a RE that recognizes a string that begins with "smith" and a two digit integer, like smith23 or smith99. Answer: ^smith[0-9]{2}
3) Write a RE that recognizes Social Security numbers in the form like 123-45-6789 Helpers from the recent past: ^smith[0-9]{2} a{3}\.b{1,4} 37
3) Write a RE that recognizes Social Security numbers in the form like 123-45-6789 Answer: [0-9]{3}\-[0-9]{2}\-[0-9]{4} 38 38
3) Write a RE that recognizes Social Security numbers in the form like 123-45-6789 Answer: [0-9]{3}\-[0-9]{2}\-[0-9]{4} NOTE: That's a conservative answer. It turns out that the dash character is not a special symbol outside sets, and so you could also write [0-9]{3}-[0-9]{2}-[0-9]{4} But I don't like to remember stuff, so I use \ a lot. 39 39 39
How to study this stuff? • Practice making up RE for problems like these: • The UCF NID • French telephone numbers like (+33 5 23 46 22 91) • Dollars and cents, like $942.73 • A field that may contain only lowercase strings with • exactly ONE vowel. • How do you know if they're good? If you know PHP • You can test them. Otherwise, check out each others' work. • (OR come see me in office hours!)(Or by appointment!) • 407 694 6763 40