1 / 12

Textual Patterns

Textual Patterns. Finding that which is hidden. http://www.flickr.com/photos/seadigs/4810947743/sizes/z/in/photostream/. Patterns. To: Sajil From: Yaalini Hi Sajil ,

aysel
Download Presentation

Textual Patterns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Textual Patterns Finding that which is hidden http://www.flickr.com/photos/seadigs/4810947743/sizes/z/in/photostream/

  2. Patterns To: Sajil From: Yaalini Hi Sajil, I'm throwing a birthday party for mom this week. Give me a call at 608-555-1234 or 618-508-0001 and we can arrange the details. I will be gone between 10/9/2011 until 10/12/2011 but I'll have my cell phone with me. See you Yallini yallini.vrinda@rckaa.edu • Consider the text below. Identify the: • Proper names • Phone numbers • Email addresses • Dates

  3. Patterns • Have you ever been prompted to enter a date and the prompt is something like "MM/DD/YYYY"? • How can we define a 'pattern'. For example • We might say that a social security number has nine digits. A dash occurs after the first three digits and then again after the next two. • Do the following strings match this definition? • 123-456-789 • 123-45-6789 • Abc-de-fghi • 123456789 • A regular expression is a pattern that some strings will match and others will not match.

  4. The Language of Regular Expressions • Lets define the "pattern language". • Any single character is a pattern. • The character represents itself. • The '.' (period) is a pattern. • The period represents "any character" • If A and B are both patterns, then so are • AB : This represents the pattern A followed by pattern B • "P." is a pattern that matches "PA", "PR", and "PQ" but not "pa". • A|B : This represents either the pattern A or the pattern B • "(P|Q)" is a pattern that matches "P" and "Q" but not "R". • (A) : this represents a group; the parens are special. Anything in parenthesis is a group. A group is one "thing".

  5. Examples (0|1|2|3|4|5|6|7|8|9) (0|1|2|3|4|5|6|7|8|9) (0|1|2|3|4|5|6|7|8|9)-(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)-(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9) We might want to say that a social security number has nine characters each of which is a digit. A dash occurs after the first three digits and then again after the next two. Can you write down a pattern for this? It will be ugly!

  6. Repeating Patterns within Patterns • Quantifiers are used to allow and constrain repetitions. If A is a pattern, then so are: • A* : this represents zero or more repetitions of A; • A+ : this represents one or more repetitions of A • A? : this represents zero or one occurrences of A; • A{n} : this represents exactly n repetitions of A. • A{m,n} : this represents at least m and no more than n repetitions of A. • Give a regex for social security numbers: • (0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){2}-(0|1|2|3|4|5|6|7|8|9){4}

  7. Character Classes • Square brackets denote a character class (this is a set of characters). The class will match any one character in the set. • Inside the brackets, characters are just listed one by one. • The '-' is used to denote a range. • Examples (characters with special meaning don't take on special meaning in the character classes). • [abc] will match "a" or "b" or "c" and nothing else • [a-c] will match "a" or "b" or "c" and nothing else • [a-z] will match any lower case alphabetic symbol • [a-zA-Z0-9] will match any alphanumeric symbol • [a+] will match either an "a" or a "+". • A caret, however, is special in a set. It is "not" • [^a] will match anything but "a" • [^0-9] will match anything but a digit character.

  8. Examples • Which of the following match [a-z][0-9]* • abc • 1z93 • a-9 • Which of the following match [0-9]*[^02468] • 03 • 999 • 354 • Give a pattern for social security numbers • [0-9]{3}-[0-9]{2}-[0-9]{4}

  9. Predefined classes • Some character classes are common and have shorthand definitions • \d : matches any decimal digit; equivalent to [0-9] • \D : matches any non-digit character; equivalent to [^0-9] • \w : matches any 'word' character; equivalent to [^ \t\n\r\f\v] • \W : matches any non-word character; equivalent to [^a-zA-Z0-9] • Also, there are two interesting 'location' matches • $ : matches the end of a string or matches before a newline • ^ : matches the start of a string or right after a newline • Give a regex for social security numbers • \d{3}-\d{2}-\d{4}

  10. Example: Phone Numbers • Consider creating a regular expression to match phone numbers. The phone numbers can take on the following forms: • 800-555-1212 • 800 555 1212 • 800.555.1212 • 1-800-555-1212 • 800-555-1212-1234 • 800-555-1212x1234 • 800-555-1212 ext. 1234 • 1-800 555.1212 #1234

  11. Example: Phone Numbers • Divide and conquer! • Note that each phone number has at most four parts. • prefix (the number 1) • area code • trunk (first three digits) • rest (next 4 digits) • extension (last digits. May be between 1 and 4 in length) • Consider defining each of these parts • what is the prefix? • what is the area code? • what is the trunk? • what is the rest? • what is the extension?

  12. Example: Phone Numbers • We need to 'conquer' by combining the solutions for the parts. • Rules: • The prefix is optional • One of the following must occur between the prefix and the area code: space, comma, dash, period • One of the following must occur between the area code and the trunk: space, comma, dash, period • One of the following must occur between the trunk and the rest: space, comma, dash, period • One of the following must occur between the rest and the extension: space, 'x', ' ext. ', dash, octothorpe.

More Related