180 likes | 337 Views
Regular Expressions. Simple Example. Sequence of simple characters:. Some Examples. Optionality and Repetition. /[Ww]oodchucks?/ matches woodchucks, Woodchucks, woodchuck, Woodchuck /colou?r/ matches color or colour /he{3}/ matches heee /(he){3}/ matches hehehe
E N D
Simple Example • Sequence of simple characters:
Optionality and Repetition • /[Ww]oodchucks?/ matches woodchucks, Woodchucks, woodchuck, Woodchuck • /colou?r/ matches color or colour • /he{3}/ matches heee • /(he){3}/ matches hehehe • /(he){3,} matches a sequence of at least 3 he’s
Operator Precedence Hierarchy 1. Parentheses () 2. Counters * + ? {} 3. Sequence ^my end$ 4. Disjunction | Examples /moo+/ /try|ies/ /and|or/
A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[tT]he/ /\b[tT]he\b/ /(^|[^a-zA-Z][tT]he[^a-zA-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.
Substitutions (Transductions) • Sed or ‘s’ operator in Perl • s/regexp1/pattern/ • s/I am feeling (.++)/You are feeling \1?/ • s/I gave (.+) to (.+)/Why would you give \2 \1?/
Finite State Automata • Regular Expressions (REs) can be viewed as a way to describe machines called Finite State Automata (FSA, also known as automata, finite automata). • FSAs and their close variants are a theoretical foundation of much of the field of NLP.
a b a a ! q0 q1 q2 q3 q4 Finite State Automata • FSAs recognize the regular languages represented by regular expressions • SheepTalk: /baa+!/ • Directed graph with labeled nodes and arc transitions • Five states: q0 the start state, q4 the final state, 5 transitions
Regular Expressions for XML Schema • Recall that the string datatype has a pattern facet. The value of a pattern facet is a regular expression. Below are some examples of regular expressions: Regular Expression - Chapter \d - Chapter \d - a*b - [xyz]b - a?b - a+b - [a-c]x Example - Chapter 1 - Chapter 1 - b, ab, aab, aaab, … - xb, yb, zb - b, ab - ab, aab, aaab, … - ax, bx, cx
Regular Expression [a-c]x [-ac]x [ac-]x [^0-9]x \Dx Chapter\s\d (ho){2} there (ho\s){2} there .abc (a|b)+x Example ax, bx, cx -x, ax, cx ax, cx, -x any non-digit char followed by x any non-digit char followed by x Chapter followed by a blank followed by a digit hoho there ho ho there any (one) char followed by abc ax, bx, aax, bbx, abx, bax,... Regular Expressions (cont.)
a{1,3}x a{2,}x \w\s\w ax, aax, aaax aax, aaax, aaaax, … word character (alphanumeric plus dash) followed by a space followed by a word character Regular Expressions (cont.) • A string comprised of any lower and upper case letters, except "O" and "l" • [a-zA-Z-[Ol]]* • The period "." (Without the backward slash the period means "any character") • \.
Regular Expressions (cont.) • \n • \r • \t • \\ • \| • \- • \^ • \? • \* • \+ • \{ • \} • \( • \) • \[ • \] • linefeed • carriage return • tab • The backward slash \ • The vertical bar | • The hyphen - • The caret ^ • The question mark ? • The asterisk * • The plus sign + • The open curly brace { • The close curly brace } • The open paren ( • The close paren ) • The open square bracket [ • The close square bracket ]
Regular Expressions (concluded) • \p{L} • \p{Lu} • \p{Ll} • \p{N} • \p{Nd} • \p{P} • \p{Sc} • A letter, from any language • An uppercase letter, from any language • A lowercase letter, from any language • A number - Roman, fractions, etc • A digit from any language • A punctuation symbol • A currency sign, from any language "currency sign from any language, followed by one or more digits from any language, optionally followed by a period and two digits from any language" <xsd:simpleType name="money"> <xsd:restriction base="xsd:string"> <xsd:pattern value="\p{Sc}\p{Nd}+(\.\p{Nd}\p{Nd})?"/> </xsd:restriction> </xsd:simpleType> <xsd:element name="cost" type="money"/> <cost>$45.99</cost> <cost>¥300</cost>
Example R.E. [1-9]?[0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5] 0 to 99 200 to 249 250 to 255 100 to 199 This regular expression restricts a string to have values between 0 and 255. … Such a R.E. might be useful in describing an IP address ...
IP Datatype Definition <xsd:simpleType name="IP"> <xsd:restriction base="xsd:string"> <xsd:pattern value="(([1-9]?[0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\.){3} ([1-9]?[0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])"> <xsd:annotation> <xsd:documentation> Datatype for representing IP addresses. Examples, 129.83.64.255, 64.128.2.71, etc. This datatype restricts each field of the IP address to have a value between zero and 255, i.e., [0-255].[0-255].[0-255].[0-255] Note: in the value attribute (above) the regular expression has been split over two lines. This is for readability purposes only. In practice the R.E. would all be on one line. </xsd:documentation> </xsd:annotation> </xsd:pattern> </xsd:restriction> </xsd:simpleType>
Regex for different platform line breaks? Different systems use different line break characters. How do you handle this in XML? Read on and find out. Consider this XML document: (line break characters are explicitly shown) <?xml version="1.0"?> \r\n <Test> \r\n <para xml:space="preserve">This is a \r\n simple paragraph. What \r\n do you think of it?</para> \r\n </Test> \r\n When an XML parser reads in this document it "normalizes" ALL line breaks. Thus, after normalization the XML document looks like this: <?xml version="1.0"?> \n <Test> \n <para xml:space="preserve">This is a \n simple paragraph. What \n do you think of it?</para> \n </Test> \n Things to note: 1. All line breaks have been normalized to \n. Consequence: you don't have to be concerned about different platforms using different line break characters since all XML documents will have their line break characters normalized to \n regardless of the platform. (So, if you're writing an XML Schema regex expression you can simply use \n to indicate line break, regardless of the platform.) 2. The xml:space="preserve" attribute has no impact on line break normalization. 3. Suppose that you want a line break character in your XML document, other than \n. For example, suppose that you want \r in your XML document. By default, it would get normalized to \n. To prevent this, use a character reference: 
 Do Lab 5