210 likes | 403 Views
Chapter 2. Regular Expressions and Automata 2.1 Regular Expressions. 2007 년 3 월 30 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 21 ~ 33. Outline. Introduction Basic Regular Expression Patterns Disjunction, Grouping, and Precedence A Simple Example
E N D
Chapter 2. Regular Expressions and Automata2.1 Regular Expressions 2007년 3월 30일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 21 ~ 33
Outline • Introduction • Basic Regular Expression Patterns • Disjunction, Grouping, and Precedence • A Simple Example • A More Complex Example • Advanced Operators • Regular Expression Substitution, Memory, and ELIZA
Introduction • One of the unsung successes in standardization in computer science • a language for specifying text search strings • an algebraic notation for characterizing a set of strings • regular expression search • requires a pattern that we want to search • function will search through the corpus returning all texts that contain the pattern 3 / 20
Basic Regular Expression Patterns (1/6) • metacharacter the slash / • metacharacter the square bracket [ ] 4/ 20
Basic Regular Expression Patterns (2/6) • metacharacter the dash – • / [123456789] / / [1-9] / • / [ABCDEFGHIJKLMNOPQRSTUVWXYZ] / / [A-Z] / 5 / 20
Basic Regular Expression Patterns (3/6) • metacharacter the caret ^ 6 / 20
Basic Regular Expression Patterns (4/6) • metacharacter the question-mark ? • Kleene * • zero or more occurrences of the immediately previous character or regular expression • /a*/ means ‘any string of zero or more as’ • /aa*/ means ‘one or more as’ • /[ab]*/ means ‘zero or more as or bs’ 7 / 20
Basic Regular Expression Patterns (5/6) • Kleene + • one or more of the previous character • /baaa*!/ = /baa+!/ • metacharacter period . (wildcard expression) • /beg.n/ • any character between beg and n • begin, beg’n, begun • .* • any string o fcharacters 8 / 20
Basic Regular Expression Patterns (6/6) • Anchor • special metacharacter • caret ^ matches the start of a line • dollar sign $ matches the end of line • / ^The dog\.$/ matches a line that contains only the phrase The dog. • \b matches a word boundary • /the/ VS /\bthe\b/ • there • / ^ $/ 9 / 20
Disjunction, Grouping, and Precedence • We can’t use the [] to search for “cat or dog” • metacharater pipe symbol | • /cat | dog/ matches either cat or the string dog • How can I specify both guppy and guppies? • /guppy|ies/ • sequences like guppy take precedence over the | • /guppy(y|ies)/ 10 / 20
Disjunction, Grouping, and Precedence • operator precedence hierarchy 11 / 20
A simple Example • to write a RE to find cases of the English article the • / the / • this pattern will miss the word when it begins a sentencc and hence is capitalized (i.e., The) • / [tT]he / • the embedded in other words (e.g., other or theology) • / \b[tT]he\b / • / [^a-zA-Z] [tT]he [^a-zA-Z] / • (^|/ [^a-zA-Z]) [tT]he [^a-zA-Z] / 12 / 20
A More Complex Example (1/2) • "any PC with more than 500 MHz and 32 Gb of disk space for less than $l000” • regular expression for prices (e.g., $999.99) • simple regular expression for prices • / $ [0-9] + / • to deal with fractions of dollars • / $ [0-9] + \. [0-9] [0-9] / • this pattern only allows $199.99 but not $199 • / \b $ [0-9] + ( \. [0-9] [0-9] )? \b / 13 / 20
A More Complex Example (2/2) • regular expression for processor speed • regular expression operating systems and vendors 14 / 20
Advanced Operators (1/3) • Aliases for common sets of characters 15 / 20
Advanced Operators (2/3) • Regular expression operators for counting • / a \.{24} z / 16 / 20
Advanced Operators (3/3) • Some characters that need to be backslashes 17 / 20
Regular Expression Substitution, Memory, and ELIZA(1/2) • Perl substitution operator s / regexp1 / regexp2 / • s / colour / color / • number operator (using memory) • changing the 35 boxes to <35> boxes • s / ([0 - 9] +) / <\1> / • /the (.*)er they were, the \ler they will be/ • will match The bigger they well be, the bigger they were • but not The bigger they well be, the faster they were • these numbered memories are called resisters • “extended” feature of regular expressions 18 / 20
Regular Expression Substitution, Memory, and ELIZA(2/2) • number operator (Cont’) • /the (.*)er they (.*), the \ler they \2/ • will match The bigger they were, the bigger they were • but not The bigger they were, the bigger they will be • ELIZA • simple natural-language understanding program (1966) • substitution using memory 19 / 20
Regular Expression Substitution, Memory, and ELIZA(3/3) 20 / 20