210 likes | 325 Views
CSE467/567 Computational Linguistics. Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo. Levels of processing. phonetics/phonology – sounds morphology – word structure syntax – sentence structure semantics – meaning
E N D
CSE467/567Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo
Levels of processing • phonetics/phonology – sounds • morphology – word structure • syntax – sentence structure • semantics – meaning • pragmatics – goals of language use • discourse – utterances in context CSE 467/567
Words: the building blocks of sentences CSE 467/567
Words have internal structure • readable = read + able • readability = read + able + ity • the structure of words can be described using a regular grammar CSE 467/567
Chomsky hierarchy CSE 467/567
Problem • I often need to find an e-mail, but I have thousands of e-mails in my various folders. Suppose I want to find an e-mail about geese. The e-mail may mention “geese” or “goose”; also, if it appears at the start of a sentence, its initial letter will be capitalized. Need to match “goose”, “geese”, “Goose” or “Geese”. CSE 467/567
Regular expressions (in Perl) “a regular expression is an algebraic notation for characterizing a set of strings” [p. 22] Regular expressions are commonly used to specify search strings. For example, the UNIX utility program grep lets the user specify a pattern to search for in files. CSE 467/567
Sequences of characters Matching a sequence of characters /…/ Examples: /a/ matches the character ‘a’ /fred/ matches the string ‘fred’ Note: /fred/ does not match the string ‘Fred’! In other words, patterns are case-sensitive. CSE 467/567
Character disjunction(character classes) Square brackets are used to indicate disjunction of characters. Examples: /[Ff]/ matches either ‘f’ or ‘F’ /[Ff]red/ matches either ‘fred’ or ‘Fred’ This form of disjunction applies only at the character level. A set of characters in square brackets are sometimes referred to as a character class. CSE 467/567
Ranges Sometimes it is useful to specify “any digit” or “any letter”. “Any digit” can be written as /[0123456789]/, since any of the ten digits satisfies the pattern. An alternative is to use a special range notation: /[0-9]/ Any letter can be specified as /[A-Za-z]/ Range notation does not extend the power of regular expressions, but gives us a convenient way to express them. CSE 467/567
Complementing character classes To search for a character that is not in a character class, use the caret (^) in front of the character class that is enclosed in square brackets. Examples: /[^a]/ matches anything except ‘a’ /[^0-9]/ matches anything except a digit CSE 467/567
Matching 0 or 1 occurrence The ‘?’ matches zero or one occurrences of the preceding expression. Examples: /a?/ matches ‘a’ or ‘’ (nothing) /cats?/ matches ‘cat’ or ‘cats’ Note that the “preceding expression”, in these examples, is a single letter. We’ll see how to form longer expressions later. CSE 467/567
The Kleene star and plus The Kleene star (*) matches zero or more occurrences of the preceding expression. Examples: /a*/ matches ‘’, ‘a’, ‘aa’, ‘aaa’, etc. /[ab]*/ matches ‘’, ‘a’, ‘b’, ‘aa’, ‘ab’, ‘ba’, ‘bb’, etc. + matches one or more occurrences + is not necessary: /[ab]+/ is equiv. to /[ab][ab]*/ CSE 467/567
Wildcard The period (.) matches any single character except the newline (\n). CSE 467/567
Anchors Anchors are used to restrict a match to a particular position within a string. ^ anchors to the start of a string $ anchors to the end of a string /[Ff]red/ matches both ‘Fred’ and ‘Fred is home’ /^[Ff]red$/ matches ‘Fred’ but not ‘Fred is home’ \b anchors to a word boundary \B anchors to a non-boundary CSE 467/567
Conjunction Two regular expressions are conjoined by juxtaposition (placing the expressions side by side). Examples: /a/ matches ‘a’ /m/ matches ‘m’ /am/ matches ‘am’ but not ‘a’ or ‘m’ alone CSE 467/567
Disjunction We have already seen disjunction of characters using the square bracket notation General disjunction is expressed using the vertical bar (|), also called the pipe symbol. This form of disjunction allows us to match any one of the alternative patterns, not just characters like the [ ] disjunction form. CSE 467/567
Grouping • Parentheses, ‘(’ and ‘)’, are used to group subpatterns of a larger pattern. • Ex: /[Gg](ee)|(oo)se/ CSE 467/567
Replacement In addition to matching, we can do replacements when a match is found: Example: To replace the British spelling of color with the American spelling, we can write: s/colour/color/ CSE 467/567
Registers – saving matches • To save a match from part of a pattern, to reuse it later on, Perl provides registers • Registers are named \#, where # is the number of the register • Ex. DE DO DO DO DE DA DA DA IS ALL I WANT TO SAY TO YOU /(D[AEO].)*/ will match the first line /(D[AEO])(.D[AEO]) \2 \2\s \1 (.D[AEO]) \3 \3/ matches it more specifically This pattern also matches strings like DA DE DE DE DA DO DO DO \s matches a whitespace character CSE 467/567
For more information • PERL Regular Expression TUTorial • http://perldoc.perl.org/perlretut.html • PERL Regular Expression reference page • http://perldoc.perl.org/perlre.html CSE 467/567