180 likes | 694 Views
Regular expressions Day 2. LING 681.02 Computational Linguistics Harry Howard Tulane University. Course organization. Regular expressions. SLP 2.1. Questions. What is a string? A sequence of symbols. In text, a sequence of alphanumeric characters.
E N D
Regular expressionsDay 2 LING 681.02 Computational Linguistics Harry Howard Tulane University
Course organization LING 681.02, Prof. Howard, Tulane University
Regular expressions SLP 2.1
Questions • What is a string? • A sequence of symbols. • In text, a sequence of alphanumeric characters. • What is a regular expression (RE or regex)? • A language for specifying text search strings, requiring a pattern to search for and and a corpus to search through. • What is an algebra? • A set of elements and a group of operations defined for them • e.g. the set of real numbers and the operations +, –, *, and /. • What is a false positive? • a string that is incorrectly matched > decreases accuracy • What is a false negative? • a string that is incorrectly excluded > decreases coverage • What is precedence? LING 681.02, Prof. Howard, Tulane University
* + - ^ ? . | () {n} \b \w $ \1 0 or more occurrences of the previous character or RE 1 or more occurrences of the previous character or RE The two ends of a range Not (negation) or beginning of line; "caret" the previous character is optional any character either … or "pipe" grouping or put in a register n occurrences of previous character or RE word boundary white space end of line replace with RE in register 1 Notation in Perl LING 681.02, Prof. Howard, Tulane University
Exercise 2.1: REs • The set of all alphabetic strings. • [a-zA-Z][a-zA-Z]* • [a-zA-Z]+ • The set of all lower case alphabetic strings ending in a b. • [a-z]*b • The set of all strings with two consecutive repeated words (e.g., “Humbert Humbert” and “the the” but not “the bug” or “the big bug”). • ([a-zA-Z]+)\s+\1 LING 681.02, Prof. Howard, Tulane University
Exercise 2.1: REs, cont. • The set of all strings from the alphabet a, b such that each a is immediately preceded by and immediately followed by a b. • (b+(ab+)+)? • All strings that start at the beginning of the line with an integer and that end at the end of the line with a word. • ˆ\d+\b.*\b[a-zA-Z]+$ LING 681.02, Prof. Howard, Tulane University
Exercise 2.1: REs, cont. • All strings that have both the word grotto and the word raven in them (but not, e.g., words like grottos that merely contain the word grotto). • \bgrotto\b.*\braven\b|\braven\b.*\bgrotto\b • Write a pattern that places the first word of an English sentence in a register. Deal with punctuation. • ˆ[ˆa-zA-Z]*([a-zA-Z]+) LING 681.02, Prof. Howard, Tulane University
Exercise 2.2 • patterns • (r"\b(i’m|i am)\b", "YOU ARE"), • (r"\b(i|me)\b", "YOU"), • (r"\b(my)\b", "YOUR"), • (r"\b(well,?) ", ""), • (r".* YOU ARE (depressed|sad) .*", r"I AM SORRY TO HEAR YOU ARE \1"), • (r".* YOU ARE (depressed|sad) .*", r"WHY DO YOU THINK YOU ARE \1"), • (r".* all .*", "IN WHAT WAY"), • (r".* always .*", "CAN YOU THINK OF A SPECIFIC EXAMPLE"), • (r"[%s]" % re.escape(string.punctuation), ""), LING 681.02, Prof. Howard, Tulane University
REs in Python • The re module provides Perl-type regular expression patterns, see http://www.amk.ca/python/howto/regex/ • NLPP goes into REs in §3.4, p. 97ff LING 681.02, Prof. Howard, Tulane University
Next time SLP Automata: §2.2-end & Ex. 2.3-end NLPP: finish §1, do as many of the exercises as you can