200 likes | 285 Views
Regular Expressions: grep. LING 5200 Computational Corpus Linguistics Martha Palmer. Homework 2. Bytes Read path names ~ not necessary in home directory Display results of commands if they’re just a few lines. Switches. -c list a count of matching lines only (like adding | wc)
E N D
Regular Expressions: grep LING 5200 Computational Corpus Linguistics Martha Palmer
Homework 2 • Bytes • Read path names • ~ not necessary in home directory • Display results of commands if they’re just a few lines. BASED on Kevin Cohen’s LING 5200
Switches • -c list a count of matching lines only • (like adding | wc) • -i ignore the case of the letters in the pattern • -n include the line numbers • -v show lines that do NOT match the pattern grep -i lemma README.english grep -ic lemma README.english grep -in lemma README.english BASED on Kevin Cohen’s LING 5200
The Chomsky Grammar Hierarchy • Regular grammars, aabbbb S → aS | nil | bS • Context free grammars, aaabbb S → aSb | nil • Context sensitive grammars, aaabbbccc xSy → xby • Transformational grammars - Turing Machines BASED on Kevin Cohen’s LING 5200
Movement What did John give to Mary? *Where did John give to Mary? John gave cookies to Mary. John gave <what> to Mary. BASED on Kevin Cohen’s LING 5200
Nested Dependencies and Crossing Dependencies John, Mary and Bill ate peaches, pears and apples, respectively CF The dog chased the cat that bit the mouse that ran. CF The mouse the cat the dog chased bit ran. CS BASED on Kevin Cohen’s LING 5200
Most parsers are Turing Machines • To give a more natural and comprehensible treatment of movement • For a more efficient treatment of features • Not because of respectively – most parsers can’t handle it. BASED on Kevin Cohen’s LING 5200
b*c matches the first character in the string cabbbcde, • b*cd matches the third to seventh characters in the string cabbbcdebbbbbbcdbc. BASED on Kevin Cohen’s LING 5200
Character classes: ranges • All upper-case, all lower-case, all letters, any digit from zero to 9… • [A-Z] • [a-z] • [A-Za-z] • [0-9] • Practice! BASED on Kevin Cohen’s LING 5200
Character classes: complements • Any character that's not a vowel • [^aeiouAEIOU] In this context, means "not" BASED on Kevin Cohen’s LING 5200
Anchors • Any line that begins with… • Any line that ends with… • ^T line that begins with T • VBZ$ line that ends with VBZ BASED on Kevin Cohen’s LING 5200
Quantifiers • One or more… • Zero or more… • One or zero… • a+ one or more “a's” • a* zero or more “a's” • a? one “a”, or nothing • And more… BASED on Kevin Cohen’s LING 5200
grep/egrep • X+ instead of xx* • (xxx|yyy) • ? Matches a single character BASED on Kevin Cohen’s LING 5200
Searching the treebank • cat ??/* | egrep -i '(push|pull)[a-z]*’ BASED on Kevin Cohen’s LING 5200
grep/egrep • grep '^[^a-z]*epl' README.english • grep ‘ epl' README.english • egrep '^[^a-z]*(epl|epw)' README.english • egrep ‘ (epl|epw)' README.english • Nice when you have tokenized strings… BASED on Kevin Cohen’s LING 5200
More grepping • But when you don’t…. • /corpora/celex/english/epw/epw.cd • Find all capitalized words • grep ^'[0-9][0-9]*.[A-Z]' epw.cd | wc -l BASED on Kevin Cohen’s LING 5200
Exercises – pick a directory • How many 5 letter words? • head -10 wsj_0564 | grep -i ' [a-z][a-z][a-z][a-z][a-z] ' | wc • grep -i ' [a-z][a-z][a-z][a-z][a-z] ' * | wc BASED on Kevin Cohen’s LING 5200
Lab (cont.) • Are there any words with no vowels? • grep -i ' [^aeiou][^aeiou]* ' wsj_0564 | wc • grep -i ' [^aeiouy][^aeiouy.]* ' wsj_0564 | wc • grep -i ' [^aeiouy"][^aeiouy."]* ' wsj_0564 • 80%? BASED on Kevin Cohen’s LING 5200
Lab (cont.) • Find “1-syllable” words. (words with exactly one vowel) grep -i ' [^aeiouy]*[aeiouy][^aeiouy]* ‘ • Find “2- syllable” words. (words with exactly two vowels) • Delete words ending with a silent “e” from the “2-syllable” list BASED on Kevin Cohen’s LING 5200
Emacs • emacs –nw • Control x, control c – exit • Control x, control s – save • Control x, control v – visit • Appropos BASED on Kevin Cohen’s LING 5200