120 likes | 209 Views
Regular Expressions and Xkwic. LING 5200 Computational Corpus Linguistics Martha Palmer February 28, 2006. grep/egrep. X+ instead of xx* (xxx|yyy) xxx OR yyy ? Matches a single character of the preceding character set, or nothing. More grepping/egrepping.
E N D
Regular Expressions and Xkwic LING 5200 Computational Corpus Linguistics Martha Palmer February 28, 2006
grep/egrep • X+ instead of xx* • (xxx|yyy) xxx OR yyy • ? Matches a single character of the preceding character set, or nothing BASED on Kevin Cohen’s LING 5200
More grepping/egrepping • /corpora/celex/english/epw/epw.cd • Find all capitalized words • grep ^'[0-9][0-9]*.[A-Z]' epw.cd | wc –l • OR • egrep ^'[0-9]+.[A-Z]‘epw.cd | wc –l BASED on Kevin Cohen’s LING 5200
Homework 3 • Please give me command AND results! • 1. In the file /corpora/celex/english/epw/epw.cd, find all words that contain only upper-case letters, e.g. USSR and VTOL. • ANS:158 • grep '^[0-9][0-9]*\\[A-Z][A-Z]*\\' epw.cd | wc –l • egrep '^[0-9]+\\[A-Z]+\\' epw.cd | wc –l • egrep ^'[0-9]+[\][A-Z]+\\' epw.cd | wc -l • egrep ^'[0-9]+.[A-Z]+\\' epw.cd | wc –l BASED on Kevin Cohen’s LING 5200
Homework 3 • 2. How many entries have a syllable that ends with a 4-consonant cluster? • ANS: 45 • egrep 'CCCC]' epw.cd (why not \] )? 56 • grep 'CCCC]' epw.cd 56 • grep 'CCCC]' epw.cd | grep –v ‘ed[ \\]’ 36 • egrep 'CCCC]\\' epw.cd 45 BASED on Kevin Cohen’s LING 5200
Homework 3 • 3. Find all multi-word terms in which only the first letter is capitalized, e.g. Colorado potato beetle. • ANS: 238/243 • egrep ^'[0-9]+.[A-Z][a-z]+( [a-z]+)+\\' epw.cd | wc –l • egrep ^'[0-9]+\\[A-Z][a-z]*( [a-z]+)+\\' epw.cd | wc -l 100184\X chromosome\0\52203\1\P\'Eks-"kr5-m@-s5m\[VCC][CCVV][CV][CVVC]\[Eks][kr@U][m@][s@Um] 100185\X chromosomes\0\52203\1\P\'Eks-"kr5-m@-s5mz\[VCC][CCVV][CV][CVVCC]\[Eks][kr@U][m@][s@Umz] 100287\Y chromosome\0\52250\1\P\'w2-"kr5-m@-s5m\[CVV][CCVV][CV][CVVC]\[waI][kr@U][m@][s@Um] 100288\Y chromosomes\0\52250\1\P\'w2-"kr5-m@-s5mz\[CVV][CCVV][CV][CVVCC]\[waI][kr@U][m@][s@Umz] BASED on Kevin Cohen’s LING 5200
Homework 3 • 4. Find all multi-word terms in which the first letter (and only the first letter) of each word is capitalized, e.g. Union Jacks and Royal Automobile Club. Note: your regex should be able to accommodate an arbitrary number of words. • ANS: 296/298 • egrep ^'[0-9]+.[A-Z][a-z]+( [A-Z][a-z]*)+\\' epw.cd egrep ^'[0-9]+.[A-Z][a-z]*( [A-Z][a-z]*)+\\' epw.cd BASED on Kevin Cohen’s LING 5200
Homework 3 • 5. Find all disyllabic words that contain only vowels. • ANS: 4 • egrep '\\\[V+\]\[V+\]\\' epw.cd 5\AA\52\5\1\P\"1-'1\[VV][VV]\[eI][eI] 6\AA\95\6\1\P\"1-'1\[VV][VV]\[eI][eI] 4727\ayah\13\2714\2\P\'2-@\[VV][V]\[aI][@]\S\'#-j@\[VV][CV]\[A:][j@] 43355\i.e.\424\22210\1\P\"2-'i\[VV][VV]\[aI][i:] BASED on Kevin Cohen’s LING 5200
Homework 3 • 6. Multiword expressions (Find a similar phrase in the wsj/raw corpus, and search for all variants of it in the entire corpus. ) • egrep –i ‘.tip of the *[a-z] iceberg’ • egrep ‘[Tt]he tip of (a|the).* iceberg’ • patriarchical /a more alarming BASED on Kevin Cohen’s LING 5200
Homework 3 • 6. Other multiword expressions • war on (inflation/drugs/the dictator) • fight the war on the expenditure side rather • rule of (the day/journalism/Ferdinand Marcos) • cream of the (British) crop BASED on Kevin Cohen’s LING 5200
Searching the treebank • cat ??/* | egrep -i '(push|pull)[a-z]*’ • OR xkwic? BASED on Kevin Cohen’s LING 5200
XWin 32 • See e-mail • Load on laptops, bring laptops to class if any issues • Go to Feb 9 Emacs & Xkwic lecture BASED on Kevin Cohen’s LING 5200