290 likes | 440 Views
More Xkwic and Tgrep. LING 5200 Computational Corpus Linguistics Martha Palmer March 2, 2006. Resources – Laura is bugging me to make a CU Corpora page…. Like this http://www.stanford.edu/dept/linguistics/corpora/cas-home.html
E N D
More Xkwic and Tgrep LING 5200 Computational Corpus Linguistics Martha Palmer March 2, 2006
Resources – Laura is bugging me to make a CU Corpora page… • Like this http://www.stanford.edu/dept/linguistics/corpora/cas-home.html • TGREP http://www.stanford.edu/dept/linguistics/corpora/cas-tut-tgrep.html BASED on Kevin Cohen’s LING 5200
Searching with pos tags and ! • [word = "[tT]he" & !( pos = "DT" ) ]; wsj • [ !(word = "water" | pos = "NN")]; • [ !(word = "water") & !( pos = "NN")]; • [ word != "water" & pos != "NN" ]; BASED on Kevin Cohen’s LING 5200
Operator precedence The precedence properties of the (logical) operators are defined by the following list, i.e. if operator x is listed before operator y, operator x has precedence over y. Operators are evaluated left-right • =, !=, !, &, | • [ ! word = "water" & ! pos = "NN" ]; disambiguates as • [ !(word = "water") & !( pos = "NN")]; BASED on Kevin Cohen’s LING 5200
Searching sequences with | and ? • "Bill" [pos = "NP"]; • [pos = "NP"] [pos = "NP"] [pos = "NP"]; • ([pos = "NP"] [pos = "NP"]) | ([pos = "NP"] "of" [pos = "NP"]); • ([pos = "NP"] "of“? [pos = "NP"]); Note: First match applies BASED on Kevin Cohen’s LING 5200
Corpus Position: wild cards and contexts • "give" []* "up"; • "give" []{0,5} "up"; • "give" []* "up" within 7; • "Clinton" expand to 5; • "Clinton" expand left to 5; • "Clinton" expand right to 5; BASED on Kevin Cohen’s LING 5200
Assignments and Intersect • Q1 = "rain"; • Q2 = [pos="NN"]; • intersect Q1 Q2; • Q1 = [pos = "JJ"] [pos = "NN"]; • Q2 = "acid" "rain"; • intersect Q1 Q2; • [word = "acid" & pos = "JJ"] [word = "rain" & pos = "NN"] BASED on Kevin Cohen’s LING 5200
Structural restrictions • "give" []* "up" within s; • ("gain" []* "profit") | ("profit" []* "gain") within 3 s; • ("gain" []* "profit") | ("profit" []* "gain") within article; • "Clinton" expand left to 2 s; BASED on Kevin Cohen’s LING 5200
Defining structural restrictions • Nounphrase = [pos = "DT"] [pos = "JJ"] [pos = "NN"]; • Nounphrase; • [pos = “JJ”] • Go back to select BASED on Kevin Cohen’s LING 5200
For fun • <s> [pos = "V.*"][pos = "PN.*”] </s> • <s> []* [pos = "V.*"][pos = "PN.*”] </s> • ( [pos = “V.*”] [pos = “PN.*”]) within s • Not a question, not beginning of sentence… BASED on Kevin Cohen’s LING 5200
less is more • less <filename> • cat ??/* | less • Switches • SPACE – next screenful • b– previous screenful • /<reg exp pattern> /RNR search for pattern • ?<reg exp pattern> search backwards for pattern • q - quit BASED on Kevin Cohen’s LING 5200
Searching for a word • tgrep Halloween – what happens? • Why don’t you have to specify a file? babel>grep tgrep .cshrc # tgrep stuff #setenv TGREP_CORPUS /corpora/treebank2/tbl_075/tgrepabl/brwn_cmb.crp setenv TGREP_CORPUS /corpora/treebank2/tgrepabl/wsj_mrg.crp • Count results: tgrep research | wc –l • cat ??/* | grep Halloween | wc -l BASED on Kevin Cohen’s LING 5200
Tgrep Switches • -a Match on all patterns in a sentence • -w Return the whole sentence • -n Put the entire string on one line • -t Print only the terminals BASED on Kevin Cohen’s LING 5200
Viewing it in sentential context • tgrep –wn Halloween | more • tgrep –wn research | more (20,865 hits) • Can also use less BASED on Kevin Cohen’s LING 5200
Viewing it in sentential context • tgrep –wn research | more BASED on Kevin Cohen’s LING 5200
Searching by POS • tgrep NNS | more Another way to do your sanity check BASED on Kevin Cohen’s LING 5200
See more data? • tgrep NNS | grep . | more BASED on Kevin Cohen’s LING 5200
Sentential context (again) • tgrep –wn NNS | more BASED on Kevin Cohen’s LING 5200
Searching by syntactic constituent • tgrep NP | more BASED on Kevin Cohen’s LING 5200
Single-line outputs • tgrep –n NP | more BASED on Kevin Cohen’s LING 5200
Viewing tree-like output • tgrep –w NP | head 20 BASED on Kevin Cohen’s LING 5200
Searching for relations between nodes • tgrep ‘NP < CC’ | head -16 BASED on Kevin Cohen’s LING 5200
tgrep –g (whole language) • A < B – A immediately dominates B • A < B – A is immediately dominated by B • A << B – A dominates B • A >> B – A is dominated by B • A . B – A immediately precedes B • A .. B – A precedes B • A<<,B – B is the leftmost descendent of A • A<<‘B – B is the rightmost descendent of A BASED on Kevin Cohen’s LING 5200
Alternation • node names can be ORed e.g. • tgrep ‘Clinton|Gore’ | head BASED on Kevin Cohen’s LING 5200
Character classes • Regular expressions • tgrep ‘/[Cc]hild/’ | egrep . | head BASED on Kevin Cohen’s LING 5200
Working towards that weird example… • tgrep ‘/[Pp]resident/’ | head BASED on Kevin Cohen’s LING 5200
Combining alternation and a regular expression • tgrep ‘Clinton|Gore|[Pp]resident/’ | head BASED on Kevin Cohen’s LING 5200
Searching for a transitive verb • tgrep -w 'VP << like < NP << DT' | more BASED on Kevin Cohen’s LING 5200
Verbs + Particles • tgrep -w 'VP << kick' > kicktgrep 'VP << /kick.*/ <2 PRT' kicktgrep 'VP <1 VB <2 PRT' kicktgrep -nw 'VP <1 /VB.*/ <2 PRT' kicktgrep 'VP <1 (VB < kick) <2 PRT' kicktgrep 'VP <1 (/VB.*/ < kick) <2 PRT' kick BASED on Kevin Cohen’s LING 5200