610 likes | 862 Views
Extraction of Coocurrence Data. Basic Assumptions. The collocates of a collocation cooccur more frequently within text than arbitrary word combinations. (Recurrence) Stricter control of cooccurrence data leads to more meaningful results in collocation extraction. Word (Co)occurrence.
E N D
Basic Assumptions • The collocates of a collocation cooccur more frequently within text than arbitrary word combinations. (Recurrence) • Stricter control of cooccurrence data leads to more meaningful results in collocation extraction.
Word (Co)occurrence • Distribution of words and word combinations in text approximately described by Zipf’s law. • Distribution of combinations is more “extreme” than that of individual words.
Word (Co)occurrence • Zipf 's law: • nm is the number of different words occurring m times • i.e., there is a large number of low-frequency words, and few high-frequency ones
Word (Co)occurrence • Collocations will be preferably found among highly recurrent word combinations extracted from text. • Large amounts of text need to be processed to obtain sufficient number of high-frequency combinations.
Control of Candidate Data • Extract collocations from relational bigrams • Syntactic homogeneity of candidate data • (Grammatical) cleanness of candidates e.g. N+V pairs: Subject+V vs. Object+V • Text type, domain, and size of source corpus influence the outcome of collocation extraction
Terminology • Extraction corpustokenized, pos-tagged or syntactically analysed text • Base datalist of bigrams found in corpus • Cooccurrence databigrams with contingency tables • Collocation candidatesranked bigrams
Types and Tokens • Frequency counts (from corpora) • identify labelled units (tokens),e.g. words, NPs, Adj-N pairs • set of different labels (types) • type frequency = number of tokens labelled with this type • example:... whatthe blackbox does ...
box Types and Tokens • Frequency counts (from corpora) • identify labelled units (tokens),e.g. words, NPs, Adj-N pairs • set of different labels (types) • type frequency = number of tokens labelled with this type • example:... whatthe blackbox does ...
Types and Tokens • Counting cooccurrences • bigram tokens = pairs of word tokens • bigram types = pairs of word types • contingency table = four-way classification of bigram tokens according to their components
Contingency Tables contingency table for pair type (u,v)
Collocation Extraction: Processing Steps • Corpus preprocessing • tokenization (orthographic words) • pos-tagging • morphological analysis / lemmatization • partial parsing • (full parsing)
Collocation Extraction: Processing Steps • Extraction of base data from corpus • adjacent word pairs • Adj-N pairs from NP chunks • Object-V & Subject-V from parse trees • Calculation of cooccurrence data • compute contingency table for each pair type (u,v)
Collocation Extraction: Processing Steps • Ranking of cooccurrence data by "association scores" • measure statistical association between types u and v • true collocations should obtain high scores • using association measures (AM) • N-best list = listing of N highest-ranked collocation candidates
Base Data:How to get? • Adj-N • adjacency data • numerical span • NP chunking • (lemmatized)
Base Data:How to get? • V-N • adjacency data • sentence window • (partial) parsing • identification of grammatical relations • (lemmatized)
Base Data:How to get? • PP-V • adjacency data • PP chunking • separable verb particles(in German) • (full syntactic analysis) • (lemmatization?)
Adj-N In the first place, the ‘less genes, more behavioural flexibility’ argument is a total red herring. In/PRP the/ART first/ORD place/N ,/$, the/ART ‘/$’ less/ADJ genes/N ,/$, more/ADJ behavioural/ADJ flexibility/N ’/$’ argument/N is/V a/ART total/ADJ red/ADJ herring/N ./$.
span size 1 (adjacency)wj, j= -1 first/ORD place/N less/ADJ genes/N behavioural/ADJ flexibility/N ’/$’ argument/N red/ADJ herring/N Adj-N:poswi = N
more/ADJ flexibility/N ’/$’ argument/N flexibility/N argument/N red/ADJ herring/N total/ADJ herring/N span size 2wj, j = -2, -1 first/ORD place/N the/ART place/N less/ADJ genes/N ‘/$’ genes/N behavioural/ADJ flexibility/N Adj-N:poswi = N
more/ADJ flexibility/N ’/$’ argument/N flexibility/N argument/N red/ADJ herring/N total/ADJ herring/N span size 2wj, j = -2, -1 first/ORD place/N the/ART place/N less/ADJ genes/N ‘/$’ genes/N behavioural/ADJ flexibility/N Adj-N:poswj = ADJ, poswi = N
Adj-N (S (PP In/PRP (NP the/ART first/ORD place/N ) ) ,/$, (NP the/ART ‘/$’ less/ADJ genes/N ,/$, more/ADJ behavioural/ADJ flexibility/N ’/$’ argument/N ) (VP is/V (NP a/ART total/ADJ red/ADJ herring/N ) ) ) ./$.
Adj-N (S (PP-mod In/PRP (NP the/ART first/ORD place/N ) ) ,/$, (NP-subj the/ART ‘/$’ less/ADJ genes/N ,/$, more/ADJ behavioural/ADJ flexibility/N ’/$’ argument/N ) (VP-copula is/V (NP a/ART total/ADJ red/ADJ herring/N ) ) ) ./$.
Adj-N:NP chunks NP chunks • (NP the/ART first/ORD place/N ) • (NP the/ART ‘/$’ less/ADJ genes/N ,/$, more/ADJ behavioural/ADJ flexibility/N ’/$’ argument/N) • (NP a/ART total/ADJ red/ADJ herring/N ) Adj-N Pairs • less/ADJ genes/N • more/ADJ flexibility/N • behavioural/ADJ flexibility/N • more/ADJ argument/N • behavioural/ADJ argument/N • total/ADJ herring/N • red/ADJ herring/N
N-V: Object-VERB • spill the beans • Good for you for guessing the puzzle but from the beans Mike spilled to me, I think those kind of twists are more maddening than fun. • bury the hatchet • Paul McCartney has buried the hatchet with Yoko Ono after a dispute over the songwriting credits of some of the best-known Beatles songs.
N-V: Object-Mod-VERB • keep <one‘s> nose to the grindstone • I'm very impressed with you for having keptyour nose to the grindstone, I'd like to offer you a managerial position. • We’ve learned from experience and keptour nose to the grindstone to make sure our future remains a bright one. • She keepsher nose to the grindstone.
N-V: Object-Mod-VERB • keep <one‘s> nose to the grindstone (VP {kept, keeps, ...} {(NP-obj your nose), (NP-obj our nose), (NP-obj her nose), ... } (PP-mod to the grindstone) )
PN-V: P-Object-VERB • zur Verfügung stellen (make available)Peter stellt sein Auto Maria zur Verfügung (Peter makes his car available to Maria) • in Frage stellen (question) Peter stellt Marias Loyalität in Frage (Peter questions Maria’s loyalty) • in Verbindung setzen (to contact) Peter setzt sich mit Maria in Verbindung (Peter contacts Maria)
Contingency Tablesfor Relational Cooccurrences (big, dog) (black, box) (black, dog) (small, cat) (small, box) (black, box) (old, box) (tabby, cat) pair type: (u,v) = (black, box)
Contingency Tablesfor Relational Cooccurrences (big, dog) (black, box) (black, dog) (small, cat) (small, box) (black, box) (old, box) (tabby, cat) pair type: (u,v) = (black, box)
Contingency Tablesfor Relational Cooccurrences (big, dog) (black, box) (black, dog) (small, cat) (small, box) (black, box) (old, box) (tabby, cat) pair type: (u,v) = (black, box)
Contingency Tablesfor Relational Cooccurrences (big, dog) (black, box) (black, dog) (small, cat) (small, box) (black, box) (old, box) (tabby, cat) pair type: (u,v) = (black, box)
Contingency Tablesfor Relational Cooccurrences (big, dog) (black, box) (black, dog) (small, cat) (small, box) (black, box) (old, box) (tabby, cat) pair type: (u,v) = (black, box)
Contingency Tablesfor Relational Cooccurrences (big, dog) (black, box) (black, dog) (small, cat) (small, box) (black, box) (old, box) (tabby, cat) pair type: (u,v) = (black, box)
Contingency Tablesfor Relational Cooccurrences (big, dog) (black, box) (black, dog) (small, cat) (small, box) (black, box) (old, box) (tabby, cat) f(u,v) = 2 f1(u) = 3 f2(v) = 4 N = 8
Contingency Tablesfor Relational Cooccurrences f(u,v) = 2 f1(u) = 3 f2(v) = 4 N = 8
Contingency Tablesfor Relational Cooccurrences f(u,v) = 2 f1(u) = 3 f2(v) = 4 N = 8
Contingency Tablesfor Relational Cooccurrences real data from the BNC(adjacent adj-noun pairs, lemmatised)
Contingency Tables in Perl %F = (); %F1 = (); %F2 = (); $N = 0; while (($u, $v)=get_pair()) { $F{"$u,$v"}++; $F1{$u}++; $F2{$v}++; $N++; }
Contingency Tables in Perl foreach $pair (keys %F) { ($u,$v) = split /,/, $pair; $f = $F{$pair}; $f1 = $F1{$u}; $f2 = $F2{$v}; $O11 = $f; $O12 = $f1 - $f; $O21 = $f2 - $f; $O22 = $N - $f1 - $f2 - $f; # ... }
Why are Positional Cooccurrences Different? • adjectives and nous cooccurring within sentences • "I saw a black dog" (black, dog)f(black, dog)=1, f1(black)=1, f2(dog)=1 • "The old man with the silly brown hat saw a black dog" (old, dog), (silly, dog), (brown, dog), (black, dog), ... , (black, man), (black, hat)f(black, dog)=1, f1(black)=3, f2(dog)=4
Why are PositionalCooccurrences Different? • "wrong" combinations could be considered as extraction noise( association measures distinguish noise from recurrent combinations) • but: very large amount of noise • statistical models assume that noise is completely random • but: marginal frequencies often increase in large steps
Contingency Tables for Segment-Based Cooccurrences • within pre-determined segments (e.g. sentences) • components of cooccurring pairs may be syntactically restricted(e.g. adj-noun, nounSg-verb3.Sg) • for given pair type (u,v),set of all sentences is classified into four categories
Contingency Tables for Segment-Based Cooccurrences • u S = at least one occurrence of u in sentence S • u S = no occurrences of u in sentence S • v S = at least one occurrence of v in sentence S • v S = no occurrences of v in sentence S
Contingency Tables for Segment-Based Cooccurrences • fS(u,v) = number of sentences containing both u and v • fS(u) = number of sentences containing u • fS(v) = number of sentences containing v • NS= total number of sentences
Frequency Counts for Segment-Based Cooccurrences • adjectives and nous cooccurring within sentences • "I saw a black dog" (black, dog)fS(black, dog)=1, fS(black)=1, fS(dog)=1 • "The old man with the silly brown hat saw a black dog" (old, dog), (silly, dog), (brown, dog), (black, dog), ... , (black, man), (black, hat)fS(black, dog)=1, fS(black)=1, fS(dog)=1
Segment-Based Cooccurrences in Perl foreach $S (@sentences) { %words = map {$_ => 1} words($S); %pairs = map {$_ => 1} pairs($S); foreach $w (keys %words) { $FS_w{$w}++; } foreach $p (keys %pairs) { $FS_p{$p}++; } $NS++; }
Contingency Tables for Distance-Based Cooccurrences • problems are similar to segment-based cooccurrence data • but: no pre-defined segments • accurate counting is difficult • here: sketched for special case • all orthographic words • numerical span: nL left, nR right • no stop word lists
Contingency Tables for Distance-Based Cooccurrences • nL = 3, nR = 2