390 likes | 929 Views
Corpus. Corpus. What is a corpus? A collection of naturally occurring language text, chosen to characterize a state or variety of a language. John Sinclair, Corpus, Concordance, Collocation, OUP, 1991 Balanced corpus
E N D
Corpus • What is a corpus? • A collection of naturally occurring language text, chosen to characterize a state or variety of a language. John Sinclair, Corpus, Concordance, Collocation, OUP, 1991 • Balanced corpus • A corpus is a representative sample if what we can find in the sample also holds for the general population.
Some Well-Known Corpora • Brown Corpus • Created in the 1960s at Brown University • 1 Million words • Balanced • POS tagged A01 0010 1 The Fulton County Grand Jury said Friday an investigation A01 0020 1 of Atlanta's recent primary election produced "no evidence" A01 0020 9 that any irregularities took place. A01 0030 5 The jury further said in term-end presentments that A01 0040 3 the City Executive Committee, which had over-all charge A01 0050 2 of the election, "deserves the praise and thanks of A01 0050 11 the City of Atlanta" for the manner in which the election A01 0060 11 was conducted.
British National Corpus (BNC) • A 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. • About 10 meters of shelf space if printed
Some Well-Known Corpora • TREC • Text REtrieval Conference • newspaper articles (majority) • abstracts of scientific articles • federal register • 3GB of compressed text • Used to test information retrieval systems
Sample from the TREC Corpus <DOC> <DOCNO> DOE1-03-0002 </DOCNO> <TEXT> Pressure and fluid oscillations at the steam injection into pool water were discussed from the view point of the conversion of thermal energy into work. When the change of fluid state moves clockwise in the p-V diagram, the oscillation sustains since the thermal energy changes into positive work. The oscillation threshold at the condensation oscillation was discussed as putting the conversion ratio equal to zero. The change of oscillation pattern by the steam mass flow at the chugging was also discussed deriving the p-V diagram by a numerical model of chugging. </TEXT> </DOC>
Project Gutenberg • History • Project Gutenberg began in 1971 when Michael Hart was given an operator's account with $100,000,000 of computer time on a mainframe at the University of Illinois. • Contents • Light Literature: Alice in Wonderland, • Heavy Literature; Bible, Shakespeare, Moby Dick, • References; Roget's Thesaurus, almanacs,
Some Well-Known Corpora • Penn TreeBank • Parsed trees of 1 million words of WSJ. • Created by LDC at UPenn • The largest treebank • Created semi-automatically.
( (S (NP-SBJ (NP Pierre Vinken) , (ADJP (NP 61 years) old) ,) (VP will (VP join (NP the board) (PP-CLR as (NP a nonexecutive director)) (NP-TMP Nov. 29))) .)) ( (S (NP-SBJ Mr. Vinken) (VP is (NP-PRD (NP chairman) (PP of (NP (NP Elsevier N.V.) , (NP the Dutch publishing group))))) .))
Some Well-Known Corpora • SUSANNE corpus • Created by University of Sussex in England • 1/7 of Brown corpus • Manually parsed • Checked many times • Very well documented
A01:0010a - YB <minbrk> - [Oh.Oh] A01:0010b - AT The the [O[S[Nns:s. A01:0010c - NP1s Fulton Fulton [Nns. A01:0010d - NNL1cb County county .Nns] A01:0010e - JJ Grand grand . A01:0010f - NN1c Jury jury .Nns:s] A01:0010g - VVDv said say [Vd.Vd] A01:0010h - NPD1 Friday Friday [Nns:t.Nns:t] A01:0010i - AT1 an an [Fn:o[Ns:s. A01:0010j - NN1n investigation investigation . A01:0020a - IO of of [Po. A01:0020b - NP1t Atlanta Atlanta [Ns[G[Nns.Nns] A01:0020c - GG +<apos>s - .G] A01:0020d - JJ recent recent . A01:0020e - JJ primary primary . A01:0020f - NN1n election election .Ns]Po]Ns:s] A01:0020g - VVDv produced produce [Vd.Vd] A01:0020h - YIL <ldquo> - . A01:0020i - ATn +no no [Ns:o. A01:0020j - NN1u evidence evidence . A01:0020k - YIR +<rdquo> - . A01:0020m - CST that that [Fn. A01:0030a - DDy any any [Np:s. A01:0030b - NN2 irregularities irregularity .Np:s] A01:0030c - VVDv took take [Vd.Vd] A01:0030d - NNL1c place place [Ns:o.Ns:o]Fn]Ns:o]Fn:o]S] A01:0030e - YF +. - .O] A01:0030f - YB <minbrk> - [Oh.Oh]
Some Well-Known Corpora • SemCor • a 200,000 word corpus manually tagged by lexicographers as part of the WordNet Project.
Canadian Hansards • A bilingual corpus of the proceedings of the Canadian parliament Contains parallel texts in English and French which have been used to investigate statistically based machine translation.
<PAIR> <ENGLISH> no , it is a falsehood . </ENGLISH> <FRENCH> non , ce est un mensonge . </FRENCH> </PAIR> <PAIR> <ENGLISH> Mr. Speaker , the record speaks for itself with regard to what I said about the price of fertilizer . </ENGLISH> <FRENCH> monsieur le Orateur , ma déclaration sur le prix de les engrais a été confirmée par les événements . </FRENCH> </PAIR>
Word Counting • Simplest kind of statistics • What is a word? • The answer is not as easy as it looks. • Space separated? What about punctuation marks? New York bookstore’s $22.50 McDonald’s google.com can’t O’Connor I’d Tiburon, Calif.-based data base
Words in Chinese/Japanese/Korean • No word boundary • Can be treated in the same way as phrasal words in English. • almost all words are phrasal words. • Segmentation Problem: • Tokenize the Chinese text so that each token is a word. • Lack of standard definition of what is a word.
Words in Web Pages • Issues: tags, scripts, images Source on the next page
</table><p class=e><table border=0 cellpadding=1 cellspacing=0 width=100%><tr><td width=1% valign=top nowrap><font size=-1 class=f>Category: </font></td><td><font size=-1><a href=http://directory.google.com/Top/Regional/Europe/United_Kingdom/Education/Products_and_Services/?tc=1>Regional > Europe > ... > Education > Products and Services</a> </font></table><div><p class=g><a href=http://www.cobuild.collins.co.uk/ onmousedown="return clk(1,this)"><b>Cobuild</b> Home Page</a><br><font size=-1><b>Cobuild</b> English Dictionary, New Edition About <b>Cobuild</b> About the Bank of English Idiom<br> of the Day Wordwatch Feature About WordbanksOnline Corpus Access <b>Cobuild</b> <b>...</b> <br><span class=f><font size=-1>Description:</font></span> Develops and maintains corpora for modern written and spoken text. Features an online resource of...<br><span class=f>Category: </span><a class=fl href=http://directory.google.com/Top/Reference/Education/Products_and_Services/English_as_a_Second_Language/?il=1>Reference > Education > ... > English as a Second Language</a><br><font color=#008000>www.cobuild.collins.co.uk/ - 2k - </font><a class=fl href=http://216.239.39.104/search?q=cache:Z7MzsR4S6hYJ:www.cobuild.collins.co.uk/+cobuild&hl=en&ie=UTF-8>Cached</a> - <a class=fl href=/search?hl=en&lr=&ie=UTF-8&q=related:www.cobuild.collins.co.uk/>Similar pages</a></font> <blockquote class=g><p class=g><a href=http://www.cobuild.collins.co.uk/about.html onmousedown="return clk(2,this)">About <b>COBUILD</b></a><br><font size=-1>Welcome to <b>Cobuild</b>. If you're interested in the English <b>...</b> A Brief Introduction<br> to <b>Cobuild</b>. <b>Cobuild</b> is a department of HarperCollins Publishers <b>...</b>
Tokenizer • Space tokenizer: • A token is a consecutive sequence of characters between white spaces. • Simple • Fails in many cases • Regular Expression tokenizer • Use regular expressions to define tokens. • The longest prefix that matches a regular expression is a token. • Remove the token from the input stream and repeat the process.
Counting Words: Example • If you pay, the story rolls. If you don’t, the story folds. • 12 word tokens • The number of words • 8 word types. • The number of distinct words
Zipf’s Law • Zipf’s Law • Rank * frequency = constant • English terms constant about .1 • Example • the frequency count of the 50th most frequent word is 3 times that of the 150th. • Implications: • 20% of the words covers 80% of the text • Difficult to achieve (near) complete coverage.
1000*rf/n 1000*rf/n 1000*rf/n the 59 from 92 or 101 of 58 he 95 about 102 to 82 million 98 market 101 a 98 year 100 they 103 in 103 its 100 this 105 and 122 be 104 would 107 that 75 was 105 you 106 for 84 company 109 which 107 is 72 an 105 bank 109 said 78 has 106 stock 110 it 78 are 109 trade 112 on 77 have 112 his 114 by 81 but 114 more 114 as 80 will 117 who 106 at 80 say 113 one 107 mr 86 new 112 their 108 with 91 share 114
What does Word Counts Tell us? • Information retrieval (IR) systems use word counts to determine the importance of words in a document. • Two intuitions: • If a word is frequently used in a document, it is probably important in the document. • If a word is frequently used in all documents, it is not important in any of them.
Keyword Extraction • How to find the keywords in an document? • Peter Turney’s Web demo.
Automatic Summarization • Many IR systems have a feature called summarization. • A summary of a document is typically a small number of sentences in the document.
What is a Sentence? • Always treating “.?!” as sentence boundary is correct about 92% of the time. • Abbreviations have . at the end • But abbreviations could end sentence too! • A good sentence boundary detector has over 99.8% accuracy.
Sentence Boundary Detection • For English, it is generally sufficient to look at a token that contains a potential sentence boundary ending mark (?.!) and the following token.
Declare Sentence Boundary If: • The first token ends with a double quote ("). • The first token begins with a lower case letter but is not one of • p.m. a.m. v. vs. v.s. i.e. cf. viz. e.g. p. pp. • The second token is a word that often appears at the sentence initial positions.
All of the following are true: • The first token is a corporate name designator such as Inc. or Ltd. • The second token does not begin with an open brace or parenthesis • The second token is either a title word, such as president, or Dr. or a word containing no periods, but not both; • The second token is not a country name.
None of the following is true: • The first token is an abbreviation; • The first token is a capitalized word enclosed by a pair of parentheses; • The first token is one of the title words; • The second token is a single letter initial or a number.
Concordance • KWIC: Key Word In Context • display word occurrences and their contexts • align all occurrences of a word to make the left and right context move visible. • Concordance is an important tool for when building lexicons (dictionaries). • Example • Cobuild web site.
Concordance Example I suspected that, aside from the sheer amount of time Deirdre spent alone incapacitating ones, because of the sheer amount of role change required. [p] The end was not a release - it was a sheer, blissful deliverance. I tumbled modifications of slavery itself. Sheer brute force was sufficient to get in total darkness but the top of the sheer cliff on the west side was tinged cultures of the two continents? Is it sheer coincidence that the poorer parts BONE coat. No other can match it for sheer comfort & downright toughness. reality, not in dreams.” For Day, the sheer concreteness of Thrse's teachings months ago. Sometimes, he says, the sheer contrast in living standards makes For all we know, God may take sheer delight in being probed [h] Striped Semi-Sheer [/h] [p] Not sheer enough to see through…but age because it--it was almost all sheer entertainment. And a lovely-- from leaping overboard through sheer exuberance -- and probably where the baby bunting in &hellip [p] The sheer familiarity of that ancient nursery trackless scrub and finally stop in sheer grass and sage surrounded by bush. filthy rainwater in the gutter. Or on sheer hope and courage, on days when even