30 likes | 232 Views
Lemma: canonical (citation) form of a lexeme, which conventionally represents the set of related words Lexeme: the set of related words But …. Text Corpora: British National Corpus: 100M words Brown Corpus: 1M words Hansards: 750K words Wall Street Journal: 914K words
E N D
Lemma: canonical (citation) form of a lexeme, which conventionally represents the set of related words • Lexeme: the set of related words • But….
Text Corpora: • British National Corpus: 100M words • Brown Corpus: 1M words • Hansards: 750K words • Wall Street Journal: 914K words • AP newswire: 620+M words • Penn Treebank: +1M words, bracketed syntactically, WSJ+ • Speech Corpora: • London-Lund Corpus: 1M words • Call Home: lots • ATIS (7812 words) • Switchboard: 240h (+3M words) • Broadcast News: lots • TDT: ++1000h (Eng,Ara,Mand) • Communicator: 62h (317k words)