60 likes | 232 Views
Issues: Large Corpora. What are large corpora? anything that is too large for manual or semi-automatic compilation and annotation from ca. 10 million words, up to billions The challenges of large corpora acquisition, cleanup & duplicate detection linguistic annotation
E N D
Issues: Large Corpora • What are large corpora? • anything that is too large for manual orsemi-automatic compilation and annotation • from ca. 10 million words, up to billions • The challenges of large corpora • acquisition, cleanup & duplicate detection • linguistic annotation • classification and clustering • retrieval (corpus queries, frequency data)
Acquisition • How to obtain texts • OCR on scanned page images • online archives (structured and unstructured) • Web as corpus, online periodicals • Cleanup: error-correction & normalisation • OCR errors, typographical errors • historic & other variant spellings • annotate normalisation, but keep original form! • Fine-grained language ident. (incl. quotations)
Duplicates • Duplicates are a ubiquitous problem • not just on the Web (newswires, editions) • quotations = small units of duplicated text • Identification of duplicates • paragraph level or finer • near-duplicates (how can these be defined?) • Marking up duplicates • duplicates should not be deleted, but annotated • adds complex link structure to document collection
Annotation • Desirable annotation levels • POS tagging, lemmatization, named entities • shallow parsing, anaphora and NE resolution • Key problems • ensure acceptable quality of automatic processingon “nonstandard” text (no training data available) • prerequisites: normalisation and language ident. • Evaluation is important • targeted evaluation: queries, freq. unknown words
Classification & Clustering • Metadata • archives usually have rich metadata • Web as corpus: usually no or little metadata • even metadata from archives will often have to be validated or refined (e.g. anthology of poetry) • Machine learning approach • classification (predefined categories, training data) • clustering (unsupervised, suggests categories) • critical evaluation is of utmost importance
Retrieval • What do we want to find? • words, phrases, regular expressions • part-of-speech patterns, metadata constraints • semi-structured data (shallow parsing) • When is it sufficient to use Google or LexisNexis? • Building our own query engine: approaches • SQL database: frequency information is central • CQP and friends (+ MySQL for frequency data) • scalable backbone (Lucene), linguistic extensions