1 / 6

Issues: Large Corpora

Issues: Large Corpora. What are large corpora? anything that is too large for manual or semi-automatic compilation and annotation from ca. 10 million words, up to billions The challenges of large corpora acquisition, cleanup & duplicate detection linguistic annotation

donny
Download Presentation

Issues: Large Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Issues: Large Corpora • What are large corpora? • anything that is too large for manual orsemi-automatic compilation and annotation • from ca. 10 million words, up to billions • The challenges of large corpora • acquisition, cleanup & duplicate detection • linguistic annotation • classification and clustering • retrieval (corpus queries, frequency data)

  2. Acquisition • How to obtain texts • OCR on scanned page images • online archives (structured and unstructured) • Web as corpus, online periodicals • Cleanup: error-correction & normalisation • OCR errors, typographical errors • historic & other variant spellings • annotate normalisation, but keep original form! • Fine-grained language ident. (incl. quotations)

  3. Duplicates • Duplicates are a ubiquitous problem • not just on the Web (newswires, editions) • quotations = small units of duplicated text • Identification of duplicates • paragraph level or finer • near-duplicates (how can these be defined?) • Marking up duplicates • duplicates should not be deleted, but annotated • adds complex link structure to document collection

  4. Annotation • Desirable annotation levels • POS tagging, lemmatization, named entities • shallow parsing, anaphora and NE resolution • Key problems • ensure acceptable quality of automatic processingon “nonstandard” text (no training data available) • prerequisites: normalisation and language ident. • Evaluation is important • targeted evaluation: queries, freq. unknown words

  5. Classification & Clustering • Metadata • archives usually have rich metadata • Web as corpus: usually no or little metadata • even metadata from archives will often have to be validated or refined (e.g. anthology of poetry) • Machine learning approach • classification (predefined categories, training data) • clustering (unsupervised, suggests categories) • critical evaluation is of utmost importance

  6. Retrieval • What do we want to find? • words, phrases, regular expressions • part-of-speech patterns, metadata constraints • semi-structured data (shallow parsing) • When is it sufficient to use Google or LexisNexis? • Building our own query engine: approaches • SQL database: frequency information is central • CQP and friends (+ MySQL for frequency data) • scalable backbone (Lucene), linguistic extensions

More Related