Issues: Large Corpora

Issues: Large Corpora • What are large corpora? • anything that is too large for manual orsemi-automatic compilation and annotation • from ca. 10 million words, up to billions • The challenges of large corpora • acquisition, cleanup & duplicate detection • linguistic annotation • classification and clustering • retrieval (corpus queries, frequency data)

Acquisition • How to obtain texts • OCR on scanned page images • online archives (structured and unstructured) • Web as corpus, online periodicals • Cleanup: error-correction & normalisation • OCR errors, typographical errors • historic & other variant spellings • annotate normalisation, but keep original form! • Fine-grained language ident. (incl. quotations)

Duplicates • Duplicates are a ubiquitous problem • not just on the Web (newswires, editions) • quotations = small units of duplicated text • Identification of duplicates • paragraph level or finer • near-duplicates (how can these be defined?) • Marking up duplicates • duplicates should not be deleted, but annotated • adds complex link structure to document collection

Annotation • Desirable annotation levels • POS tagging, lemmatization, named entities • shallow parsing, anaphora and NE resolution • Key problems • ensure acceptable quality of automatic processingon “nonstandard” text (no training data available) • prerequisites: normalisation and language ident. • Evaluation is important • targeted evaluation: queries, freq. unknown words

Classification & Clustering • Metadata • archives usually have rich metadata • Web as corpus: usually no or little metadata • even metadata from archives will often have to be validated or refined (e.g. anthology of poetry) • Machine learning approach • classification (predefined categories, training data) • clustering (unsupervised, suggests categories) • critical evaluation is of utmost importance

Retrieval • What do we want to find? • words, phrases, regular expressions • part-of-speech patterns, metadata constraints • semi-structured data (shallow parsing) • When is it sufficient to use Google or LexisNexis? • Building our own query engine: approaches • SQL database: frequency information is central • CQP and friends (+ MySQL for frequency data) • scalable backbone (Lucene), linguistic extensions

Issues: Large Corpora

Issues: Large Corpora

Presentation Transcript

Surviving Large Scale Internet Outages

Animal Welfare

Security, Privacy, and Ethical Issues in Information Systems and the Internet

MUMmer: fast alignment of large-scale DNA and protein sequences

COMPLIANCE ISSUES FACING GROUP PRACTICES Growing and Operating a Large Medical Practice

Author-Topic Models for Large Text Corpora

IA901 2012 Session Four

Quality Issues in Coagulation Laboratory

Chapter 5 Structure and Function of Large Biological Molecules

Midbrain (“mesencephalon”)

Statistical Methods

Divorce and Separation Tax Issues

A Scalable Information Management Middleware for Large Distributed Systems

Two Hybrid

Topic Models for Morphologically Rich Languages

Raising teachers’ awareness to corpora

CLASSROOM GAMES FROM CORPORA

Large Scale Studies of Dyslexia in Florida

Chapter Five Common Stock

Legal Issues

Bifurcation Stenting: A primer