outline

outline • goals, current situation, project description • requirements • implementation

the initiative goals standardisation flexibility

the initiative • goal: diachronic corpus of German, Old High German (800) to Modern German (1900) for linguistic, philological and historic research • current situation: a lot of digitized texts, but • different (mostly implicit) quality standards (source, diplomaticity) • different formats (WordPerfect, WordCruncher, XML, ...) • different header structures (if any) • different positional or structural annotation (if any) • unequal coverage and different corpous composition for the language stages • availability sometimes problematic, no common search tools

the initiative • linguists, philologists, corpus linguists, computer scientists from 15 German universities, international cooperation • 5 language groups + architecture group • pilot project for corpus architecture at Humboldt-Universität, Berlin • planned duration 7 years, size after 7 years • core corpus: 40 M words • extension corpus: 60 M words • current situation: funding declined by the German Science Foundation (DFG), we are looking for other financing options

requirements - standardisation • standardisation • common quality standard(s) • source: original (preferred) or edited text • diplomaticity • common header structure – compatible with TEI/Menota • dialect • text type/genre • paleography/codicology • common structural annotation • graphic • logical • conflicting hierarchies

requirements - standardisation • common positional annotation • levels • tagsets • lemmatisation • within language group - normalisation • across language groups – hyperlemma • multi-linguality

requirements - flexibility • different texts may have different annotation layers • every text (extension corpus): header information, minimal structural annotation • core corpus: additionally lemmatisation, pos-tags • presentation corpus: aligned facsimiles, sound files • multi-modality • in addition: texts may have more annotation layers (syntax, information structure, narratological information, paleographical information, ...) – the tagsets and guidelines for each layer are standardised • texts and annotation layers can be added at any time

requirements – character-wise addressing • token cannot be the graphemic word because of • difference between graphical word and lexeme • paleographic annotation • word-formation information • (Dipper et al. 2004, Lüdeling, Poschenrieder & Faulstich 2005)

Swerlenrecht kůnnen wil•d~volge

Implementation Concept

Key requirements (again) • multilinguality • multimodality • open set of annotation layers • varying annotation depth • search

architecture • web-based client-server architecture • corpus is stored in a relational database • additionally tools for annotation, search, presentation etc. (based on XPath, Vitt 2005, Faulstich, Leser & Vitt 2005) • extended ODAG model (Carletta et al.2003, Dipper et al. 2004, Faulstich, Leser & Lüdeling 2005; Faulstich & Leser 2005) • why relational database? • conflicting hierarchies and partial annotation – tabular model and pure XML tree model not possible • complex search on several text versions and their annotations • complex search on syntactic structures (graphs) and alignments (graphs over spans)

import & export • import of annotated files (XML) into the database • export from the database into XML for presentation • export from the database into XML for external annotation tools • Vitt (2004); Faulstich, Leser & Lüdeling (2005); Vitt (2005); Faulstich, Leser & Vitt (2006)

outline

outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: