160 likes | 275 Views
outline. goals, current situation, project description requirements implementation. the initiative goals standardisation flexibility. the initiative. goal: diachronic corpus of German, Old High German (800) to Modern German ( 1900) for linguistic, philological and historic research
E N D
outline • goals, current situation, project description • requirements • implementation
the initiative goals standardisation flexibility
the initiative • goal: diachronic corpus of German, Old High German (800) to Modern German (1900) for linguistic, philological and historic research • current situation: a lot of digitized texts, but • different (mostly implicit) quality standards (source, diplomaticity) • different formats (WordPerfect, WordCruncher, XML, ...) • different header structures (if any) • different positional or structural annotation (if any) • unequal coverage and different corpous composition for the language stages • availability sometimes problematic, no common search tools
the initiative • linguists, philologists, corpus linguists, computer scientists from 15 German universities, international cooperation • 5 language groups + architecture group • pilot project for corpus architecture at Humboldt-Universität, Berlin • planned duration 7 years, size after 7 years • core corpus: 40 M words • extension corpus: 60 M words • current situation: funding declined by the German Science Foundation (DFG), we are looking for other financing options
requirements - standardisation • standardisation • common quality standard(s) • source: original (preferred) or edited text • diplomaticity • common header structure – compatible with TEI/Menota • dialect • text type/genre • paleography/codicology • common structural annotation • graphic • logical • conflicting hierarchies
requirements - standardisation • common positional annotation • levels • tagsets • lemmatisation • within language group - normalisation • across language groups – hyperlemma • multi-linguality
requirements - flexibility • different texts may have different annotation layers • every text (extension corpus): header information, minimal structural annotation • core corpus: additionally lemmatisation, pos-tags • presentation corpus: aligned facsimiles, sound files • multi-modality • in addition: texts may have more annotation layers (syntax, information structure, narratological information, paleographical information, ...) – the tagsets and guidelines for each layer are standardised • texts and annotation layers can be added at any time
requirements – character-wise addressing • token cannot be the graphemic word because of • difference between graphical word and lexeme • paleographic annotation • word-formation information • (Dipper et al. 2004, Lüdeling, Poschenrieder & Faulstich 2005)
Key requirements (again) • multilinguality • multimodality • open set of annotation layers • varying annotation depth • search
architecture • web-based client-server architecture • corpus is stored in a relational database • additionally tools for annotation, search, presentation etc. (based on XPath, Vitt 2005, Faulstich, Leser & Vitt 2005) • extended ODAG model (Carletta et al.2003, Dipper et al. 2004, Faulstich, Leser & Lüdeling 2005; Faulstich & Leser 2005) • why relational database? • conflicting hierarchies and partial annotation – tabular model and pure XML tree model not possible • complex search on several text versions and their annotations • complex search on syntactic structures (graphs) and alignments (graphs over spans)
import & export • import of annotated files (XML) into the database • export from the database into XML for presentation • export from the database into XML for external annotation tools • Vitt (2004); Faulstich, Leser & Lüdeling (2005); Vitt (2005); Faulstich, Leser & Vitt (2006)