360 likes | 483 Views
TEI for language resources: a missed chance or a coming opportunity ?. Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia. Overview. Some history Why TEI isn‘t used for LRs (as much as expected) MULTEXT-East and other case studies Conclusions.
E N D
TEI for language resources: a missed chance or a coming opportunity? Tomaž ErjavecDept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia
TEI for Language Resources Overview • Some history • Why TEI isn‘t used for LRs (as much as expected) • MULTEXT-East and other case studies • Conclusions
TEI for Language Resources History At its inception TEI was meant to cover CL/NLP LRs, esp. corpora: • ACLone of the supporting associations • modules for corpora, linguistic analysis, feature-structures, graphs • BNC in TEI • At the time CL/NLP do not use SGML:clear playing field
TEI for Language Resources The age of XML and LRs Release of XML (more or less) corresponds to the begining of the era of Language resources: 1998: XML 1.0, First LREC conference But developed LRs (mostly) did not use TEI. Why?
TEI for Language Resources Reason 1: (X)CES • EAGLES Corpus Encoding Standard • „constraining or simplifying the TEIspecifications in order to ensure interoperability“(Ide 1998) • So, more compact and easier to apply than TEI • Almost TEI, but not quite • No methods for extension
TEI for Language Resources Reason 2: Comp Sci attitude • I don‘t care about the data format, I want to develop algorithms! (... I even hate XML...) • If I use XML I will roll my own schema optimal for my experiments / application (...that‘s what ‚X‘ means...) • I won‘t spend weeks (months, years) just getting to know TEI (...I need only 4 different elements anyway...)
TEI for Language Resources Reason 3: General gripes • Missing modules for syntactic analysis & lexical databases • Not perscriptive / precise enough • Too general elements • Too book oriented
TEI for Language Resources Result • Project-local proposals: • TIGER treebank format • Concede lexical database format • GENIA NER format • ... • Semantic Web: DC, RDF, OWL • ISO TC 37 SC4: • LMF, isoCat, • LAF, MAF, SynAF, ...
TEI for Language Resources MyTEI • MULTEXT-East: multilingual corpora and lexica • Fida(PLUS): Slovene Reference Corpus • IJS-ELAN, SVEZ-IJS: en-sl parallel corpora • jaSlo: Japanese-Slovene L2 dictionary • eZISS: Scholarly Digital Editions of Slovene Literature • JRC-ACQUIS: Parallel corpus of EC laws • SDT: Slovene Dependency Treebank • SBL: Slovene Biographic Lexicon • AHLib: DL/corpus of 19th century Slovene books • JOS: Slovene gold-standard corpus for HLT • MULTEXT-East...
TEI for Language Resources MULTEXT-East • EU project 1995-97: MULTEXT sequel • Development of standardised language resources for Central and Eastern European languages + English hub • Corpora, lexica, morphosyn. specifications • V1: 1998, 7 languages, LaTeX + CES/SGML • V4: 2010, 16 languages, TEI P5 • http://nl.ijs.si/ME/
TEI for Language Resources MULTEXT-East Version 4 by language and resource type
TEI for Language Resources Why TEI for MTE? • Because I like TEI • Varied resources: • Metadata / Documentation • „Document“ corpus: rich annotation structure • Lingustically annotated „1984“ corpus • Sentence alignments: stand-off markup • Morphosyntactic specifications: book-like Either choose several (moving target) schemas or use TEI.
TEI for Language Resources Documentation
TEI for Language Resources TEI Header-v4-v3-v2-v1-eci-ota-soas-
TEI for Language Resources Annotated 1984 <text xml:id="Osl." xml:lang="sl"> <body> <div type="part" xml:id="Osl.1"> <div type="chapter" xml:id="Osl.1.2"> <p xml:id="Osl.1.2.2"> <s xml:id="Osl.1.2.2.1"> <w xml:id="Osl.1.2.2.1.1" lemma="biti" ana="#Va-p-sm">Bil</w> <w xml:id="Osl.1.2.2.1.2" lemma="biti" ana="#Va-r3s-n">je</w> <w xml:id="Osl.1.2.2.1.3" lemma="jasen" ana="#Agpmsnn">jasen</w> <c xml:id="Osl.1.2.2.1.4">,</c> ← sorry! <w xml:id="Osl.1.2.2.1.5" lemma="mrzel" ana="#Agpmsnn">mrzel</w> <w xml:id="Osl.1.2.2.1.6" lemma="aprilski" ana="#Agpmsny">aprilski</w> <w xml:id="Osl.1.2.2.1.7" lemma="dan" ana="#Ncmsn">dan</w> <w xml:id="Osl.1.2.2.1.8" lemma="in" ana="#Cc">in</w> <w xml:id="Osl.1.2.2.1.9" lemma="ura" ana="#Ncfpn">ure</w> <w xml:id="Osl.1.2.2.1.10" lemma="biti" ana="#Va-r3p-n">so</w> <w xml:id="Osl.1.2.2.1.11" lemma="biti" ana="#Va-p-pf">bile</w> <w xml:id="Osl.1.2.2.1.12" lemma="trinajst" ana="#Mlc-pa">trinajst</w> <c xml:id="Osl.1.2.2.1.13">.</c>
TEI for Language Resources Whitespace • A long time ago „1984“ lost its spaces • Whitespace is brittlebut important: • Retokenisation • Reading • TEI <space> no good! • So <mte:space> </mte:space>, 24:1? • Sitting on the fence JOS solution: </S> • <mte:g/>?
TEI for Language Resources Sentence alignments In MTE V3: <?xml version="1.0" encoding="us-ascii"?> <!DOCTYPE cesAlign SYSTEM "xcesAlign.dtd"> <cesAlign version="4.1"> <linkList id="Oruen"> <linkGrp type="body" targType="s" domains="Oru Oen"> <link xtargets="Oru.1.1.1.1 ; Oen.1.1.1.1"/> <link xtargets="Oru.1.1.16.6 Oru.1.1.16.7 ; Oen.1.1.15.6"/> <link xtargets="Oru.1.3.4.1 ; Oen.1.3.4.1 Oen.1.3.4.2"/> <link xtargets=" ; Oen.1.3.4.3"/>
TEI for Language Resources TEI P5 Alignments • TEI way is with two level indirection: 1st grouping, 2nd alignment • Too complicated, esp. as 98% alignments are 1-1 • Chose fence-sitting one-level: <linkGrp type="alignment" corresp="oana-mk.xml oana-sl.xml"> <link n="1:1" targets="oana-mk.xml#Omk.1.1.1.1 oana-sl.xml#Osl.1.2.2.1"/> <link n="2:1" targets="oana-mk.xml#Omk.1.1.2.6 oana-mk.xml#Omk.1.1.2.7 oana-sl.xml#Osl.1.2.3.6"/> <link n="1:2" targets="oana-mk.xml#Omk.1.1.2.8 oana-sl.xml#Osl.1.2.3.7 oana-sl.xml#Osl.1.2.3.8"/> <!--link n="0:1" targets="oana-sl.xml#Osl.4.12.2"/-->
TEI for Language Resources Morphosyntactic specifications • Define categories (PoS) and their features • Map feature-structures to morphosyntactic descriptions (MSD tagsets) • Specify which languages have which features and tagsets • E.g. [Category=Adverb Type=general Degree=superlative] ≡ Rgs ∈ Tagsetsl • Complex morphology → complex specifications • MSD tagsets are grounded in lexicon and corpus
TEI for Language Resources Example: common specifications <table n="msd.cat" xml:lang="en" xml:id="msd.cat.Q"> <head>Common specifications for Particle</head> <row role="type"> <cell role="position">0</cell> <cell role="name">CATEGORY</cell> <cell role="value">Particle</cell> <cell role="code">Q</cell> <cell role="lang">ro</cell> <cell role="lang">sl</cell> ... </row> <row role="attribute"> <cell role="position">1</cell> <cell role="name">Type</cell> <cell> <table> <row role="value"> <cell role="name">negative</cell> <cell role="code">z</cell> <cell role="lang">ro</cell> </row> <row role="value"> <cell role="name">interrogative</cell> <cell role="code">q</cell> <cell role="lang">bg</cell> <cell role="lang">hr</cell>....
TEI for Language Resources Language particular specifications <div type="section" select="sl" xml:id="msd.Q-sl"> <head>Slovene Particle</head> <table n="msd.cat" select="sl" xml:id="msd.cat.Q-sl"> <head>Slovene Specification for Particle</head> <row role="type"> <cell role="position">0</cell><cell role="name" xml:lang="sl">besedna_vrsta</cell> <cell role="value" xml:lang="sl">členek</cell> <cell role="code" xml:lang="sl">L</cell> <cell role="name" xml:lang="en">CATEGORY</cell> <cell role="value" xml:lang="en">Particle</cell> <cell role="code" xml:lang="en">Q</cell> </row> </table> <p xml:lang="sl">Opombe: <list> <item>kot členki so označene le pojavnice, ki so navedene v leksikonu</item> </list> </p> <divGen xml:id="msd.Q-sl.lexicon" type="msd.lex" select="sl"/> </div> MTEsl = JOS
TEI for Language Resources Encoding • TEI provides needed elements, also for commentary, bibliography, ... • TEI XSLT used to render as HTML • Tables retained from MULTEXT • Several XSLT scripts for MSD conversions, e.g. to collating sequence, to fvLib and fsLib • Interesting challenge: conversion to isoCat (Adam P. for Polish tagset), OWL
TEI for Language Resources MTE specifications in OWL(by Christian Chiarcos)
TEI for Language Resources Morals, 1 • TEI good for in-place markup of richly annotated resources with varied structure: • Readable • Updatable (validation) • Not good for huge dataset with shallow annotation: • Processable • Read only → useful for (small, medium size) gold standard hand-corrected language resources / „new“ langauges → localisation /
TEI for Language Resources IMPACT @ JSI • EU IP „Improving Access to Text“ • Make better OCR and IR for historical texts • JSI: Developing a lemmatisation (+ modernisation) module for XIX century Slovene • Background: Lexicon, Tagging and Lemmatisation for modern Slovene + FSA rewrite patterns • Current dataset: AHLib (~100 books) • AHLib marked up in TEI
TEI for Language Resources AHLib Digital Library
TEI for Language Resources IMPACT Lexicon
TEI for Language Resources Mark-up challenges • Text-critical apparatus vs. linguistic annotation • „Parallel“ corpora of transcriptions and modernisations • Layered linguistic annotations: tokenisation, tagsets • Lexicon (+dictionary) encoding
TEI for Language Resources Morals, 2 • Text-critical editions use TEI anyway • Ditto for DLs of historical texts • HLT increasingly applied also to such texts • TEI provides a good basis to join the two views
TEI for Language Resources Current EU Projects: FlareNet • Fostering Language Resources Network (2008-11) • WG4 - Harmonisation of Formats and Standards • D4.1 Identification of problems in the use of LR standards and of standardisationneeds (M12): • „For academic purposes the TEI Guidelines (current version P5) has been a wellestablished and widely used resource of LR‐specific standards mainly for corpusanalysis, markup and annotation. But TEI is hardly known in industrial communities(with a few exceptions) and completely foreign to professional groups such as localizersand translators. We see great potential in using TEI Guidelines in industrial contexts.“ /underlined by T.E./ • D4.2 Proposal of a European Language Resource Standards Framework (M24 /2010-09-01)
TEI for Language Resources Research Infrastructures for the Humanities • DG Research funded RIs; pilot phase, 2008-2010 • DARIAH ask Lou... • EU RI CLARIN:Common Language Resources and Technology Infrastructure • WP5 Language Resources and Technologies Overview • D5C-3: Interoperability & Standards: „Due to the versatile nature of TEI, most of the following chapters include details on encodingdigital text by following the P5 guidelines and conversion methods.“
TEI for Language Resources Morals, 3 • TEI is firmly acknowledged in current work on LR encoding standardisation • But is not perscriptive enough and lacks modules for many types of LRs → Need of constrained solutions & linkages to ISO/W3C standards: • Cross-walks • Roma & Schema „namespace“ catalogueto DC, LMF, MAF, ...
TEI for Language Resources TEI for LRSWOT • Universality, Maturity, Community, Extensibility (compare ISO) • Vagueness, Learning curve, ISO/W3C linkage • HLT (Humanities Language Technologies), New languages • Marginalisation, Technical obsolescence
TEI for Language Resources Conclusions • Frontiers: DL+HLT, Gold standard LRs • Priority: Instantiated connections to other standards and languages • Connection with linguistics? SIG will tell...