240 likes | 359 Views
Representing dictionaries with the TEI. Proposal for basic guidelines Laurent Romary - Max Planck Digital Library With the help of Susanne Alt - CNRS. Background. The P5 edition of the TEI guidelines XML ODD - Roma Modules and classes DTD, RelaxNG, W3C schemas The dictionary chapter
E N D
Representing dictionaries with the TEI Proposal for basic guidelines Laurent Romary - Max Planck Digital Library With the help of Susanne Alt - CNRS
Background • The P5 edition of the TEI guidelines • XML • ODD - Roma • Modules and classes • DTD, RelaxNG, W3C schemas • The dictionary chapter • Very close to the P4 version • Work to be done • Enhancing the coherence with the class system • Providing more examples • …
Proposal for today • Browse through the main features of the dictionary chapter • Identify questionable issues • Select best practices • Work with Roma and implement (part of) the best practices • Minimal schema that dictionary project can start with • Bottom approach to customization • Discuss about conformance
Dictionaries as TEI documents • Same general document structure as any other TEI document • <teiHeader>, <text> • Define a common strategy concerning source identification with general text sources • Specific documentation of previous editions • Intuition that <teiCorpus> is not to be retained here • <front>, <body>, <back> • Divisions… • Strong case for unnumbered <div>s • Can we recommend/implement a basic dictionary oriented typology?
Issues [see Wuerzburg.xml] • Providing precise guidelines for • <publicationStmt> • Elicit the role and possible content of <publisher> • <sourceDesc> • Base the guidelines on <biblStruct> (<biblItem>?) and <listBibl>
Describing dictionary entries • A variety of possible objects • <entry>, <entryFree> <superEntry>, <dictScrap> • <hom>, <re> • First issue: dealing with the editorial workflow • Keep <dictScrap> for ongoing tagging activity • depends on the degree of structure of the dictionary • Stay consistent in the use of entry/entryFree/superEntry/hom • Strong feeling for limiting ourselves to <entry> • Point to the importance of <re> • Embedded entries
Finding the right granularity • The core lexical unit: <entry> • Should be used coherently in a dictionary project to gather up homogenous lexical objects • Possible combination with: • <superEntry> to group sets of homographs • Should only be used to record such a feature when it exists in legacy data • Should be avoided for new editorial projects • <hom> to subdivide senses in groups of homonyms
Example • Recording a series of homographs with <superEntry> <body> <entry/> <entry/> <superEntry> <entry type="hom" n="1"/> <entry type="hom" n="2"/> </superEntry> </body> • Issues • Values of ‘n’ attribute according to the source • Values of type defined in ‘att.entryLike’
Example • Recording a series of homographs with <hom> <entry> <hom n="1"> <sense n="1"/><sense n="2"/> </hom> <hom n="2"> <sense n="1"/><sense n="2"/><sense n="3"/> </hom> </entry> • Issues • Weak boundary between polysemes and homonyms • Why not just have separate entries?
From word to senses… • Background • Semasiological vs. onomasiological views on lexical data • Two complementary data organisations • Two sets of standards • In ISO: TMF (ISO 16642) vs. LMF • In the TEI: Terminology vs. Print dictionary chapters
The LMF Model Lexical DB 1..1 1..1 1..1 0..n Global Info Lexical Entry 1..1 1..1 0..n 1..1 0..n Sense Form 1..1
Consequences for dictionaries • Strong <form> to <sense> orientation • <form> qualifies the entry, with the identification of the headword and its morphological variations • <sense> is subordinated to the choice made for <form> • Role of grammatical information • Overall qualification of the entry • Qualification of morphological variants • Issue • <re> does not necessarily fit into the theory
Example • Basic structure of an <entry> <entry> <form> <orth>chat</orth> </form> <sense> <def>Petit animal familier</def> </sense> </entry>
Representing form and grammar • General issues • Multiple forms • <orth>, <pron>, etc. • Compounds • May be represented using embedded forms • Role of grammar (<gramGrp>) • In isolation: qualifies the entry • Within a form: marks special features associated with the form • Inflexions • Can be reprensented by means of additional <form>’s
Example • A simple entry <entry> <form> <orth>chat</orth> <pron>∫a</pron> </form> <gramGrp> <pos>N</pos> <gen>f<gen> </gramGrp> </entry>
Example • Simple entry with inflected form <entry> <form type=“lemma”> <orth>chat</orth> </form> <gramGrp> <pos>N</pos> <gen>m</gen> </gramGrp> <form type=“inflected”> <orth>chats</orth> <gramGrp> <number>p</number> </gramGrp> </form> </entry>
<form>: the case of the Campe dictionary • Step 1: Dealing with the presence of determiners <form type=“lemma”> <form type=“determiner”> <orth>Das</orth> </form> <form type=“headword”> <orth>Aak</orth> </form> </form>
<form>: the case of the Campe dictionary • Step 2: adding grammatical information <form type=“lemma”> <form type=“determiner”> <orth>Das</orth> <gramGrp> <pos value=“D”/> <gen>n</gen> </gramGrp> </form> <form type=“headword”> <orth>Aak</orth> <gramGrp> <pos>N</pos> <gen>n</gen> </gramGrp> </form> </form>
<form>: the case of the Campe dictionary • Step 3: dealing with inflected forms <form type=“inflected”> <form type=“determiner”> <orth>des</orth> <gramGrp>…</gramGrp> </form> <form type=“headword”> <orth><oVar><oRef/>-es</oVar></orth> <gramGrp> <case value=“G”>G</case> </gramGrp> </form> </form>
Main arguments for the proposed changes • Coherent use of <form> and <orth> • Accounts for a coherent access to orthographic information in form/orth • Coherent use of grammatical features • Danger of tag abuse with • <gram type=“art_n”>Das</gram> • ‘type’ attribute should indicate a grammatical feature • <gram> content should be the value of that feature • Non differentiation of features (art_n -> pos + gen)
<sense>: main components • Core elements • <def>: to provide the definition • <dicteg> • Need to establish guidelines on the identification of sources • <etym>: a complex issue…
Documentation des exemples <dicteg><q>Ta gamine est assise trop <oRef/>, elle ne dépasse pas de la table.</q></dicteg> • <dicteg><cit> • <q>Ta gamine est assise trop <oRef/>, elle ne dépasse pas de la table.</q> • <bibl>Benoit M., Michel C., Le Parler de Metz...</bibl> • </cit></dicteg> • <dicteg> • <cit> • <q>Ta gamine est assise trop <oRef/>, elle ne dépasse pas de la table.</q> • <biblStruct> • <author>BENOIT M, MICHEL C.</author> • <title>Le Parler de Metz et du pays messin</title> • <imprint> • <pubPlace>Metz</pubPlace> • <publisher>Serpenoise</publisher> • <date>2001</date> • <biblScope>p. 38</biblScope> • </imprint> • </biblStruct> • </cit> • </dicteg>
A quick glimpse into Roma • A journey in three steps • Adding the PD module and generating a schema • Checking out elements • Expressing constraints on specific values
Final discussion • What is it, being TEI conformant?