480 likes | 507 Views
The Guidelines (P5) of the Text Encoding Initiative (TEI). Laurent Romary Max Planck Digital Library With many thanks to Lou Burnard (OUCS). In the beginning. Michael Sperberg-McQuen. Lou Burnard. Antonio Zampolli. 1. Novembre 1987: Vassar College, Poughkeepsie.
E N D
The Guidelines (P5) of the Text Encoding Initiative (TEI) Laurent Romary Max Planck Digital Library With many thanksto Lou Burnard (OUCS)
In the beginning Michael Sperberg-McQuen Lou Burnard Antonio Zampolli 1. Novembre 1987: Vassar College, Poughkeepsie Seite 1
Digitizing source documents Further work on documents The basic scenario? Seite 2
material transcriptions of early printed books organization library metaphor technologies XSLT rendering in the browser; simple search via php TEI source visible? Yes, via view source Early America’s Digital Archive http://www.mith2.umd.edu/eada/ Seite 3
material multiple recensions from different traditions of the Buddhist Canon organization aligned texts for each sutra technologies Cocoon, eXist, Xquery TEI source visible? yes Samyukta Agama http://buddhistinformatics.chibs.edu.tw/BZA/ Seite 4
Can scientists bear standards? • Standards are essentially bad for scientists • Freezing knowledge • Making one lose time that could be dedicated to research • Forcing diverging views to agree • …especially if the work is done by others • A positive view on standards • Documenting data • Giving semantics to data • Pooling data from various origins • Allowing interoperability of tools • A possible answer • Standards as specification platforms • Does the TEI provide you with this? Seite 5
Basic structure(s) • Every TEI-conformant document comprises a header followed by (at least one) text • the header contains: • mandatory file description • optional encoding, profile and revision descriptions • the header is essential for: • bibliographic control and identification • resource documentation and processing Seite 7
Structure of a TEI text • In the simplest case, a text just consists of paragraphs or clauses or verse lines • A TEI text has a little more structure: it contains • optional front matter • optional back matter • a body Seite 8
The body of a text usually has divisions • usually nested one with another • the type attribute labels a particular level e.g. as "part" or "chapter“ • the n attribute gives a particular division a name or number • the xml:id attribute gives a particular division a unique identifier Seite 9
For example... <text> <front> <!-- titlepage, etc here --> </front> <body> <div type="book" n="I" xml:id="JA0100"> <head>Book I.</head> <div type="chapter" n="1" xml:id="JA0101"> <head>Of writing lives in general...</head> <!-- remainder of chapter 1 here --> </div> <div n="2" xml:id="JA0102"> <!-- chapter 2 here --> </div> <!-- remainder of book 1 here --> </div> <div type="book" n="II" xml:id="JA0200"> <!-- book 2 here --> </div> <!-- remaining books here --> </body> </text> Seite 10
TEI global attributes • The attribute class att.global defines these for all elements: • xml:id supplies a unique identifier • n supplies a (non-unique) name or number • rend gives a suggestion about rendition (appearance) • xml:lang identifies the language using an ISO standard code • The linking module extends this class with: • corresp, synch, ana for specific association types • next, prev for aggregating fragmented elements Seite 11
Text components What are divisions composed of? • prose is mostly paragraphs (p) • verse is mostly lines (l), sometimes in hierarchic groups (lg) • drama is mostly speeches (sp) containing p or l elements interspersed with stage directions (stage) These may be mixed, and may also appear directly within undivided texts .... but divisions can also contain embedded text or quote elements. Seite 12
For example <div type="book"> <l>Of Man's first disobedience, and the fruit</l> <l>Of that forbidden tree whose mortal taste</l> <l>Brought death into the World, and all our woe,</l> <l>With loss of Eden...</l> .... </div> <lg type="haiku"> <l n="1">Summer grass —</l> <l n="2">all that's left</l> <l n="3">of warriors' dreams</l> </lg> Seite 13
For example <stage>Enter Barnardo and Francisco, two Sentinels,at several doors</stage> <sp who="Barnardo"> <l part="f">Who's there? </l> </sp> <sp who="Francisco"> <l>Nay, answer me. Stand and unfold yourself.</l> </sp> <sp who="Barnardo"> <l part="i">Long live the king! </l> </sp> <sp who="Francisco"> <l part="m">Barnardo? </l> </sp> <sp who="Barnardo"> <l part="f">He. </l> </sp> Seite 14
not to mention <p>.... And he wrote on one side of the paper: <quote> <p>HELP!</p> <p>PIGLIT (ME)</p> </quote> and on the other side: <quote>IT'S ME PIGLIT, HELP HELP</quote> </p> <p>Then he put the paper in the bottle... </p> Seite 15
What are speeches, paragraphs, and lines made of? • phrases that are conventionally typographically distinct • “data-like” (names, numbers, dates, times, addresses) • editorial interventions (corrections, regularizations, additions, omissions ...) • cross references and links • lists, notes, graphics, tables, bibliographic citations... • all kinds of annotations! Which of these you need to markup will depend on your research agenda Seite 16
for example... <head>Of writing lives in general,and particularly of <title>Pamela</title>, with a word by the bye of <name key="#CIBC03">Colley Cibber</name> and others.</head> <p>It is a trite but true observation, that <q>examples work more forcibly on the mind than precepts</q> </p> <p> <name key="#JA">Mr. Joseph Andrews</name>, <rs>the hero of our ensuing history</rs>, was esteemed to be ...</p> Seite 17
Direct speech • Use the who attribute to show speakers • Speeches can be nested in other speeches <q who="Wilson"> Spaulding, he came down into the office just this day eight weeks with this very paper in his hand, and he says: <q who="Spaulding">I wish to the Lord, Mr. Wilson, that I was a red-headed man.</q> </q> Seite 18
Foreign language phrases • The xml:lang attribute may be attached to any element • Use <foreign> if nothing else is available • Use ISO 639-2 code to identify language Have you read <title xml:lang="deu">Die Dreigroschenoper</title>? <mentioned xml:lang="fra">Savoir-faire</mentioned> is French for know-how. John has real <foreign xml:lang="fra">savoir-faire</foreign>. Seite 19
‘Inter’ class elements • <list> • lists of all kinds • <note> • notes (authorial or editorial) • <figure> • pictures or figures • <table> • tables • <bibl> • bibliographic descriptions Seite 20
for example... <list type="xmas"> <label>For my true love</label> <item> <list type="bullets"> <item>three calling birds></item> <item>two french hens</item> <item>a partridge in a pear tree</item> </list> </item> <label>For Uncle Joe</label> <item>socks as usual</item> </list> Seite 21
Example <figure> <head>Mr Fezziwig's Ball</head> <figDesc>A Cruikshank engraving showing Mr Fezziwig leading a group of revellers.</figDesc> <graphic url="fezz.gif"/> </figure> Seite 22
Feeling overwhelmed? All of this is just one way of looking at the TEI. 1. The TEI is a modular system: you use it to build an encoding scheme appropriate to your needs, by selecting specific modules 2. Each module defines a group of elements and attributes 3. Elements are classified structurally and semantically Define your goals before using the TEI! Seite 23
Some other modules Your choice from: 1. Transcription of spoken texts 2. Dictionaries and lexica 3. Varieties of linguistic annotation 4. Nonstandard characters and glyphs 5. Linking, alignment, non-hierarchic structures 6. Detailed metadata (the TEI Header) 7. Manuscript Description 8. Text-critical apparatus 9. Physical description 10. Onomastics and ontologies 11. The ODD system Seite 24
Customizing the TEIHow do you use the TEI modular structure?
The global TEI architecture Seite 26
Following the TEI spirit Conformance to the TEI means: • Sharing a common text encoding culture • Sharing the same vocabulary (when applicable) • Allowing user autonomy in defining modifications (extensions, customization), but sharing the mechanisms to do so The TEI gives you a lot of help in following these rules. Seite 27
Important concepts The TEI's literary programming with ODD (One Document Does it all) provides: • Schema specification • User oriented documentation • Modularity: all specifications pertaining to a coherent sub-domain of the TEI • Classes: identifying shared behaviours or semantics • Extensibility: a consequence of the above mechanisms Seite 28
The TEI ODD in practice The TEI Guidelines, its schema, and its schema fragments, are all produced from a single XML resource containing: 1. Descriptive prose (lots of it) 2. Examples of usage (plenty) 3. Formal declarations for components of the TEI Abstract Model: • elements and attributes • Modules • classes and macros Seite 29
Possibilities of customizing the TEI The TEI has over 20 modules. A working project will: • Choose the modules they need • Probably narrow the set of elements within a module • Probably add local datatype constraints • Possibly add new elements • Possibly localize the names of elements Seite 30
Quick and simple access to the TEI Imagine that you have seen your colleague next door doing some encoding with the TEI and want to do the same thing: • Go to Roma at http://tei.oucs.ox.ac.uk/Roma/ • Generate a schema [Schema] • Make a trial with the editor, creating a simple document • Get back to Roma and make basic documentation Seite 31
Roma:Start Seite 32
Roma:Schema Select Seite 33
Roma:Generate Doc Seite 34
Subsetting the TEI Suppose you now feel you want to use some more of the TEI, but not all of it • Go to Roma... • Look at [Modules] • Explore default modules by pointing to main elements (by order of interest). You can throw away most things, but • In textstructure, you should really keep TEI, text, body and div • In core, most people need p, q, list, pb and head • From header, keep everything unless you really understand the details • Start checking out elements • Make editorial choices (numbered vs. unnumbered head’s) Seite 35
Roma: Modules Seite 36
Roma: Change Module Seite 37
Adding TEI objects You can add your own elements and attributes. But • make very sure you are not just making something which is syntactic sugar for an existing TEI concept • do not rename existing elements - you can do that directly in ODD • if you want facilities from a very different field of discourse, such as maths or vector graphics, use the existing standards in that area • consider interoperability Seite 38
Roma: Add Element Seite 39
Under the hood TEI customizations are themselves expressed in TEI XML, using elements from the tagdocs module. For example: <schemaSpec ident="myTEIlite"> <desc>This is TEI Lite with simplified heads</desc> <moduleRef key="tei"/> <moduleRef key="core"/> <moduleRef key="textstructure"/> <moduleRef key="header"/> <moduleRef key="linking"/> <elementSpec ident="head" mode="change"> <content> <rng:text/> </content> </elementSpec> </schemaSpec> produces something like TEI Lite, with a slight change Seite 40
What does an ODD look like? <elementSpec module="spoken" ident="pause"> <classes> <memberOf key="model.divPart.spoken"/> <memberOf key="att.timed"/> <memberOf key="att.typed"/> </classes> <content> <rng:empty/> </content> <attList> <attDef ident="who" usage="opt"> <gloss>A unique identifier</gloss> <desc>supplies the identifier of the person or group pausing. Its value is the identifier of a <gi>person</gi> or <gi>persGrp</gi> element in the TEI header. </desc> <datatype> <rng:ref name="data.pointer"/> </datatype> </attDef> </attList> <desc>a pause either between or within utterances.</desc> </elementSpec> Seite 41
... from which we generate element pause { pause.content, pause.attributes } pause.content = empty pause.attributes = att.global.attributes, att.timed.attributes, att.typed.attributes, att.ascribed.attributes, model.divPart.spoken |= pause att.timed |= pause att.typed |= pause att.ascribed |= pause Seite 42
.. or <!ELEMENT %n.pause; %om.RR; EMPTY> <!ATTLIST %n.pause; %att.global.attributes; %att.timed.attributes; %att.typed.attributes; %att.ascribed.attributes;> <!ENTITY % model.divPart.spoken "%x.model.divPart.spoken; %n.event; | %n.kinesic; | %n.pause; | %n.shift; | %n.u; | %n.vocal; | %n.writing;"> Seite 43
... and, indeed Seite 44
MPDL CoLaboratory (MPDL CoLab) • Platform for community building and knowledge exchange • Aim: • improve exchange of explicit knwoledge and make tacit and individual know-how explicit • Supports community-building processes • Connects people with similar fields of interest and goals • within the MPS: MPDL, librarians, scientists • Outside: underlying basis of our national and international collaborations • Provide information about existing standards and best practices in the domain of supporting scientific life cycles • Ensuring long-term compatibility between local and centralized initiatives within the MPDL Seite 45
Exploring TEI P5 • Visit http://www.tei-c.org/release/doc/tei-p5-doc/html/ • alphabetical lists of classes, macros, elements • each chapter describes a distinct module • each module presents a semantically related list of elements, with examples of their use Feedback and advice available to all on tei-l@listserv.brown.edu Seite 47