1 / 39

TEI Introduction: Encoding Information for Interchange

Learn about Text Encoding Initiative (TEI) - origins, goals, modular architecture, customization, and designing DTDs for textual interchange. Discover the TEI solution for modularization in markup schemes.

jamesortega
Download Presentation

TEI Introduction: Encoding Information for Interchange

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Encoding Information for InterchangeAn introduction to the TEI Lou Burnard Humanities Computing Unit Oxford University

  2. The problem • SGML/XML markup is powerful, flexible, and can be customised to meet most (all?) needs • But to use it, you need a formal specification (aka document type definition orDTD) • Where do you get one from? • How do you choose?

  3. Some answers • Roll your own • from scratch • within an existing framework • Take what’s on offer • Use the TEI architecture

  4. The Text Encoding Initiative Origins and Goals Modular Architecture Customization

  5. Where did the TEI come from? • From the humanities research community • librarians and cybernauts • linguists, historians, lexicographers... • Sponsors • ACH Association for Computers and the Humanities • ACL Association for Computational Linguistics • ALLC Association for Literary and Linguistic Computing • Funders • U.S. National Endowment for the Humanities • Mellon Foundation • Commission of European Communities DG XIII • Social Science and Humanities Research Council of Canada

  6. … and where is it going? • Continued work in new application areas • manuscript description • physical description • non-SGML data • XML conformance • Continued take-up • Need for new infrastructure • Corrected reprint of P3 due summer 1998

  7. Goals of the TEI • better interchange and integration of data • support for all texts, in all languages, from all periods • guidance for the perplexed: what to encode • assistance for the specialist: how to encode any information of interest a user-driven codification of existing best practice

  8. TEI Deliverables • coherent set of recommendations for text encoding • comprising several distinct SGML tagsets • based on existing practice • documented in a reference manual • tutorials for general and specialised audiences ... but no software

  9. The TEI modus operandi... • identify significant particularities independent of notation or realisation • avoid controversy, over-delicacy, inadequacy • seek generalizable solutions, acceptable to a consensus

  10. ... and some consequences • focus on content, not presentation • descriptive, not prescriptive • Occam's razor • modular, extensible dtd • highly general in application, needs customization for particular areas

  11. Who uses TEI? • see http://www-tei.uic/orgs/tei/app/ • digital librarians and archivists • LC, HTI, UVA, CETH, OTA... • Language Engineering projects • EAGLES, BNC, MULTEX, Parole, Silfide • academic researchers • Women Writers Project, Project Orlando, Model Editions Partnership, Canterbury Tales Project, Bodleian Library, and many more...

  12. Designing your DTD • How can a single mark-up scheme handle a large variety of requirements ? • all texts are alike • every text is different • Learn from the database designers • one construct, many views • each view a selection from the whole

  13. How many dtds might you need? • one (the Corporate or WKWBFY approach) • none (the Anarchic or NWEUMP approach) • as many as it takes (the Mixed Economy or WNSA approach) or is there a better way?

  14. The TEI solution: modularization • a (very) large number of element and attribute definitions • organised as tagsets (core, base, additional, or auxiliary) • grouped into classes a single main DTD with many faces (a British DTD)

  15. Combining Tag Sets • And how does one combine tagsets? The how-many-dtds problem is back. • all tag sets, all the time (the table d'hôte model) • a few pre-selected combinations (the combination plate model) • in completely unconstrained abandon (the smorgasbord model) • one from column A, two from column B (the Chinese menu model)

  16. The Chicago Pizza Model <!ENTITY % base “(deepDish|thinCrust|stuffed)” > <!ENTITY % topping “(pepperoni|mushrooms|sausage| pepper | anchovies | ...)” > <!ELEMENT pizza - - (%base;, tomatoSauce & cheese, %(topping)*) >

  17. To build a view of the TEI dtd, take... • the core tagsets • the base of your choice • the toppings of your choice <!DOCTYPE TEI.2 system 'tei2.dtd' [ <!ENTITY % tei.prose 'INCLUDE' > <!ENTITY % tei.analysis 'INCLUDE' > ]> <tei.2>.....</tei.2>

  18. … trim to fit ... • user extension files • rename elements • undefine elements to be redefined* or removed <!ENTITY % tei.extensions.ent SYSTEM ‘myMods.ent’ > <!ENTITY % n.p ‘para’ > <!ENTITY % seg ‘IGNORE’> * see later

  19. … and cook thoroughly • ‘compile’ the dtd to remove all parameterization • easier to use for some software • better project management • see http://firth.natcorp.ox.ac.uk/~tei/pizza.html • don’t forget the documentation!

  20. TEI base tagsets • one only must be selected • defines basic structural components • currently defined: • prose, verse, drama • transcribed speech • dictionaries • terminological databases • mixtures of bases require special treatment

  21. TEI additional tagsets • sets of elements for specialised application areas • can be mixed and matched ad lib • currently provided: • linking and alignment; analysis; feature structures; certainty; physical transcription; textual criticism, names and dates; graphs and trees; figures and tables; language corpora....

  22. How does this work ? • Main dtd consists of marked sections, each (potentially) containing one tagset • By default, all tagsets are IGNOREd <![ %TEI.tagset [ <!-- declarations for tagset here --> ]]> <!ENTITY % TEI.tagset “INCLUDE”>

  23. How does this work? (contd) • Tagsets contain element and attlist declarations, each also enclosed by a marked section • By default all elements are INCLUDEd <![ %element [ <!ELEMENT %n.element - - (#PCDATA)> <!ATTLIST %n.element %a.global > ]]> <!ENTITY % element “IGNORE”>

  24. How does this work? (contd) • Element names (GIs) are always referred to indirectly, so that they may be renamed <!ELEMENT %n.elem1 - (%n.elem2;+)> <!ENTITY % n.elem1 “elem1”> <!ENTITY % n.elem2 “foo”>

  25. Element Classes • Model classes • elements which share syntactic properties (i.e. occur in same position) • Attribute classes • elements which share attributes • Class membership can be inherited • Another way of doing architectural forms

  26. Some TEI model classes • divn: structural elements like divisions <div>, <div1>, <div2>, <lg>, <lg1>... • divtop: elements which can appear at the start of a divn element <head>, <epigraph>, <byLine>... • chunk: paragraph-like elements <sp>, <p>, <lg>, <l>… • phrase: elements which appear within chunks <hi>, <foreign>, <date>, <q> ...

  27. Some TEI semantic classes • data: phrases likely to be normalised or processed non textually <date>, <time>, <name>... • biblpart: specialised components of bibliographic descriptions <author>, <title>, <editor>... • demographic: descriptive features of participants in a language interaction <birth>, <socEcstat>, <occupation>...

  28. Some TEI attribute classes • global: attributes which are available to every element n, lang, id, TEIform • linking: attributes for elements which have linking semantics targType, targOrder, evaluate

  29. The class system in action • Simplifying documentation and understanding of the DTD • Parameterizing content models • different for different bases • Simplifies customization • class membership is unaffected • adding new elements to an existing class

  30. Parameterized content models • “Components”, for example: • a dictionary is composed of entries • a play is composed of speeches • a novel is composed of paragraphs • in each case, the basic “text soup” (and the structural divisions) remain the same, but they are organized differently

  31. How does this work? (contd) • the component class has different members in different bases <![ %TEI.prose [ <!ENTITY % m.component “p|list|note”> ]]> <![ %TEI.dictionaries [ <!ENTITY % m.component “entry”> ]]> <!ENTITY %component.seq “(%m.component)+”> <!ELEMENT div -- (head?, (%component.seq), div*) >

  32. Customization... • Removing an element involves • undeclaring it • (NB: ISO 8879 permits references to undefined elements -- though not all vendors know this) • Adding a new element involves • determining its class • defining it • adding it to that class

  33. Customization (contd) • Modification of an element implies removal followed by addition • Class membership should be unaffected <!-- in TEI.extensions.ent --> <!ENTITY % p “IGNORE”> <!-- in TEI.extensions.dtd --> <!ELEMENT %n.p - - (#PCDATA)>

  34. How does this work? (contd) • Each model class is defined as a parameter entity • Reference to class members is always indirect • Membership extensible (by a kludge) <!ENTITY % x.class ““> <!ENTITY % m.class “%x.classname1 | name2 | name3 ...” > <!ELEMENT % n.element - - (%m.class;+)>

  35. An example: the Lampeter corpus • Requirements • light presentational tagging • structural markup for access • demographic information about text production • small number of tags to ease data capture and validation • Implementation • tagsets: prose base, and tags from four additional sets • some extensions, many exclusions

  36. The Lampeter corpus DTD subset <!DOCTYPE TEICORPUS.2 SYSTEM "tei2.dtd" [ <!ENTITY % TEI.prose "INCLUDE"> <!ENTITY % TEI.corpus "INCLUDE"> <!ENTITY % TEI.figures "INCLUDE"> <!ENTITY % TEI.transcr "INCLUDE"> <!ENTITY % TEI.extensions.ent SYSTEM "lampext.ent"> <!ENTITY % TEI.extensions.dtd SYSTEM "lampext.dtd"> <!-- more declarations here --> ]>

  37. The Lampeter corpus extensions.ent <!ENTITY % analytic 'IGNORE' > <!ENTITY % biblStruct 'IGNORE' > <!-- hic desunt multa --> <!ENTITY % supplied 'IGNORE' > <!ENTITY % x.phrase "it|ro|sc|su|bo|go|"> <!ENTITY % x.biblPart "printer|pubFormat|bookSeller|"> <!ENTITY % x.demographic "socecstatusPat|biogNote|"> <!ENTITY % x.globincl "gap|">

  38. The Lampeter corpus extensions.dtd <!ELEMENT (it|ro|sc|su|bo|go) - - (%phrase.seq)> <!ELEMENT (persName|printer|pubFormat |bookSeller|biogNote|socecstatusPat) - - (%phrase.seq) > NB: This is a provisional version only! (no attlists, no documentation…)

  39. Summary • Designing a successful DTD involves careful, conscious, controlled , theft • Modularize the task • A class system helps identify • what is true of all documents • what is true of some documents • Modifiability can be compatible with standardization

More Related