410 likes | 436 Views
Anyone for pizza?. Designing a TEI document type declaration. The T E what?. Originally, a research project within the humanities Sponsored by ALLC, ACH, ACL Funded 1990-1994 by US NEH, EU LE Programme et al Major influences digital libraries and text collections language corpora
E N D
Anyone for pizza? Designing a TEI document type declaration
The T E what? • Originally, a research project within the humanities • Sponsored by ALLC, ACH, ACL • Funded 1990-1994 by US NEH, EU LE Programme et al • Major influences • digital libraries and text collections • language corpora • scholarly datasets • Now an international membership consortium incorporated Jan 2001 http://www.tei-c.org
Current TEI activity • Preliminary XML version of DTD now available • Text update in progress • Workgroups under consideration • character set issues • manuscript description • modelling • lexica and termbanks • physical description • resource discovery • Membership will vote by end 2001
Who uses TEI? • digital librarians and archivists • HTI, UVA, CETH, OTA... • Language Engineering projects • EAGLES, BNC, MULTEX, ECI, Silfide • academic researchers • Women Writers Project, CURIA Project, VWWP, Orlando, Model Editions Partnership, Canterbury Tales Project, Bodleian Library... • http://www.hcu.ox.ac.uk/TEI/Applications/
Goals of the TEI • better interchange and integration of scholarly data • support for all texts, in all languages, from all periods • guidance for the perplexed: what to encode • assistance for the specialist: how to encode any information of interest • Hence a loose framework into which unpredictable extensions can be fitted
Legacy of the TEI • a way of looking at what “text” really is • a codification of current scholarly practice • (crucially) a set of shared assumptions and priorities about the digital agenda: • focus on content and function (rather than presentation) • generic solutions (rather than application-specific ones)
TEI Deliverables • A set of recommendations for text encoding, • covering both generic text structures and some highly specific areas based on (but not limited by) existing practice • A very large collection of element definitions • combined into a very loose document type declaration • A mechanism for creating multiple views (DTDs) of the foregoing • One such view and associated tutorial: TEI Lite; others exist (e.g. CES, BNC)
The TEI modus operandi... • identify significant particularities • independent of notation or realization • avoid controversy, over-delicacy, inadequacy • seek generalizable solutions, acceptable to a consensus
... and some consequences • focus on content, not presentation • descriptive, not prescriptive • Occam's razor • modular, extensible dtd
Designing a dtd for the TEI • How can a single markup scheme handle a large variety of requirements? • all texts are alike • every text is different • Learn from the database designers • one construct, many views • each view a selection from the whole
How Many dtds? • How many dtds might the TEI require? • one (the Corporate or WKWBFY approach) • none (the Anarchic or NWEUMP approach) • as many as it takes (the Mixed Economy or XML approach) • Or a single main dtd with many faces (the British approach)
The TEI solution: modularization • a (very) large number of element and attribute definitions • organized as tagsets (core, base, additional, or auxiliary) • grouped into classes
How to combine Tag Sets… • all tag sets, all the time (the table d'hôte model) • a few pre-selected combinations (the combination plate model) • in completely unconstrained abandon (the smørgasbord model) • one from column A, two from column B (the Chinese menu model)
The Chicago Pizza Model <!ENTITY % base “(deepDish | thinCrust | stuffed)” > <!ENTITY % topping “( pepperoni | mushrooms | sausage | pepper | anchovies | ...)” > <!ELEMENT pizza - - ( %base;, tomatoSauce & cheese, %(topping)) >
To build a view of the TEI dtd, take... • the core tagsets • the base of your choice • the toppings of your choice <!DOCTYPE TEI.2 SYSTEM 'tei2.dtd' [ <!ENTITY % TEI.prose 'INCLUDE' > <!ENTITY % TEI.analysis 'INCLUDE' > ]> <TEI.2>.....</TEI.2>
TEI base tagsets • one only must be selected • defines basic structural components • currently defined: • prose, verse, drama • transcribed speech • dictionaries • terminological databases • mixtures of bases require special treatment
TEI additional tagsets • sets of elements for specialized application areas • can be mixed and matched ad lib • currently provided: • linking and alignment; analysis; feature structures; certainty; physical transcription; textual criticism, names and dates; graphs and trees; figures and tables; language corpora....
For an XML DTD • Just add another declaration to the subset • (This is new in TEI P4) <!ENTITY % TEI.XML “INCLUDE”>
How does this work? • enables all declarations within the tagset marked section defined in the main TEI dtd • these may include element, attribute, and class definitions <!ENTITY % TEI.tagset “INCLUDE”>
How does this work? • Within the main DTD: • the declarations making up each tagset are enclosed by an IGNORE marked section • the declarations for each element are enclosed by an INCLUDE marked section • this can be over-ridden by your declaration within the DTD subset
Customizing the TEI DTD • In DTD subset • selection of tag sets • specification of document entities • in TEI.extensions.ent • renaming of elements • suppression of elements • modification of TEI classes • in TEI.extensions.dtd • definition of new elements
Entity definitions • typically will include entity declarations for embedded graphics etc. • may also invoke special characters etc. <ENTITY % myStuff “myEnts.dtd”> %myStuff;
To modify the dtd • Define your modifications in a pair of extension files <!DOCTYPE TEI.2 SYSTEM "tei2.dtd" [ <!ENTITY % TEI.prose "INCLUDE" > <!ENTITY % TEI.extensions.ent SYSTEM "myMods.ent" > <!ENTITY % TEI.extensions.dtd SYSTEM "myMods.dtd" > ]><TEI.2>...</TEI.2>
In your extension files you can… • rename elements <!ENTITY % n.p “para” > • undefine elements <!ENTITY % seg “IGNORE”> • The pizzaChef gives you a list of all the elements available from your chosen tagsets, and generates extension files for you
You can also • supply additional (or replacement) declarations • supply entirely new elements and embed them in the architecture <!ENTITY % seg “IGNORE”> <!ELEMENT %n.seg; (#PCDATA)> <!ENTITY % x.phrase 'blort|'> <!ELEMENT blort (#PCDATA)> <!ATTLIST blort %a.global; farble (foo|bar|baz) "baz">
An example • In the DTD subset we write: <!ENTITY % TEI.prose "INCLUDE"> <!ENTITY % biblStruct "IGNORE"> • In the prose tagset it says: <!ENTITY % TEI.prose "IGNORE"> <![ %TEI.prose; [ .... lots of other declarations ... <!ENTITY % biblStruct "INCLUDE"> <![ %biblStruct; [ <!ELEMENT biblStruct .... > <!ATTLIST biblStruct .... > ]]> … yet more declarations … ]]>
Finally, the pizza is cooked • The carthage program removes • parameterization in the DTD • unreferenced or inaccessible elements • The pizzachef website • http://www.tei-c.org/pizza.html • command line equivalent: • http://www.tei-c.org/maketeidtd/
Element Classes • Most TEI elements are assigned to one or more • model classes, identifying their syntactic properties, or • attribute classes, identifying their attributes • This provides a (relatively) simple way of • documenting and understanding the DTD • parameterizing content models • facilitating customization • An alternative way of doing architectural forms
Some TEI model classes divn: structural elements like divisions (<div>,<div1>, <div2>…) divtop: elements which can appear at the start of a divn element (<head>, <epigraph>, <byLine>…) chunk: paragraph-like elements (<sp>, <p>, <lg>…) phrase: elements which appear within chunks (<hi>, <foreign>, <date> …)
Implementation of classes • Each model class is defined as a pair of parameter entities • Reference to class members is always indirect <!ENTITY % x.class ““> <!ENTITY % m.class “%x.classname1|name2” > <!ELEMENT foo (%m.class;+)>
Class mobility • Each model class is defined as a parameter entity, containing • a reference to an initially null extension class • a list of members • To add a new member to a class, we redefine the extension class: <!ENTITY % x.class “myChunk|myOther“>
TEI attribute classes • global: attributes which are available to every element (n, lang, id, TEIform) • linking: attributes for elements which have linking semantics (targType, targOrder, evaluate
The TEIFORM attribute • protects applications from the effect of element renaming <titre TEIform="title">...</titre> • protects applications from the effect of syntactic sugar <abc type="xyz”> can be rewritten as <xyz TEIform="abc">
TEI Auxiliary DTDs • independent dtds for specialized information: • writing system / character set • feature system (for feature-structure notation) • tag set documentation • independent, free-standing TEI header
What can go wrong? • extensions must use SGML syntax • beware of zombie elements • beware of over zealous pruning • remember that some TEI rules are not enforced (or enforceable) by the DTD • You have to know what's on the menu before you can choose from it
A case study: the Lampeter corpus • Fairly typical requirements for historical language corpora: • light presentational tagging • structural markup for access • detailed information about source text production • small number of tags to ease data capture and validation • Implementation • tagsets: prose base, and tags from four additional sets • some extensions, many exclusions
The Lampeter corpus DTD subset <!DOCTYPE teiCorpus.2 SYSTEM "tei2.dtd"[ <!ENTITY % TEI.prose "INCLUDE"> <!ENTITY % TEI.corpus "INCLUDE"> <!ENTITY % TEI.figures "INCLUDE"> <!ENTITY % TEI.transcr "INCLUDE"> <!ENTITY % TEI.extensions.ent SYSTEM "lampext.ent"> <!ENTITY % TEI.extensions.dtd SYSTEM "lampext.dtd"> ]>
The Lampeter corpus extensions.ent <!ENTITY % analytic 'IGNORE' > <!ENTITY % biblStruct 'IGNORE' > <!-- hic desunt multa --> <!ENTITY % supplied 'IGNORE' > <!ENTITY % x.phrase "it|ro|sc|su|bo|go|"> <!ENTITY % x.biblPart "printer|pubFormat|bookSeller|"> <!ENTITY % x.demographic "socecstatusPat|biogNote|">
The Lampeter corpus extensions.dtd <!ELEMENT it (%phrase.seq;)> <!ELEMENT printer (%phrase.seq;)> <!ATTLIST it %a.global; > <!– etc.for all other new elements -->
To finish the job • Document your extensions, using the TEI tagset for tagset documentation • Write a manual using the ODD system to generate your DTD fragments http://www.hcu.ox.ac.uk/TEI/Master/Reference
Why bother? • The TEI is a well-known reference point • Using the TEI enables • sharing of data and resources • shared modular software development • lower learning curve and reduced training costs • The TEI is stable, rigorous, and well-documented • The TEI is also flexible, customizable, and extensible in documented ways • The architectural approach offers the best compromise for practical work.