520 likes | 881 Views
OSIS – A Closer Look. Steven J. DeRose, Ph.D. Chair, Bible Technologies Group http://www.bibletechnologies.net sderose@acm.org November 22, 2002. Why have a standard? (first, for publishers). Can reduce the costs of: Editing and publication process Software purchase, training, maintenance
E N D
OSIS – A Closer Look Steven J. DeRose, Ph.D. Chair, Bible Technologies Group http://www.bibletechnologies.net sderose@acm.org November 22, 2002
Why have a standard?(first, for publishers) • Can reduce the costs of: • Editing and publication process • Software purchase, training, maintenance • Rekeying, scanning, and conversion • Lets texts survive when your WP or typesetting program goes obsolete • Facilitates multi-format, multi-platform delivery and distribution • Enables use of generic tools
Why have a standard?(next, for users) • Lets you obtain the same texts regardless of what reading and other tools you use • Because the publisher does no more work to support 10, than to support 1 • Helps texts survive when your book-reading software goes obsolete • Reduced costs • Better, more reliable resources • Enables communities of interest • Shared notes, collaborative study,…
The medium picture Cost savingsusually start here XHTML Typeset OCR Braille XML/OSIStext HTML PDF WPs Open eBook Other XML Palmtops 4+7 convertors instead of 4 7 (and reality is bigger) Cell delivery
The basic principle:“Descriptive markup” • WPs only see “huge, bold, space before” • Now find/reformat all chapter headings • Expensive to apply a house style or look/feel • Hard to create diverse forms: • Web, paper, and braille publication • A perfect user could use stylesheets • But interfaces make inconsistent work easier • Instead: say what kind of portion each is • A formatter applies rules by kind
Why should I separate out the formatting? • It speeds your work • You can use a stylesheet from someone else, and not have to do any manual formatting • Typesetter can enhance formatting without risking corrupting your content • Therefore, less time wasted reviewing galleys • Multiple formats from the same source • Print, braille, Web, etc. • House styles for different journals • Last-minute changes are safer, cheaper • Especially crucial for Bible publishing
Why not just use HTML? • HTML is nice but lacks • Units like poem, chapter, verse, inscription • Ways to annotate for meaning, grammar, etc • Support for reference systems: "Matt 1:1" • Multi-purpose tags like <b>, <i>, etc. • Are hard to tease apart when you need to • HTML limitations encourage using tables to force layout, making re-use infeasible • And…..
Compare • <item> <desc>Cashmere sweater</desc> <price unit='yen'>120000</price></item><item> <desc>Socks</desc> <price unit='yen'>1000</price></item> versus: • <br>Cashmere sweater, ¥120000<br>Socks, ¥1000
Why is the markup better? • When relations are marked,an indexer can match price with item • If not, there is no reliable way • (there are lots of ways one might guess…) • A search for “Cashmere and ¥1000” hits • Needlessly annoying the searcher • How many false hits have you had like this? Markup is not just about formatting
How do you spell XML? • The Extensible Markup Language • HTML on steroids (sort of) • Key features: • Intrinsic support for Unicode • Ability to create your own units • Ability to validate how they are used • (no chapters inside footnotes, etc.) • Very easy for computes to process • Separates formatting (remember earlier)
OSIS and XML • OSIS is an application of XML • XML specifies the syntax • OSIS specifies a lexicon for our genre Life would be easy if natural languages were that simple! • There are many other lexica for XML • Humanities: Text Encoding Initiative • Closely related to OSIS
What is OSIS, really? • OSIS defines: • A set of XML element types • p, verse, inscription, note,…. • Certain attributes for those types • type=“devotional” • A standard form for Biblical references • A consistent way to to write them down • A way to specify within-verse locations • A way to refer to editions and translations, or to refer to a passage generically
Concept: a hierarchy osis osisText div type=‘book’ header div type=‘chapter’ workosisWork=‘KJV’ p p title language identifier verse verse verseosisID=‘Gen.1.3’ text content note text content inscription
What's under the covers? • All of this is represented by inserting markers ("tags") into the text • Like HTML but more consistent • All starts and ends are explicit • Three kinds: • Start tags: <p> • End tags: </p> • Empty tags: <milestone/> • <p>Jesus wept.</p>, is an element.
What else is there? • Elements can contain other elements • <div type="chapter"> <verse>In the beginning...</verse> <verse>And the Word...</verse>...</chapter> • Many elements can also contain text • Some elements require or prohibit others • No <div> inside <abbr> • An empty tag just marks a point • <milestone type="pb"/>
Attributes • Usually modify a whole element • Appear only inside start tags <name type="nonhuman">Baal</name><div type="chapter">…</div><verse osisID="Rev.22.21"><q who="God"><transChange type="added">
a abbr actor caption castGroup castItem castList catchWord cell closer contributor coverage creator date description div divineName figure foreign format The full set of (68) tags • head • header • hi • identifier • index • inscription • item • l • label • language • lg • list • mentioned • milestone • milestoneEnd • milestoneStart • name • Note • osis • osisCorpus • osisText • p • publisher • q • rdg • reference • refSystem • relation • revisionDesc • rights • role • roleDesc • row • salute • seg • signed • Source • Speaker • speech • subject • table • teiHeader • title • transChange • type • verse • w • work
Don't panic • A lot of these get used once each, in the header, almost as a ritual • You can paste a sample header and fill it in • About a dozen form the Dublin Core set for cataloging and identification info • Most of the rest fall into nice groups • The hard parts (later) include • Milestones • Quotes when they cross verses/paragraphs
Three major pieces to OSIS • The markup elements and their attributes • Defined by a schema • The standardized referencesystem • Partly defined in the schema • Partly defined in grammar and prose • The authoritysystem • A way to declare formal/normalized names • Declaration portion still in process
Basic OSIS markup (What's in a name?)
Sample markup <div type="testament"> <div type="book" osisID="Gen"> <div type="chapter" osisID="Gen.1"> <verse osisID="Gen.1.1">In the beginning God created the heaven and the earth.</verse> <verse osisID="Gen.1.2">And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.</verse> <verse osisID="Gen.1.3">And God said, Let there be light: and there was light.</verse> <verse osisID="Gen.1.31">And God saw every thing that he had made, and, behold, it was very good. And the evening and the morning were the sixth day. <note type="x-StudyNote">And the evening...: Heb. And the evening was, and the morning was etc.</note></verse> </div></div></div> </osisText></osis>
Big generic elements • div Testament, book, chap, section • type the type of division, as above • divTitle optional display title • title Title of any div • list Genealogies and other lists • label • item • table Mainly for appendixes, etc. • row • cell
Book/chapter/verse • Large units all use the <div> element • It has a type attribute, with values • appendix • book • chapter • concordance • glossary • As with most attributes you can add new values if they start with "x-" • <div type='x-toronto-thing'> • We expect to add more div types in time • <verse osisID="Rev.3.20"> Note: There are no separate tags for testament, book, or chapter
Small items • abbr <abbr expansion="">… • divineName <divineName>The Lord… • foreign <foreign lang="">Talitha… • hi Emphasis in notes/comm • inscription Mene, mene, tekel, parsin • mentioned The name <mentioned>Peter • name Destroyed the <name type= "nonhuman">Baals</name> • P The ubiquitous paragraph • q Quotations (more later)
Genre-specific elements • Epistolarysalute, closer • <closer>I, Paul, sign this with my own hand.</closer> • Illustrationsfigure • May contain caption, note, index • Poetrylg, l • Also used for other line-oriented text • lg (line group) can be nested • Dramaspeech, speaker • speaker ok in: speech cell closer div inscription l p q salute verse • who attribute can point to a castItem in the header
Inscription <verse osisID="Dan.5.25">This is the inscription that was written: <inscription>Mene, Mene, Tekel, Parsin<note type="">Aramaic UPARSIN (that is, AND PARSIN)</note></inscription> • How many inscriptions can you think of?
About the source/target layout • <milestone> • Use to mark point events • page and column breaks of a source manuscript • Intended screen breaks for display • Types: column footer header line page screen • Note: Do not confuse with milestoneStart and milestoneEnd, which stand in for several other elements when they must cross verse/p boundaries in certain ways.
About the text itself • transChange Changed in translation • Types: added amplified changed deleted moved • rdg Variant readings • Used only within notes (for now) • <note>Some ancient mss <rdg>kiss the Son</rdg></note> • seg (extensions) • w word-level linguistics • Attributes: POS, morph, lemma, gloss, src, xlit
Attributes of all elements(all are optional) Name Type Meaning osisRef osisRefType annotateWork anything I am about W annotateType osisAnnotation My relation to W ews anything ID xs:ID For Web to link to lang languageType language, wr sys osisID osisIDType reference to here resp anything responsible person splitID anything (later) type anything subType anything n anything name/num of unit
The reference system (I am named, therefore I am)
Header overview • Purpose • Identify the file as an XML file • Identify the file as using the OSIS schema • Say whether it's one text or a collection • Identify and declare names for: • The work itself (title, author, etc) • Other works referenced • Verse reference systems used • Characters in the text <castList>
Header sample <?xml version="1.0" encoding="UTF-8" ?> <osis xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="osisCore.1.1.xsd"> <osisText osisIDWork="KJV" osisRefWork="defaultReferenceScheme"> <header> <work osisWork="KJV"> <title>King James Version of 1769</title> <identifier type="OSIS">KJV</identifier> <language>en</language> <refSystem>Bible.KJV</refSystem></work> <work osisWork="defaultReferenceScheme"> <refSystem>Bible.KJV</refSystem></work> </header>
Other header elements • osisCorpus • Use inside <osis> when there will be several texts in one document, as for a polyglot • osisCorpus can have its own header • osisCorpus then contains osisText elements • teiHeader • Allows including a fuller TEI-style header • Work uses the standard "Dublin Core" tags to give catalog/bibliography info
Dublin Core • title The title of the work or collection • creator The primary author • contributor Other contributers (set 'role') • identifier ISBN or similar unique ID of work • date Publication date • language Primary language of the work • rights Statement of permissions/rights • publisher Name of the publisher • description An abstract or precis of the work • format What representation (=OSIS) • coverage Intended audience and scope • relation • source If derived from another work • subject LCSH or similar subject descr • type • refSystem (OSIS only, not in D.C.)
Identifying parts of the work • osisID must be specified on any element that has a canonical reference: • <verse osisID="Luk.3.10"> • <p osisID="Rev.3.20"> • <div type="chapter" osisID="Luk.3"> • 3-letter book names, periods to separate • HTML <a name="…"> available as well • More useful in notes/commentary, not Bible • Back-of-book index entries • <index level1="Idols" level2= "burning of" level3="by Hezekiah"> • <index level1="False gods" see="Idols">
When it won't come out even • If several verse are translated as (say) a p • Put all the appropriate osisIDs on the p • <p osisID="Matt.1.1 Matt.1.2"> • If a verse is split across paragraphs • Tag each part; use splitID to number them • <p>…<verse osisID="1Pe.1.3" splitID="1">…</verse></p> <p>…<verse osisID="1Pe.1.3" splitID="2">…</verse>…</p> • milestone_Start… milestone_End • Used to mark units that cross boundaries • abbr closer div foreign l lg q salute seg signed speech verse
References • Reference to other places/works • <note>See also <reference osisRef= "Mat.1.1">Matthew</reference> for a similar theme.</note> • div, figure, note, and reference can also directly refer: • <div type="commentary" osisRef="Luk.3.10"> • This identifies the passage this commentary div is about. • HTML <a href="…"> also available • (more useful in notes/commentary, not Bible)
work ref canonical ref canonical ref grain ref range ref finegrain ref Reference syntax 'code point', ~=character NIV.Heb:Psa.42.1-Psa.43.12@cp[12] book verse edition chapter refsystem grain type grain value
Notes • Notes are placed right where they are referenced in the text. • Notes have several types • allusion alternative background citation devotional exegesis explanation study translation enumeration variant • Additional types must start with "x-" • catchWord -- marks referenced text cited within a note • <note><catchWord>hello</catchWord> may also be translated "goodbye" here.</note> • rdg -- marks alternate readings
On to the authority system The name is the thing, and the true name is the true thing. To know the name is to control the thing. -- Ursula LeGuin
Cast-lists • To declare cast of characters • Provides a formal ID for each • Can refer to ID from <speaker>, <q>, etc. • castList • castGroup • castItem • actor • role • roleDesc
The authority system • Only supported for castList at present • We intend to provide • A schema for declaring sets of formal names • A way to invoke such lists in documents • Standard name sets for • Bible versions • Versification schemes • People, places, etc. in the Bible • Journals, classical literature, and other works commonly cited in Biblical studies
OSIS in practice Tourist to police officer: Can you tell me how to get to Carnegie Hall? Officer to tourist: Practice, practice, practice.
5 levels of 'correct': SLipshod Only well-formed Valid Accurate Complete SL: no check required O: Load in IE 5+ V: xp, xmetal, and other true validators A: requires human proofreading and interpretation C: there is always more that could be marked up How do I know if the markup is correct?
Tools vs. today • Today we will use the raw form • Experts will need to know this • Users should have protective software • Some XML editing programs: • SoftQuad XMetal -- $300 • Open Office -- free, very promising • Some generic-enough HTML editors: • BBEdit, emacs, Netscape Communicator
Getting to OSIS • The cleaner your data, the easier it is • Data is seldom as clean as you think it is • Structured formats (USFM, XSEM, LGM, ThML) are the easiest sources • Tools: • Perl/awk/sed/cc and the like • XSLT if coming from XML • BTG has sponsored development of several convertors. • BTG will maintain a repository of utilities
Getting your OSIS XML to display in IE • Make sure the document is at least WF • Name it filename.xml • Refer to a stylesheet if you want formatting instead of just an outline view <?xml version="1.0"?><!DOCTYPE osis []><?xml-stylesheet href="mystyle.css" type="text/css"?><osis xmlns:="http://www.bibletechnologies.org/namespaces/OSIS-1.1"><header>…
Getting your OSIS printed • Most typesetting programs now import XML • OSIS converts easily to most relevant XML schemas, using XSLT • Word processors are also gaining ability to import arbitrary XML • Typesetting firms, esp. for journals, are starting to accept XML as well.
Near-term concerns of OSIS • Linguistic annotation • Formal name lists for people, places, translations, etc. • Connecting text to multimedia • Greater support for secondary genres • Tool development and conformance
How you can help • Find the best place to apply OSIS in your organization, and do it. • Join a Working Group • Send feedback, feature requests, etc. • Join a Working Group • Convert or create OSIS texts • Join a Working Group • Create a converter for your current format • Join a Working Group • Tell your friends and colleagues • Join a Working Group