120 likes | 133 Views
Learn about encoding SBL into TEI-XML format, up-conversion, methodology, article structure, and future implementation plans for an IR system. Understand the significance and process involved in this digital initiative.
E N D
Towards a Digital Edition of the Slovenian Biographical Lexicon • Petra Vide Ogrin • Slovenian Academy of Sciences and Arts, • Library • Tomaž Erjavec • Department of Knowledge Technologies, • Jožef Stefan Institute INFuture 2007, Zagreb
Overview of the talk • SBL (publication, nature, significance) • Methodology: • TEI P5 • up-conversion into TEI-XML format • Example of TEI-XML article structure: • skeleton • actual XML document • Future plans: implementation of IR system INFuture 2007, Zagreb
SBL 15 volumes + index, published over a long period of time (1925-1991) Who is included? : notable figures important for Slovenian cultural life, from the beginnings up to the contemporary time - criteria Covers 5,031 biographical entries, over 5,100 persons Data in the articles are checked against the relevant primary material sources INFuture 2007, Zagreb
Methodology of encoding • Use of open standards and software • Use of TEI P5 Guidelines • Up-conversion from OCR source into TEI-XML • Down-conversion into XHTML(Implementation of DL open source software → full-text and advanced searching) INFuture 2007, Zagreb
TEI – Text Encoding Initiative • What’s TEI? • Why do we encode? • to make explicit (to a machine) what is implicit (to a person) • to add value by supplying annotations (structural metadata) • to facilitate re-use of the same material • XML (eXtensible Markup Language): • international standard • application-, platform- and vendor- independent • extensible
TEI P5 • no backward compatibility with P4 – new possibilities for text encoding • validation of an XML document: checking against an XML schema • an XML schema (XML syntax) = project-specific combination of TEI modules • extension and generalization of modular system • interoperability and standards (ISO, W3C: Unicode, lang → xml:lang, id → xml:id) • some new elements, e. g. for biographical and prosopographical data → relevant for SBL project INFuture 2007, Zagreb
Up-conversion into TEI-XML • OpenOffice – TEI OO package (XSLT stylesheets) → TEI-XML document (basic structure) • (semi-)automatic encoding – to achieve the needed structure: • Perl, XSLT • manual intervention (correction) INFuture 2007, Zagreb
An SBL article • Typical structure: • biographical entry • biography: data about birth, death, residence, occupation, important events (marriage, ordination etc.) • representative bibliography that depicts a person's life and work • One or more paragraphs • Encyclopaedic style: dense language, many abbreviations (bibliography, authors, general: e.g. months (Sept.) etc.) INFuture 2007, Zagreb
Article TEI-XML structure <div> <listPerson> <person> <!--other elements for biographical data: birth, death, occupation ...--> </person> </listPerson> <p> <!--the annotated text of the article--> </p> </div> INFuture 2007, Zagreb
Future plans • Implementation of an IR system – for full-text and advanced searching • Possible adoption of PhiloLogic • Exploring automatic recognition, extraction and encoding of data INFuture 2007, Zagreb