270 likes | 348 Views
Research and Instructional Support, LITS Mount Holyoke College French 331, Fall 2005. Creating an Electronic Edition of an Original 18 th Century Manuscript -- Mémoires de la comtesse de L…. Shaoping Moss Monday, Oct. 3, 2005. Today’s Topics. An electronic edition:
E N D
Research and Instructional Support, LITS Mount Holyoke College French 331, Fall 2005 Creating an Electronic Edition of an Original 18th Century Manuscript -- Mémoires de la comtesse de L… Shaoping Moss Monday, Oct. 3, 2005
Today’s Topics • An electronic edition: • what and why an electronic edition • Significance of Manuscripts • Technologies used behind the scene • Markup languages: SGML, XML and HTML • Stylesheets: XSLT • TEI -- Guidelines and DTDs • Group Project • Encoding Project Objectives • Text interpretation and markup of the manuscript
What and Why an Electronic Edition? • An electronic edition -- a transcription of a text, which can • be encoded as an object of study for literary, linguistic, historical, or related purposes. • be searched and manipulated by computer programs in many different ways. • facilitate and expand access. • facilitate the long-term preservation of the original form of the materials.
Significance of Manuscripts The term 'manuscript' simply means “written by hand.” These works written by authors, artists, scientists, and others, not only contain invaluable information for the study of the genesis, meaning and reception of their work, also for the reconstruction and a better understanding of the contemporary society and mentality in which they lived. In addition manuscripts throw light on the economics, psychology, politics, and social sciences, as well as the history and philosophy of science.
What does Encoding a Text Mean? • The purpose of encoding a document is to embed intelligence in the text in such a way that the computer program can derive information from it. • The information embedded in the text is variously called encoding, markup, or tagging.
What’s a Document? • A document is: • A set of information presented to the reader in different forms and media: books, web pages, magazines, articles, advertisements. • A collection of small elements, which can be headings, paragraphs, quotations, etc. • Structure versus Format • Structure concerns the content of a document. • Format concerns the way a document looks.
Sample Digital Collections • The Newton Papers Project http://www.newtonproject.ic.ac.uk/ The Newton Project aims to create a printed edition of Newton's theological, alchemical and administrative writings and an electronic edition of all his writings, including his correspondence. Sample Transcriptions: http://www.newtonproject.ic.ac.uk/texts/cul3996_d.html • The Adams Family Paper: an Electronic Archive http://www.masshist.org/digitaladams/aea/ • Five College Archives & Manuscript Collections -- use XML (EAD) to improve searching capabilities of archival finding aids http://asteria.fivecolleges.edu/index.html
Markup Languages • Address the structure of a document. • Identify different components of the document. • A set of symbols that can be placed in the text of a document to define and label the parts of the document. • Convey information to software that will allow it to: • determine the functions and boundaries of document parts. • index the data for searching. • render the data (e.g. for screen display or print). • transform the data (e.g. for a voice synthesizer) for some output device(s).
Development of Markup Languages • SGML -- Standard Generalized Markup Language (‘86) • Initiated by Charles Goldfarb at IBM in the 1960s • Adopted as a standard of the International Organization for Standardization (ISO 8879) in 1986 • HTML -- Hypertext Markup Language (‘91)developed by Tim Berners-Lee at a physics lab near Geneva, Switzerland in 1992 • XML -- eXtensible Markup Language (‘98)XML is a new Web standard developed by World Wide Web Consortium since 1998.
SGML and Its Subdivisions • SGML is a toolkit for developing specialized markup languages. • SGML is composed of tag-set building rules. • SGML has given birth to other sets of subdivisions: • HTML and XML • CALS for U.S Department of Defense • BOEING for commercial airlines • C-H for publishing • OED for Old English Dictionary • TEI guidelines for the Text Encoding Initiative • EAD for Encoded Archival Descriptions
Good: Its simplicity has contributed to the rapid growth of the World Wide Web in the 1990s. XHTML 1.0 is the latest HTML standard. Bad: Easy HTML coding has made it harder for browsers to handle. Tags are predefined in HTML. Format and content are mixed and content is hard to reuse. e.g. <H1>My First XML</H1> <H2>Introduction to XML</H2> <b><FONT SIZE=2><P>What is HTML?….</P></FONT SIZE=2> HTML: Good v. Bad
What is XML? http://www.w3.org/XML/ • XML stands for eXtensible Markup Language. • XML was designed to describe data. • XML tags are not predefined in XML. • You must define your own tags in using XML. • XML separates format from content and semantic structure, e.g. <title>What is XML?</title> <chapter>Introduction to XML</chapter> • Data encoded in XML can function much like a traditional database. • XML content can be output in many formats, such as XHTML, text, Word documents, PDF, etc.
A Sample XML Document <?xml version="1.0" encoding="ISO8859-1" ?> <booklist> <book> <booktitle>Project Cool Guide to XML for Web Designers</booktitle> <author>Teresa A. Martin</author> <country>USA</country> <publisher>John Wiley and Sons</publisher> <price>25.00</price> <year>1999</year> </book> … </booklist>
Transformation of the XML Document XSLT file Word file
XSLT • XSLT -eXtensible Stylesheet Language Transformations • A markup language and programming syntax for processing XML data • Contains a set of template rules that defines what info. can be taken out of the XML document and how it is structured • Is most often used to: • Transform XML to HTML for delivery to standard Web clients or wireless devices • Transform XML from one structure to another • Convert XML data into any wanted output - text, Word document, PDF, etc.
Markup Languages in Libraries • EAD for Encoded Archival Descriptions http://www.loc.gov/ead/ • The Dublin Core Metadata http://dublincore.org/ • MARC XML - MARC 21 XML Schema http://www.loc.gov/standards/marcxml/ • MODS XML - Metadata Object Description Schema http://www.loc.gov/standards/mods/
Markup Languages in Academics • TEI -- guidelines and DTDs http://www.tei-c.org/Guidelines2/index.htm • Resource Bioinformatic Sequence Markup Language (BSML) • Mathematical Markup Language (MathML)
What is TEI? • Initially launched in 1987, the Text Encoding Initiative (TEI) is an international and interdisciplinary standard for encoding, keeping and analyzing textual content & structure of digital texts. This standard is designed for use with a broad range of text types. Now it is widely used in libraries, archives, and by publishers and researchers for online research and teaching and for the storage and exchange of large and small text collections. http://www.tei-c.org/
TEI Guidelines • The TEI encoding system is built upon Standard General Markup Language (SGML) and shifted to XML in 2002. The system is described in the TEI guidelines. It is modular and flexible, including basic modules, such as prose, poetry, drama, speech, lexicography and terminology. These modules can be combined in various ways according to the needs to adapt to a great number of text-encoding purposes. http://www.tei-c.org/Guidelines2/index.html
TEI Lite • http://www.tei-c.org/Lite/ (documentation) • http://www.tei-c.org/Lite/DTD/ (download the files) • TEILite is a simplified ‘starter set’ of TEI elements, which has been defined in simple DTD. It includes most of the core tags, basic structural components, and an adequate set of header elements. It is a good starting point for simple encoding projects, and has proved very popular and serves about 85% of its users’ needs.
DTD -- Document Type Definition • A DTD is a computer-readable text file that defines a markup language for a particular type of document, such as a poem, a novel, or an archival finding aids. • Its purpose is to define the document structure with a list of legal elements --a root element, parent and child elements, and where data can be placed. • It lays out the logical structure of the data. • It establishes rules about which elements a document may have, which are required, which can repeat, etc. • A DTD can be declared inline in your XML document, or as an external reference.
TEI Document Format • All TEI documents follow the same essential format: • TEI header -- documents the bibliographic information about the electronic edition being created. • TEI body -- contains the content being created.
Relationships in a TEI Document • <TEI.2> • <teiHeader></teiHeader> • <text> • <body> • </body) • </text> • </TEI.2> Parent element of <teiHeaher> and <text> Sibling elements <TEI.2> is an ancestor element of <body>
The Encoding Example A sample TEI markup for Mémoires de la comtesse de L http://www.mtholyoke.edu/courses/smoss/TEI_projects/french331/example.html <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE TEI.2 SYSTEM "DTD/teixlite.dtd"> <?xml-stylesheet href="example.xsl" type="text/xsl"?> <TEI.2> … </TEI.2>
Encoding Project Objectives Encoding Mémoires de la comtesse de L… is an act of analysis and interpretation, presenting intellectual challenges that bring us closer to the text and thus help us better understand the work, life, and the social environment surrounding the author.
Group Project: Let’s have Fun!! • 3 octobre: • Overview of XML/TEI technology • Hands-on encoding exercises: themes, personal and place names • 7 novembre: • Demo encoding images by Shaoping in class • 28 novembre: • Demo encoding translation of selected words by Shaoping in class • 5 et 14 decembre: • class demo of each group project (Note: Students will have to make appointment with Alexandra for encoding problems.)
Contact Info Shaoping Moss Information Technology Consultant Research and Instructional Support Mount Holyoke College Email: smoss@mtholyoke.edu Phone: (413) 538-3034 Alexandra Balan Tech Mentor abalan@MtHolyoke.edu