LIS 450EP Case Study: The Illinois Digital Library Initiative Project

LIS 450EP Case Study: The Illinois Digital Library Initiative Project Timothy W. Cole William H. Mischo t-cole3@uiuc.edu, w-mischo@uiuc.edu Grainger Engineering Library Information Center University of Illinois at Urbana-Champaign http://dli.grainger.uiuc.edu/Publications/WHMischo/LIS450EP/

Outline • Digital Libraries, Publishers, XML, &the Scholarly Information Environment. • The Illinois DLI / D-Lib Testbed Project. • XML Technologies in Journal Publishing • Current work: linking, metadata, metasearch, & the Open Archives Initiative Protocol for Metadata Harvesting.

References Cole, Timothy W., William H. Mischo, Thomas G. Habing, and Robert H.Ferrer. "Using XML and XSLT to Process and Render Online Journals,"Library Hi Tech 19, no. 3 (2001): 210 - 222. Availablehttp://dx.doi.org/10.1108/07378830110405067 Shreeves, Sarah L., Joanne S. Kaczmarek, and Timothy W. Cole. "Harvesting Cultural Heritage Metadata Using the OAI Protocol." Library Hi Tech 21, no. 2 (2003): 159-169. Available:http://dx.doi.org/10.1108/07378830310479802 Lagoze, Carl and Herbert Van de Sompel. "The making of the OpenArchives Initiative Protocol for Metadata Harvesting," Library Hi Tech21, no. 2 (2003): 118 - 128. Avaliablehttp://dx.doi.org/10.1108/07378830310479776 XML Schemas for Qualified Dublin Core, see bottom of Web page at URL: http://www.dublincore.org/schemas/xmls/

Overview • We now have the tools to pursue the grand challenges of Information retrieval: • Standard retrieval environment (Web) and interface/client (Web Browser). • Standardized search/retrieval mechanisms (HTTP Post/Get, SQL, Z39.50, OAI). • Standard language for describing and transforming content and metadata (XML, XSLT, XML Schemas). • Standard interoperability mechanisms to connect heterogeneous content (HTTP, SOAP, OAI).

XML and Publishers • Tim Gill of Quark, “…the use of XML could lead to a drop in the cost of Web publishing by 30% to 50% and a significant reduction in the time it takes to produce sites.” • Gill: “I don’t believe that there is any innovation in print that is going to save us even 10% in costs.” • AIP all-XML Journal • Issues and Challenges remain. • Use of XML behind the scenes commonplace

XML and Publishers • Vendor-Neutral, platform-independent structured information standard. • Document representation & interchange standard. • Applications can externalize their data/metadata as XML. • Based on Document Object Model (DOM), std. OOP-style components (XSLT, CSS, …) • Issues with full-text representation: PDF, XML/HTML. Value in indexing, retrieval.

The Digital Library • ‘Digital’, ‘Virtual’, ‘Electronic’ Library as network-based library without regard to place and time. • Digital Collections vs. Digital Library. • Tendency to call collections & resources DLs. • IMLS Framework of Guidance for Building Good Digital Collections • Emphasis on the integration of collections and creation of DL services (e.g., NSDL). • Application of standards and protocols enables and facilitates development of services.

Scholarly Communication Overview • Web-based E-Resources still publisher-centric. • Not user-centric or topic-centric • Growth of Heterogeneous Distributed Repositories. • Value-added services and ‘branding’ of journals. • Prestige of Journals and Publishers • Reciprocal linking relationships between publishers. • Cooperation on linking standards (DOI, CrossRef). • Alternative publishing models - Academia (e.g., SPARC), Preprint Servers, disintermediation.

Full-Text Technologies • Continuum of Web-Enabled technologies presently being utilized. • Evolving technologies and standards. • Role and history of markup. • Increasing role and importance of XML. • Towards a “Smart Document”

Distributed Repositories • Current Resources: • publisher repositories; A & Is (remote and local); course management systems; OIA and preprint servers; Web search engines; vendor portals; institutional repositories • Goal for distributed repositories: Integration of discrete publisher repositories, locally loaded full-text, local and remote A & I services, OPAC, Web resources, and local data.

Distributed Repository - Needs • Support simultaneous searching of A & I Services, Distributed Repositories, OPACs, Web search engines, local files. Integrate TOC, full-text. • Remote Reference 24 X 7. • Metadata harvesting • Digital archiving. • Local Resolver services for locally loaded or Aggregator Resources.

Illinois Testbed Project • Funded under DLI-I by NSF, DARPA, and NASA, 1994--1998. Awards made to 6 universities. • Large-scale Testbed, Distributed Repository models, evaluation, Web software. • Funded under CNRI D-Lib Test Suite Program, 1998—2001. • Collaborating Partners Program. AIP, APS, ASCE, IEE, NRL, ASM, ACM, NTT Learning Systems, Elsevier. • All XML Journals -- AIP, APS, ACM.

Illinois Testbed • American Institute of Physics--APL, JAP, RSI • 18,000+ articles, 1995--. • American Physical Society--PRL • 14,000+ articles, 1995--, weekly updates. • ASCE Journals (25 titles) • 10,000+ articles, 1995--. • IEE Proceedings and Electronics Letters • 8,500+ articles, 1993--. • IEEE Computer Society. • ASM (American Society for Materials) Handbook. • ACM (Association for Computing Machinery) Transactions. • Elsevier Science.

Project Issues • Evolution of the Document. • Distributed information environment. • Use of Metalanguages & Transformations (SGML, XML). • Searching over full-text of journals vs. document surrogates in A & I format. • Rendering and styling (SGML, XML, MathML). • Dynamic metadata for normalization, linking. • Breadth and depth of collections. • User needs.

Accomplishments • Process & retrieve from multiple publishers & heterogeneous DTDs. • Metadata specification that uses RDF, Qualified Dublin Core, XML Schemas, XML Namespaces. • Cross-repository searching (Testbed & D-LIB Test Suite). Full-Text and Metadata. • SGML to XML Conversion. • XSLT, CSS, for transformation & rendering, including Mathematics.

Accomplishments (2) • Linking: Forward/Backward within Testbed, from/to A & I Services. • Conversion of ISO 12083 math markup to MathML; rendering of MathML. • Enhanced Web retrieval mechanisms: Author Word Wheels, Co-Occurrence Matrices. • Detailed user transaction logs, gathered at the search argument level, with identification of characteristics of each user search sessions • Simultaneous search within DeLiver of Tesbed repositories, A & Is, NCSTRL, …

Ongoing Investigations • Support federated/broadcast searching of A & I Services, Distributed Repositories, enhanced navigation, expanded gateway functions. • Interoperability models, e.g., Metadata harvesting vs. Federated (Broadcast) • Z39.50 protocols, HTTP harvesting, Spider technology (gathering). • E-Journal Archiving (AIP). • Local link server with context-sensitive resources. • MathML & other ENTS (Essential Non-Text Stuff)

XML Parser APIs: Tree-Based and Event-Based • DOM (Document Object Model for XML & HTML). • DOM Level 1 and Level 2 W3C recommendation. Widely implemented, Tree-Based. Hierarchy of nodes. Loads entire document into memory. Level 2 adds namespace support, traversal, stylesheets, events, triggers. Level 3 W3C candidate recommendation. Parsers allow developers to iterate through documents, change document content. • SAX (Simple API for XML). • Open-source, not W3C. Initially Java-based. Event-based, fires events as it reads document, need not load entire document into memory. Good for single-pass processing. Xerces, XML4C, Sun Project X (Crimson), MSXML.

XML Schema and Structure • DTD • Original schema representation, defines structural rules for a class of XML documents. Inherited from SGML. • XML Schema http://www.w3.org/XML/Schema • W3C recommendation. Also sets out standardized structure for class of XML documents. Is coded in XML, can be parsed and edited with standard software. Two separate parts: structures and datatypes. • Namespaces http://www.w3.org/TR/REC-xml-names/ • W3C recommendation (1.1 candidate in work) Allows developers to qualify element and attribute names with unique URIs, avoids recognition errors.

XML, XSLT, and CSS • Use XML full-text articles as ordered hierarchy of content objects. • Generate item-level metadata in XML, using RDF and Dublin Core syntax and semantics. • XSLT and CSS used to present metadata and articles in either XML or HTML format depending on Browser. • Mathematics rendering using MathML tools (conversion from ISO 12083 to MathML). • Real-time transformation between XML and HTML using XSLT (scalability issues).

XML Linking • XML Base http://www.w3.org/TR/xmlbase • W3C recommendation. Permits use of relative URI path prefixes. Can then shorten references. • XLink http://www.w3.org/TR/xlink/ • W3C recommendation. Method for specifying navigational links. Allows enforcement of specific path order through links. xlink:type=“simple” corresponds to HTML <a> or <img> tags. May be used with XPointer. • XInclude http://www.w3.org/TR/xinclude • W3C working draft. Copies entire XML documents or selected portions into current document. Uses XPath and XPointer to specify document elements to include. Unlike XML external entities, no DTD is required. • XML Pointer Language http://www.w3.org/XML/Linking • Composed of multiple W3C recommendations and working drafts. A language to be used for fragment identifier in XML. Uses XPath. Permits string searches and range specifiers.

Searching and Transformation • XPath http://www.w3.org/TR/xpath • W3C recommendation. Defines pattern-matching syntax used by XSLT and XPointer. Method for selecting data (e.g. nodes, attributes, …) in a document. • XSL-FO http://www.w3.org/TR/xsl/ • W3C recommendation. FO similar to CSS but more powerful for XML document formatting. • XSLT http://www.w3.org/TR/xslt • W3C recommendation. (2.0 working draft) Mechanism for transforming XML documents. Can be used for normalization of XML documents from different schemas. • XML Query http://www.w3.org/XML/Query • Composed of multiple W3C working drafts. Designed to bring database-style queries to XML documents.

Converting XML to HTML (XSLT) • Simple one-to-one conversions:<sect> becomes <span class="sect"> • span.sect {display:block;margin-left:2em} • Attribute based conversions:<emph type="1"> becomes <span class="emph_1"> • span.emph_1 {font-style:italic} • Generated text, such as punctuation:<ag><au>Tom</au><au>Tim</au><au>Bob</au></ag>becomesTom, Tim, Bob. • Rearranged children:<au><sn>Habing</sn><fn>Tom</fn></au>becomesTom Habing

XSLT Where Should It Happen • Client-side • IE5+, Netscape 7+/Mozilla • Not Netscape 6 and earlier • IE5 not fully compliant w/ XSLT and XPath standard • Can reduce the load on your servers • But performance on low-end clients can be BAD • Server-side • Performance could be a problem on busy servers, serving large, complex documents • More control & flexibility over the conversion (metamerge) • Offline Preconversion • Best performance • Not best for dynamic documents (metamerge)

Remote Object Access • Web Services: • Based on XML, SOAP (Simple Object Access Protocol – W3C), UDDI (Universal Description, Discovery, and Integration), and WSDL (Web Services Description Language). Applications are assembled on the fly in XML, exposed to the world, and accessed via the Web from different devices. • Supported by Microsoft .net, IBM WebSphere, SUN One. • OCLC looking at implementing Web Services (e.g., for Name Authority lookup)

Schemas vs. DTDs • Both are systems of representing a data model that defines the data’s elements and attributes, and the relationship among elements. • Schemas add namespaces, address limitations of DTDs & facilitate data-typing. • W3C XML Schema Working Group: two documents: XML structures and datatypes. • Alternatives to XML Schema:RELAX-NGSchematron

Examples from DLI / D-Lib • ACM Search • XML & XSLT for layered views of content (publisher.toc, journal.toc, XSLT, HTML) • Transforms of SGML to MathML(png image, SGML math, MathML) • On the fly XML to HTML • Transforms of Qualified DC to Simple DCQualified, Simple, XSLT, Alt. XSLT

Linking & Metadata Aggregation • Digital Object Identifier (DOI) and CrossRef. • OpenURL and Value-Added Service Components (SFX, Encompass). • Local Resolver Servers. • OAI-PMH, Dublin Core (DC) & Qualified DC.

Metadata in DLI • To normalize & augment presentation. • To normalize searching (e.g. Names). • To store dynamic links. • Types of links: • Articles referenced By item (Backward). • Articles that reference the item (Forward). • A & I Records for references and items. • Other relationships (TOC, Other items by Author, Collaborative Data). • Known item and presumptive linking.

Digital Object Identifier (DOI) • DOI is both a unique identifier of a piece of digital content AND a system to access that content digitally. Persistent object identifier. • ‘The ISBN for the 21st Century’ -- Norman Paskin. • DOI system has two main parts: (the identifier and a directory system) and a third logical component, a database. • Developed by AAP (Association of American Publishers), now managed by International DOI Foundation. • 5 million+ DOI records in CrossRef

DOI Construction • First real open standard for content identification. • DOI is a number that identifies a digital object: • 10.1063/S000369519903216 • 10 Registration Agency Prefix • 1063 Publisher Prefix • S000369519903216 Suffix (Publisher-assigned ID) • Suffix can be SICI or PII. • The DOI and URL pointing to the digital object, is registered with the International DOI Foundation, e.g: • 10.1063/333 | http://www.pubsite.org/apr99/artl1.pdf

Reference Linking • Alternatives to DOI: • Proprietary Link Managers (AIP, APS • Even then, most still use DOIs as well • CrossRef Project: major Sci-Tech professional societies and commercial publishers. • 252 members • 9.3 million registered items (journal articles & conference papers). • Appropriate Copy Problem (OhioLink, Los Alamos, NRL).

Local Resolver • Issue: Directing users to locally held or licensed version of Digital Object (locally loaded or from Aggregator). • Appropriate Copy problem. • Additional desire to direct users to local value-added services: local print holdings, interlibrary borrowing, other articles in A & I Services. • Special Services • http://g118.grainger.uiuc.edu/linker/

DOI Proxy OpenURL Client (Web Browser) AIP Handle Server dx.doi.org/10.1063/1234 IEE Nosfx=y Cookie on client Aware Elsevier Local AIP, IEE OpenURL Local Value Added Illinois Local Link Server DOI CrossRef Metadata Database Metadata UIUC Metadata Registry

Open Archives Initiative (OAI) • Version 1 released Jan ‘01, V.2 released June ‘02 • Mechanism for data providers to expose their metadata through an HTTP protocol and a mechanism for harvesting records containing metadata from repositories. • Roots in e-print archives. • Lightweight, low-barrier. Easy to implement on standard Web servers to handle OAI protocol requests; need to incorporate into workflow used to create / maintain metadata.

OAI Continued • Requires repositories to support the Dublin Core schema as lowest common denominator. • Allows communities to expose metadata in other formats as long as records are structured as XML data with corresponding XML schema. • Application for discipline specific portals, institutional repositories, NSDL, IMLS • Over 250 OAI 2.0 metadata providers. • http://oai.grainger.uiuc.edu/registry • OAI extensions in development: • OAI Static Repository Gateway • OAI Rights

How OAI Works OAI “VERBS” Identify ListMetadataFormats ListSets ListIdentifiers ListRecords GetRecord Service Provider Metadata Provider H A R VESTER REPOSITORY OAI HTTP Request OAI (OAI Verb) HTTP Response (Valid XML)

Metadata Schemas Used By OAI Metadata Providers

Illinois-Mellon OAI Project • Funded to create a web portal to scholarly information resources in cultural heritage harvested via OAI-PMH • Primary objectives: • Build harvesting and search service • Investigate viability and utility of searching OAI harvested resources • Explore issues of advanced search/indexing/display • Document user needs & usage patterns • Identify critical issues and best practices for using OAI-PMH with cultural heritage material

Technical achievements (Mellon) • Developed harvesting tools (OpenSource) • Refined data provider tools (OpenSource) • Investigated logistics and scalability of harvesting activities • Created XSL stylesheets for metadata transformations • Experimented w/configurations for scalability and performance issues

Metadata aggregation (Mellon) • 39 providers (OAI-compliant and surrogates) • Metadata describing resources of 580 institutions • 1.1 million original records • 2.6 million including item-level records derived from EAD finding aids

Type of resources (Mellon) • Hidden web • Other includes: • archival collections • websites • moving images • audio • 30% of metadata describes digitized objects (of any type)

DC element usage (Mellon) • Records containing subject & description element • Many different controlled and local vocabularies in use • Granularity: a record may describe a collection of coins — or one coin

Related ongoing & future work • Test usability with targeted user community • Linking resources • Including linking using MathML • Simultaneous search, automated metadata generation, & automated metadata normalization • NSF National Science Digital Library Projects • Mathematics resources & MathML • Combining sci-tech journals with other Web resources • Additional OAI Implementations • IMLS NLG • CIC • DLF - DODL

Open Issues • Role of Authors, Academic Institutions, Libraries, Publishers, Abstracting & Indexing Services. • Disintermediation may affect both Libraries and Publishers. • Information as Function not Place. • Provide ‘Digital Library’ services built atop digital collections. • Role of XML technology. • Service mechanisms: processing & archiving, search and discovery, presentation, linking.

LIS 450EP Case Study: The Illinois Digital Library Initiative Project