590 likes | 776 Views
Digital Library Technologies at the Grainger Library. William H. Mischo, Timothy W. Cole, Tom Habing w-mischo@uiuc.edu Grainger Engineering Library Information Center University of Illinois at Urbana-Champaign National Digital Archives Project Office of Taiwan March 25, 2002. Outline.
E N D
Digital Library Technologies at the Grainger Library William H. Mischo, Timothy W. Cole, Tom Habing w-mischo@uiuc.edu Grainger Engineering Library Information Center University of Illinois at Urbana-Champaign National Digital Archives Project Office of Taiwan March 25, 2002
Outline • IR Tools and Full-Text • Distributed Information Environment. • Illinois Projects. • XML Technologies. • Metadata Technologies. • DOIs, Linking, Local Resolver • OAI • Portals, Simultaneous Search, Linking • Issues & Trends.
Overview • We now have the tools to pursue the grand challenges of Information retrieval: • standard retrieval environment (Web) and interface/client (Web Browser). • Standardized search/retrieval mechanisms (HTTP Post/Get, SQL, Z39.50). • Standard language for describing and transforming content and metadata (XML, XSLT, DC, DCQ, RDF, Schemas). • Standard transport mechanisms to connect heterogeneous content (HTTP, SOAP, OAI). • Candidate set of ‘best practices’ for IR.
The Digital Library • ‘Digital’, ‘Virtual’, ‘Electronic’ Library as network-based library without regard to place and time. • Tendency to apply term to collections and resources. • Digital Collections vs. Digital Library. • Emphasis on the integration of collections and services (NSDL). • Application of standards and protocols is important.
Full-Text Technologies • Continuum of Web-Enabled technologies -- all presently being utilized. • Evolving technologies and standards. • Role and history of markup. • XML: its role and importance. • The Smart Document.
Scholarly Communication Overview • E-Resources are Web-based and publisher-centric. • Growth of Heterogeneous Distributed Repositories. • Value-added services and ‘branding’ of journals. • Prestige of Journals and Publishers • Reciprocal linking relationships between publishers. • Cooperation on linking standards (DOI, CrossRef). • Alternative publishing models - Academia, Preprint Servers, disintermediation.
Distributed Information Model • Diverse information environment in which we operate. • Multiple elements, relationships and nodes. • Need for gateway, interface, and navigation tools. • Need for document representation, transmission, linking, and retrieval middleware tools and standards. • Role of A & I Services.
Distributed Repository Issues • Integration of discrete publisher repositories, locally loaded full-text, local and remote A & I services, OPAC, Web resources, and local data. • Issues for user access: • need to identify appropriate publisher repository, but presently interfaces are different and full-text and controlled vocabulary searching often not offered. • A & Is: not full-text but offer controlled vocabulary, no links to full-text repositories.
Distributed Repository - Needs • Integration of discrete publisher repositories, locally loaded full-text, local and remote A & I services, OPAC, Web resources, and local data. • Support simultaneous searching of A & I Services, Distributed Repositories, OPACs, Web search engines, local files. Integrate TOC, full-text. • Remote Reference 24 X 7. • Metadata harvesting, archiving. • Local Resolver services for locally loaded or Aggregator Resources.
Illinois Testbed Project • Funded under DLI-I by NSF, DARPA, and NASA, 1994--1998. Awards made to 6 universities. • Large-scale Testbed, Distributed Repository models, evaluation, Web software. • Funded under CNRI D-Lib Test Suite Program, 1998—2001. • Collaborating Partners Program. AIP, APS, ASCE, IEE, NRL, ASM, ACM, NTT Learning Systems, Elsevier. • All XML Journal -- AIP, APS, ACM.
Illinois Testbed • American Institute of Physics--APL, JAP, RSI • 18,000+ articles, 1995--. • American Physical Society--PRL • 14,000+ articles, 1995--, weekly updates. • ASCE Journals (25 titles) • 10,000+ articles, 1995--. • IEE Proceedings and Electronics Letters • 8,500+ articles, 1993--. • IEEE Computer Society. • ASM (American Society for Materials) Handbook. • ACM (Association for Computing Machinery) Transactions. • Elsevier Science.
Project Issues • Evolution of the Document. • Distributed information environment. • Use of Metalanguages & Transformations (SGML, XML). • Searching over full-text of journals vs. document surrogates in A & I format. • Rendering and styling (SGML, XML, MathML). • Dynamic metadata for normalization, linking. • Breadth and depth of collections. • User needs.
Accomplishments • Process & retrieve from multiple publishers & heterogeneous DTDs. • Metadata specification that uses RDF, Dublin Core (DCQ, DC Agents) Schemas, IDLI Namespace. • Cross-repository searching (Testbed & D-LIB Test Suite). Full-Text and Metadata. • SGML to XML Conversion. • XSLT, CSS, for transformation & rendering, including Mathematics.
Accomplishments (2) • Linking: Forward/Backward within Testbed, from/to A & I Services. • Conversion of ISO 12083 math markup to MathML. • Enhanced Web retrieval mechanisms: Author Word Wheels, Co-Occurrence Matrices. • Detailed user transaction logs, gathered at the search argument level, with identification of characteristics of each user search sessions • Local Link Server for DOIs, Context-Sensitive linking.
Accomplishments (3) • CSS/DHTML Math rendering techniques, TechExplorer integration. Two international math conferences. • Simultaneous search within DeLiver of Tesbed repositories, A & Is, NCSTRL. • Local Link Server and Appropriate Copy Issues. • Simultaneous search of A & Is, OPAC, Google, Local resources with integrated reference linking using OpenURL and DOIs from A & Is. • Open Archives Initiative (OAI).
Ongoing Investigations (1) • Support simultaneous searching of A & I Services, Distributed Repositories, enhanced navigation, expanded gateway functions. • Interoperability models, e.g., Metadata harvesting vs. Federated (Broadcast). • OAI Provider and Harvesting software. OAI EAD and Cultural Heritage collection and retrieval system. • HTTP harvesting, Spider technology (gathering).
Ongoing Investigations (2) • Archiving. • Local Link Server with context-sensitive resources. • Reference Linking integration built on OpenURL and DOI. • NSDL presence. • Reference Assistant software with simultaneous search, point-of-contact assistance, and remote reference capability..
XML (eXtensible Markup Language) • Like SGML, a Data Description Language (Metalanguage). • Subset/version of SGML. • Allows fine-granularity markup of content and structure. Author can create their own elements (extensible). • Tags define the structure of document not presentation format. • Validated vs. “well-formed” - separation of authoring process from representation & presentation. • Either validated in DTD/Schema or well-formed. • Compatible with relational DBs.
XML and Publishers • Seybold Seminars Publishing 2000, Boston, February 2000. • Tim Gill of Quark, “…the use of XML could lead to a drop in the cost of Web publishing by 30% to 50% and a significant reduction in the time it takes to produce sites.” • Gill: “I don’t believe that there is any innovation in print that is going to save us even 10% in costs.” • Issues and Challenges remain. • Publishers are looking at the all-XML journal.
XML Features • The milestones in document description and transmission: ASCII, TCP/IP, HTTP and HTML, XML. Web Programmability. • DTD not required with XML. Needed if internal entities. • Use of Document Object Model (DOM). • Technology approach from Web developer’s standpoint: XML data, CSS presentation layer, XSLT to transform the structure (‘view’) of the data/document.
Role of XML • “If you ask 20 people in the industry, ‘what is XML?’ You’ll get 20 different answers – Dale Fuller, CEO, Inprise Corporation. • Vendor-Neutral, platform-independent structured information standard. • Document representation and interchange Standard. • Applications can externalize their data/metadata as XML. • Issues with full-text representation: PDF, XML/HTML. Value in indexing, retrieval.
XML Parser APIs: Tree-Based and Event-Based • DOM (Document Object Model). • DOM Level 1 and Level 2 W3C recommendation. Widely implemented, Tree-Based. Hierarchy of nodes. Loads entire document into memory. Level 2 adds namespace support, traversal, stylesheets, events, triggers. Level 3 working draft. DOM HTML candidate. Parsers allow developers to iterate through documents, change document content. • SAX (Simple API for XML). • Open-source, XML-DEV, not W3C. Event-based, fires events as it reads document, need not load entire document into memory. Good for single-pass processing. Xerces, XML4C, Sun Project X (Crimson).
XML Linking • XML Base http://www.w3.org/TR/xmlbase • Permits use of relative URI path prefixes. Can then shorten references. • XLink http://www.w3.org/TR/xlink/ • Method for specifying navigational links. Allows enforcement of specific path order through links. xlink:type=“simple” corresponds to HTML <a> or <img> tags. • XInclude http://www.w3.org/TR/xinclude • Copies entire XML documents or selected portions into current document. Candidate recommendation. Uses XPath and XPointer to specify document elements to include. • XPointer http://www.w3.org/TR/xptr • Uses XPath to identify portion of a document. Permits string searches and range specifiers.
XML Schema and Structure • DTD • Original schema representation, defines structural rules for a class of XML documents. • XML Schema http://www.w3.org/XML/Schema • Also sets out standardized structure for class of XML documents. Is coded in XML, can be parsed and edited with standard software. Two separate parts: structures and datatypes. • Namespaces http://www.w3.org/TR/REC-xml-names/ • Allows developers to qualify element and attribute names with unique URIs, avoids recognition errors.
XML Implementations • XHTML, SVG (Structured Vector Graphics), XForms (similar to HTML forms). • MathML http://www.w3.org/Math/ • Markup language for describing mathematics, both presentation and content. • RDF http://www.w3.org/RDF/ • Resource Description Framework. Defines structure for encoding object metadata. Facilitates metadata interchange & harvesting. RDF Schemas. • Others: DocBook, XML ISO12083, Open eBook, WAP/WML.
Searching and Transformation • XPath http://www.w3.org/TR/xpath • Defines pattern-matching syntax used by XSLT and XPointer. Method for selecting data in a document. MSXML 3.0 supports XPath. Supercedes XPatterns./descendant-or-self::node()/child::name • XSL • Includes transformative and FO formatting objects. FO will replace CSS for document formatting. • XSLT http://www.w3.org/TR/xslt • Mechanism for encoding style rules, ensures consistent rendering of XML documents of the same type. • XML Query http://www.w3.org/XML/Query • Response to limitations of XPath. Would bring database-style queries to XML documents.
Remote Object Access • SOAP (Simple Object Access Protocol) • Microsoft, IBM, Sun. Allows applications to invoke objects or functions residing on remote servers. Creates request block in XML. • XML-RPC http://www.xmlrpc.com/ • Remote procedure calling using HTTP as the transport and XML as the encoding. Open, but not standard protocol; widely adopted. • Web Services.
Remote Object Access • Web Services: • Based on XML, SOAP, UDDI (Universal Description, Discovery, and Integration), and WSDL (Web Services Description Language). Applications are assembled on the fly in XML, exposed to the world, and accessed via the Web from different devices. • Supported by Microsoft .net, IBM WebSphere, SUN ONE.
XML, XSLT, and CSS • Use XML full-text articles as ordered hierarchy of content objects. • Generate item-level metadata in XML, using RDF and Dublin Core syntax and semantics. • XSLT and CSS used to present metadata and articles in either XML or HTML format depending on Browser. • Mathematics rendering using MathML tools (conversion from ISO 12083 to MathML). • Real-time transformation between XML and HTML using XSLT (scalability issues).
XSLT Where Should It Happen • Client-side • IE5+ only • Not Netscape 6 or Mozilla (yet) • IE5 not yet fully compliant w/ XSLT and XPath standard • Can reduce the load on your servers • But performance on low-end clients can be BAD • Server-side • Performance could be a problem on busy servers, serving large, complex documents • More control & flexibility over the conversion (metamerge) • Offline Preconversion • Best performance • Not best for dynamic documents (metamerge)
Converting XML to HTML (XSLT) • Simple one-to-one conversions:<sect> becomes <span class="sect"> • span.sect {display:block;margin-left:2em} • Attribute based conversions:<emph type="1"> becomes <span class="emph_1"> • span.emph_1 {font-style:italic} • Generated text, such as punctuation:<ag><au>Tom</au><au>Tim</au><au>Bob</au></ag>becomesTom, Tim, Bob. • Rearranged children:<au><sn>Habing</sn><fn>Tom</fn></au>becomesTom Habing
Converting XML to HTML (cont.) • Some elements are converted into HTML elements other than <span> or <div> • Figures are converted to <img src="…"> tags. • Internal links with ID and IDREF attributes are usually converted into HTML anchor tags. • Table elements are converted into corresponding HTML <table>, <tr>, or <td> tags. • ‘Real’ DTDs require some fairly complex processing. • So far XSLT seems to be able to handle nearly every case we have come across • However, some cases have required JScript extensions to XSLT
Schemas vs. DTDs • Both are systems of representing a data model that defines the data’s elements and attributes, and the relationship among elements. • Schema addresses limitations of DTDs and the increasingly data-oriented role of XML. • Initial Arbortext, DataChannel, Inso, Microsoft, and Univ of Edinburgh proposal: XML-Data. • W3C XML Schema Working Group: two documents: XML structures and datatypes.
Schema Justification • Description of document type’s structure should be in an XML document instead of written in special syntax (DTD). • Schema are in XML: easier to edit and process using standard XML DOM manipulation tools. • DTD notation doesn’t allow schema designers the power to impose strong data typing -- for example, the ability to say that a certain element type must always have a positive integer value, that it may not be empty, or that it must be one of a list of possible choices.
Metadata and Linking Standards • Digital Object Identifier (DOI) and Persistent Object Identifiers. • OpenURL and Value-Added Service Components (SFX). • Open Archives Initiative (OAI), Dublin Core and Qualifiers. • Local Resolver Servers.
Metadata in DLI • To normalize & augment presentation. • To normalize searching (e.g. Names). • To store dynamic links. • Types of links: • Articles referenced By item (Backward). • Articles that reference the item (Forward). • A & I Records for references and items. • Other relationships (TOC, Other items by Author, Collaborative Data). • Known item and presumptive linking.
DLI Metadata Schema • Maintained as XML files using RDF and Qualified Dublin Core syntax and semantics. • Example: <dcq:issued> <!-- subproperty/refinement of DC Date --> <dcq:W3CDTF> <!-- DC Date encoding --> <rdf:value>1999-09</rdf:value> </dcq:W3CDTF> </dcq:issued> • Application of XML DOM for processing at DC or idli level.
New DLI Metadata Schema <dc:creator> <rdf:Seq> <rdf:li> <dca:Person rdf:ID="AUTHOR-1"> <dca:agentname> <dca:FNF> <rdf:value>L'Ecuyer, Pierre</rdf:value> </dca:FNF> </dca:agentname> <dca:agentaffiliation>Université de Montréal Département...</dca:agentaffiliation> <dca:agentidentifier rdf:resource="mailto:lecuyer@iro.umontreal.ca" /> </dca:Person> </rdf:li> ….. </rdf:Seq> </dc:creator>
Digital Object Identifier (DOI) • DOI is both a unique identifier of a piece of digital content AND a system to access that content digitally. Persistent object identifier. • ‘The ISBN for the 21st Century’ -- Norman Paskin. • DOI system has two main parts: (the identifier and a directory system) and a third logical component, a database. • Developed by AAP (Association of American Publishers), now managed by International DOI Foundation.
DOI Construction • First real open standard for content identification. • DOI is a number that identifies a digital object: • 10.1063/S000369519903216 • 10 Registration Agency Prefix • 1063 Publisher Prefix • S000369519903216 Suffix (Publisher-assigned ID) • Suffix can be SICI or PII. • The DOI and URL pointing to the digital object, is registered with the International DOI Foundation, e.g: • 10.1063/333 | http://www.pubsite.org/apr99/artl1.pdf
Using a DOI • DOIs are resolved using the Handle System technology from CNRI (Corporation for National research Initiatives). • Retrieval of object is two step process: link is sent to central directory where current Web address is stored, location is sent back to browser with special message to redirect to address, e.g: • dx.doi.org/10.1063/333 redirects to www.pubsite.org/apr99/artl1.pdf
Reference Linking • Alternatives to DOI: • PubMed/PubRef (National Library of Medicine) • PubSCIENCE (DOE/OSTI) • Proprietary Link Managers (AIP, APS) • CrossRef Project: major Sci-Tech professional societies and commercial publishers. • System design calls for one URL for each DOI; underlying technology can handle multiple URLs however.
Local Resolver • Issue: Directing users to locally held or licensed version of Digital Object (locally loaded or from Aggregator). • Harvard problem, Appropriate Copy problem. • Additional desire to direct users to local value-added services: local print holdings, interlibrary borrowing, other articles in A & I Services.
Local Resolver • Local Resolver Servers • OpenURL Protocol, CookiePusher vs. IP Addresses. • Demonstration Project at Illinois, OhioLink (Ex Libris SFX), Los Alamos. • Localizing Name Resolution for AIP, ASCE, Elsevier, other publishers. • Use of CrossRef Metadata Database for identifying Publisher from DOI and linking to Local Copy, A & I Services, Library Assistance.