670 likes | 826 Views
The Mellon-Funded Fedora Project A Briefing for the Los Alamos National Laboratory August 26, 2002. Sandy Payette Cornell Information Science. Motivation. The Problem of Complex Content. Some familiar objects. Digital Library Content not just documents. Complex, compound, dynamic objects.
E N D
The Mellon-Funded Fedora ProjectA Briefing for the Los Alamos National LaboratoryAugust 26, 2002 Sandy Payette Cornell Information Science
Motivation The Problem of Complex Content
Some familiar objects Digital Library Contentnot just documents ... • Complex, compound, dynamic objects
Key Research Questions • How can clients interact with heterogeneous collections of complex objects in a simple and interoperable manner? • How can complex objects be designed to be both generic and genre-specific at the same time? • How can we hide the complexity of an object’s underlying data structures and relationships from clients? • How can we associate services and tools with objects to provide different presentations or transformations of the object content? • How can we associate specialized, fine-grained access control policies with specific objects, or with groups of objects?
The Flexible Extensible Digital Object Repository Architecture (FEDORA) • Developed as a DARPA and NSF-funded research project at Cornell (1997-present) • CORBA-based reference implementation • Extensive interoperability testing • Policy Enforcement • Interpreted and re-implemented at University of Virginia (1999) • Simple web-oriented implementation, focused on access to collections • Java servlet and relational db • Virginia prototype supported testbed of 10,000,000 digital objects with very good results (1999-2001) • Andrew W. Mellon Foundation granted Virginia and Cornell $1,000,000 to develop a full-featured production FEDORA system that that is web-based (2002+)
Flexibility – object model that fits many different contexts Management - of distributed digital content and services Access–stable interfaces to digital objects; behavior-centric Interoperability – among digital objects and repositories Extensibility – easy evolution of object behaviors Security –rights management and access control Preservation– of content, plus “look and feel” FEDORAOriginal Research Goals
Model for Collaboration Digital Library Research and Real Library Requirements • University of Virginia developing extensive digital collections since 1992 • Virginia Digital Library R&D Group chartered with finding solution for integration • Formal Requirements analysis • Search for commercial products • Discovery: Cornell research parallels stated requirements
Virginia Requirements:Managing the Collections • Scalability to support hundred of millions of objects • Persistent unique names for all resources without respect to machine address • Support inter-relationships among objects • Manage the digital resources and metadata, as well as computer programs, services and tools that support them • Enforce appropriate policies for use of Library resources • Provide a high level of security • Support preservation activities appropriately
Virginia Requirements:Delivering the Collections • Well-architected, flexible relationships between services/tools and digital content • Digital objects, themselves, have ability to provide users with an appropriate launch-pad or tool to use the object content • Every resource can be used in any number of contexts • Move towards a digital library that is configurable by an “aware” user • Provide resource discovery (searching) across the full collection • Deep searching in particular collections
Shortcomings of commercial digital library products • Narrow focus on specific media formats (e.g. image databases, document management) • Fail to effectively address interrelationships among digital entities • Fail to address interoperability; no open interfaces to facilitate sharing of services; no standard protocols for cross-system interoperability • Fail to provide facilities for managing programs and tools that are integral to delivering digital content. • Not extensible; does not enable easy integration of new tools and services
The Fedora Architecture Overview of Basic Model
Digital Object Containerfor aggregating any digital content Content disseminations based on behavior definitions Extensibility of behavior mechanisms Repository Service layer for “contained” Digital Objects Object lifecycle management Access management FEDORA Basic Architectural Abstractions
FEDORA Digital Object Globally unique persistent id Persistent ID ( PID ) Public view: access methods for obtaining “disseminations” of digital object content Disseminators Internal view: metadata necessary to manage the object System Metadata Datastreams Protected view: content that makes up the “basis” of the object
FEDORA Digital Object Architecture Behavior Definition Object Data Object Persistent ID (PID) Persistent ID ( PID ) System Metadata Datastreams Disseminators Service Definition Metadata Behavior Mechanism Object System Metadata Persistent ID (PID) Datastreams System Metadata Datastreams Service Binding Metadata
Digital Object InteroperabilityCommon Behaviors for Variable Content Functional equivalency
Book Photo Collection Digital Object ExtensibilityAdding New Behaviors Digital Object The same underlying content... to create new disseminations not originally conceived of can be operated on in novel ways…
Virginia Prototype Content Models and Fedora Demos
General Image Content Model Persistent ID ( PID ) Disseminators Behavior Behavior Disseminator Definition Mechanism web_ image1 web_image web_ image1 get_thumb HTTP GET get_ med imagedisplay.java get_high HTTP GET get_ veryhigh HTTP GET web_default_image web_default web_default_image Metadata get_as_page imagedisplay.java get_in_context HTTP GET (thumb) System Metadata admin Administrative metadata desc Descriptive metadata Datastreams basis1 pointer to thumbnail size image basis2 pointer to medium resolution image basis3 pointer to high resolution image basis4 pointer to highest resolution image (Mycenae image example)
MrSID Image Content Model Persistent ID ( PID ) Disseminators Behavior Behavior Disseminator Mechanism Definition web_image_ mrsid web_image web_image_ mrsid get_thumb get_ image.pl get_ med get_ image.pl get_high get_ image.pl get_ veryhigh get_ image.pl web_default_image web_default web_default_image get_as_page get_ image.pl get_in_context get_ image.pl Metadata System Metadata admin Administrative metadata desc Descriptive metadata Datastreams basis1 pointer to MrSID formatted image (Pavilion III image example)
Finding Aid Content Model Persistent ID ( PID ) Disseminators Behavior Behavior Disseminator Definition Mechanism web_ ead1 web_ ead web_ ead1 get_web_default eaddoc.java get_ tp tp.xsl get_ admin admin.xsl get_summary summary.xsl get_ scopecontent scopecontent.xsl get_ bioghist bioghist.xsl get_component component.xsl get_arrangement arrangement.xsl get_organization organization.xsl get_document document.xsl get_menu menu.xsl web_default_ ead1 web_default web_default_ ead1 get_as_page eaddoc.java get_in_context document.xsl System Metadata admin Administrative metadata desc Descriptive metadata Datastreams basis1 pointer to XML Finding Aid source (Finding Aid example)
TEI Letter Content Model Persistent ID ( PID ) Disseminators Behavior Behavior Disseminator Definition Mechanism web_ teiletter1 web_ teiletter web_ teiletter1 get_ teiletter _default teiletterdoc.pl get_original letter.header.xsl get_modern modern.xsl get_ teiheader teiheader.xsl get_ pageimages pageimages.xsl Metadata web_default web_default_ teiletter web_default_ teiletter get_as_page teiletterdoc.pl get_in_context letter.header.xsl System Metadata admin Administrative metadata desc Descriptive metadata Datastreams Datastream (s) basis1 pointer to XML TEI letter source (TEI letter example)
TEI Book Content Model Persistent ID ( PID ) Disseminators Behavior Behavior Disseminator Definition Mechanism web_ teibook1 web_ teibook web_ teibook1 get_web_default teidoc.java get_ teiheader admin.xsl get_ toc contents.xsl get_menu_ teibook menu.xsl get_ tp _ teibook tp.xsl get_id id.xsl web_default_ teibook web_default web_default_ teibook Metadata get_as_page teidoc.java get_in_context contents.xsl System Metadata admin Administrative metadata desc Descriptive metadata Datastreams basis1 pointer to XML TEI book source (TEI book example)
GDMS Content Model Persistent ID ( PID ) Disseminators Behavior Behavior Disseminator Mechanism Definition web_ gdms2 web_ gdms web_ gdms2 get_web_default imagedef.java get_ gdmswalk gdmswalk.xsl get_menu imagemenu.xsl web_default_ gdms web_default web_default_ gdms get_as_page imagedef.java get_in_context HTTP GET System Metadata admin Administrative metadata desc Descriptive metadata Datastreams Metadata Datastream basis1 pointer to XML GDMS source file (Mycenae example) (lawn example)
Numerical Data Content Model (ICPSR survey example)
The New FEDORA Technical Specifications – Part I
Background Material Overview of Web Service Technologies
What is a Web Service? • A distributed application that runs over the internet. • An addressable network endpoint which receives structured messages returns structured responses. • A web application that publishes an open interface through which clients can send requests and received responses.
How is this different from plain old web applications? • Formally defined API (application programming interface) defines a set of abstract operations for a web service • Published bindings for client to run operations • Standard protocol for invoking operations on the service. • XML as standard means of encoding service requests and responses.
Why are Web Services important? • Interoperability • Web applications can interact and build upon each other • Data is transferred in an interoperable manner (e.g., over HTTP) • Data is encoded in an interoperable format (XML) • Works in decentralized, distributed, operating-system independent environment. • Standards-oriented • Means to expose complex operations with rich data typing (via XML Schema language typing) • Ease of integrating distributed systems via the Web • W3C effort to develop this service architecture
How are Web Services Implemented? • The Simple Object Access Protocol (SOAP) Approach • SOAP is a messaging protocol that can run over different transport protocols (e.g., HTTP, SMTP) • Operation oriented (send a request to a end point) • Like CORBA, RMI, DCOM…but for Web and simpler • Application APIs can be defined and published using the Web Service Description Language (WSDL) • Requests and responses sent as XML messages • Supports simple and complex data typing in requests and responses • Supports transmission of binary data within requests or response packages
How are Web Services Implemented? • The REST (Representational State Transfer) Approach • URI + HTTP + XML • URI/resource driven; message built into a URI (URL) • HTTP GET or POST • Response is XML data • Issues: • Not a standard, but a style of doing web apps; arguably it just gives a fancy name to how lots of people do applications on the web by default; nothing really new here; just argues to do things the way we have been, maybe a little more standard by using XML. • Fragile service definition – URL’s change • No data typing on requests • Limited ability to transmit complex requests on URL • W3C behind SOAP, but only one strong voice out there for REST (Prescod).
Example of Web Service using SOAP My Application SOAP Request (XML) Google Web Service SOAP/HTTP SOAP/HTTP doSpellingSuggestion(payet) payette SOAP Response (XML)
XML SOAP Request <?xml version="1.0" encoding="UTF-8"?> SOAP-ENV:Envelope xmlns:SOAP-ENV=http://schemas.xmlsoap.org/soap/envelope/ xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance xmlns:xsd="http://www.w3.org/1999/XMLSchema"> <SOAP-ENV:Body> <m:doSpellingSuggestion xmlns:m="urn:GoogleSearch"> <key>/e325JlNPASJu</key> <phrase>payet</phrase> </m:doSpellingSuggestion> </SOAP-ENV:Body> </SOAP-ENV:Envelope>
XML SOAP Response <?xml version="1.0" encoding="UTF-8"?> <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance" xmlns:xsd="http://www.w3.org/1999/XMLSchema"> <SOAP-ENV:Body> <ns1:doSpellingSuggestionResponsexmlns:ns1="urn:GoogleSearch" SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"> <return xsi:type="xsd:string">payette</return> </ns1:doSpellingSuggestionResponse> </SOAP-ENV:Body> </SOAP-ENV:Envelope>
New Fedora: Key Features • Repository system exposed as two related Web services • described using WSDL • both SOAP and HTTP bindings • Digital objects encoded and stored as XML using Metadata Encoding and Transmission Standard (METS) • Digital object behaviors implemented as linkages to distributed web services (also described using WSDL) • Digital objects support versioning of both content and services.
The New FEDORA Encoding Digital Objects in XML
Metadata Encoding and Transmission Standard (METS) • XML “standard” for encoding descriptive, administrative, and structural metadata of digital library objects • Developed under auspices of the Digital Library Federation • METS standard maintained by the Network Development and MARC Standards Office of the Library of Congress http://www.loc.gov/standards/mets/
METS Schema • METS is written in the XML Schema Language • METS defines four sections for an object • Descriptive metadata • Administrative metadata • File group • Structure map • METS goals include: • Facilitate management of objects within a repository • Provide a standard format for exchange of objects between repositories • Provide standard format for transmission of objects to users for rendering (via tools or applications)
Digital Object Versioning • Versioning within Data Objects • Datastream versioning • Date/time stamped • New version every time datastream is modified • Disseminator versioning • Date/time stamped • New version if disseminator is modified to reference a different Behavior Mechanism (“better mousetrap”) • Versioning within Behavior Definition and Mechanism Objects • New versions of WSDL metadata recorded in these objects (with date/time stamps) • This deserves much more explanation that this slide can offer!
METS : Sample Fedora Object Click here for image digital object
Fedora Dissemination Database • Alternate form of object storage that will act as a cacheof most recent versions of digital objects • Ensure high-performance access (disseminations) • Repository system replicates from authoritative XML version of objects to relational database • Plan to phase-out the database in Phase 2-3: • Access sub-system to work completely off the XML storage, as XML tools improve performance-wise. • Pursue different caching strategies as necessary
The New FEDORA Repository System Design
FEDORA Web Service API Definitions • “API-M” – interface for management sub-system • Operations necessary to create and maintain objects and their components • Interface directly with authoritative XML version of object • “API-A” – interface for access sub-system • Operations necessary for clients to perform disseminations on objects in the repository • No direct access to object internal structure or components • Will work against cached representation of object to optimize performance.