660 likes | 671 Views
This presentation introduces ResourceSync, a framework for synchronizing web-based resources. It discusses the problem domain, scope, and technology of the framework, and provides a demonstration of its capabilities.
E N D
ResourceSync A Modular Framework for Web-Based Resource Synchronization Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp Martin Klein Los Alamos National Laboratory @mart1nkle1n http://www.openarchives.org/rs #resourcesync ResourceSync was funded by the Sloan Foundation & JISC
ResourceSync • Collaboration between NISO and the Open Archives Initiative • Funded by the Sloan Foundation and JISC • Goal: Devise a specification for web-based resource synchronization
This ResourceSync Presentation • Problem Domain • Scope • Framework - Overview • Framework – Technology • Demonstration • Status
Background - OAI-PMH • Recurrent metadata exchange from a Data Provider to Service Providers • XML metadata only • Repository centric • Devised 1999-2002, prior to REST, prior to dominance of web search engines
Revisit the Problem Domain - ResourceSync • Synchronization of resources from a Source to Destinations • Web resources, anything with an HTTP URI & representation • Resource centric • Devised 2012-2013, leverages key ingredients of web interoperability, existing specifications, existing Search Engine Optimization practice
Problem Statement • Consideration: • Source (server) A has resources that change over time: they get created, modified, deleted • Destination (servers) X, Y, and Z leverage (some) resources of Source A • Problem: • Destinations want to keep in step with the resource changes at Source A
Problem Statement • Consideration: • Source (server) A has resources that change over time: they get created, modified, deleted • Destination (servers) X, Y, and Z leverage (some) resources of Source A • Problem: • Destinations want to keep in step with the resource changes at Source A • Goal: • Design an approach for resource synchronization aligned with the Web Architecture that has a fair chance of adoption by different communities
This ResourceSync Presentation • Problem Domain • Scope • Framework - Overview • Framework – Technology • Demonstration • Status
Scope – Collection Size • Size of a Source’s resource collection: • A few resources - small web sites, repositories • Millions of resources – large repositories, datasets, linked data collections
Scope – Change Frequency • Change frequency of a Source’s resources: • Low – daily, weekly, monthly • High – seconds, minutes
Scope – Synchronization Latency • Destination’s requirements regarding synchronization latency: • High latency acceptable • Low latency essential
Scope – Collection Coverage • Destination’s requirements regarding the coverage of a Source’s resources: • Partial coverage of the Source’s resources acceptable • Full coverage of the Source’s resources verifiable
Scope – Bitstream Accuracy • Destination’s requirements regarding bitstream accuracy: • Unverifiable bitstream accuracy acceptable • Verifiable bitstream coverage essential
This ResourceSync Presentation • Problem Domain • Scope • Framework - Overview • Framework – Technology • Demonstration • Status
Solution Perspective - Destination • Destination needs regarding synchronization: • Baseline synchronization: Initial catch-up operation to align with the Source’s resources • Incremental synchronization: Remain synchronized as the Source’s resources evolve • Audit: Destination determines whether it effectively is in sync with the Source • Bitstream accuracy • Coverage of resources
Solution Perspective - Source • Source communicates about the state of its resources: • Publish inventory: snapshot of the state of resources at a moment in time • Publish changes: enumeration of resource changes that occurred during a temporal interval • Notify about changes: send notifications as changes occur • Communication payload: • Minimal, e.g. HTTP URI of resource • Additional, e.g. content-based hash of resource
Resource List • In order to meet a Destination’s need for baseline synchronization, the Source may publish a Resource List • A Resource Listis an inventory, a snapshot of existing resources • Per resource, it minimally provides the resource’s URI • Process: • Destination obtains the Resource List • Destination obtains listed resources by their URI • Optimization: Resource Dump, a list pointing to ZIP files that contain resource representations
Publish Resource List: Inventory at Tx Resource List@Tx= { A ; B ; C }
Change List • In order to meet a Destination’s need for incremental synchronization, the Source may publish a Change List • A Change List enumerates resource change events that occurred in a temporal interval • For each event, it minimally lists datetime, URI of the resource, the nature of the change • Process: • Destination obtains the Change List • Destination obtains created/updated resources, removes deleted resources • Optimization: Change Dump
Publish Change List: Resource Changes During Interval Ty-Tz Change List[Ty,Tz] = { A updated @Tc ; B updated @Tc ; C created @Td ; D deleted @Te ; C updated @Tf }
Change Notification • In order to meet a Destination’s need for incremental synchronization and low latency, the Source may send Change Notifications • A Change Notification conveys resource change events as they occur • For each event, it minimally lists datetime, URI of the resource, the nature of the change • Process: • Destination receives Change Notification • Destination obtains created/updated resources, removes deleted resources
Send Change Notification – Resource Changes at Ta Change Notification @Ta = { A updated @Ta }
Send Change Notification – Resource Changes at Tb Change Notification @Tb = { D updated @Tb }
Send Change Notification – Resource Changes at Tc Change Notification @Tc = { A updated @Tc ; B updated @Tc }
Send Change Notification – Resource Changes at Td Change Notification @Td = { C created @Td }
Send Change Notification – Resource Changes at Te Change Notification @Te = { D deleted @Te }
Send Change Notification – Resource Changes at Tf Change Notification @Tf = { C updated @Tf }
Communication Payload – Metadata & Links • A Source may provide additional metadata and links pertaining to resources conveyed in Resource Lists, Change Lists, Change Notifications • Metadata about a resource: content encoding, content length, mime type, content-based hash • Linking to related resources: mirror copies, alternate representations, resource versions, diff between current and previous version, metadata-to-content link, content-to-metadata link, collection membership, etc.
Communication Payload – Metadata – Hash • In order to meet a Destination’s need for audit, the Source may provide a content-based hash pertaining to a resource • Source computes the content-based hash for a resource • Source provides the hash as metadata pertaining to the resource in its communication payload • Destination processes communication payload, obtains the resource • Destination computes the content-based hash for the obtained resource, compares with the Source’s
Communication Payload – Link – Interlink Metadata & Content • In order to allow a Destination to establish the relationship between a Source’s metadata and a Source’s content, the Source may provide appropriate links • Metadata resources and content resources are just resources identified by HTTP URIs • Both can independently be subject to synchronization and can be interlinked using appropriately typed links
Communication Payload – Link – Link to Diff • In order to minimize content transfer, a Source may link to a diff between the previous and the new version of a resource • Destination can obtain the diff and patch its (previous) version of the resource • Connection between the resource and the diff is established by means of appropriately typed link • Nature of the diff is established by means of MIME type • Few diff MIME types exist. Communities can establish their own.
Further Framework Characteristics • Modular: A Source does not have to implement all capabilities • Source decides which capabilities to support based on local and community requirements • Sets of Resources: Division of a Source’s resource collection in logical groupings. • Supported capabilities can differ per set • Discovery: Mechanisms for Destinations to determine whether and how a Source supports ResourceSync • Based on conventions for web discovery and documents that detail the level of support
This ResourceSync Presentation • Problem Domain • Scope • Framework - Overview • Framework – Technology • Demonstration • Status
Sitemap Protocol • ResourceSync builds on the Sitemap protocol used by major search engines • Similarity between resource synchronization and resource discovery/indexing • Extends the Sitemap protocol to meet synchronization needs • Cf. Metadata and Links • Sitemap document format is used throughout the framework to express Resource Lists, Change Lists, etc. • Type of ResourceSync document can be determined through explicit declaration
Common Sitemap <urlsetxmlns="http://www.sitemaps.org/schemas/sitemap/0.9”> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> </url> … </urlset>
Resource List <urlsetxmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" at="2013-01-03T09:00:00Z” /> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type=”application/pdf” /> </url> <url> … </url> </urlset>
Change List <urlsetxmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:mdcapability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z” /> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:lnhref=“http://example.com/res2/meta” rel=“describedby” /> </url> <url> … </url> </urlset>