410 likes | 553 Views
Beyond Storage: Rethinking the role of repositories in scholarly communication. DELOS Workshop Digital Repositories: Interoperability and Common Services May 11, 2005. Sandy Payette Cornell University. First… is there a problem?. Existing scholarly communication system.
E N D
Beyond Storage:Rethinking the role of repositories in scholarly communication DELOS Workshop Digital Repositories: Interoperability and Common Services May 11, 2005 Sandy Payette Cornell University
Existing scholarly communication system • Does not mirror the reality of the scholarly process • Published information artifacts do not resemble the rich information that is produced along the process • Not evolved enough to enable easy and effective integration and dissemination of new, rich forms of digital information
Roles of digital repositories today • Early Dissemination: • Enhance upstream scholarly communication • Improvement over traditional pre-print (paper) sharing among scholars • Open Access: • Harnad’s “subversive proposal” • Possibility of bypassing or eliminating traditional publisher model • Document Discovery: • Searching for documents in a repository, • Federation or metadata harvest for search over multiple repositories • Storage and Archiving: • E-print archives: author-self archiving gives scholars control over their intellectual output • Institutional repositories: institutions commit to preservation
Evolutionary, but not revolutionary • In many ways repositories represent an evolution of the traditional publishing paradigm • Submit documents • Gain access to documents… • Share results earlier in the scholarly process, and electronically • Still locked into document-centric paradigm • Store documents to promote access • Store documents to promote archiving • Index documents to promote search and discovery • Citation analysis to understand relationships of documents
Signs of Change – Scholars exercising the network • Grid computing in sciences • Share computing resources • Share services and distributed virtual file systems • Examples • Enabling Grids for E-Science (http://public.eu-egee.org/) • National Virtual Observatory (http://www.us-vo.org/) • Humanities computing • Hyperlinked historical documentary editions • New Forms of Digital Scholarship • Rossetti archive (http://www.rossettiarchive.org/) • Perseus (www.perseus.tufts) • Pompeii Forum (http://pompeii.virginia.edu) • Tibetan and Himalayan Digital Library (thdl.org)
The revolutionary opportunity… • Looming on the horizon is the potential of a future scholarly communication system that is • Highly collaborative • Network-based • Data-intensive • Process-oriented • We can change the way research and education is conducted by exposing rich knowledge-oriented information assets • Digital repositories must be rationalized within this broader vision.
New Functionality • Content aggregation: • combining information entities in novel ways • Knowledge integration: • capturing semantic and factual relationships among information entities • Information reuse: • allowing secondary, tertiary products • Information transformation: • combining information entities with computational services • Collaboration and contribution: • blurring the line between authors, publishers, users, experts…
A New Scholarly Information System 3 Basic Requirements • Redefine the “information unit” of scholarly communication • Create a scholarly communication system that better supports the process of research and learning • Record the “crumb trails” of the scholarly process
Data (1) The new “information unit” • Documents • Text • Data • Simulations • Images • Video • Computations • Automated Analyses Aggregations
(2) Process-oriented Scholarly Communication System • Decompose the traditional process (Roosendaal & Geurts) • Registration (establish intellectual priority of result) • Certification (certify quality and validity of result) • Awareness (ensure accessibility) • Archiving (ensure availability for future use) • Rewarding (means to support tenure, promotion, compensation) • But, they missed some things…
(2) Process-oriented Scholarly Communication System • Add new services to the mix • Workflow • Collaborative functions (e.g., annotation, re-use) • Data mining and analysis • Preservation monitoring and migration • Expose all as network-accessible atomic services • Service discovery • Service invocation • Service aggregation, orchestration, choreography
Process-orientation - workflows Ingest-oriented process Ingest to Repo Assign Access Policy Validate byte- streams Index and Register Link to Simulation Service SIP World of Services Ingest To Archive Preservation-oriented process Format Migration Make Copies Visit The Doctor Object Versioning In Repo Ingest To Archive Digital Object
(3) Record the “crumb trails” • Events • Critical state transitions of information assets • Preservation-noteworthy events • Provenance • When we enable re-use and re-combination of assets, we must be able to show from whence it came • Relationships • Among information assets • Versions of an asset • Between agents and assets • Between services and assets
Selected repositories with notable features re: the vision • Open-source repository software • Fedora • DSpace • Installed Systems • aDORe (Los Alamos National Laboratory) • arXiv • Grid projects • Storage Resource Broker (SRB) • Chimera
Fedora vs. the vision • Flexible digital object model • Services associated with digital objects • Relationships among digital objects • Relationship ontology • RDF-based metadata • Search the repository “as a graph” • Upcoming – new security architecture • Policy enforcement (XACML) • Repository policy • Object policies (fine-grained control)
Fedora Repository – Web Services Web Services Exposure
Fedora Objects – RDF Graph view Member Object Collection Object
DSpace vs. the vision • The related Simile project is most interesting • Significance: semantic web technologies brought to the task of search and discovery across different repository systems • RDF-based search across heterogeneous metadata formats • Ontology-based • DSpace History system • Event recording • RDF-based • Opportunity in DSpace 2 • Web service exposure? • Service-based dissemination architecture?
LANL’s aDORe vs. the vision • Standards-based repository architecture • OAI-PMH • MPEG21-DIDL • Open URL • Very good example of the use of simple protocols to enable modular service-based architecture • Services dynamically associated with objects
3 2 1 6 7 4 5 aDORe architecture TechReport OAI-PMH Indata.lanl.gov LANL OAI PMH A&I MPEG-21 DIP Engine DID with DIM publisher OAI PMH DID A&I APPLICATION publisher OpenURL OpenURL FTXT Registry of trans- formations Profile/ BehaviorRegistry publisher OAI PMH Ingest Pre-Ingest Repo Index OAI PMH OAI PMH Identifier Resolver CNRI handle, JAVA, C Slide courtesy of Herbert Van de Sompel
arXiv vs. the vision • Progress in decomposition and distribution of traditional steps in scholarly publishing value chain
Selected Grid vs. the vision • SRB • Distributed, virtualized file system • Support for very large amounts of data • Data grid compatible with computational grid • Possible as backend persistent store for other repository systems (e.g., DSpace, Fedora) • Chimera • Derived data as first class information entities • Information model (Virtual Data System) • Process model (Virtual Data Language)
The architecture challenge • Current situation • Heterogeneous repository systems • Heterogeneous object models (or no object model) • Multiple protocols and service APIs • Services lacking formal interface definitions • Can these resources ever play nicely together? • Need common abstractions…
Publisher Repositories Document Repositories Web Resources Data Stores Databases Solution: Information Network Overlay Client Layer Information Network API NetworkRepresentation Layer Source Layer
Translate to Technical Requirements • Rich information objects • Integration of local and remote sources • Mixed genre • Dynamic information objects • Integration with local and distributed services • Graph-based information model to enable overlay • Nodes are information objects • Edges are relationships among those objects • Service-oriented process model: • Coordination of information entities and services • Workflow; multi-step executions; transformations • Interoperable access and management API for objects • Fine granularity access control
Pathways Project • National Science Foundation Funding 2004-2007(http://www.infosci.cornell.edu/pathways) • Van de Sompel, Payette, Erickson, Lagoze, Warner. Rethinking Scholarly Communication: Building the System that Scholars Deserve. D-Lib Magazine September 2004.
Vision: “Graphite” Information Model Most things can be represented as a graph of nodes and arcs. Cornell/LANL Pathways Project
Service-oriented process model • Key challenge is to integrate a distributed service model within the information network overlay. • Technologies to watch • OWL-S (W3C) • Ontology-based service descriptions • Service modeled within semantic web • Netkernel (1060research) • Enables a graph-like overlay for URI-identified resources • Information entities and services can be accommodated • Grid technologies (Open Grid Services Infrastructure) • Enables creation of ‘virtual organizations’ that can share distributed computational resources and services • Web-services and WSDL in latest incarnation
The W3C’s Take on Things… • People and communities have data stores and programs to share • Vision: Expanding Web of machine accessible resources • Key Web technologies: • Web Services: Web of programs* • Standards for interactions between programs on the Web • Easier to expose and use services • Semantic Web: Web of data* • Standards for data, relationships, descriptions on the Web • Easier to Search for, Share, Aggregate, Extend information • * abstractions :-) Source: http://www.w3.org/2004/Talks/0923-sb-whoiw3c/slide12-0.html
Beyond Storage Must understand new scholarly activities and new technical developments… so we can frame repositories within a broader service-oriented architecture.
What basic changes can occur now? • Expose repositories as web services • Support compound digital objects • Local and remote content • Any media type • Provide a way to associate services with objects (dynamic views) • Provide ability to assert relationships among objects • Move toward ontology-based metadata • Enable easy integration of repository with other services
Research Challenges • Enable low barrier to entry • Simple protocols (e.g., like OAI) • Light-weight (REST vs. SOAP?) • Simple tools to create overlays • Note complexity in setting up Grid-based services • Integration of information and service models • Security and Trust • Authentication and trust among repositories and services • Interoperability of authorization policy • Preservation • Distributed and dynamic resources
Beyond Storage Questions and Discussion!