550 likes | 571 Views
The Technical Infrastructure of the NSDL. Dean Krafft, Cornell University dean@cs.cornell.edu. NSDL Technical Overview. Structure of the talk: NSDL 1.0 Architecture and Lessons Learned The Fedora-based NSDL Data Repository (NDR) and NSDL 2.0
E N D
The Technical Infrastructure of the NSDL Dean Krafft, Cornell Universitydean@cs.cornell.edu
NSDL Technical Overview Structure of the talk: • NSDL 1.0 Architecture and Lessons Learned • The Fedora-based NSDL Data Repository (NDR) and NSDL 2.0 • Inspiring Contribution and Collaboration - ExpertVoices • Other NSDL 2.0 Services and Tools • Q&A
What is the NSDL? • An NSF-funded $20 million/year program in Science, Technology, Engineering and Mathematics (STEM) education • A digital library describing over a million carefully selected online STEM resources from over 100 collections (at http://nsdl.org) • A core integration team (Cornell, UCAR, Columbia) working with 9 “pathways” portals and over 200 NSF grantees • A large community of researchers, librarians, content providers, developers, students, and teachers
What are the building blocks? • A distributed set of NSDL collections • A central repository of aggregated information about Science, Technology, Engineering, and Mathematics digital resources • A set of services that build on the repository, initially search and archive • A set of portals that expose NSDL STEM resources to a variety of user communities
Infrastructure overview: NSDL 1.0 Search Service Central Metadata Repository STEM Collections on the Web Collection Registration System Protocol: OAI-PMH HTTP REST SQL Archive Service NSDL.org Portal
NSDL 1.0 Ingest • Create a “union catalog” of Dublin Core metadata records for STEM resources • Harvest those records from collections using OAI-PMH (openarchives.org) • Normalize and augment metadata records (change to qualified DC) • Store records in an Oracle DB and re-serve qualified DC through OAI-PMH
Metadata Aggregation Metadata Repository (Oracle DB) OAI-PMH Metadata Ingest Service NSDL Dublin Core OAI-PMH Server … Harvest Management/ Collection Registration System NSDL DC Collection OAI Servers
NSDL 1.0 Search • Harvest metadata records from MR using OAI-PMH • Crawl URLs specified in the metadata using Nutch • Build a search index using metadata plus full-text of available content pages • Expose search in a web portal at nsdl.org for K-gray access to NSDL resources
SearchArchitecture Metadata Repository Pathway Portal OAI-PMH … Search and Discovery Server OAI Harvester Pathway Portal Lucene index generator REST Query Interface Lucene Query Engine Nutch Harvester NSDL.org Portal Web Content http/ftp
NSDL 1.0 Lessons • Rather than one portal for everyone, support communities with common interests: Pathways now provide discipline and area-specific portals • Metadata is expensive: unlike traditional libraries, e.g. through OCLC, digital collections have very “mixed quality” metadata, with unusual and inconsistent coding • On the good side: Oracle DB and OAI-PMH server scaled successfully to over 1 million catalog records
NSDL 1.0 Lessons continued • OAI-provided collections need 3 types of expertise: domain (resources & pedagogy), metadata (vocabulary & formatting), and technical (XML schema, UTF8, HTTP, OAI-PMH). • In many cases it took several months from first contact to successful OAI harvest, and the average harvest failure rate has stayed at 25%-50%, with only 23% of that transient failures • Incremental harvesting fundamental to efficient processing, but problematic: issues with persisting deleted records and recovering from partial harvests • Result: some automation, but high people cost
NSDL 1.0 Summary • Metadata Repository was quick to implement using known technologies, but • Limited model • Metadata-centric orientation • No content – only metadata • Limited relationships – collection/item • Limits on context, structure, and access • Severe limits on contribution and collaboration • One-way data flow: NSDL → Users
Going beyond the card catalog • Create an NSDL that guides not just resource discovery, but resource selection, use, and contribution • Supports creating “context” for resources • Presents resources in context: in a lesson plan; with ratings; correlated with education standards • Supports creating a permanent archive of resources • Enables community tools for structuring, evaluation, annotation, contribution, collaboration • Goal: Create a dynamic, living library
NSDL 2.0: NSDL Data Repository • Goals: • Architecture of participation: service-based, not a monolithic application/single user experience • Remixable data sources and data transformations • Harnessing (and capturing) collective intelligence • A free market of millions of inter-related resources (create the “long tail”) • Two-way data flow: NSDL ↔ users • Solution: Fedora-based NSDL Data Repository
Fedora: the NDR middleware • A Flexible, Extensible Digital Object Repository Architecture (http://www.fedora.info) • Open source project with $2.2 million in Mellon funding 2002-2007 • Collaboration of Cornell and Univ. of Virginia • Key funded users include: • eSciDoc project (collaboration of the Max Planck Society and FIZ Karlsruhe) • VTLS Corp., Harris Corp., Library of Congress • Australian Research Repositories Online to the World (ARROW) • Royal Library Denmark, National Library, and DTU
The Fedora Vision: A Repository for Rich Information Networks
What is Fedora? • An architecture, toolkit, and implementation: middleware, not a vertical application • DSpace in contrast: a vertical application with a fixed workflow targeted at users • Stores arbitrary internal and external digital objects, disseminations (transformations and combinations), relationships among objects • Entirely SOAP/REST based, disseminations are URLs • XML data store; RDBMS cache; RDF triplestore supports relationship queries
Fedora Key Features • Content aggregation • Digital object model to combine information entities in novel ways • Knowledge integration • Ontology-based relationships among objects • Information reuse • Create secondary, tertiary objects • Information transformation • Combine objects with computational services • Collaboration and contribution • Enable annotation, info sharing, workflow, contextualization • Information management and preservation • XML-based object storage • Service-oriented architecture; web services • Store relationships and service linkages with objects
Fedora Digital Object Model Component View Digital object identifier Persistent ID ( PID ) Relations (RELS-EXT) Reserved Datastreams Key object metadata Dublin Core (DC) Audit Trail (AUDIT) Datastreams Set of content or metadata items (local or external URL redirects) Datastream Datastream Disseminators Web-service methods for distributing views of recombined content Default Disseminator Disseminator
Implementing the NDR with Fedora • Multiple Object Types: • Resources (with local or remote content) • Metadata • Aggregations (collections) • Metadata Providers (branding) • Agents • Relationships with arbitrary graph queries: • Structural (part of) • Equivalence • Annotation
Draft NDR API Characteristics • Uses REST calls for all interactions • Specializes Fedora for NDR objects/relationships • Disseminations allow combining metadata from multiple sources, or related content • Authentication: Requests signed with private key associated with an agent • Authorization: Agent can become a metadata provider or aggregator; can create resources • Documentation being developed at http://ndr.comm.nsdl.org
An Information Network Overlay • Think of the NDR as a lens for viewing science content on the net • Content can be: • Local: stored directly in the NDR • Remote: accessed through a URL • Computed: derived from a database or web service • Archived: an older version stored at SDSC • It all has a repository-based URL
Network Overlay View User View API/UI Repository View with Relations & Annotations Resources on the Web
Status of the NDR • Repository in test load • over 875,000 metadata records • over 2 million digital objects • Over 163 million RDF triples (lots) • Scaling challenges • moved to 64-bit architecture with 32GB memory • need to carefully structure RDF queries • can scale current system by factor of five • need to move to more powerful triplestore (Oracle) • Estimating fully operational beta version of new NDR in June
How should we use the NDR? • The NDR provides powerful capabilities for: • Creating context around resources • Enabling the NSDL community to directly contribute resources and context • Representing a web of relationships among science resources and information about those resources • How do we use it? Here’s one specific example …
Issues in STEM Education • Issue: Need to support scientific inquiry • Issue: Students need a better understanding of the processes of scientific research • Issue: Teachers are often under-prepared to teach science and math • Issue: Scientists need tools to make science and math research more available
Addressing the Needs • In Response: NSDL is building an educational tool that… • Models scientific inquiry and exposes the processes of scientific research • Promotes and facilitates conversations between research and education communities • Brings content expertise into the classroom to support under-prepared teachers • Allows scientists, teachers, and media specialists to collaboratively develop instructional context around NSDL resources
What is Expert Voices? • A system using blogging technology to: • Support STEM conversations among scientists, teachers and students • Tie NSDL resources to real-world science news • Create context for resources to enhance discovery, selection and use • Enable NSDL community members to become NSDL contributors: of resources, questions, reviews, annotations, and metadata • Expert Voices ≠ LiveJournal • Contributors are carefully selected, contributions are about science, the process of science, and education
Expert Voices As An Educational Tool • Topic-based discussion (e.g. tsunamis) with pointers to related resources • Research outreach (Criterion 2) – explaining and documenting NSF-funded research • Experts can add resources with topical context to the NSDL • Resources can be reviewed and annotated • Question/answer and discussion forum: scientist ↔ teacher ↔ student ↔ librarian
Broadening Participation: An Expert Voices Learning Scenario • “Hurricane Season Blog” run by a National Weather Service hurricane expert, an Earth Science teacher, and a school media specialist familiar with NSDL resources • Expert creates an entry for Hurricane Gertrude • “On track to hit Ft. Lauderdale in 72 hours” • “Currently undergoing eyewall replacement cycle” • “Expecting 15 foot storm surge” • Media specialist adds links to NSDL resources: Hurricane Hunters site, latest satellite photos, and USGS flooding and flood plain site (storm surge context) • Teacher makes connections to relevant standards and appropriate pedagogy for use by other teachers • Students experience engaging real-time, real-world applications of science lessons
Broadening Participation: An Expert Voices Outreach Scenario • NSF grantee: Bioluminescence researcher wants to make research K-12 accessible • Creates an Expert Voices conversation • Enables his students and researchers to document process and results – how science really works • Writes about publications and educational resources (e.g. www.photobiology.info) • Adds these to the NSDL, creating audience-level metadata • Entries serve as annotations that create K-12 context for the college-level research
Expert Voices Implementation • Open source multi-user blogging system • Published entries become NSDL resources • Owner controls publication of entries and visibility of comments • Entries can contain linked references to NSDL resources, references to URLs that should become resources, and new resource metadata • Integrated with NSDL community sign-on
Expert VoicesImplementation • Initial blog system is multi-user WordPress • WordPress plug-ins provide NDR integration and Shibboleth authentication • Publication of blog entry creates: • Content, as a new resource with simple metadata • New NDR resources • New metadata for any referenced resources in content • Graph of relationships between entry and all referenced resources • Blog available as independent RSS feed
NDR Entry for Expert Voices Existing Collection Topic- based Blog New Audience MD New Metadata Member of Metadata Provider Member of Blog Entry Metadata Provider Metadata for Metadata for Annotates Referenced New Resource 1 Referenced Existing Resource 2 Inferred relationship between resources
NDR Application: OnRamp • NDR-integrated multi-user, multi-project content management system • Supports NSDL single sign-on and group management • Decentralized workflow for the creation and distribution of both simple and complex content • Disseminates content in multiple publication and online forms • Delivery estimated 3Q06
NDR Application: Instructional Architect • Created by Mimi Recker and colleagues at Utah State University • Teacher develops a lesson plan, incorporating NSDL and other resources • Assigns subject, grade level, ed standard • Distributes to class or public • Available now, with NDR integration in design
NDR Application: Integrated Wiki • Community of approved contributors (e.g. teachers, librarians, scientists) are granted edit access on OpenNSDL wiki • New resources and metadata are created as wiki pages and reflected into the NDR • Non-wiki-based NDR resources and metadata are displayed as read-only wiki pages, subject to comment and linking • User and project pages organize NDR resources
NDR Application: Content Assignment Tool • Developed by Anne Diekema, Elizabeth Liddy, et al. at the Syracuse University Center for Natural Language Processing • Uses text analysis and machine learning to suggest Educational Standards alignment for resources • Content expert assigns standard, and system learns from the assignment • Standalone tool available now; standards associated with resources in the NDR by 3Q06
Other applications in development • Automated grade-level assignment based on vocabulary analysis (San Diego Supercomputer Center) • iVia-based Expert-Guided crawl: Tool for Pathways and others to turn websites into resource collections (UC Riverside) • Automated subject assignment (UC Riverside) • MyNSDL: Bookmark and tag STEM resources within and outside the NSDL (Cornell)
STEM Collections NSDL 2.0 Ecosystem Search Service Archive Service Fedora- based NDR Protocol: OAI-PMH HTTP REST NDR API …