280 likes | 545 Views
SEAD. Sustainable Environment – Actionable Data. CNI Fall Members Meeting Arlington, VA 12/12/2011. Robert H. McDonald SEAD Sr. Personnel Assoc. Dean/Associate Director Indiana University. Margaret Hedstrom SEAD PI/Project Director Professor & Associate Dean UM School of Information.
E N D
SEAD Sustainable Environment – Actionable Data CNI Fall Members Meeting Arlington, VA 12/12/2011 Robert H. McDonald SEAD Sr. Personnel Assoc. Dean/Associate Director Indiana University Margaret Hedstrom SEAD PI/Project Director Professor & Associate Dean UM School of Information
NSF DataNet Program • new types of organizations that integrate library & archival sciences, cyberinfrastructure, computer & information sciences, & domain science expertise • provide reliable digital preservation, access, integration, and analysis capabilities for science and/or engineering data over a decades-long timeline; • continuously anticipate and adapt to changes in technologies and in user needs and expectations; • engage in research to drive the leading edge forward • serve as component elements of an interoperable data preservation and access network http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503141
Partners • SEAD’s Unique Contributions • Address domain-driven needs & requirements • Serve scientists and researchers in the “long tail” • Integrate existing technologies, tools & services (rather than build new from scratch)
Data challenges Heterogeneity of all kinds Multiple scales Multidisciplinary Many small datasets
The long tail of scientific research • Small and derived data sets • Heterogeneous data • Multiple sources of data • Short-lived data with long-term value • Value of data grows when combined & integrated
SEAD’s Goals • Provide data services that address the needs of researchers working toward sustainability • Integrate these services into an generalizable “Active and Social Curation” infrastructure suited to the social structure and economics of long-tail research communities • Develop capabilities to package and migrate the most valuable datasets to a federated repository infrastructure for long-term preservation • Education, outreach, & training to disseminate SEAD’s contributions to other projects & communities
SEAD’s Strategy • Leverage social media for discovery of data, interest, and expertise • Move data curation upstream in the data life cycle • Involve domain scientists in setting priorities for evolution of data and services • Take advantage of existing infrastructures (Institutional Repositories, ICPSR) for long-term preservation
Active and Social Curation • Engage researchers during projects, not at the end • Automatically capture metadata as defined by the data producers • Provide facilities for commentary, recommendations, and mark-up of data • Further reduce costs by re-engineering curation processes to leverage this rich metadata and volunteered effort
Active Curation Model Workflows Data Active Curation Social Media Metadata Review Rating Commenting
SEAD Status SEAD start date: 10/1/2011 In other words, SEAD is not ready to accept your data!
SEAD Personnel • Margaret Hedstrom, PI (Michigan) • Praveen Kumar, co-PI (Illinois) • Jim Myers, co-PI (RPI) • Beth Plale, co-PI (Indiana) • Ann Zimmerman, co-PI/Project Manager (Michigan) • George Alter (ICPSR) • Bryan Beecher (ICPSR) • Katy Börner (Indiana) • Robert McDonald (Indiana) • Jude Yew, Post-doc (Michigan) • + many more to come
SEAD TEAM University of Michigan: Margaret Hedstrom (UM PI), Ann Zimmerman (Co-PI and Project Manager), George Alter, Bryan Beecher, Charles Severance, Karen Woollams, Jude Yew. Indiana University: Beth Plale (IU PI), Katy Borner, Robert H. McDonald, Kavitha Chandrasekar, Robert Ping, Stacy Kowalczyk, Robert Light. University of Illinois:Praveen Kumar (UIUC PI), Rob Kooper, Luigi Marini, Terry McLaren. Rensselaer Polytechnic Institute: Jim Myers (RPI PI), Ram Prasanna Govind Krishnan, Lindsay Todd, Adam Wilson.
SEAD Cyberinfrastructure • An international resource for sustainability science • Novel technical and business approaches to supporting the long-tail of research data • Lifecycle support: actionable data services integrated with curation and preservation infrastructure
Key Challenges for SEAD Cyberinfrastructure • Managed Data storage and services are expensive! • Begging for metadata doesn’t work! • Curation and preservation are time consuming! • The long-tail is not standardized! • Data collections are always missing something valuable! • Data models evolve! • Cyberinfrastructure is obsolete by the time you build it! • Building Community as you leverge cyberinfrastructure
SEAD: Social Networking • Co-authorship • Co-funding • Micro-citation • Shared project repositories • Shared tags • Threaded discussions • Quoting, forwarding, …
Linked Data and Repositories • Tag and annotate data • Overlay it with reference data • Organize it in domain terminology • Link it to people, papers, projects, conversations…
KEY SEAD Questions • What could SEAD capture when? • How can SEAD provide direct value to data producers, users, and curators? • How can robust web-services and social computing lower barriers and reduce/realign costs?
SEAD: Active Content Repository • With the ‘Big Picture’ graph in-hand, curators can: • Focus on what to curate and when, • Automate parts of the process • Use existing/emerging technologies for packaging and preserving datasets • Better manage federated repositories
SEAD: Leveraging Existing Resources • Cyberinfrastructure • IU Data Capacitor/HPC Capabilities • UIUC/NCSA HPC Capabilities • Rensselaer CCNI Capabilities • Repositories • UM Deep Blue • IU ScholarWorks • ICPSR Repository • UIUC IDEALS
SEAD LayerCake View • Services over an active content layer that is backed by/harvested into a federated archive infrastructure based on institutional resources
CI Technical Approach OAIS Repository Federation Active and Social Curation Curation Boundary Automated Curation Workflow/Rule Engine CI Technical Approach Metadata Management Data Acquisition, Analysis and Simulation Scholarly Communication Operates on Metadata, Content Objects and Trigger Events DDI3. METS, PREMIS, MODS, DC, SensorML, OGC, … Ingest scripts: fixity, integrity, authentication, transformation Ingest, AIPs VIVO/ Linked Data Digital Repository Federation (OAIS compliant) Appraisal and Selection Active Content Repository Compound Objects - OAI-ORE Preservation Actions Dissemination Packages Wide-Area File System Search, Browse, Annotation, Visualization Tools Migration and Emulation Tools Use, Reuse, Repurposing Tools Access Mechanisms and E-Scholarship Services Contributor User
Toward PetaScale Data • Internet2 upgrade: • Total bandwidth from 100 Gbps to 8.8 Tbps • Moving a petabyte of data will go from from 10 days to 25 hrs
SEAD 18 Month Prototype Targets for Cyberinfrastructure • Active and Social Content Curation • Pilot Active Content Repository, VIVO deployments • Exemplar services for Data Ingest, Discovery, Re-use, Curation • CI for Long-term Access • Data model, protocol design/development • Pilot Federated Repository infrastructure
SEAD CI QuickView • SEAD will quickly build a repository and data services infrastructure for sustainability research that can be responsively adapted based on community feedback – Community Agile Development • SEAD will leverage existing tools and emerging practices to dramatically enhance the interactions of researchers and data librarians – Active Curation • SEAD’s focus on the long-tail will force an emphasis on ease-of-use and low costs that is critical for long-term sustainability – Leverage Existing Institution Resources for Long-term Access • SEAD will leverage experiences in the sustainability research community to provide guidance for other long-tail communities making the transition to an interdisciplinary, systems-oriented approach to research – Sustainability and Resource Growth Partnership and Collaboration
Acknowledgments SEAD is funded by the National Science Foundation under cooperative agreement #OCI0940824 • For more on SEAD go to: • http://sead-data.net • Follow us on Twitter @SEADdatanet http://sead-data.net