Semantic Application for Digital Repositories

Semantic Application forDigital Repositories Fabrizio Gagliardi EMEA & LATAM Director Technical Computing MSR External Research Microsoft Corporation

Microsoft Research’s Commitment to Science • Advancement of Science • Global Collaboration • Technology Excellence • Interoperability • Putting computing into science… • Applying Microsoft products and research technologies to advance the scientific research and engineering innovation process • Putting science into computing… • Ensuring that research community requirements are factored into future versions of Microsoft software

Scholarly Communications: Project Overview • Current or Completed Projects • Cornell – arXiv.org + Word 2007 (and repository interoperability via SWORD) • MIT / Broad Institute – Authoring (Word 2007) + data for research reproducibility • MSR – CMT++ interoperability with data + metadata transfer/exchange (conference management tool enhancements) • LiveLabs – eJournal publishing online service (community publishing tool) • UC San Diego / PLoS – Semantic mark-up of scholarly articles (+ submission) • Chem4Word with Office & Cambridge University – Create add-in to Word 2007 to facilitate drawing of chemical compounds and equations • Johns Hopkins University – Digital Archive for Astronomy/Astrophysics data (storage, preservation and access) • Planets Project / EU (with MSR – Cambridge) OpenXML and file format preservation + interoperability • eChemistry Project (Cornell, Penn State, Indiana, Cambridge, Southampton) – ORE exemplar: access to compound chemical info objects (cross-repository access to open chemistry data) • British Library – Researcher Information Centre (RIC) online workflow tool for scientists and researchers • Creative Commons Add-in for Office 2007 – evolving the Word 2003 effort • University of Southampton (UK) – Port ePrints Repository Software for installation on the Windows platform • University of Manchester / “MyExperiment” Project – social networking for scientists • ORE Acceleration Project (OAI – Object Reuse & Exchange) – Alpha spec development • Indiana University – Toolbox for Social Networking (SRT) • UK National Archives – Virtual PC / Emulation of legacy systems to facilitate preservation • National Library of Medicine / NCBI – “PubMed Int’l” UK version of PubMed + NLM DTD • Pipeline • DRIVER 2 (EU) – Infrastructure integration of across a network of European research repositories

Research Output Repository Platform Goals • A platform for building services and tools for research output repositories • Papers, Videos, Presentations, Lectures, References, Data, Code, etc. • Relationships between stored entities • Enable a tools and services ecosystem for “research output” repositories on MS technologies Execution • Utilizing OAI-ORE, SWORD, and other community protocols • In development, deployment within MSR in early Q4 • Beta release to the community in late Q4 • Built on SQL Server 2008 + Entity Framework • Using WPF and Silverlight for UI

Research Output Repository Platform Non-goals • A generic platform for asset management • Support the lifecycle of publications • Compete with existing repository solutions Goals • Create a platform for building “research output” repositories • Engage with the digital library and scholarly communications community • Become the “research output” repository for MSR (RMCr project) • Papers, Videos, Presentations, Lectures, References, Data, Code, etc. • Support an ecosystem of services and tools • Available to the community for free (we are still considering the open source route) • Build an easy-to-install collection of basic services and tools

An Ecosystem of Research Repositories Support of harvesting & federation to/from Institutional Repositories - arXiv.org - DSpace - ePrints - Fedora - etc. Entities + Relationships can be synched to cloud storage so that they are: - Always Available - Sharable - Mixable - Harvestable Researchers manage their personal research entities(data, citations, documents, workflows, etc.)

Current Project Status • Limit Tech Preview release due June 2008 • Public Beta targeted for Aug/Sept 2008 For more details • Contact: • Alex Wade (Program Manager) / alex.wade@microsoft.com • Community Forum: • http://community.research.microsoft.com/forums/90.aspx

eScience and Semantic Computing meet the Cloud The cyberinfrastructure for the next generation of researchers

The Future: Software plus Services for Science? • Expect scientific research environments will follow similar trends to the commercial sector • Leverage computing and data storage in the cloud • Scientists already experimenting with Amazon S3 and EC2 services, with mixed results; • For many of the same reasons • Siloed research teams, no resource sharing across labs • High storage costs • Low resource utilization • Excess capacity • High costs of reliably keeping machines up-to-date • Little support for developers, system operators

A smart cyberinfrastructure • Collective intelligence • If last.fm can recommend what song to broadcast to me based on what my friends are listening to, why cannot the cyberinfrastructure of the future recommend articles of potential interest based on what the experts in the field that I respect are reading? • Already examples emerging but the process is manual(Connotea, BioMedCentral Faculty of 1000 ...) • Automatic correlation of scientific data • Smart composition of services and functionality • Cloud computing to aggregate, process, analyze and visualize data

A world where all data is linked… • Important/key considerations • Formats or “well-known” representationsof data/information • Pervasive access protocols are key (e.g. HTTP) • Data/information is uniquely identified (e.g. URIs) • Links/associations between data/information • Data/information is inter-connected through machine-interpretable information (e.g. paper Xis about star Y) • Social networks are a special case of ‘data networks’ Attribution: Richard Cyganiak

…and stored/processed/analyzed in the cloud visualization and analysis services scholarly communications Vision of Future Research Environment with both Software + Services domain-specific services search books citations blogs &social networking Reference management instant messaging identity mail Project management notification document store storage/data services knowledge management The Microsoft Technical Computing mission to reduce time to scientific insights is exemplified by the June 13, 2007 release of a set of four free software tools designed to advance AIDS vaccine research. The code for the tools is available now via CodePlex, an online portal created by Microsoft in 2006 to foster collaborative software development projects and host shared source code. Microsoft researchers hope that the tools will help the worldwide scientific community take new strides toward an AIDS vaccine. See more. compute services virtualization knowledge discovery

Thanks you for your attention

Emergence of a New Research Paradigm? • Thousand years ago – Experimental Science • Description of natural phenomena • Last few hundred years – Theoretical Science • Newton’s Laws, Maxwell’s Equations… • Last few decades – Computational Science • Simulation of complex phenomena • Today – eScience or Data-centric Science • Unify theory, experiment, and simulation • Using data exploration and data mining • Data captured by instruments • Data generated by simulations • Data generated by sensor networks • Scientists overwhelmed with data • Computer Science and IT companieshave technologies that will help (With thanks to Jim Gray)

Today Scientists... • Annotate, share, discover data • Custom, standalone tools • Conferences, Journals • Publication process is long, subscriptions, discoverability issues • Collaborate on projects, exchange ideas • Email, F2F meetings, video-conferences • Use workflow tools to compose services • Domain-specific services/tools Web users... • Generate content on the Web • Blogs, wikis, podcasts, videocasts, etc. • Form communities • Social networks, virtual worlds • Interact, collaborate, share • Instant messaging, web forums, content sites • Consume information and services • Search, annotate, syndicate

Data can be easily produced http://ecrystals.chem.soton.ac.uk Thanks to Jeremy Frey

Data and services can be easily composed • Taverna Workflow • Compose services from the Web SensorMap Functionality: Map navigation Data: sensor-generated temperature, video camera feed, traffic feeds, etc.

Data is easily accessible With thanks to Catharine van Ingen

Data is easily shareable Sloan Digital Sky Server/SkyServer http://cas.sdss.org/dr5/en/

Today… Computers aregreat tools for huge amountsof data For example, Google and Microsoft both have copies of the Web for indexing purposes

Tomorrow… Computers will stillbe great tools for huge amountsof data We would likecomputers to alsohelp with theautomatic of the world’s information

Semantic Computing

What is Semantic Computing? • Set of concepts and technologies • Data modeling • Relationships • Ontologies • Machine learning (entity extraction) • Inference, reasoning • Data, information, knowledge… Current technologies Possibilities for innovation

Semantics • Term used to refer to the concept of “meaning” • The linguistics, AI, Natural Language Processing, etc. communities have been working on “meaning” and ”knowledge” related technologies for decades • Pragmatic approach to Semantic Computing • Emergence of a new breed of technologies to capture meaning (RDF, OWL, etc.) • Combine with the pervasiveness of the Web community technologies such as folksonomies …

A word about the “Semantic Web” • The term is used to describe a set of technologies used to represent data, concepts, and their relationships • Become a buzzword like Web 2.0 • Prefer to use the term “Semantic Computing” which is about modeling data in ways that can be automatically processed by computers

Semantic Computing • Some efforts are driven by the traditional “knowledge engineering” community • Engaged in building well-controlled ontologies • Important for domain-specific vocabularies with data formats and relationships specific to a community • Model does not easily scale to the Internet • Some efforts are driven by the Web 2.0 community • Focus on the pervasiveness of Web protocols/standards • Emphasis on microformats (small, flexible, embeddable structures) • Exploit evolving and ever-expanding vocabularies such as folksonomies and tag clouds

Semantic Application for Digital Repositories

Semantic Application for Digital Repositories

Presentation Transcript

Metadata for Digital Repositories

A Metadata Application Profile for Canadian Repositories

JISC Digital Repositories Call for Proposals

Semantic Web Application

Digital Preservation for Digital Repositories

Digital Repositories

Trust in Digital Repositories

Museums and Digital Repositories

Replication Policies for Federated Digital Repositories

Visual Search Interfaces for Online Digital Repositories

ESFRI contribution about digital repositories

Trusted Digital Repositories, Certification

SPARC Digital Repositories Meeting 2008

The Metadata Coal Face for Digital Repositories

Digital Repositories Team

IMS Digital Repositories Interoperability

Data Requirements and Digital Repositories

Digital/Open Access repositories

Digital Archive Policies and Trusted Digital Repositories

Trusted Digital Repositories, Certification

Standards For JISC's Digital Repositories Programme

Digital repositories and Grids