Online tools and standards for Biodiversity data in the Semantic Web

Online tools and standards for Biodiversity data in the Semantic Web Dr Dimitris Koureas Biodiversity Informatics Group | Department of Life Sciences The Natural History Museum London

What is the semantic web? http://… http://… Slide adjusted from Page R. presentation in pro-iBiosphere

What is the semantic web? http://… http://… link , Slide adjusted from Page R. presentation in pro-iBiosphere

What is the semantic web? http://… http://… http://… Slide adjusted from Page R. presentation in pro-iBiosphere

What is the semantic web? http://… http://… person book is a author of Fred http://… Slide adjusted from Page R. presentation in pro-iBiosphere

The Semantic web: What is the semantic web? “The future of the web …and always will be” – Peter Norvig (Google) Slide adjusted from Page R. presentation in pro-iBiosphere

Biodiversity informatics The study of the transformationand communicationof informationin Life and Earth sciences provides the means (generating and enhancing the necessary infrastructure)

Research vs Infrastructure Slide adapted from Patterson D. 2013, Tempe, Arizona

Research • Discovery • Ephemeral • Individualistic • Massive redundancy • Optional • Risk taking vs Infrastructure Slide adapted from Patterson D. 2013, Tempe, Arizona

Research • Discovery • Ephemeral • Individualistic • Massive redundancy • Optional • Risk taking vs • Implementation • Communal / agreed • Essential • Persistent • Robust & reliable • Adaptable Infrastructure Slide adapted from Patterson D. 2013, Tempe, Arizona

What are the current challenges in Biodiversity informatics?

Current taxonomic data production Typically generated by smallcommunities for “local” research projects Publications based on countless specimens, images, maps, keysand datasets Figure from Costello M.J et al, 2013 doi: 10.1126/science.1230318

Our current taxonomic data production • 15-20k new spp. described annually (2M total)1 • 30k nomenclatural acts (12M total) 1 • 20k phylogenies (750k total)2 • 31k taxa sequenced (360k taxa total)3 • 800k BioMed papers (40M total pp. of taxonomy) 4 • Countless specimens, images, maps, keys and datasets 1.8 M described spp. (17M names) 300M pages (over last 250 years) 1.5-3B specimens Figures from 1) Zhang, Zootaxa 2011 4, 1-4; 2) Web-of-Science; 3) Genbank and 4) PubMed.

Now imagine that… Estimates of 7.5 million species still undescribed1 1How Many Species Are There on Earth and in the Ocean? Mora C et al. doi:10.1371/journal.pbio.1001127

Biodiversity informatics landscape • Key problems • Landscape is complex, fragmented & hard to navigate • Many audiences (policy makers, scientists, amateurs, citizen scientists) • Many scales (global solutions to local problems) Figure adapted from Peterson et al, Syst. & Biodiv. 2010 doi: 10.1080/14772001003739369

Science is global • It needs global standards • Global workflows • Cooperation of global players • Science is carried out “locally” • By local scientists • Being part of local infrastructures • Having local funders BUT

Expected volume of taxonomicandbiodiversity data Need of extracting, aggregatingandlinkingdataon a global level

To achieve this… • This requires data, information & knowledge to be… • Digital • Not printed paper • Openly accessible • Not behind barriers (e.g. paywalls) • Linked-up • Not in silos “Link together evolutionary data… by developing analytical tools and proper documentation and then use this framework to conduct comparative analyses, studies of evolutionary process and biodiversity analyses” Cyndy Parr, Rob Guralnick, NicoCellinese and Rod Page. TREE doi:10.1016/j.tree.2011.11.001

Hour-glass motif for big data infrastructure Data re-use Data pool Data generation Slide adapted from Patterson D. 2013, Tempe, Arizona

Big data world with re-use data • Re-use • Quality enhancement • Distribute • Make discoverable and actionable • Atomize • Standardize (metadata, ontology) • Use stable UUIDs to identify content • Preserve • Federate • Register • Make accessible • Normalize data • Structure data • Make data digital Visualization Analysis Aggregation Manipulation Data re-use Data pool Data generation Observations Experiments Models Processed

Big data world with re-use data Visualization Analysis Aggregation Manipulation Data re-use Data pool Data generation Observations Experiments Models Processed

Nodes interconnected • Dynamically interconnected • Nodes with sub-discipline specific responsibilities • Standard Exchange formats • Using UUIDs to identify content • Ontologies • Nodes are the essence of infrastructure Slide adapted from Patterson D. 2013, Tempe, Arizona

But how many biodiversity informatics projects are out there?

But how many biodiversity informatics projects are out there? At least 679! Categories: Data Aggregator - a web site that collates data from a variety of sources (digital and hardcopy) and presents it in one formData Indexer - a web site that provides lists or indexes of other sites that provide data Data Provider - a web site that provides data directly from research or other studiesData Standards - a web site that contributes to formulating or developing standards for dataFacilitator- a web site that facilitates the provision of data by other projects or web sites Sources: EDIT, TDWG & ViBRANT 2013

Aggregators GBIF: Our global leader in occurrence data

Aggregators http://www.eu-nomen.eu/portal/ EU-NOMEN - PESI

Aggregators Making taxonomy digital, open& linked

Scratchpads are an integrated system to Enter, Curate, Mark-up, Link and Publish data taxonomicworkflow in asinglevirtualenvironment

The Scratchpads concept External data & services Your data A Scratchpad is a website that holds data for you and your community

580 Scratchpads Communities by 8,185 active registered users covering 55,607 taxa in 653,274 pages. In total more than1,300,000 visitors Per month unique visitors to Scratchpads sites 65,000 unique visitors/month

Facilitators BOLD Barcode of Life Data Systems Researchers can assemble, test, and analyse their data records in BOLD before uploading them to: International Nucleotide Sequence Database Collaboration (DDBJ, ENA, GenBank)

Providers Biodiversity Heritage Library BHL http://www.biodiversitylibrary.org/ Biodiversity literature openly available to the world as part of a global biodiversity community > 40 M pages of legacy literature

Standard Exchange formats

Standard Exchange formats http://rs.tdwg.org/dwc/index.htm Darwin Core (DwC) Primarily used as a specimen records metadata standard

Standard Exchange formats http://www.tdwg.org/standards/115/ Access to Biological Collection Data (ABCD) highly detailed and aims to provide a complete set of data elements for natural history collection items

Standard Exchange formats http://www.tdwg.org/standards/638/ Audubon Core Multimedia Resources Metadata Schema The Audubon Core metadata schema ("AC") is a representation-neutral metadata vocabulary for describing biodiversity-related multimedia resources and collections.

Standard Exchange formats Taxonomic Concept Transfer Schema (TCS) http://tdwg.napier.ac.uk/index.php?pagename=HomePage Mechanism to exchange data concerning the names of organisms

Standards facilitate systems interoperability

We need Unique Identifiers UPIDs to identify content Identifiers A key to find something in a database.

We need Unique Identifiers 10.4289/0013-8797.115.1.75

We need Unique Identifiers http://hdl.handle.net/10.4289/0013-8797.115.1.75 http://dx.doi.org/10.4289/0013-8797.115.1.75 http://www.google.co.uk/search?q=10.4289/0013-8797.115.1.75 http://zoobank.org/10.4289/0013-8797.115.1.75

We need Unique Identifiers Can a taxonomic name be used as a UPID? Are taxonomic names enough for communication between Scientists? YES Are taxonomic names enough for communication between machines? CAN BE IF Is it Unique? Is it Persistent? Is it an Identifier?

We need Unique Identifiers For example: Page R., Brief Bioinform (2008) 9 (5): 345-354. doi: 10.1093/bib/bbn022

We need Unique Identifiers ONLY IF Name reconciliation Patterson, D. J. et al. 2010. Names are key to the big new biology. TREE 25: 686-691 doi: 10.1016/j.tree.2010.09.004

Ontologies Knowledge Organisation Systems The need for Controlled Vocabularies and Ontologies Google has done it: http://googleblog.blogspot.co.uk/2012/05/introducing-knowledge-graph-things-not.html Plant anatomical and structural development Ontology http://www.plantontology.org/

Example of ontology usage • Deans A. et al. Time to change how we describe biodiversity, Trends in Ecology & Evolution 2012 • doi:10.1016/j.tree.2011.11.007

Examples of integrated projects http://protectedplanet.net http://thymus.myspecies.info

How are all this relevant to my work? What should I take home?

Providers Community Data silos Repositories #bigdata

The four nodes of data workflow 1. We collect and generatedata 2.We curate, link and structure data 3.We analysedata 4.We publishdata

Online tools and standards for Biodiversity data in the Semantic Web

Online tools and standards for Biodiversity data in the Semantic Web

Presentation Transcript

Open Source Semantic Web Tools

Linked data for manuscripts in the Semantic Web

Semantic Web and Linked Data

The Explicator Project: Integrating Astronomy Data with Semantic Web Tools

Data Mining and Semantic Web

Standards and tools for publishing biodiversity data

Data on the (Semantic) Web

Geospatial data in the Semantic Web stSPARQL

PR, Big Data and the Semantic Web

Geospatial data in the Semantic Web GeoSPARQL

Semantic Web Tools

Semantic Web Tools

Everyday Tools for the Semantic Web Developer

Languages for the Semantic Web and Semantic Web Services

Benchmarking Textual Annotation Tools for the Semantic Web

Semantic Web Standards

Searching for Knowledge and Data on the Semantic Web

Linked Data, Libraries and the Semantic Web

Knowledge Standards W3C Semantic Web

Languages for the Semantic Web and Semantic Web Services

Tools for semantic trajectory data mining