550 likes | 685 Views
Online tools and standards for Biodiversity data in the Semantic Web. Dr Dimitris Koureas Biodiversity Informatics Group | Department of Life Sciences The Natural History Museum London. What is the semantic web?. http://…. http://….
E N D
Online tools and standards for Biodiversity data in the Semantic Web Dr Dimitris Koureas Biodiversity Informatics Group | Department of Life Sciences The Natural History Museum London
What is the semantic web? http://… http://… Slide adjusted from Page R. presentation in pro-iBiosphere
What is the semantic web? http://… http://… link , Slide adjusted from Page R. presentation in pro-iBiosphere
What is the semantic web? http://… http://… http://… Slide adjusted from Page R. presentation in pro-iBiosphere
What is the semantic web? http://… http://… person book is a author of Fred http://… Slide adjusted from Page R. presentation in pro-iBiosphere
The Semantic web: What is the semantic web? “The future of the web …and always will be” – Peter Norvig (Google) Slide adjusted from Page R. presentation in pro-iBiosphere
Biodiversity informatics The study of the transformationand communicationof informationin Life and Earth sciences provides the means (generating and enhancing the necessary infrastructure)
Research vs Infrastructure Slide adapted from Patterson D. 2013, Tempe, Arizona
Research • Discovery • Ephemeral • Individualistic • Massive redundancy • Optional • Risk taking vs Infrastructure Slide adapted from Patterson D. 2013, Tempe, Arizona
Research • Discovery • Ephemeral • Individualistic • Massive redundancy • Optional • Risk taking vs • Implementation • Communal / agreed • Essential • Persistent • Robust & reliable • Adaptable Infrastructure Slide adapted from Patterson D. 2013, Tempe, Arizona
What are the current challenges in Biodiversity informatics?
Current taxonomic data production Typically generated by smallcommunities for “local” research projects Publications based on countless specimens, images, maps, keysand datasets Figure from Costello M.J et al, 2013 doi: 10.1126/science.1230318
Our current taxonomic data production • 15-20k new spp. described annually (2M total)1 • 30k nomenclatural acts (12M total) 1 • 20k phylogenies (750k total)2 • 31k taxa sequenced (360k taxa total)3 • 800k BioMed papers (40M total pp. of taxonomy) 4 • Countless specimens, images, maps, keys and datasets 1.8 M described spp. (17M names) 300M pages (over last 250 years) 1.5-3B specimens Figures from 1) Zhang, Zootaxa 2011 4, 1-4; 2) Web-of-Science; 3) Genbank and 4) PubMed.
Now imagine that… Estimates of 7.5 million species still undescribed1 1How Many Species Are There on Earth and in the Ocean? Mora C et al. doi:10.1371/journal.pbio.1001127
Biodiversity informatics landscape • Key problems • Landscape is complex, fragmented & hard to navigate • Many audiences (policy makers, scientists, amateurs, citizen scientists) • Many scales (global solutions to local problems) Figure adapted from Peterson et al, Syst. & Biodiv. 2010 doi: 10.1080/14772001003739369
Science is global • It needs global standards • Global workflows • Cooperation of global players • Science is carried out “locally” • By local scientists • Being part of local infrastructures • Having local funders BUT
Expected volume of taxonomicandbiodiversity data Need of extracting, aggregatingandlinkingdataon a global level
To achieve this… • This requires data, information & knowledge to be… • Digital • Not printed paper • Openly accessible • Not behind barriers (e.g. paywalls) • Linked-up • Not in silos “Link together evolutionary data… by developing analytical tools and proper documentation and then use this framework to conduct comparative analyses, studies of evolutionary process and biodiversity analyses” Cyndy Parr, Rob Guralnick, NicoCellinese and Rod Page. TREE doi:10.1016/j.tree.2011.11.001
Hour-glass motif for big data infrastructure Data re-use Data pool Data generation Slide adapted from Patterson D. 2013, Tempe, Arizona
Big data world with re-use data • Re-use • Quality enhancement • Distribute • Make discoverable and actionable • Atomize • Standardize (metadata, ontology) • Use stable UUIDs to identify content • Preserve • Federate • Register • Make accessible • Normalize data • Structure data • Make data digital Visualization Analysis Aggregation Manipulation Data re-use Data pool Data generation Observations Experiments Models Processed
Big data world with re-use data Visualization Analysis Aggregation Manipulation Data re-use Data pool Data generation Observations Experiments Models Processed
Nodes interconnected • Dynamically interconnected • Nodes with sub-discipline specific responsibilities • Standard Exchange formats • Using UUIDs to identify content • Ontologies • Nodes are the essence of infrastructure Slide adapted from Patterson D. 2013, Tempe, Arizona
But how many biodiversity informatics projects are out there?
But how many biodiversity informatics projects are out there? At least 679! Categories: Data Aggregator - a web site that collates data from a variety of sources (digital and hardcopy) and presents it in one formData Indexer - a web site that provides lists or indexes of other sites that provide data Data Provider - a web site that provides data directly from research or other studiesData Standards - a web site that contributes to formulating or developing standards for dataFacilitator- a web site that facilitates the provision of data by other projects or web sites Sources: EDIT, TDWG & ViBRANT 2013
Aggregators GBIF: Our global leader in occurrence data
Aggregators http://www.eu-nomen.eu/portal/ EU-NOMEN - PESI
Aggregators Making taxonomy digital, open& linked
Scratchpads are an integrated system to Enter, Curate, Mark-up, Link and Publish data taxonomicworkflow in asinglevirtualenvironment
The Scratchpads concept External data & services Your data A Scratchpad is a website that holds data for you and your community
580 Scratchpads Communities by 8,185 active registered users covering 55,607 taxa in 653,274 pages. In total more than1,300,000 visitors Per month unique visitors to Scratchpads sites 65,000 unique visitors/month
Facilitators BOLD Barcode of Life Data Systems Researchers can assemble, test, and analyse their data records in BOLD before uploading them to: International Nucleotide Sequence Database Collaboration (DDBJ, ENA, GenBank)
Providers Biodiversity Heritage Library BHL http://www.biodiversitylibrary.org/ Biodiversity literature openly available to the world as part of a global biodiversity community > 40 M pages of legacy literature
Standard Exchange formats http://rs.tdwg.org/dwc/index.htm Darwin Core (DwC) Primarily used as a specimen records metadata standard
Standard Exchange formats http://www.tdwg.org/standards/115/ Access to Biological Collection Data (ABCD) highly detailed and aims to provide a complete set of data elements for natural history collection items
Standard Exchange formats http://www.tdwg.org/standards/638/ Audubon Core Multimedia Resources Metadata Schema The Audubon Core metadata schema ("AC") is a representation-neutral metadata vocabulary for describing biodiversity-related multimedia resources and collections.
Standard Exchange formats Taxonomic Concept Transfer Schema (TCS) http://tdwg.napier.ac.uk/index.php?pagename=HomePage Mechanism to exchange data concerning the names of organisms
We need Unique Identifiers UPIDs to identify content Identifiers A key to find something in a database.
We need Unique Identifiers 10.4289/0013-8797.115.1.75
We need Unique Identifiers http://hdl.handle.net/10.4289/0013-8797.115.1.75 http://dx.doi.org/10.4289/0013-8797.115.1.75 http://www.google.co.uk/search?q=10.4289/0013-8797.115.1.75 http://zoobank.org/10.4289/0013-8797.115.1.75
We need Unique Identifiers Can a taxonomic name be used as a UPID? Are taxonomic names enough for communication between Scientists? YES Are taxonomic names enough for communication between machines? CAN BE IF Is it Unique? Is it Persistent? Is it an Identifier?
We need Unique Identifiers For example: Page R., Brief Bioinform (2008) 9 (5): 345-354. doi: 10.1093/bib/bbn022
We need Unique Identifiers ONLY IF Name reconciliation Patterson, D. J. et al. 2010. Names are key to the big new biology. TREE 25: 686-691 doi: 10.1016/j.tree.2010.09.004
Ontologies Knowledge Organisation Systems The need for Controlled Vocabularies and Ontologies Google has done it: http://googleblog.blogspot.co.uk/2012/05/introducing-knowledge-graph-things-not.html Plant anatomical and structural development Ontology http://www.plantontology.org/
Example of ontology usage • Deans A. et al. Time to change how we describe biodiversity, Trends in Ecology & Evolution 2012 • doi:10.1016/j.tree.2011.11.007
Examples of integrated projects http://protectedplanet.net http://thymus.myspecies.info
How are all this relevant to my work? What should I take home?
Providers Community Data silos Repositories #bigdata
The four nodes of data workflow 1. We collect and generatedata 2.We curate, link and structure data 3.We analysedata 4.We publishdata