350 likes | 362 Views
The OBIS Index. Where we are – as at October 2003 Tony Rees – CSIRO Marine Research, Hobart for: OBIS IC meeting, Washington DC. Advance information Subject of this talk is. - New (mostly created within last 8 weeks, some within last 8 days)
E N D
The OBIS Index Where we are – as at October 2003 Tony Rees – CSIRO Marine Research, Hobart for: OBIS IC meeting, Washington DC
Advance informationSubject of this talk is ... - New (mostly created within last 8 weeks, some withinlast 8 days) - Innovative (uses special components available only from CSIRO, plus others custom created for this project) - Powerful (offers a major ramp-up of OBIS functionality, for modest additional complexity) - Exciting (opens the possibility to many new features) - so – worth a look!
OBIS: A Distributed System Strengths of this approach ... • Data sources remain under custodianship of OBIS contributors (no IP issues, good for community building, owners do their own QA and updates) • Portal concerns itself with technical issues, not a data manager • Portal size, resource requirements don’t increase as OBIS membership and content grow • No problems with version control
OBIS: A Distributed System Weaknesses with this approach ... • Availability, speed of links, and speed of responses to/from contributors are critical to proper functioning of the system (compounds with increasing number of contributors) – i.e., system response depends on factors outside OBIS’ control • Portal has no knowledge of OBIS provider content (has to do a live distributed query for every piece of information) – also, a user may search repeatedly on taxa for which no data are held in the system • No opportunity to provide value-adding, such as search by taxonomic group (as contributors do not provide this information in any enforced way) • No opportunities for advanced search functions e.g. “near match” (would be difficult to do in real-time distributed query)
One example: The “Zero Records” Problem ... • Incorrect spelling? • No data available via OBIS for this taxon? • Data exist, but are in one of the sources which are off-line? NB, these responses are actually the slowest to generate, as well!
A solution - the OBIS Index = a reduced subset of OBIS data, stored in a standardized format, in a convenient location • Single record per species, with relevant summary information, i.e., number of records, date range, depth range, plus “c-squares” spatial index (sufficient distribution information for “quick maps” and spatial searches) • Master genus list, with cross-references to a simple taxonomic hierarchy • Degree of QA, i.e. masking informal/unresolved taxa, and freshwater/ terrestrial species
C-squares spatial indexing ... • Doesn’t store the point data, just a list of the squares in which data are present, for each taxon • Efficient for data reduction • Easy to store and query Choice of square size is a design decision (this index uses 0.5 x 0.5 deg. squares, =~ 50 km)
Index benefits ... - Initial taxon searches and mapping take place by querying the index, not the remote data sources: • rapid response time • always complete (irrespective of whether any data sources are off line) • can return lists of multiple taxa as desired (no longer need to search for taxa sequentially) • limits user selection to a picklist of species represented in the system - no more “zero records” responses • correct user spelling not required (enter part of a name, or browse a category, or ask for “near matches”) • can return information for user’s desired taxonomic group(s) only • Use as “pre-filter” to answer many queries directly fromthe index, without needing to do a distributed searchuntil actual data are required – i.e., a 2-stage process.
Index Development So Far ... • Nov. 2002 – July 2003:initialconcept development and refinement (Tony, Rainer, Phoebe) – incl. endorsement by OBIS IC, Mar. ’03 • Aug. – Sept. 2003: • design/build initial prototype Index, plus partially populate with summary data (Tony) • construct master genus list and taxonomic hierarchy (=“OBIS categories”), and tag most genera with relevant category (Tony) • Sept. 2003:circulate URL and background information to OBIS IC, TWG for comment • Sept. – Oct. 2003: • refine prototype index (Tony) • construct “crawler” and finish first-pass population of the index (Tony, Pamela) • tag remaining genera with taxonomic attribution (Tony) • build spatial search module (Tony)
Reality check – what do users need ... Key OBIS functions: • Show/get distribution data for a desired species • Show/get species information for an area (preceded by ...) • List species for which data are available! (e.g. by organism type) • Show areas for which data are available! (e.g. by organism type)
Current (prototype) OBIS Index Search Interface- as at October 2003 (www.marine.csiro.au/datacentre/obis/quicksearch1.htm)
Current OBIS Categories (Oct. 2003) - page 1 of 2 (approx. 140 in total)
Example Possible Index Searches ... “Generate List ...” function: previously offered? • All fishes beginning with “B...”, or “Bathy...” N • All whales, or decapods, or bryozoans N • All species of the genus “Raja” Y/N • All “near matches” to “Coelorhynchus” N “Spatial Search ...” function: • All fishes, or hexacorals, or “any invertebrates”, or any N OBIS taxa, in any 10 x 10 degree square • All species of “Raja” in a given 10 x 10 degree square Y/N • Global distribution map for any OBIS taxonomic category N (e.g. can use to identify data gaps) (Note, could also offer searching by 5 x 5 degree square or smaller, but data are probably too patchy for this to be useful at present time)
Costs associated with the Index ... • Design, build costs (i.e., person hours) • mostly done - although will be refined further (CSIRO contribution) • Hosting costs • CSIRO is happy to host, at least for present; access via web can be seamless, once integrated into the portal • Refreshing/ content maintenance costs • some person time needed, in addition to automated “crawler” – upload taxon lists from new data contributors, check for bad data, flag new genera with relevant taxonomic group as needed • crawling ideally should be repeated frequently, to keep index current • Continued development and integration into OBIS Portal • ??
Recap – what’s new ... • Speed, consistency, reliability • includes no more 0 records, or “try later” messages (at least on “Stage 1” searches) • Many new functions, including • User need only enter part of a name • Can automatically correct for spelling errors • Report on multiple taxa simultaneously (tens to thousands) • Spatial searches from clickable maps • Introduction of “OBIS categories” • OBIS content available at a glance (summary statistics, spatial coverage by category) • Screening of irrelevant, and/or bad data • Expansion of ease-of-use, from expert to increasingly non-expert users, without compromising integrity of the system.
Future tasks ... • Include common names in search results, search interface • Auto-resolution of synonyms, variants ... • Quick Images? Quick Species Pages? • How to embed seamlessly into Portal • Further development of CSIRO mapper, and/or c-squares enabling for other mappers? (KGS, SEAMAP...) • Think about replication, system load issues • How to manage development process from here • Any overlap with GBIF activities? (OBIS is “marine component of GBIF”; GBIF has indicated interest in indexing) • Other ??
Summary ... - Interesting challenge thus far! - Reasonably complex package (database, software, content building and maintenance) - Personal opinion – major step forward in OBIS functionality - Close to deployment in “production” version - How to integrate with OBIS work plan?
“Quick map” for Balaenoptera physalus (38,000+ OBIS records) - in < 6 secs.
Also: maps are all now “active maps” – click on/near any red square initiates “live” OBIS spatial search for the relevant base data.
Another example: “near match ...” for a genus name – where user is unsure of correct spelling ...
Pre-generated map presented, showing all records for category ...
Another feature: can use stored record count information to generate OBIS summary statistics per category, e.g.: