160 likes | 269 Views
Controlled Vocabulary Working Group Activities 2005-2006. The Problem. Inconsistent, disjunct and sparse keywords negatively impact data discovery. 72.2% of all keywords are used at only a single LTER site. 90% of all keywords are used at 4 or fewer LTER sites. The Problem.
E N D
The Problem • Inconsistent, disjunct and sparse keywords negatively impact data discovery 72.2% of all keywords are used at only a single LTER site 90% of all keywords are used at 4 or fewer LTER sites
The Problem • Good “Browse” interfaces require some organization of keywords • E.g. BIOSPHERE • PLANTS • VASCULAR PLANTS • OAK • NON-VASCULAR PLANTS • ANIMALS • VERTEBRATES • INVERTEBRATES
Possible Solutions • Create an LTER Controlled Vocabulary or Thesaurus or Ontology • Advantages: • Absolute control on contents • Ability to customize to meet LTER needs • Disadvantages: • Development will be time and resource expensive • Such development can be a highly technical field requiring specialists
Possible Solutions • Adopt an existing controlled vocabulary, thesaurus or ontology • Advantages: • Minimal cost to LTER • Aids in linking LTER to a larger world of data systems • Disadvantages: • Lack of control • Existing systems may not be suitable for LTER use • Lack desirable terms
2005 LTER IM Meeting • A the 2005 IM meeting we decided that the best option to explore was Option 2 (use an existing resource) • Rationale: • Could potentially save lots of time, trouble and money! • Helps forge links with other groups • Could make LTER systems interact better with other similar systems
General Steps • Identify existing resources that LTER could use • NBII Thesaurus • GEMET (GEneral Multilingual Environmental Thesaurus) • Global Change Master Directory (GCMD) • SEEK Ontology • Evaluate the usability of existing systems • Develop tools and relationships needed to exploit and improve the system(s) of choice
Assembling Resources • assemble list of existing keywords • EML • Keywords • title words • attribute definition words • taxonomy keywords • ITIS SPIRE web service from UMD.BaltCo.... • DTOC • publications titles, keywords and abstracts • Site keyword lists - e.g., AND-LTER • need to count word and site frequency and number of keywords per document
Consolidated List • The consolidated list includes 21,153 words or terms along with • Number of “lists” on which it appeared (max 5) • Number of sites and uses from each list • Max and Min number of sites using (0-26) • Max and Min number of uses (0-12,611) • Is it a multi-word term?
Ranking/Rating Words • Terms were sorted by: • Number of Lists • Max. number of sites on any single list • Min. number of sites on any single list • Number of uses • The top 1010 terms were then rated as “useful” (U), “marginal/not sure” (M) or “not useful” (N) by volunteers • Needed for abbreviations e.g., CO2 and words that are too general (e.g., “Above”, “Total”) • The resulting list was then additionally sorted by a term score T=((U*1)+(M*0)+(N*-1))/(U+M+N) • Always “Useful”=1.00, Always “Not Useful”= -1.00
Preliminary Evaluation • Volunteers have used highly ranked words from the “list of 1000” to test retrieval from various thesauri • So far NBII seems to be preferred, but we need additional testers • Inigo San Gil has been working on automated queries of the of NBII Thesaurus
Tasks for this meeting • Once we have a controlled vocabulary, how shall we use it? What tools do we need to develop? • What additional testing/evaluation is required (bring in PI’s?)? What institutional relationships need to be pursued? What actions do we need to take to improve the usability of resources for LTER use?