1 / 16

Controlled Vocabulary Working Group Activities 2005-2006

Controlled Vocabulary Working Group Activities 2005-2006. The Problem. Inconsistent, disjunct and sparse keywords negatively impact data discovery. 72.2% of all keywords are used at only a single LTER site. 90% of all keywords are used at 4 or fewer LTER sites. The Problem.

penn
Download Presentation

Controlled Vocabulary Working Group Activities 2005-2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Controlled Vocabulary Working GroupActivities 2005-2006

  2. The Problem • Inconsistent, disjunct and sparse keywords negatively impact data discovery 72.2% of all keywords are used at only a single LTER site 90% of all keywords are used at 4 or fewer LTER sites

  3. The Problem • Good “Browse” interfaces require some organization of keywords • E.g. BIOSPHERE • PLANTS • VASCULAR PLANTS • OAK • NON-VASCULAR PLANTS • ANIMALS • VERTEBRATES • INVERTEBRATES

  4. Possible Solutions • Create an LTER Controlled Vocabulary or Thesaurus or Ontology • Advantages: • Absolute control on contents • Ability to customize to meet LTER needs • Disadvantages: • Development will be time and resource expensive • Such development can be a highly technical field requiring specialists

  5. Possible Solutions • Adopt an existing controlled vocabulary, thesaurus or ontology • Advantages: • Minimal cost to LTER • Aids in linking LTER to a larger world of data systems • Disadvantages: • Lack of control • Existing systems may not be suitable for LTER use • Lack desirable terms

  6. 2005 LTER IM Meeting • A the 2005 IM meeting we decided that the best option to explore was Option 2 (use an existing resource) • Rationale: • Could potentially save lots of time, trouble and money! • Helps forge links with other groups • Could make LTER systems interact better with other similar systems

  7. Plan of Action

  8. General Steps •  Identify existing resources that LTER could use • NBII Thesaurus • GEMET (GEneral Multilingual Environmental Thesaurus) • Global Change Master Directory (GCMD) • SEEK Ontology • Evaluate the usability of existing systems • Develop tools and relationships needed to exploit and improve the system(s) of choice

  9. Assembling Resources • assemble list of existing keywords • EML • Keywords  • title words  • attribute definition words  • taxonomy keywords • ITIS SPIRE web service from UMD.BaltCo.... • DTOC  • publications titles, keywords and abstracts  • Site keyword lists - e.g., AND-LTER  • need to count word and site frequency and number of keywords per document

  10. Some Statistics

  11. Consolidated List • The consolidated list includes 21,153 words or terms along with • Number of “lists” on which it appeared (max 5) • Number of sites and uses from each list • Max and Min number of sites using (0-26) • Max and Min number of uses (0-12,611) • Is it a multi-word term?

  12. Ranking/Rating Words • Terms were sorted by: • Number of Lists • Max. number of sites on any single list • Min. number of sites on any single list • Number of uses • The top 1010 terms were then rated as “useful” (U), “marginal/not sure” (M) or “not useful” (N) by volunteers • Needed for abbreviations e.g., CO2 and words that are too general (e.g., “Above”, “Total”) • The resulting list was then additionally sorted by a term score T=((U*1)+(M*0)+(N*-1))/(U+M+N) • Always “Useful”=1.00, Always “Not Useful”= -1.00

  13. Top of the list

  14. Bottom of the list

  15. Preliminary Evaluation • Volunteers have used highly ranked words from the “list of 1000” to test retrieval from various thesauri • So far NBII seems to be preferred, but we need additional testers • Inigo San Gil has been working on automated queries of the of NBII Thesaurus

  16. Tasks for this meeting • Once we have a controlled vocabulary, how shall we use it? What tools do we need to develop? • What additional testing/evaluation is required (bring in PI’s?)? What institutional relationships need to be pursued? What actions do we need to take to improve the usability of resources for LTER use?

More Related