310 likes | 416 Views
LTER Controlled Vocabulary Workshop May 26-27, 2011. “Scientists seeking data should be able to efficiently and reliably locate LTER datasets through searching, browsing …“ Get feedback on general direction of working group activities Resolve some specific issues Decide on “Next Steps”
E N D
“Scientists seeking data should be able to efficiently and reliably locate LTER datasets through searching, browsing …“ • Get feedback on general direction of working group activities • Resolve some specific issues • Decide on “Next Steps” • Products • Comments to be acted on • White paper concerning specific issues and “next steps” Objectives
Eclectic use of terms to used for discovering LTER data makes it difficult to perform reliable or efficient searches • Often several terms for one concept • One site uses CO2 another Carbon Dioxide, another Carbon-dioxide • Carbon to Nitrogen Ratio, C:N, C:N Ratio, Carbon-to-nitrogen Ratio • No way to relate broader terms with narrower terms • Searching on “Landscape Change” doesn’t find data sets related to “desertification” even though desertification is a kind of landscape change The Challenge
2006 Analysis of LTER keywords Only 3.2%! * Allows multi-word terms
We started off by surveying what terms were already being used in a variety of LTER documents Our goal was to see if there were any existing lexical resources that we could simply adopt Past
Test of list vs NBII Thesaurus - 2008 58% of LTER terms were not found in the NBII Thesaurus Results suggested that we needed to develop our own resource
Identify a list of preferred terms that would be used by sites in creating metadata documents • Focus on LTER-wide searches • Want to facilitate cross-site synthesis • People searching LTER Metacat rather than individual sites are interested in relevant data from multiple sites • Want to hit the “sweet spot” for the number of terms • Too many terms make keywording documents difficult, and results in searches with too few datasets • Too few terms make it hard to locate usably small numbers of datasets Goals for Development of Keyword List
Assembled list of words already in LTER Metadata (EML documents) • Selected using criteria: • Keywords shared with GCMD and NBII, or • Keywords used at more than one LTER site • Reviewed by Information Managers • Removals and additions were suggested • Edited based on voting • Created a Draft set of Taxonomys • Included some additions and deletions Steps Taken
Goal: Improve Searching & Browsing • Reliability (of all the suitable target documents, what percentage did you find) • Efficiency (of the documents your search returned, what percentage were suitable) • A list alone is not sufficient to support browsing and sophisticated searching of data – more structure is needed Structuring the Controlled vocabulary
Structures = = = = Multiple taxonomys are a Polytaxonomy Complexity
Relationships should be independent of context • Must pass “Some-not-all test” • Each taxonomy should include only one type of entity (listed in Z39.19 section 6.3.2) • Things and their physical parts (birds, trees, leaves) • Materials (wood, nitrogen, sand) • Activities or processes (acidification, production) • Events or occurrences (germination, death) • Properties or states of persons, things, materials or actions (age, speed, nitrogen content) • Disciplines or subject fields (ecology, ornithology) • Units of measurement (m, km, miles) • Unique entities (LTER,HJ Andrews Forest) • You can get into trouble if you start “mixing and matching” things within a single taxonomy! Taxonomys – Rules of the Road
The VOCAB Working Group has created a draft set of 10 taxonomys containing 713 terms • Includes additional “broader” terms needed for grouping • Includes synonyms (non-preferred terms) • Some terms originally in the list have been removed because the were perceived to be too ambiguous or context-sensitive to be useful for the purposes of searching or browsing • E.g., “Aboveground” • Some “related” terms have also been identified Activities
In 2010 a request for information was forwarded to the LTER Executive Board: “The Information Management Committee has studied how keywords are used at LTER sites, how LTER keywords relate to external lexographical resources, and compiled a draft keyword. We request guidance from the LTER Executive Board on how a controlled vocabulary might be implemented within the context of LTER to improve the reliability of data searches. “ The EB generally endorsed the idea of a LTER Controlled Vocabulary, and agreed to help have scientists participate in vetting the list and deciding on next steps (THIS WORKSHOP) Approvals
Permit use of a browse interface • Make searches more sophisticated • See “Use case” for searching • search includes synonyms plus narrower terms and/or related terms • Develop tools to help in adding keywords to LTER metadata documents • Prototype versions of a couple are already available • See Keywording “Use Case” How List and Polytaxonomy Will be used
What are your experiences with finding LTER data? What would be most helpful in finding data in the future? Review of “Use Cases” Task 1: Locating data
Evaluate the utility of the draft polytaxonomy • Is it better than the existing LTER Metacat interfaces? • Are there large changes that need to be made? • Elimination of specific taxonomys? • Creation of new taxonomys? • Addition of related terms to make a thesaurus? • Are there small changes needed? • Removal or replacement of terms Task 2: Reviewing the list & taxonomy
Improvement of existing documents • Review existing keywords and change to preferred forms • Note: even without doing this the synonym ring will help improve searching and browsing • Use preferred terms for new documents • Ideally at least one term from each of the relevant taxonomys • Note: addition of new terms to the list, should require review of all existing documents to see if they should be added – so term additions should be rare • Changes in taxonomys and term relationships do not require re-keywordingof existing documents Changes: Implications for Sites
Todd & Margaret • Focus on INTERFACE • Ways to present the data • Allow “query within result set” • Intersect query sets • Group options – by site, by time • side by side comparisons • Be able find where different types of data intersect • Can be very difficult due to missing data etc. • Problem extends beyond query interface • Interface needs to be a higher priority – sooner rather than later • Recommendation to IMC/NISAC/EB Group feedback
Rodger and Kristin • Highest level of hierarchy • Found some things to change or add “root production”, “belowground productivity” • Were generally happy with overall organization • Need system for adding new keywords – this is just a start • Intrigued by theory and where we go from here • How does it matter what is in one place or another? • Want to make sure things are well-organized…. • Data vs research question • Does not matter where it is when adding to keyword list • Need to have “best practices” for adding keywords • How will that effect sites? • How many data sets have no preferred terms? Group Feedback
At least one word from list At least one from at least 5 of the 10 taxonomys Signature datasets should be flagged with “signature dataset” tag Should include Core area(s) Best practice
Core area - Problems with definitions • Some datasets are either none, or all core areas • Weather data • Change entities to core areas? • People will want to look for this • Would not have hierarchy? • That would be OK – can have related terms • Could link to signature datasets • Need “signature dataset” keyword – used to weight • Or prioritize signature datasets for adding preferred terms • Treat as unique: • Primary Production (core area) • Data can be applied to MANY core areas - won’t map • e.g. Climate • Try adding core area taxonomy and then add core areas and related terms????? • May not be needed or appropriate – we are asking the data catalog to do too much – need catalog of research topics Core Areas
Want to search for signature datasets at top level of the hierarchy • Needs to be one click away “Signature data” Concensus
Julia and Don • Would be interesting to tally the number of hits for each keyword for each site • Tally of number of datasets for each site • GIS should be preferred term • Can mean Geographical Information Science Group reports
Atmospheric processes cross listed under hydrologic properties • Evapotranspiration should be above transpiration and evaporation • Snow not under precipitation • Geographical Properties ->Spatial Properties • Move imagery under that with satellite and photos under that – depricatelandsat • Methods – field, spatial, lab, analytical subcategories • Also cores, dendrometers etc. tools could go under this • Entities • For detailed ones, tried to find other homes • Diseases to disease and move under bio processes • Levels of organization for communities, populations, species • Are these useful terms? How often used • Biomes instead of Ecosystems Group Reports – Julia & Don Structural changes
Core areas • Do we need a special taxonomy for core areas? • Are related-terms needed, or is a polytaxonmy (hierarchy) sufficient? • Management of the vocabulary – role of researchers? • Preferred terms – are all really preferred? • E.g., Permanent forest plots Task 3: Specific Issues
How do we engage larger LTER community? • How much, and what sort of engagement is needed? • Requests we should make to the EB or IMC? • Managing the controlled vocabulary • What technology development is needed, and who should pursue it? Task 4: Next Steps
Anyone can propose adding, editing, deleting or moving terms within the hierarchy, with justification. • Proposals would be evaluated by the Controlled Vocabulary Working Group according to the following criteria: • The proposed terms should provide clear utility for searching and browsing, and not introduce ambiguity • The proposed terms should be suitable for inclusion (e.g., not locations or specific taxonomic identifiers) • Proposed terms should not be redundant with existing term(s) already in the vocabulary • Terms and their proposed places in taxonomys or thesauri should conform in form with NISO Z39.19 2005 and successor documents (e.g., sections 6.5.1, 8.3) Proposed management plan
Best Practices for adding keywords • Preferred terms (and preferred preferred terms ) • Presentation to PIs • Statistics on numbers of hits • Add workshop participants to VOCAB • Put in supplement proposal for development of search interface • Write it up now – Shovel Ready! • Like MALS – need to have all sites sign up with letters of endorsement Recommendations