530 likes | 662 Views
EcoTerm IV NBII/EioNet Demo of Federated KOS Search. Mike Frame Vienna, Austria April 2007. Discussion Topics…. Project Background NBII Thesaurus GEMET Thesaurus Prototype Client Sample Query Results Including no, 1, or both thesauri Overall Findings.
E N D
EcoTerm IVNBII/EioNet Demo of Federated KOS Search Mike Frame Vienna, Austria April 2007
Discussion Topics… • Project Background • NBII Thesaurus • GEMET Thesaurus • Prototype Client • Sample Query Results • Including no, 1, or both thesauri • Overall Findings
Biocomplexity Thesaurushttp://thesaurus.nbii.gov http://thesaurus.nbii.gov
EIONET GEMET Thesaurushttp://www.eionet.europa.eu/gemet/webservices?langcode=en
NBII/EIONET Thesaurus Web-service • Background - collaboration through Ecoinformatics TWG • Primary Goal – access distributed multi-lingual thesauri • Results – SKOS web-service & client 1
Latest Client & Service capabilities • Access to both NBII and GEMET • Single language capability • Results are provided by source • All documentation is completed http://thesaurus.nbii.gov
Initial Challenges Identified • Thesaurus scope, intent, purpose, and coverage is different • NBII = sub-discipline of environment • Endangered species • Broader Terms:Species , Special status species , Taxa • EIOINET = broad environment • Broader Terms:environmental protection
Current State • Users • Most aren’t aware of the underlying vocabulary • Vocabulary are often unique to organization and more for “categorization” than retrieval • Goal • Include all Vocabularies and let Search Engine handle results
Demonstration Search Retrieval • Created a demonstration datasets • NBII Cataloged Resources • ~30,000 web-sites, publications, images, maps, etc. • Xml structured data – controlled subject • NBII FGDC Metadata • ~22,000 resources on research studies • 150-200 elements • Semi-structured with no controlled vocabulary
NBII Catalog Records • Based on the Dublin Core + • 18 elements, of which 10 are mandatory • In place since 2002 • Used by distributed content managers
Process • Added thesaurus capabilities to Development Search Engine for: • NBII Thesaurus • EIONET GEMET Thesaurus • Used BT, RT, NT relationships & weighting • Performed sample queries within the test repositories for: • No thesaurus • GEMET only aided searching • NBII only aided searching • GEMET+NBII aided searching (X)
Test Repository 1 • NBII Resource Catalog (Dublin Core)
GEMET Thesaurus – “rare species” (expanded degrees of relevance)
Test Repository 2 • NBII FGDC Metadata
Sample Queries – No vocabulariesMetadata CH “ invasive species”
Sample Queries – No vocabulariesMetadata CH“endangered species”
Sample Queries – No vocabulariesMetadata CH “protected species”
Overall ResultsGeneral Findings • Assumption that a Thesaurus improves “number” of results is valid • Degree does vary by the term and mappings • Since users search from a # of perspectives, backgrounds, expertise, multiple thesaurus do improve the number of results
Overall ResultsUsing only GEMET Terminology • Terms not included in the NBII thesaurus that were in GEMET improved search results • GEMET strength of broad coverage aided searches • In General for the Metadata repository • Results varied somewhat, but often same top 10 results
Overall ResultsGeneral Findings • With “No thesaurus” test results produced poorer #1 results • Thesaurus results for the structured set ordered results list more differently than unstructured set (Metadata)
Issues • “integrating” multi-scope and purpose thesauri presents challenges: • Can’t turn the effort into a thesaurus project • Degrees of relevance of terms is an issue • Concept matching or different intent • Differing classification (RT vs. NT) across thesauri • Differing “weighting” algorithms
Further Study Options 1.) Take multiple thesauri “as is” 2.) Do some “attempted” concept matching i.e. “endangered animal species” – “endangered animal” 3.) If not match is present, add term and relationship as is 4.) Obtain terms from XMDR
Further Study Options – cont. • Follow-up with additional repositories • Repeat with other query terms • Re-look at weighting algorithms • Do queries with subset of terms • Repeat with completely integrated thesaurus as compared to>>>>>>> • Repeat queries with machine integration Complete By June