190 likes | 311 Views
GI-DAYS MÜNSTER A software tool for thesauri management, browsing and supporting advanced searches. J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003. Contents. Introduction Architecture of THManager application Basic capabilities Enhanced capabilities
E N D
GI-DAYS MÜNSTERA software tool for thesauri management, browsing and supporting advanced searches J. Nogueras-Iso, J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003
Contents • Introduction • Architecture of THManager application • Basic capabilities • Enhanced capabilities • Conclusions
Introduction to thesauri • „ A thesaurus is a set of terms that describe the vocabulary of a controlled indexing language, formally organized so that the a priori relationships between concepts (for example synonymous terms, broader terms, narrower terms and related terms) are made explicit“ [ISO 2788] • Used to improve the precision and recall of information retrieval in digital libraries • provide a uniform and consistent vocabulary for indexing metadata ("description of the data holdings“) • supply users with a suitable vocabulary for the retrieval. • expansion of users queries by automatically adding new terms to the query
Introduction to thesauri • A thesaurus management tool becomes a vital component in the development of any kind of digital library • One of the main objectives of Spatial Data Infrastructures is to provide the discovery, evaluation and access to spatial data for a community of users. • an SDI can be considered as digital library specialised in geographic information resources. • A thesaurus management tool will be also a vital component for the development of SDIs.
ThesaurusMngmt ThManager Thesaurus.gui Generic GUI components for thesauri visualization enhanced enhanced Thesaurus.model Keywords Thesaurus management Import/export Keywords expansion Polisemy WordNet Polisemy extraction Branch disambiguation Lexicon Architecture of THManager application Level 3. Application basic Level 2. GUI Level 1. Model << JDBC >> Level 0. Database • Thesaurus • 100% SQL (basic) • Oracle IntermediaText (enhanced) WordNet files Metadata records
Basic Capabilities • Edition of thesauri according to ISO norms • Broader (BT), narrrower terms (NT) • Related terms (RT), preferred terms (PT) • Scope notes (SN), Synonyms (SYN,USE) • Language translations (TR) • Visualization of thesauri • Hierarchical, alphabetical • Search of terms • Multilingual access support • Browsing according to the language selected by users • Import/Export • Text file proprietary formats
Import/export formats • Formats • Dot based notation • sucession of narrower terms + additional relationships (SYN,TR, ...) • Hierarchical Numbering of terms • It should use more standardized formats: • RDFS/XML, ...
Enhanced capabilities • Thesauri are intended for the homogeneous classification of resources • They are used to fill metadata keywords • However, there is still heterogeneity in metadata keywords • Metadata creators use different thesauri in different application domains • If metadata catalogs provide access to general public • Queries may not contain same terms as keywords in metadata records • A possible solution to fill the semantic gap • Disambiguation of thesauri (and queries) in relation with the concepts of an upper level ontology
WordNet Controlled list 1 Other knowledge representation models Controlled list 2 Controlled list N Thesaurus 1 Thesaurus N Thesaurus 2 Enhanced capabilities • Additional tools around semantic disambiguation • Browsing WordNet as another thesaurus • Searching polysemic senses in WordNet • Thesauri disambiguation • Automatic Expansion of Keywords
WordNet is structured in a hierarchy of synsets Synsets are defined as set of synonyms representing a particular concept (sense) WordNet libraries and files are accessed by JNI Browsing WordNet
Searching polysemic senses in WordNet • Functionality provided by Polisemy package • Compound terms are partioned if no synset is found • If adjectives found, associated nouns are also searched to reduce number of not-found words
accident administration environmental accident major accident traffic accident work accident accident source technological accident ... nuclear accident shipping accident accident explosion oil sick leakage core meltdown Thesauri Disambiguation • Unsupervised disambiguation method • The senses of every thesaurus term are searched in WordNet. • The hierarchical structure of the thesaurus is used as the word context for a voting algorithm to find the closest sense • Thesauri are partitioned into branches (trees formed by BT/NT terms whose root has no BT)
Thesauri Disambiguation II • Voting algorithm to obtain the disambiguated synset of a term a • Every synset s associated to the rest of terms in the branch votes (proximity weight) for the synsets of term “a” • Main weight: number of subsummers in WordNet hierarchy • Matches in WordNet hierarchy of ancestors • Discounting factors: • Synset depth • Branch distance • Polisemy of term associated with synset “s”
Annotation of disambiguated synsets Thesauri disambiguation III
Comparison between the initial collection of synsets and the synsets of a new term Automatic expansion of keywordswith new disambiguated thesauri
Conclusions & future lines • ThManager is a flexible tool to manage thesauri • It provides enhanced functionality for the improvement of classifications. • This tool can be easily integrated in other tools • It is used by a metadata edition tool (also presented here) to select the appropriate term for the distinct metadata fields. • Future lines: • Creation of a thesaurus Web Service providing some of the functionality offered by this tool. • thesaurus browsing, WordNet polysemy extraction, keywords expansion, ... • Concept based retrieval • Exploit the semantic disambiguation of thesauri to test different information retrieval strategies for geographic data catalogs. • It is possible to index metadata records according to a unified system: the disambiguated WordNet synsets
Advanced Information Systems Laboratory http://iaaa.cps.unizar.es