190 likes | 199 Views
Learn about how THManager application supports the management and browsing of thesauri, aiding in advanced searches and keywords expansion for improved information retrieval. Explore its basic and enhanced capabilities, relevant for digital libraries and spatial data infrastructures.
E N D
GI-DAYS MÜNSTERA software tool for thesauri management, browsing and supporting advanced searches J. Nogueras-Iso, J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003
Contents • Introduction • Architecture of THManager application • Basic capabilities • Enhanced capabilities • Conclusions
Introduction to thesauri • „ A thesaurus is a set of terms that describe the vocabulary of a controlled indexing language, formally organized so that the a priori relationships between concepts (for example synonymous terms, broader terms, narrower terms and related terms) are made explicit“ [ISO 2788] • Used to improve the precision and recall of information retrieval in digital libraries • provide a uniform and consistent vocabulary for indexing metadata ("description of the data holdings“) • supply users with a suitable vocabulary for the retrieval. • expansion of users queries by automatically adding new terms to the query
Introduction to thesauri • A thesaurus management tool becomes a vital component in the development of any kind of digital library • One of the main objectives of Spatial Data Infrastructures is to provide the discovery, evaluation and access to spatial data for a community of users. • an SDI can be considered as digital library specialised in geographic information resources. • A thesaurus management tool will be also a vital component for the development of SDIs.
ThesaurusMngmt ThManager Thesaurus.gui Generic GUI components for thesauri visualization enhanced enhanced Thesaurus.model Keywords Thesaurus management Import/export Keywords expansion Polisemy WordNet Polisemy extraction Branch disambiguation Lexicon Architecture of THManager application Level 3. Application basic Level 2. GUI Level 1. Model << JDBC >> Level 0. Database • Thesaurus • 100% SQL (basic) • Oracle IntermediaText (enhanced) WordNet files Metadata records
Basic Capabilities • Edition of thesauri according to ISO norms • Broader (BT), narrrower terms (NT) • Related terms (RT), preferred terms (PT) • Scope notes (SN), Synonyms (SYN,USE) • Language translations (TR) • Visualization of thesauri • Hierarchical, alphabetical • Search of terms • Multilingual access support • Browsing according to the language selected by users • Import/Export • Text file proprietary formats
Import/export formats • Formats • Dot based notation • sucession of narrower terms + additional relationships (SYN,TR, ...) • Hierarchical Numbering of terms • It should use more standardized formats: • RDFS/XML, ...
Enhanced capabilities • Thesauri are intended for the homogeneous classification of resources • They are used to fill metadata keywords • However, there is still heterogeneity in metadata keywords • Metadata creators use different thesauri in different application domains • If metadata catalogs provide access to general public • Queries may not contain same terms as keywords in metadata records • A possible solution to fill the semantic gap • Disambiguation of thesauri (and queries) in relation with the concepts of an upper level ontology
WordNet Controlled list 1 Other knowledge representation models Controlled list 2 Controlled list N Thesaurus 1 Thesaurus N Thesaurus 2 Enhanced capabilities • Additional tools around semantic disambiguation • Browsing WordNet as another thesaurus • Searching polysemic senses in WordNet • Thesauri disambiguation • Automatic Expansion of Keywords
WordNet is structured in a hierarchy of synsets Synsets are defined as set of synonyms representing a particular concept (sense) WordNet libraries and files are accessed by JNI Browsing WordNet
Searching polysemic senses in WordNet • Functionality provided by Polisemy package • Compound terms are partioned if no synset is found • If adjectives found, associated nouns are also searched to reduce number of not-found words
accident administration environmental accident major accident traffic accident work accident accident source technological accident ... nuclear accident shipping accident accident explosion oil sick leakage core meltdown Thesauri Disambiguation • Unsupervised disambiguation method • The senses of every thesaurus term are searched in WordNet. • The hierarchical structure of the thesaurus is used as the word context for a voting algorithm to find the closest sense • Thesauri are partitioned into branches (trees formed by BT/NT terms whose root has no BT)
Thesauri Disambiguation II • Voting algorithm to obtain the disambiguated synset of a term a • Every synset s associated to the rest of terms in the branch votes (proximity weight) for the synsets of term “a” • Main weight: number of subsummers in WordNet hierarchy • Matches in WordNet hierarchy of ancestors • Discounting factors: • Synset depth • Branch distance • Polisemy of term associated with synset “s”
Annotation of disambiguated synsets Thesauri disambiguation III
Comparison between the initial collection of synsets and the synsets of a new term Automatic expansion of keywordswith new disambiguated thesauri
Conclusions & future lines • ThManager is a flexible tool to manage thesauri • It provides enhanced functionality for the improvement of classifications. • This tool can be easily integrated in other tools • It is used by a metadata edition tool (also presented here) to select the appropriate term for the distinct metadata fields. • Future lines: • Creation of a thesaurus Web Service providing some of the functionality offered by this tool. • thesaurus browsing, WordNet polysemy extraction, keywords expansion, ... • Concept based retrieval • Exploit the semantic disambiguation of thesauri to test different information retrieval strategies for geographic data catalogs. • It is possible to index metadata records according to a unified system: the disambiguated WordNet synsets
Advanced Information Systems Laboratory http://iaaa.cps.unizar.es