150 likes | 280 Views
Thesauri and Ontologies for Digital Libraries. Pavel Smrž, Anna Sinopalnikova, Martin Povolny { smrz, anna, xpovolny}@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech Republic. Outline. Motivation Role of Thesauri and Ontologies in Present DLs, Relations Covered
E N D
Thesauri and Ontologiesfor Digital Libraries Pavel Smrž, Anna Sinopalnikova, Martin Povolny {smrz, anna, xpovolny}@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech Republic
Outline • Motivation • Role of Thesauri and Ontologies in Present DLs, Relations Covered • Word-Association Thesaurus, CLIR • XML Document Management System • XML Family Standards, XSLT Processor Extension • Conclusions and Future Directions
Motivation • size and complexity of DL grow rapidly • future DLs will need algorithms to process and understand contained data • intelligent procedures must be implemented to transform natural-language knowledge into a more appropriate representation • description of concepts and relations between them becomes crucial
Motivation • common understanding of application domains is provided by ontologies • creation of broad-coverage ontologies from scratch is extremely labour-intensive • efforts to reuse (clean-up, refine, merge) existing resources = wordnet-like semantic networks, lexical databases, thesauri, ...
Thesauri and Ontologies in Present Digital Libraries • structuring and classification of digital data(bibliographic classification supplemented/replaced by automatic conceptual document indexing) • contradictory results in the area of information retrieval (IR) Standard IR measures (precision/recall) vs. navigation through documents, userinterface aspects
Relations Covered • Synonymy – query expansion (validated by the user) • true synonyms • style, register, regional variants • orthographic variants (proper names) • Hierarchical relations (hyponymy, meronymy) – query expansion, named entity recognition, ... • “see-also”, “related-to” relations – definition of topics
Word-Association Thesauri • Large-scale psycholinguistic experiments (free association test) • Large numbers of stimulus-reaction pairs (170 000), many subjects (1 500) of different age, sex, profession, ... • Availability for English, German, Russian, Czech • Concept search rather than context search
Cross-lingual information retrieval and extraction CLIR = finding documents in a language different from the one used in the query Multilingual resources (wordnets) for many languages (EuroWordNet, BalkaNet) linked by ILI CLIE = translation of answers back to the language of the user query Visualiasation of terms referring to hierarchically organized concepts
XML Document Management System Integrating Ontologies • Several systems allow storing data and metadata together • BUT no support for efficient integration thesauri and ontologies • DEB – open-source client/server system for efficient storage and retrieval of arbitrary XML collections • XML-family standards employed in the data format, customization of UI, query language, visualisation, ...
XML-Family Standards in DEB • DEB clients use XSLT for transforming XML data into HTML (presented with the help of a HTML widget) • User-defined data views by means of XSLT • Client-side caching of parsed DOM objects • XPath for accessing information • OWL for storing ontologies transformed automatically from
Extension of the Standard XSLT Processor • nested queries for efficient processing • XSLT sheets can request data from DEB server based on information processed • Special schema (deb://) creates a virtual space of XML documents that are results of the queries • Accessing the server data from XSLT processor the same way as any other external resources
Conclusions and Future Directions • Our research on the role of thesauri and ontologies in DL influenced the development of the Czech part of the multilingual lexical resource developed under the current BalkaNet project and the last extensions to the RussNet project. • DEB is currently used as the core DL engine at NLP Lab, FI MU, Brno, Czech Republic. It manipulates standard document collections as well as dictionaries, lexical semantic databases, e-learning materials, ...
Future Directions • Open research problems related to the conceptual design of lexical resources (integration of generative concepts to the structure of knowledge bases) • DEB development – specialized modules for new W3C standards, three-level architecture (thin clients), simplification of UI customization by means of automatically generated XSLT, reimplementation in RUBY, ...