Multilingual support to a proposed Semantic Web architecture

Multilingual supportto a proposed Semantic Web architecture Andrea Ferrato TOP-UIC MS Thesis, 2003/’04 Advisor: Laura Farinetti

Purpose of this work • Design and (partially) implement multilingual support on a pre-existing Semantic Web platform • Provide an approach as generical as possible • Exploit features of the pre-existing architecture • Cope with the average chaotic structure of resources currently available A. Ferrato, TOP-UIC 2003-'04

Outline • Semantic Web • Multilinguality • The DOSE platform • Proposed solution • Given implementation • Experimental results • Conclusions A. Ferrato, TOP-UIC 2003-'04

Semantic Web • The next evolutionary stage for WWW • Goal: make network data usable by intelligent agents • Deployable only on top of existing infrastructure • Two pressing tasks • Transform existing contents to include semantics • Setup ad hoc user agents to work on them A. Ferrato, TOP-UIC 2003-'04

Transform existing contents • Basic data units: resources • Every single information entity that can be semantically isolated • Features to be given • Identification: URI • Structure: XML • Meaning: RDF • Knowledge: ontologies A. Ferrato, TOP-UIC 2003-'04

Set up ad hoc user agents • Major players in Semantic Web deployment • Invoked by users, can proceed autonomously • Key facilities to be supported • Logic • Proof • Trust A. Ferrato, TOP-UIC 2003-'04

XML + NS + XMLschema Unicode URI Semantic Web: layer cake view(Berners-Lee) Trust Rules Logic Data Self desc. doc. Proof Digital signatures Data Ontology vocabulary RDF + RDFschema A. Ferrato, TOP-UIC 2003-'04

Multilinguality • The extension to multiple languages of tasks already performed in a monolingual context • Typical issues from cross-language mapping • Lexical gaps • Role of the context • Lack of pre-acquired knowledge A. Ferrato, TOP-UIC 2003-'04

Multilinguality and Semantic Web • A problem of Text Retrieval in multiple languages (NLP) • Start from popular approaches (Controlled Vocabulary, Free text, etc.) • Two main requirements • Recognize language ID of resources • Map contents independently from language A. Ferrato, TOP-UIC 2003-'04

Language ID retrieval • Two possible scenarios • Retrieve a given ID via resource parsing • Recreate the ID via resource analysis • When recollecting a given language attribute, conform to existing language specification standards A. Ferrato, TOP-UIC 2003-'04

“lang” attribute Language inheritance + Language ID specification Content-language CSS-level declarations A. Ferrato, TOP-UIC 2003-'04

Language-independent contents mapping • Investigate the form/meaning relationship • Ontology design is crucial • Three main requirements • Consistency (based on linguistic evidence) • Flexibility (meaningful for all languages) • Extendibility (easy addition of new languages) A. Ferrato, TOP-UIC 2003-'04

Ontology models • Conceptual • founded upon general knowledge • Language-based • Built on a particular language • Interlingua • A combination of the above two • None is definitely superior for multilinguality A. Ferrato, TOP-UIC 2003-'04

The DOSE platform • Distributed Open Semantic Elaboration platform • Key features • Modularity • Scalability • Semantic integration • Main functionalities offered • Annotation • Search A. Ferrato, TOP-UIC 2003-'04

Indexer Search Engine Substructure Extractor Semantic Mapper Fragment Retriever Onto- logy Annotation Repository Syn- set DOSE: layered view Service layer Front-end layer Back-end layer A. Ferrato, TOP-UIC 2003-'04

Substructure Extractor Indexer Search Engine Fragment Retriever XML-RPC infrastructure Semantic Mapper Onto- logy Annotation Repository Syn- set DOSE: distributed view A. Ferrato, TOP-UIC 2003-'04

2 3 The Web 1 4 6 5 7 8 9 10 Annotation Repository 11 DOSE: annotation Substructure Extractor Indexer Fragment Retriever Semantic Mapper A. Ferrato, TOP-UIC 2003-'04

1 3 The Web 2 4 5 8 6 Annotation Repository 7 DOSE: search Semantic Mapper Search Engine Fragment Retriever A. Ferrato, TOP-UIC 2003-'04

DOSE and multilinguality • Traditionally: a new ontology for each different language • DOSE: the ontology language is totally independent of the synset language • Use synsets to store lexical representations only • Let the ontology focus on knowledge modelization A. Ferrato, TOP-UIC 2003-'04

Practical requirements for multilinguality • Indexing • Recognize language of resources to consequently setup the system • Store language IDs with annotations • Search • Interpret user queries coming in natural languages • Allow for cross-language search tasks A. Ferrato, TOP-UIC 2003-'04

Extension to language • Proposed approach: one ontology, many synsets • A concept is expressed by a different synset for each supported language • Each synset contains multiple lexical representations of a related concept in a single language • Separate semantic and textual layers A. Ferrato, TOP-UIC 2003-'04

salary job employment … travail chomeur … job lavoro stipendio datore di lavoro … Extension to language (cont’d) (one concept, three synsets) A. Ferrato, TOP-UIC 2003-'04

Advantages • Reduced implementation requirements • Ontology design • Resource occupation • Simplicity (in ontology management) • Flexibility • A new language just brings a new bag of synsets • Expansion of indexing word set A. Ferrato, TOP-UIC 2003-'04

Language recognition • Proposed approach • Retrieve language IDs whenever present • Otherwise, recognize language(s) • Design constraints • To be activated in the annotation phase • Refined at the document substructure level • Has to deal with the average low authoring quality of Web documents A. Ferrato, TOP-UIC 2003-'04

There wasan Old Man of Coblenz,The length of whose legs was immense… <H1 lang=“fr”>Le Bilboquet</H1> <P>C’était un vieux passe-temps… Hindi default = “it” <P lang=“ru”> ? Russian Italian <P> is French English Hindi synset Language recognition (cont’d) Validate explicit request Retrieve “lang” value Guess via heuristics Retrieve from ancestor Accept default A. Ferrato, TOP-UIC 2003-'04

Current implementation • A new English synset to couple with a disability ontology (~500 concepts) • A set of 20 bilingual documents (Italian, English) on disability • A basic Language Detector XML-RPC module implemented in Java • Testing scenarios • Parallel annotation • Language recognition A. Ferrato, TOP-UIC 2003-'04

Implementation work • Language Detector module (Java, ~1000 lines of code) • Additions to pre-existing modules (Java, ~1000 lines of code) • English synset (RDF, ~3500 lines of code) • ~ 24 Mb of annotations produced • Simulation results analysis (A 600x40 .XLS for <BODY>, a 925x250 .XLS for <Hx>) A. Ferrato, TOP-UIC 2003-'04

Multilingual DOSE in action A. Ferrato, TOP-UIC 2003-'04

Parallel annotation • Two parallel documents have • The same structure elements with the same contents • Two different languages of expression • Goal: demonstrate that two sets of parallel documents are (almost) simmetrically mapped to the same concepts (“parallel annotation”) • Both sets indexed separately, with language explicitly specified A. Ferrato, TOP-UIC 2003-'04

Parallel annotation (cont’d) • Test methodology: “Vector Space Model” • Document fragments described as vectors • Dimensions are ontology concepts • Components are weighted (tf/idf) occurrencies of such concepts • The correlation between two fragments is quantified as the cosine of the angle between their vectors A. Ferrato, TOP-UIC 2003-'04

Y Y Y English X X X Italian Correlation Parallel annotation (cont’d) IT/html/body/p[3] X:Part-time job (2.5) Y:Retirement (0) EN/html/body/p[3] X:Part-time job (1.5) Y:Retirement (1.5) A. Ferrato, TOP-UIC 2003-'04

Parallel annotation results at <BODY> level A. Ferrato, TOP-UIC 2003-'04

Correlation results at <BODY> level A. Ferrato, TOP-UIC 2003-'04

Correlation results at <BODY> level (alt) A. Ferrato, TOP-UIC 2003-'04

Parallel annotation results at <Hx> level A. Ferrato, TOP-UIC 2003-'04

Parallel annotation: notes • Parallel and nonparallel pairs can be grouped as two different distributions • i.e. Gaussian distributions • Average values of the two distributions are clearly separated, both for <BODY> and <Hx> levels • This proves that the indexing system is able to annotate relevant document fragments independently from language A. Ferrato, TOP-UIC 2003-'04

Language recognition • Separate testing on the same document set • Italian and English documents are alternated in batch processing • Avoid reuse of default settings for contiguous documents of the same language • Two ways to retrieve ancestor language • Via Annotation Repository (acceptable) • Via a “Language Stack” (still inefficient) A. Ferrato, TOP-UIC 2003-'04

Annotation Repository vs. Language Stack All cyan, underlined words are to annotate (included in the synsets) <BODY lang="en"> <H1 lang="it"> Passatempi</H1> <H2 lang="en"> Board Games</H2> <P>Gomuku</P> <P>Dama</P> … Language Stack: Dama is ignored (language “en” inherited by <H2>) Annotation Repository: Dama is annotated (language “it” inherited by <H1>, annotated) A. Ferrato, TOP-UIC 2003-'04

Language recognition results(via Annotation Repository) A. Ferrato, TOP-UIC 2003-'04

Conclusions • Typical issues discussed • Overall validity of the approach shown • Further work and improvements • Synset composition • Annotation testing with more languages • Optimize proposed language recognition techniques, add new ones A. Ferrato, TOP-UIC 2003-'04

Thank you… • Questions? A. Ferrato, TOP-UIC 2003-'04

Language recognition (2) A. Ferrato, TOP-UIC 2003-'04

Multilingual support to a proposed Semantic Web architecture

Multilingual support to a proposed Semantic Web architecture

Presentation Transcript

Semantic Web in the Context Broker Architecture

Introduction to Semantic Web

Introduction to Semantic Web

Introduction to Semantic Web Service Architecture

Introduction to Semantic Web

Inferencing support for the Semantic Web

Towards a Proxy Architecture for Semantic Web Services

Semantic Web Services Initiative Architecture Committee

NSF Support for Semantic Web Research

NSF Support for Semantic Web Research

NSF Support for Semantic Web Research

Database Support for Semantic Web

Towards a semantic web

Proposed Architecture

Software Architecture for Semantic Web Service Execution

Semantic Web in the Context Broker Architecture

Towards a Semantic Web

NSF Support for Semantic Web Research

Semantic Web layered architecture

A Latent Semantic Indexing-based approach to multilingual document clastering

Multilingual web sites

Towards a Semantic Web