420 likes | 525 Views
Multilingual support to a proposed Semantic Web architecture. Andrea Ferrato TOP-UIC MS Thesis, 2003/’04 Advisor : Laura Farinetti. Purpose of this work. Design and (partially) implement multilingual support on a pre-existing Semantic Web platform Provide an approach as generical as possible
E N D
Multilingual supportto a proposed Semantic Web architecture Andrea Ferrato TOP-UIC MS Thesis, 2003/’04 Advisor: Laura Farinetti
Purpose of this work • Design and (partially) implement multilingual support on a pre-existing Semantic Web platform • Provide an approach as generical as possible • Exploit features of the pre-existing architecture • Cope with the average chaotic structure of resources currently available A. Ferrato, TOP-UIC 2003-'04
Outline • Semantic Web • Multilinguality • The DOSE platform • Proposed solution • Given implementation • Experimental results • Conclusions A. Ferrato, TOP-UIC 2003-'04
Semantic Web • The next evolutionary stage for WWW • Goal: make network data usable by intelligent agents • Deployable only on top of existing infrastructure • Two pressing tasks • Transform existing contents to include semantics • Setup ad hoc user agents to work on them A. Ferrato, TOP-UIC 2003-'04
Transform existing contents • Basic data units: resources • Every single information entity that can be semantically isolated • Features to be given • Identification: URI • Structure: XML • Meaning: RDF • Knowledge: ontologies A. Ferrato, TOP-UIC 2003-'04
Set up ad hoc user agents • Major players in Semantic Web deployment • Invoked by users, can proceed autonomously • Key facilities to be supported • Logic • Proof • Trust A. Ferrato, TOP-UIC 2003-'04
XML + NS + XMLschema Unicode URI Semantic Web: layer cake view(Berners-Lee) Trust Rules Logic Data Self desc. doc. Proof Digital signatures Data Ontology vocabulary RDF + RDFschema A. Ferrato, TOP-UIC 2003-'04
Multilinguality • The extension to multiple languages of tasks already performed in a monolingual context • Typical issues from cross-language mapping • Lexical gaps • Role of the context • Lack of pre-acquired knowledge A. Ferrato, TOP-UIC 2003-'04
Multilinguality and Semantic Web • A problem of Text Retrieval in multiple languages (NLP) • Start from popular approaches (Controlled Vocabulary, Free text, etc.) • Two main requirements • Recognize language ID of resources • Map contents independently from language A. Ferrato, TOP-UIC 2003-'04
Language ID retrieval • Two possible scenarios • Retrieve a given ID via resource parsing • Recreate the ID via resource analysis • When recollecting a given language attribute, conform to existing language specification standards A. Ferrato, TOP-UIC 2003-'04
“lang” attribute Language inheritance + Language ID specification Content-language CSS-level declarations A. Ferrato, TOP-UIC 2003-'04
Language-independent contents mapping • Investigate the form/meaning relationship • Ontology design is crucial • Three main requirements • Consistency (based on linguistic evidence) • Flexibility (meaningful for all languages) • Extendibility (easy addition of new languages) A. Ferrato, TOP-UIC 2003-'04
Ontology models • Conceptual • founded upon general knowledge • Language-based • Built on a particular language • Interlingua • A combination of the above two • None is definitely superior for multilinguality A. Ferrato, TOP-UIC 2003-'04
The DOSE platform • Distributed Open Semantic Elaboration platform • Key features • Modularity • Scalability • Semantic integration • Main functionalities offered • Annotation • Search A. Ferrato, TOP-UIC 2003-'04
Indexer Search Engine Substructure Extractor Semantic Mapper Fragment Retriever Onto- logy Annotation Repository Syn- set DOSE: layered view Service layer Front-end layer Back-end layer A. Ferrato, TOP-UIC 2003-'04
Substructure Extractor Indexer Search Engine Fragment Retriever XML-RPC infrastructure Semantic Mapper Onto- logy Annotation Repository Syn- set DOSE: distributed view A. Ferrato, TOP-UIC 2003-'04
2 3 The Web 1 4 6 5 7 8 9 10 Annotation Repository 11 DOSE: annotation Substructure Extractor Indexer Fragment Retriever Semantic Mapper A. Ferrato, TOP-UIC 2003-'04
1 3 The Web 2 4 5 8 6 Annotation Repository 7 DOSE: search Semantic Mapper Search Engine Fragment Retriever A. Ferrato, TOP-UIC 2003-'04
DOSE and multilinguality • Traditionally: a new ontology for each different language • DOSE: the ontology language is totally independent of the synset language • Use synsets to store lexical representations only • Let the ontology focus on knowledge modelization A. Ferrato, TOP-UIC 2003-'04
Practical requirements for multilinguality • Indexing • Recognize language of resources to consequently setup the system • Store language IDs with annotations • Search • Interpret user queries coming in natural languages • Allow for cross-language search tasks A. Ferrato, TOP-UIC 2003-'04
Extension to language • Proposed approach: one ontology, many synsets • A concept is expressed by a different synset for each supported language • Each synset contains multiple lexical representations of a related concept in a single language • Separate semantic and textual layers A. Ferrato, TOP-UIC 2003-'04
salary job employment … travail chomeur … job lavoro stipendio datore di lavoro … Extension to language (cont’d) (one concept, three synsets) A. Ferrato, TOP-UIC 2003-'04
Advantages • Reduced implementation requirements • Ontology design • Resource occupation • Simplicity (in ontology management) • Flexibility • A new language just brings a new bag of synsets • Expansion of indexing word set A. Ferrato, TOP-UIC 2003-'04
Language recognition • Proposed approach • Retrieve language IDs whenever present • Otherwise, recognize language(s) • Design constraints • To be activated in the annotation phase • Refined at the document substructure level • Has to deal with the average low authoring quality of Web documents A. Ferrato, TOP-UIC 2003-'04
There wasan Old Man of Coblenz,The length of whose legs was immense… <H1 lang=“fr”>Le Bilboquet</H1> <P>C’était un vieux passe-temps… Hindi default = “it” <P lang=“ru”> ? Russian Italian <P> is French English Hindi synset Language recognition (cont’d) Validate explicit request Retrieve “lang” value Guess via heuristics Retrieve from ancestor Accept default A. Ferrato, TOP-UIC 2003-'04
Current implementation • A new English synset to couple with a disability ontology (~500 concepts) • A set of 20 bilingual documents (Italian, English) on disability • A basic Language Detector XML-RPC module implemented in Java • Testing scenarios • Parallel annotation • Language recognition A. Ferrato, TOP-UIC 2003-'04
Implementation work • Language Detector module (Java, ~1000 lines of code) • Additions to pre-existing modules (Java, ~1000 lines of code) • English synset (RDF, ~3500 lines of code) • ~ 24 Mb of annotations produced • Simulation results analysis (A 600x40 .XLS for <BODY>, a 925x250 .XLS for <Hx>) A. Ferrato, TOP-UIC 2003-'04
Multilingual DOSE in action A. Ferrato, TOP-UIC 2003-'04
Parallel annotation • Two parallel documents have • The same structure elements with the same contents • Two different languages of expression • Goal: demonstrate that two sets of parallel documents are (almost) simmetrically mapped to the same concepts (“parallel annotation”) • Both sets indexed separately, with language explicitly specified A. Ferrato, TOP-UIC 2003-'04
Parallel annotation (cont’d) • Test methodology: “Vector Space Model” • Document fragments described as vectors • Dimensions are ontology concepts • Components are weighted (tf/idf) occurrencies of such concepts • The correlation between two fragments is quantified as the cosine of the angle between their vectors A. Ferrato, TOP-UIC 2003-'04
Y Y Y English X X X Italian Correlation Parallel annotation (cont’d) IT/html/body/p[3] X:Part-time job (2.5) Y:Retirement (0) EN/html/body/p[3] X:Part-time job (1.5) Y:Retirement (1.5) A. Ferrato, TOP-UIC 2003-'04
Parallel annotation results at <BODY> level A. Ferrato, TOP-UIC 2003-'04
Correlation results at <BODY> level A. Ferrato, TOP-UIC 2003-'04
Correlation results at <BODY> level (alt) A. Ferrato, TOP-UIC 2003-'04
Parallel annotation results at <Hx> level A. Ferrato, TOP-UIC 2003-'04
Parallel annotation: notes • Parallel and nonparallel pairs can be grouped as two different distributions • i.e. Gaussian distributions • Average values of the two distributions are clearly separated, both for <BODY> and <Hx> levels • This proves that the indexing system is able to annotate relevant document fragments independently from language A. Ferrato, TOP-UIC 2003-'04
Language recognition • Separate testing on the same document set • Italian and English documents are alternated in batch processing • Avoid reuse of default settings for contiguous documents of the same language • Two ways to retrieve ancestor language • Via Annotation Repository (acceptable) • Via a “Language Stack” (still inefficient) A. Ferrato, TOP-UIC 2003-'04
Annotation Repository vs. Language Stack All cyan, underlined words are to annotate (included in the synsets) <BODY lang="en"> <H1 lang="it"> Passatempi</H1> <H2 lang="en"> Board Games</H2> <P>Gomuku</P> <P>Dama</P> … Language Stack: Dama is ignored (language “en” inherited by <H2>) Annotation Repository: Dama is annotated (language “it” inherited by <H1>, annotated) A. Ferrato, TOP-UIC 2003-'04
Language recognition results(via Annotation Repository) A. Ferrato, TOP-UIC 2003-'04
Conclusions • Typical issues discussed • Overall validity of the approach shown • Further work and improvements • Synset composition • Annotation testing with more languages • Optimize proposed language recognition techniques, add new ones A. Ferrato, TOP-UIC 2003-'04
Thank you… • Questions? A. Ferrato, TOP-UIC 2003-'04
Language recognition (2) A. Ferrato, TOP-UIC 2003-'04