240 likes | 388 Views
Leyla Zhuhadar and Olfa Nasraoui Knowledge Discovery and Web Mining Lab Dept. of Computer Engineering and Computer Science University of Louisville, KY 40292, USA Leyla.zhuhadar@wku.edu Olfa.nasraoui@louisville.edu. Multi-Language Ontology-based Search Engine.
E N D
Leyla Zhuhadar and Olfa Nasraoui Knowledge Discovery and Web Mining Lab Dept. of Computer Engineering and Computer Science University of Louisville, KY 40292, USA Leyla.zhuhadar@wku.edu Olfa.nasraoui@louisville.edu Multi-Language Ontology-based Search Engine Robert Wyatt and Elizabeth Romero The Office of Distance Learning Western Kentucky University, KY 42101, USA Robert.wyatt@wku.edu Elizabeth.romero@wku.edu
The “Big Issues” in Information Retrieval. • Performance: Efficient search and indexing (Bruce Croft, 2009); • Incorporating new data: Coverage and Freshness (Bruce Croft, 2009); • Scalability: Growing with data and users (Bruce Croft, 2009); • Adaptability: Tuning for applications and users (Bruce Croft, 2009); • Current problems: Information overload, keywords matching, ambiguity, handling evolution domain and users (Nasraoui: PKDD-2006-Invited-Talk). ACHI 2010:Multi-Language Ontology-based Search Engine
HyperManyMedia @WKU. http://hypermanymedia.wku.edu ACHI 2010:Multi-Language Ontology-based Search Engine
Semantic Search using Ontology. ACHI 2010:Multi-Language Ontology-based Search Engine
Why do we need a Cross/Multi-language Information Retrieval System? • Some major interesting reasons for designing a MLIR system: • Having a repository of documents written in multi-languages, with each individual document containing more than one language, for example: • technical documents written in non-English, but use expressions (jargon terms) written in English, • a document that uses quotes written in languages different than the language of the article itself and • a document that cites foreign articles and those citations are written in a language that is different from the language of the article itself. • The problem of a user who is capable to read or use documents written in a specific language, but he/she is not fluent in this specific language to query for the right terms to find the document, for example: • a user who is searching for images where those images are tagged and indexed in a language that the user does not understand, • a researcher who is interested in a specific research topic and would like to know which individuals or institutes world wide are working on the same topic and • a user who has a system to translate documents to different languages and would like to search for those documents in languages he is unfamiliar with. ACHI 2010:Multi-Language Ontology-based Search Engine
Natural Language Processing & Machine Translation. ACHI 2010:Multi-Language Ontology-based Search Engine
Multi-language Information Retrieval System. ACHI 2010:Multi-Language Ontology-based Search Engine
Approaches to MLIR. ACHI 2010:Multi-Language Ontology-based Search Engine
Some History. • First MLIR in 1969 by Gerard Salton (Enhanced SMART system to retrieve multilingual documents (English & German) • Pigur's system IRRD in 1979, based on a Vocabulary Thesaurus that used three languages (English, French and German) • Van derEijk in 1993 used the linguistic knowledge: • Subject Thesaurus, • Concept List, • Term List , and • Lexicon. ACHI 2010:Multi-Language Ontology-based Search Engine
HyperManyMedia Methods for Cross-language. • Falls into the Domain Specific Retrieval (E-learning). • A synergistic approach: • Thesaurus-based Approach (Query translation), and • Corpus-based Approach (Term Vector Translation). ACHI 2010:Multi-Language Ontology-based Search Engine
HyperManyMedia @WKU. http://hypermanymedia.wku.edu ACHI 2010:Multi-Language Ontology-based Search Engine
Thesaurus-based Approach. • A simple bilingual ontology thesaurus listing of terms, phrases, concepts, and subconcepts; • Using domain specific terminology to capture the HyperManyMedia domain in two languages (English and Spanish). ACHI 2010:Multi-Language Ontology-based Search Engine
Thesaurus-based Approach. ACHI 2010:Multi-Language Ontology-based Search Engine
Building the OWL File Using Protégé. http://protege.stanford.edu/ ACHI 2010:Multi-Language Ontology-based Search Engine
Thesaurus-based Approach. Building HyperManyMedia Bilingual Ontology: • We used Protégé (current ontology consists of ~40,000 lines of code: http://161.6.105.21:8084/ontology/semantic.owl) ACHI 2010:Multi-Language Ontology-based Search Engine
Thesaurus-based Approach (Query translation approach) Method. • Scenario: A user submits a query in the semantic search interface, the following two parallel processes occur: • All relevant documents to the query term will be retrieved, and the ranked based on Eq(1) • An automatic semantic mapping between the query term and the HyperManyMedia ontology, which is resident in memory, if the query term is a part of the HyperManyMedia ontology; the information retrieval system will automatically present two semantic entities: • All the subconcepts related to this query term in both languages (English and Spanish) • Synonym to the query term in the alternative language ACHI 2010:Multi-Language Ontology-based Search Engine
Corpus-based Approach (Term Vector Translation) Method. • Scenario: A user submits a query in one of the languages, English or Spanish, and clicks on cross-language translation, if the query contains part of our indexed translated terms, the search engine does the following: • Translate the query to the alternative language , as shown in Algorithm 1 • Use the Vector Space Model to calculate the dot product between the • translated query and the documents in • the HyperManyMedia repository, • after substituting each to retrieve • relevant documents and ranks • them based on the score Eq(1). ACHI 2010:Multi-Language Ontology-based Search Engine
Evaluation of Cross-Language Search Model. • Research Questions • Will there be a difference in Top-n-Recall and Top-n-Precision between College-level, Course-level, and lecture-level? • Will there be a difference in Top-n-Recall and Top-n-Precision when we Cross from the Spanish Language to the English language vs. from the English Language to the Spanish? ACHI 2010:Multi-Language Ontology-based Search Engine
Top-n-Recall/Precision for Cross-language Search Engine. • Top-n Recall: is the number of relevant retrieved documents among the top n retrieved documents divided by the total number of relevant documents. • Top-n Precision: is the number of relevant retrieved documents within the top n divided by n. ACHI 2010:Multi-Language Ontology-based Search Engine
Top-n-Recall/Precision for Cross-language Search Engine. ACHI 2010:Multi-Language Ontology-based Search Engine
Top-n-Recall/Precision for Cross-language Search Engine. ACHI 2010:Multi-Language Ontology-based Search Engine
Evaluation Conclusion. • The Cross-language search engine performs better when we cross from the Spanish language to the English language in the Precision and the opposite in Recall • Fact: The following reasons have influenced the results: • English courses have been indexed and boosted in multiple stages during the design of the platform (during the last two years). • Adding the Spanish courses was done during a very short period of time; thus we have not been able to add sophisticated tagging to these resources. • The ontology relationships between the two languages need to be logically improved using a higher level of interrelationship between entities and concepts. ACHI 2010:Multi-Language Ontology-based Search Engine
Future Work. • In the domain of Natural Language Processing: • An area of research that could be beneficial is to consider building the manual thesauri not only based on the controlled vocabulary extracted from the domain ontology as concepts/subconcepts, but by using computational linguistics; in this case, an integration between the thesauri and techniques based on corpus statistics is needed. • In the domain of Semantic Web: • “Linked Data” is the right place to extend this research (Linked Data is a project directed by Christian Bizer, Tom Heath and Tim Berners-Lee). • Multilinguality and linked data (generation, querying, visualization & presentation) The growth of Linked Dataset (July 2009) ACHI 2010:Multi-Language Ontology-based Search Engine