1 / 23

Multi-Language Ontology-based Search Engine

Leyla Zhuhadar and Olfa Nasraoui Knowledge Discovery and Web Mining Lab Dept. of Computer Engineering and Computer Science University of Louisville, KY 40292, USA Leyla.zhuhadar@wku.edu Olfa.nasraoui@louisville.edu. Multi-Language Ontology-based Search Engine.

hamlet
Download Presentation

Multi-Language Ontology-based Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Leyla Zhuhadar and Olfa Nasraoui Knowledge Discovery and Web Mining Lab Dept. of Computer Engineering and Computer Science University of Louisville, KY 40292, USA Leyla.zhuhadar@wku.edu Olfa.nasraoui@louisville.edu Multi-Language Ontology-based Search Engine Robert Wyatt and Elizabeth Romero The Office of Distance Learning Western Kentucky University, KY 42101, USA Robert.wyatt@wku.edu Elizabeth.romero@wku.edu

  2. The “Big Issues” in Information Retrieval. • Performance: Efficient search and indexing (Bruce Croft, 2009); • Incorporating new data: Coverage and Freshness (Bruce Croft, 2009); • Scalability: Growing with data and users (Bruce Croft, 2009); • Adaptability: Tuning for applications and users (Bruce Croft, 2009); • Current problems: Information overload, keywords matching, ambiguity, handling evolution domain and users (Nasraoui: PKDD-2006-Invited-Talk). ACHI 2010:Multi-Language Ontology-based Search Engine

  3. HyperManyMedia @WKU. http://hypermanymedia.wku.edu ACHI 2010:Multi-Language Ontology-based Search Engine

  4. Semantic Search using Ontology. ACHI 2010:Multi-Language Ontology-based Search Engine

  5. Why do we need a Cross/Multi-language Information Retrieval System? • Some major interesting reasons for designing a MLIR system: • Having a repository of documents written in multi-languages, with each individual document containing more than one language, for example: • technical documents written in non-English, but use expressions (jargon terms) written in English, • a document that uses quotes written in languages different than the language of the article itself and • a document that cites foreign articles and those citations are written in a language that is different from the language of the article itself. • The problem of a user who is capable to read or use documents written in a specific language, but he/she is not fluent in this specific language to query for the right terms to find the document, for example: • a user who is searching for images where those images are tagged and indexed in a language that the user does not understand, • a researcher who is interested in a specific research topic and would like to know which individuals or institutes world wide are working on the same topic and • a user who has a system to translate documents to different languages and would like to search for those documents in languages he is unfamiliar with. ACHI 2010:Multi-Language Ontology-based Search Engine

  6. Natural Language Processing & Machine Translation. ACHI 2010:Multi-Language Ontology-based Search Engine

  7. Multi-language Information Retrieval System. ACHI 2010:Multi-Language Ontology-based Search Engine

  8. Approaches to MLIR. ACHI 2010:Multi-Language Ontology-based Search Engine

  9. Some History. • First MLIR in 1969 by Gerard Salton (Enhanced SMART system to retrieve multilingual documents (English & German) • Pigur's system IRRD in 1979, based on a Vocabulary Thesaurus that used three languages (English, French and German) • Van derEijk in 1993 used the linguistic knowledge: • Subject Thesaurus, • Concept List, • Term List , and • Lexicon. ACHI 2010:Multi-Language Ontology-based Search Engine

  10. HyperManyMedia Methods for Cross-language. • Falls into the Domain Specific Retrieval (E-learning). • A synergistic approach: • Thesaurus-based Approach (Query translation), and • Corpus-based Approach (Term Vector Translation). ACHI 2010:Multi-Language Ontology-based Search Engine

  11. HyperManyMedia @WKU. http://hypermanymedia.wku.edu ACHI 2010:Multi-Language Ontology-based Search Engine

  12. Thesaurus-based Approach. • A simple bilingual ontology thesaurus listing of terms, phrases, concepts, and subconcepts; • Using domain specific terminology to capture the HyperManyMedia domain in two languages (English and Spanish). ACHI 2010:Multi-Language Ontology-based Search Engine

  13. Thesaurus-based Approach. ACHI 2010:Multi-Language Ontology-based Search Engine

  14. Building the OWL File Using Protégé. http://protege.stanford.edu/ ACHI 2010:Multi-Language Ontology-based Search Engine

  15. Thesaurus-based Approach. Building HyperManyMedia Bilingual Ontology: • We used Protégé (current ontology consists of ~40,000 lines of code: http://161.6.105.21:8084/ontology/semantic.owl) ACHI 2010:Multi-Language Ontology-based Search Engine

  16. Thesaurus-based Approach (Query translation approach) Method. • Scenario: A user submits a query in the semantic search interface, the following two parallel processes occur: • All relevant documents to the query term will be retrieved, and the ranked based on Eq(1) • An automatic semantic mapping between the query term and the HyperManyMedia ontology, which is resident in memory, if the query term is a part of the HyperManyMedia ontology; the information retrieval system will automatically present two semantic entities: • All the subconcepts related to this query term in both languages (English and Spanish) • Synonym to the query term in the alternative language ACHI 2010:Multi-Language Ontology-based Search Engine

  17. Corpus-based Approach (Term Vector Translation) Method. • Scenario: A user submits a query in one of the languages, English or Spanish, and clicks on cross-language translation, if the query contains part of our indexed translated terms, the search engine does the following: • Translate the query to the alternative language , as shown in Algorithm 1 • Use the Vector Space Model to calculate the dot product between the • translated query and the documents in • the HyperManyMedia repository, • after substituting each to retrieve • relevant documents and ranks • them based on the score Eq(1). ACHI 2010:Multi-Language Ontology-based Search Engine

  18. Evaluation of Cross-Language Search Model. • Research Questions • Will there be a difference in Top-n-Recall and Top-n-Precision between College-level, Course-level, and lecture-level? • Will there be a difference in Top-n-Recall and Top-n-Precision when we Cross from the Spanish Language to the English language vs. from the English Language to the Spanish? ACHI 2010:Multi-Language Ontology-based Search Engine

  19. Top-n-Recall/Precision for Cross-language Search Engine. • Top-n Recall: is the number of relevant retrieved documents among the top n retrieved documents divided by the total number of relevant documents. • Top-n Precision: is the number of relevant retrieved documents within the top n divided by n. ACHI 2010:Multi-Language Ontology-based Search Engine

  20. Top-n-Recall/Precision for Cross-language Search Engine. ACHI 2010:Multi-Language Ontology-based Search Engine

  21. Top-n-Recall/Precision for Cross-language Search Engine. ACHI 2010:Multi-Language Ontology-based Search Engine

  22. Evaluation Conclusion. • The Cross-language search engine performs better when we cross from the Spanish language to the English language in the Precision and the opposite in Recall • Fact: The following reasons have influenced the results: • English courses have been indexed and boosted in multiple stages during the design of the platform (during the last two years). • Adding the Spanish courses was done during a very short period of time; thus we have not been able to add sophisticated tagging to these resources. • The ontology relationships between the two languages need to be logically improved using a higher level of interrelationship between entities and concepts. ACHI 2010:Multi-Language Ontology-based Search Engine

  23. Future Work. • In the domain of Natural Language Processing: • An area of research that could be beneficial is to consider building the manual thesauri not only based on the controlled vocabulary extracted from the domain ontology as concepts/subconcepts, but by using computational linguistics; in this case, an integration between the thesauri and techniques based on corpus statistics is needed. • In the domain of Semantic Web: • “Linked Data” is the right place to extend this research (Linked Data is a project directed by Christian Bizer, Tom Heath and Tim Berners-Lee). • Multilinguality and linked data (generation, querying, visualization & presentation) The growth of Linked Dataset (July 2009) ACHI 2010:Multi-Language Ontology-based Search Engine

More Related