1 / 16

Cvetana Krstev, Ranka Stanković , Duško Vitas, Ivan Obradović

The Usage of Various Lexical Resources and Tools to Improve the Performance of Web Search Engines. Cvetana Krstev, Ranka Stanković , Duško Vitas, Ivan Obradović Human Language Technology Group, University of Belgrade, Serbia. Contents. Typical problems when retrieving

Download Presentation

Cvetana Krstev, Ranka Stanković , Duško Vitas, Ivan Obradović

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. The Usage of Various Lexical Resources and Tools to Improve the Performance of Web Search Engines Cvetana Krstev, Ranka Stanković, Duško Vitas, Ivan Obradović Human Language Technology Group, University of Belgrade, Serbia

  2. Contents Typical problems when retrieving documents using a web search engine The lexical resources used The system options Technical implementation Results and evaluation HLT Group, University of Belgrade

  3. Typical problems when retrieving documents using a web search engine Highly inflective language Donošenjem odluke… Odluka o priređivanju igara… Ministarstvo donosi odluku… By making a decision… A decision to organize games The Ministry shall make a decision… ________________________ Sastojci za 10 porcija: 3 glavice crnog luka, 1 šoljica ulja, 1/2 čaša belog vina, 1 čaša soka od paradajza (The ingredients for 10 portions: 3 onions, 1 cup of oil, ½ glass of white wine, 1 glass of tomato juice.) Bilingual search in order to find documents on the chosen subject in two languages, e.g. English and Serbian. Typical problems Lexical realization of a concept synonyms: beli luk‘garlic’→češnjak hyponyms: muzički instrument ‘musical instrument’ → klavir ‘piano’, gitara ‘guitar’ etc. derivations: Beograd→Beograđanin, Beograđanka, etc and other relations HLT Group, University of Belgrade

  4. The lexical resources used Work Station for Query Expansion Serbian WN conceived within the Balkanet project with 14.593 synsets and Princeton WN are used for query expansion with related words & for bilingual searches WordNets Inflectional finite state transducers (FST) Prolex WS4QE Prolex: multilingual database of proper names organized around a conceptual proper name that represents the same concept in different languages http://www.cnrtl.fr/lexiques/prolex/ Morphological dictionaries FST for inflection of both simple and compound words developed for the Unitex system http://www-igm.univ-mlv.fr/~unitex Serbian morphological dictionary isin LADL format: 117,000 lemmas with 1,400,000 different lexical words HLT Group, University of Belgrade

  5. The lexical resources used • For query beli luk two FSTs for components and one for the compound are used producing only 12 instead of 216 possible combinations: beli luk AND belim lukom AND beli lukovi AND belih lukova AND belima lukovima AND belim lukovima AND bele lukove AND bela luka AND beloga luka AND belog luka AND belome luku AND belom luku • thus preventing false retrievals such as: • ...posmatrano sa dna vidika, izgleda kao da iz širokih lukova belog mosta teče i razliva se ne samo zelena Drina… • Thus, from a bottom view, it appears that not only green Drina flows and spills over under the wide arcs of the white bridge… HLT Group, University of Belgrade

  6. Alternate alphabet usage Inclusion of inflectional forms Addition of related words Inflexion of free phrases The system options • štrajk ‘strike’ →штрајк • štrajk ‘strike’ →štrajk, štrajka, štrajkovietc. • štrajk ‘strike’ → obustava rada ‘work stoppage’ • solarni sistem ‘solar system’ →Merkur,Venera,Zemlja, Mars • Engleska ‘England’ → Englez ‘Englishman’, Engleskinja, ‘English woman’ + with Albion • inflection of free phrases by predicting their syntactic structure Improved query HLT Group, University of Belgrade

  7. Rule based procedure for inflection • Procedure for automatic inflection of compounds and phrases based on a set of rules • Rule design strategy - result of expert knowledge on morphology and the analysis of existing manually created compound dictionaries • Experiments with various rule strategies possible – the final strategy is result of several iterations • The rule based strategy presently consists of 53 rules with total of 1014 rule subtypes (rule parts) HLT Group, University of Belgrade

  8. Rule based procedure for inflection <RuleID="43" CFLX="NC_N6X" Status="true"> <RuleTypeID="1"> <WordRTID="1" POS="N" Flex="true" /> <WordRTID="2" POS="*" Flex="false" Condition="GramCats,2"/> <WordRTID="3" POS="*" Flex="false" Condition="GramCats,2"/> <WordRTID="4" POS="*" Flex="false" Condition="GramCats,2"/> </RuleType> <RuleTypeID="2"> <WordRTID="1" POS="N" Flex="true" /> <WordRTID="2" POS="PREP" Flex="false" /> <WordRTID="3" POS="*" Flex="false" Condition="PrepAgr,2" /> <WordRTID="4" POS="*" Flex="false" /> </RuleType> <RulePartID="1" Frequency="3" Example="princ na belom konju"> <WordRPID="1" GramCats="ms1v" /> </RulePart> <RulePartID="2" Frequency="2" <WordRPID="1" GramCats="ms1q" /> </RulePart> <RulePartID="3" Frequency="2" > <WordRPID="1" GramCats="ns1q" /> </RulePart> <RulePartID="4" Frequency="1" > <WordRPID="1" GramCats="fs1q" /> </RulePart> <RulePartID="5" Frequency="0"> <WordRPID="1" GramCats="ns1v" /> </RulePart> <RulePartID="6" Frequency="0"> <WordRPID="1" GramCats="fs1v" /> </RulePart> </Rule> HLT Group, University of Belgrade

  9. Rule based procedure for inflection • System evaluation on three separate sets of data that differ both in content and in structure: • compound toponyms (238) • formal names of professions (356) • search engine queries (728)(log file of one of Serbian professional journals) • Evaluation indicated that: • the strategy can be integrated in morphological query expansion mechanism for compounds and phrases which do not exist in the compounds dictionary HLT Group, University of Belgrade

  10. Tehnical implementation • The Process • The developed web application receives the user query and • subsequently uses the local web service WS4QE to expand the query and • forwards it to the Google search engine using the Google AJAX Search API (enables the embedding of Google searches into personal web pages or web applications) • Interface • Query expansion is implemented with different possibilities and levels of detail, so the web user can choose from several options • From simple query expansion to complex wordnet advanced search • Search results are displayed within our own web pages for different types of query expansions, depending on the resources and type of expansion HLT Group, University of Belgrade

  11. Tehnical implementation • Web service WS4QE uses classes from .NET dll components developed within WS4LR (WorkStation for Lexical Resources) • WS4LR enables the usage of lexical resources for query expansion • The components that make up the WS4LR system and their inter-relationships HLT Group, University of Belgrade

  12. Compare • Query submitted directly to Google with only the initial string ‘beli luk’ returned a total of 54,900 • Expanded with ‘бели лук’,’češnjak’,’чешњак’ then submitted by WS4QE to Google, as a result, total of 92,700 documents were obtained. Tehnical implementation • WS4QE home page • Wordnet advanced search HLT Group, University of Belgrade

  13. Compare • Query submitted directly to Google obtained 66,600 documents • Expanded query with hypernym, in both alphabets obtained 160,000 documents • Morphological expansion in two alphabets (without semantic expansion) obtained 285,000 documents Results for expanded query HLT Group, University of Belgrade

  14. Results for expanded query HLT Group, University of Belgrade

  15. formulation of queries Queries often need to be ‘fine tuned’ in order to obtain an optimal balance between recall and precision Conclusion further endeavors 1. We shall continue do develop our lexical resources 2. We will strive to broaden the scope of tasks that can be solved with our tools approach Lexical resources can be put to the aid of the user by offering him/her various possibilities of query expansion HLT Group, University of Belgrade

  16. Thank You ! cvetana@matf.bg.ac.yu ranka@rgf.bg.ac.yu vitas@matf.bg.ac.yu ivano@rgf.bg.ac.yu

More Related