1 / 64

Wordnets for information retrieval: a hole in one!

Wordnets for information retrieval: a hole in one!. Piek Vossen OmniPaper seminar, Leuven December, 3 rd 2004. Content. WordNet, EuroWordNet, Global Wordnet Why should we use wordnets? How do we use wordnets? Why are wordnets not enough? Conceptual indexing Conceptual matching

Download Presentation

Wordnets for information retrieval: a hole in one!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wordnets for information retrieval: a hole in one! Piek Vossen OmniPaper seminar, Leuven December, 3rd 2004 Irion Technologies (c)

  2. Content • WordNet, EuroWordNet, Global Wordnet • Why should we use wordnets? • How do we use wordnets? • Why are wordnets not enough? • Conceptual indexing • Conceptual matching • Conceptual dialogue • Demos Irion Technologies (c)

  3. Princeton WordNet • Developed by George Miller and his team at Princeton University, as the implementation of a mental model of the lexicon • Organized around the notion of a synset: a set of synonyms in a language that represent a single concept • Semantic relations between concepts • Currently covers about 100,000 concepts and 120,000 English words Irion Technologies (c)

  4. Vocabulary of a language Relations Concepts • rec: 12345 • financial institute 1 bank rec: 54321 - side of a river 2 rec: 9876 - small string instrument fiddle 1 violin type-of rec: 65438 - musician playing violin 2 fiddler violist rec:42654 - musician type-of rec:35576 - string of instrument 1 part-of string rec:29551 - underwear 2 rec:25876 - string instrument Wordnet Model Irion Technologies (c)

  5. EuroWordNet • EU project (1996-1999) to develop wordnets for 8 European languages • Each wordnet is linked to the English wordnet that functions as an Interlingua • Cross-lingual wordnet database, where you can go from a synset in one language to a synset in any other language • Coverage: 10,000 – 50,000 synsets and up to 55,000 words in Dutch, German, French, Spanish, Italian, Estonian and Czech Irion Technologies (c)

  6. Domains Ontology bewegen gaan move go III 2OrderEntity berijden Traffic I I III III III Location Dynamic II Air Road` rijden ride drive Lexical Items Table Lexical Items Table Lexical Items Table Lexical Items Table III III II ILI-record {drive} guidare conducir cavalcare III cabalgar jinetear III mover transitar andare muoversi EuroWordNet Model II II Inter-Lingual-Index I = Language Independent link II = Link from Language Specific to Inter lingual Index III = Language Dependent Link Irion Technologies (c)

  7. Example of EuroWordNet structure organism Causes Patient Part of to get well being disease organ person treat Agent scalpel doctor Instrument operate sick person, patient Patient stomach disease stomach Involves Irion Technologies (c)

  8. Global Wordnet Association EuroWordNet BalkaNet • Arabic • Polish • Welsh • Chinese • 20 Indian Languages • Brazilian Portuguese • Hebrew • Latvian • Persian • Kurdish • Avestan • Baluchi • Hungarian • Romanian • Bulgarian • Turkish • Slovenian • Greek • Serbian • English • German • Spanish • French • Italian • Dutch • Czech • Estonian • Danish • Norway • Swedish • Portuguese • Korean • Russian • Basque • Catalan • Thai http://www.globalwordnet.org Irion Technologies (c)

  9. Why to use wordnets? Irion Technologies (c)

  10. Why are wordnets not used by Internet Search Engines? • Without wordnets recall is very low but this does not seem to be a problem: • There is too much information on the Internet to handle anyway; • There is redundancy of information, i.e. it is expressed in any conceivable way and any conceivable language; • Whatever you type in, you allways get many results; • Google approach: • All content words should occur (boolean AND); • Pidgeon ranking: pages to which many people link are on top, show what others know; Irion Technologies (c)

  11. Why wordnets should be used? • Cross-lingual retrieval is not possible unless you map words across languages; • Very specific questions still give no results if the query is formulated differently from the answer, e.g. Google: • “evaluate web of concepts for OmniPaper search system” (3 results) • “evaluate web of concepts for OmniPaper search engine” (0 results) • Small-scale indexes have no redundancy, there will be no results for queries formulated differently; Irion Technologies (c)

  12. golf club(s) Tiger Woods golf sticks thesaurus golf clubs semantic network Language technology: a hole in one! Funnel Irion Technologies (c)

  13. How to use wordnets? Irion Technologies (c)

  14. No NLP Morpho- logy Index & query Disambiguation: - Biology How to get the correct recall? Wordnet Full Expansion police cells jail prison cell cell growth neuron cell [prison] cell [phone] cell [tissue] mobile phone • Index • Disambiguation: • Communication • Legal • Biology cell-division neuron cellular Irion Technologies (c)

  15. Why are wordnets not enough? Irion Technologies (c)

  16. Words out of context: • Traditional search paradigm focuses on document/page retrieval and not on phrase retrieval: • Dominant meanings will overrule other meanings: • “Internet services on Java” gives no results for the island Java only for the software. • Compositional differences are neglected: • “toxic medication” versus “medication against toxication”, • “animal party” versus “party animal” Irion Technologies (c)

  17. Where are we heading at? • There is a growing need for more precision and more complex applications to find more fine-grained facts regardless of ‘form’ • Information retrieval (IR): documents • Classification: topics • Informatie extractie (IE): facts • Multimodal human machine interfaces (speech, mobile, chat); • Question-answering systems (QA): simple human-machine interface • Dialogue systems: iterative human-machine • Intelligent machines (reason, decisions): intelligent human-machine interface • Summarization -> Multidoc summaries ->Language generation ->Machine translation Irion Technologies (c)

  18. Approach • Multilingual wordnet database and morphy-syntatic processing are used to decompose text to concept elements: • -> maximum recall; • Word-sense-disambiguation at index and query side: • -> reduce noise; • Synonym selection: • -> reduce more noise; • Match query phrases with document phrases: • -> match concept combinations in context • Intelligent dialogues to create context at the user side: • -> match intended meanings Irion Technologies (c)

  19. Cut out the noise from a multilingual semantic network • Concept selection • Assign domain labels and selectional patterns to documents and phrases • Select word meanings within domains and patterns • Synonym selection • Most frequent synonyms for selected concepts • Co-occurrence relations Irion Technologies (c)

  20. Domains Clothing Sport Finance Culture Music Ball sports Winter sports Wordnet: Domain information Concepts Relations Vocabularies of languages 1 • rec: 12345 • financial institute rec: 54321 - river side 2 bank 1 rec: 9876 - small string instrument violin 2 rec: 65438 - musician playing a violin violist rec:42654 - musician type-of 1 rec:35576 - string of an instrument type-of part-of string 2 rec:29551 - underwear rec:25876 - string instrument Irion Technologies (c)

  21. WordNet/Semnet More Contexts + Domain Domain Set of concepts Train Text Classifier Text grouped by Domains Train Synsets Export Glosses Classify Examples • Un-seen Document • Phrase: financial scandal Juventus • Phrase: Players boycott the match Concept Selection Domain based concept selection IST-project MEANING Sport - words TwentyOne Classify • Microworld: Sport - Nanoworld: Finance • Nanoworld: Sport Irion Technologies (c)

  22. Poly semy Word types in document(s) Word tokens in document(s) 80% 20% 1 1 1 1 1 1 2 2 2 4 5 6 40 80% Microworld 20% Nanoworld 20% 20% Factotum 10% ball - goal – game - score eat - food be – person – have – begin – stop - part When to apply what strategy? Microworld 20% Nanoworld 20% Factotum 70% Irion Technologies (c)

  23. Restrict synonym expansion Apply multiword lookup Resolve compounds and derivations Normalize word Select concepts within Nanoworld & Microworld Assign domain label to phrase in context Assign domain label to document Extract phrases Conceptual Indexing Document Microworld = sport Context Phrase Nanoworld = finance Concept1..N Word form1 ConceptN Word form2 ConceptM Word formN Context Irion Technologies (c)

  24. Assign domain label to query Normalize word Resolve compounds and derivations Apply multiword lookup Select concepts within Nanoworld Conceptual query analysis Query Nanoworld = finance Concept1..N Word form1 ConceptN Word form2 ConceptM Word formN Irion Technologies (c)

  25. ?Context Conceptual matching Document Microworld = sport Query Phrase Nanoworld = finance Nanoworld = finance Concept1..N Concept1..N Word form1 Word form1 ConceptN ConceptN Word form2 Word form2 ConceptM ConceptM Word formN Word formN Phrase-score: • number matching concepts • matching nanoworlds • matching nanoworld-microwolds: potatos, potatoes, Afganistan & afghanistan • fuzzy word match: • café, cafe, Café, CaFé, CAFÉ, café-noir depart, departure, departures, departing, departings • flexion and derivation: mensenrechtenactivistenleider, human rights • multiwords and compounds: • original word, synonym or translation: café, pub, bar, coffee shop, tea room United States of America, US, USA, VS, Amerika, Pays-Bas, Holland, the Netherlands Irion Technologies (c)

  26. How to create more context? • Replace the front-end by an intelligent dialogue system; • Users are invited to ask questions in Natural Language; • The system uses the linguistic structure to infer valuable information about information states; • The system evaluates the answers (results); • Context history is built up and used to find more precise results or adjust results; Irion Technologies (c)

  27. Conteptual Dialogue system Arrangements Dialogue Manager A Active holidays B Winter holidays C • What can I do for you? • I want to book a holiday. D • Can you provide me with more details? Classifier Engine Appartments • Nice appartment with swimming pool. E • There are two arrangements that might be what you are looking for. Have a look at F or G Fly & Drive F G • No, I would like something near the sea! Summer holidays • Perhaps H and Iare a better option? H Camping • Do you also have flight drive arrangements? I • Yes but not within your first selection. Retrieval Engine • And without swimming pool? • Please have a look at E Irion Technologies (c)

  28. Demos • Cross-lingual retrieval where queries in 6 languages can be matched with a conceptual index • Dutch dialogue system, where the complete context is used to guide users to information step by step Irion Technologies (c)

  29. Cross-lingual retrieval system • Antonya: portal of environmental information • More than 3000 URLS crawled (mostly in the Netherlands) • Indexing languages: English, German, French, Dutch, Spanish, Italian • Search languages: English, German, French, Dutch, Spanish, Italian • http://www.antonya.net Irion Technologies (c)

  30. Conteptual Dialogue system • Service desk for the city of Nijmegen • 256 products on their website, which are services for citizines • Dialogue system to analyse user queries and evaluate information states (33 different states) • Classification system trained with the documents to find to retrieve answers • http://kundera.irion.nl/burgerloket/ Irion Technologies (c)

  31. Thank you for your attention! Irion Technologies (c)

  32. Irion Technologies (c)

  33. Irion Technologies (c)

  34. Irion Technologies (c)

  35. Irion Technologies (c)

  36. Irion Technologies (c)

  37. Irion Technologies (c)

  38. Irion Technologies (c)

  39. Irion Technologies (c)

  40. Irion Technologies (c)

  41. Irion Technologies (c)

  42. Irion Technologies (c)

  43. Irion Technologies (c)

  44. Conteptual Dialogue system • Service desk for the city of Nijmegen • 100 products on their website, which are services for citizines • Dialogue system to analyse user queries and evaluate information states (33 different states) • Classification system trained with the documents to find to retrieve answers • http://kundera.irion.nl/burgerloket/ Irion Technologies (c)

  45. Irion Technologies (c)

  46. Irion Technologies (c)

  47. Irion Technologies (c)

  48. Irion Technologies (c)

  49. Irion Technologies (c)

  50. Irion Technologies (c)

More Related