1 / 35

Mastering Information Retrieval: The Art of Finding What You Need

Explore the world of information retrieval with Alexander Gelbukh. Learn to find necessary data in huge, poorly structured information sets using innovative techniques. Discover the importance of knowledge and the challenges faced in searching the web. This comprehensive guide covers various topics in computer science and electronic commerce. Delve into natural language processing research seminars and practical applications for intelligent technologies. Don't miss out on the opportunity to enhance your information retrieval skills with this valuable resource.

vann
Download Presentation

Mastering Information Retrieval: The Art of Finding What You Need

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Alexander Gelbukh Moscow, Russia

  2. Mexico

  3. Computing Research Center (CIC), Mexico

  4. Chung-Ang University, KoreaElectronic Commerce andInternet Application Lab

  5. Special Topics in Computer ScienceThe Art ofInformation Retrieval Alexander Gelbukh www.Gelbukh.com

  6. Information Retrieval • In a huge amount • of poorly structured information • find the information that you need • when you don’t know exactly what you need • or can’t explain it • The Web • User information need • Ranking

  7. Information Retrieval • In a huge amount • of poorly structured information • find the information that you need • when you don’t know exactly what you need • or can’t explain it • The Web • User information need • Ranking

  8. Importance • Knowledge: the main treasure of man • Web: Repository? Cemetery of information! • Natural language and multimedia information • Poorly structured, badly written • Corporate and organizational document bases • Senate speeches: Mexico • Medical data collections • Corporate memory. Microsoft knowledge base • Future: data explosion  increasing importance

  9. Perspectives • Corporations: corporate databases • Organizations: document bases • Government • European Union multilingual problem • The same in Asia • Academy • Lots of open research topics • Web topics • Computational Linguistics topics • Intelligent technologies, AI

  10. Textbook http://sunsite.dcc.uchile.cl/irbook/

  11. Contents • Introduction • Modeling • Retrieval Evaluation • Query Languages • Query Operations • Text and Multimedia Languages and Properties • Text Operations • Indexing and Searching • Parallel and Distributed IR • User Interfaces and Visualization • Multimedia IR: Models and Languages • Multimedia IR: Indexing and Searching • Searching the Web • Libraries and Bibliographical Systems • Digital Libraries

  12. Calendar • September 18 Chapter 1 Introduction • September 25 Chapter 2 Modeling • October 2 Chapter 3 Retrieval Evaluation • October 9 Chapter 4 Query Languages • October 16 Chapter 5 Query Operations October 23 – midterm exam • October 30 Chapter 6 Text and Multimedia Languages... • November 6 Chapter 7 Text Operations • November 13 Chapter 8 Indexing and Searching • November 20 Chapter 10 User Interfaces and Visualization • November 27 Chapter 13 Searching the Web • December 4 Chapter 14 Libraries and Bibliographical Systems • December 11 Chapter 15 Digital Libraries December – final exam

  13. Class structure Main course: Information Retrieval • Discussion of previous chapter. Questions • I briefly present a new chapter Research seminar: Natural Language Processing • Discussion of previous paper. Questions. • Identification of possible research topics • Presentation of a new paper or current work • Discussion and questions • Goal: publications!

  14. Natural Language Processing Research Seminar

  15. What CL is about Computers to process natural language text • “Understand” • Generate • Search • Organize • Translate • … Useful in IR

  16. Methods • No: text as a stream of letters • Brute force statistics • Simplified heuristics (ex.: Porter) • Yes: attention to language rules • Linguistically motivated approaches • Knowledge-based approaches • Corpus-based approaches

  17. What IR is about • Classical IR: find words? Concepts! • Question answering • Summarization • Clustering • … Take language seriously

  18. Text representations for IR • Represent the retrieval features • Strings → stems (lexemes), synsets, phrases. • Women → woman, lady, female • Old men and women→ old woman • Structured representation of text • Network of related events and entities • Enables logical inference

  19. CL tasks useful in IR • Morphology (stemming) • POS / Word dense disambiguation • Word relatedness • Anaphora resolution • Parsing and semantics (phrase search) • Synonymic rephrasing • Translation etc… Each one a whole science in itself

  20. Morphology • Q: pig T: piggish • Simple: stemming • piggish → pig- • Lexeme: set of word forms • same stem can give different words • pigment → not pig; piny → pine, not pin • Dictionary/corpus-based methods • Learning; dictionary management

  21. Part of Speech Disambiguation • Q: oil well T: He did very well • Q: what is an are? T: They are nice • Important for English, Chinese. Less important for other types • Perhaps not so helpful directly, but is necessary for most other tasks • Usually statistical / heuristic methods

  22. Word Sense Disambiguation • Q: bank account T: on the beautiful banks of Han river ... • bill: document, banknote, law, ax, peak, Gates... • Very frequent, almost any word in text • Statistical & dictionary methods • International competitions

  23. Word relatedness • Q: female T: woman (women) • Synonyms. Subtypes/super-types • Dictionaries. WordNet. Similarity. Lesk. • Q: Korea T: Seoul • Other linguistic relationships (e.g., part) • Real-world relationships (facts) • Q: Clinton T: Lewinsky • Statistical co-occurrence (MI)

  24. Anaphora resolution • Q: Awards of Prof. Han T: Prof. Han said... He did... IBM awarded him... • Frequency • Phrases, co-occurrence, summarization, inference, translation • Heuristic (Mitkov) and knowledge-based methods • Other types of co-reference

  25. Parsing, semantics • Q: Awards of Prof. Han T1: Prof. Han among many other prizes has several IBM awards T2: Mr. Kang has an award Prof. Han does not know of • Understanding of text • Rich structured representation • Better phrase search; question answering, summarization, ...

  26. Synonymic rephrasing, reasoning • Q: experiencedcomputer scientists T: Prof. Han has been programmingfor many years and awarded an IBM award • Requires good syntactic and semantic analysis • Knowledge-based methods

  27. Multilingual access • Q: 요구르트T: We sell excellent yoghurt. Продаем йогурт. Se vende rico yogur. • Search multilingual collections • Europe: dozens of official languages of EU • If you don’t know how to say it in English • Dictionaries, bilingual corpora, ...

  28. Tasks are entangled • Many of CL tasks require other tasks • Morphology → syntax → semantics • Many CL tasks form circles • parsing ← WSD ← parsing • I see a wild cat with a telescope (tripod?) • Can be done quick-and-dirty (?) • Fighting for last %s • Zipf law: 20% of men drink 80% of beer

  29. Tools and infrastructure • Analysis tools • Tasks, methods • Dictionaries and grammars • Types, structure • Automatic acquisition • Corpora • Corpora analysis tools and methods

  30. Possible tasks • WSD to help IR • Clustering + summarization in IR results • Anaphora and coreference resolution to help IR • Multilingual IR • Applications to Korean • ... a lot of others

  31. Reading • Textbooks • Manning & Schütze, Allen, Jurafsky, Hausser, ... • CICLing proceedings • Computational Linguistics • Google, ResearchIndex

  32. Questions • Who expects to publish? • Who will make a presentation at the next seminar?

  33. Thank you! Till September 18

More Related