350 likes | 369 Views
Explore the world of information retrieval with Alexander Gelbukh. Learn to find necessary data in huge, poorly structured information sets using innovative techniques. Discover the importance of knowledge and the challenges faced in searching the web. This comprehensive guide covers various topics in computer science and electronic commerce. Delve into natural language processing research seminars and practical applications for intelligent technologies. Don't miss out on the opportunity to enhance your information retrieval skills with this valuable resource.
E N D
Alexander Gelbukh Moscow, Russia
Chung-Ang University, KoreaElectronic Commerce andInternet Application Lab
Special Topics in Computer ScienceThe Art ofInformation Retrieval Alexander Gelbukh www.Gelbukh.com
Information Retrieval • In a huge amount • of poorly structured information • find the information that you need • when you don’t know exactly what you need • or can’t explain it • The Web • User information need • Ranking
Information Retrieval • In a huge amount • of poorly structured information • find the information that you need • when you don’t know exactly what you need • or can’t explain it • The Web • User information need • Ranking
Importance • Knowledge: the main treasure of man • Web: Repository? Cemetery of information! • Natural language and multimedia information • Poorly structured, badly written • Corporate and organizational document bases • Senate speeches: Mexico • Medical data collections • Corporate memory. Microsoft knowledge base • Future: data explosion increasing importance
Perspectives • Corporations: corporate databases • Organizations: document bases • Government • European Union multilingual problem • The same in Asia • Academy • Lots of open research topics • Web topics • Computational Linguistics topics • Intelligent technologies, AI
Textbook http://sunsite.dcc.uchile.cl/irbook/
Contents • Introduction • Modeling • Retrieval Evaluation • Query Languages • Query Operations • Text and Multimedia Languages and Properties • Text Operations • Indexing and Searching • Parallel and Distributed IR • User Interfaces and Visualization • Multimedia IR: Models and Languages • Multimedia IR: Indexing and Searching • Searching the Web • Libraries and Bibliographical Systems • Digital Libraries
Calendar • September 18 Chapter 1 Introduction • September 25 Chapter 2 Modeling • October 2 Chapter 3 Retrieval Evaluation • October 9 Chapter 4 Query Languages • October 16 Chapter 5 Query Operations October 23 – midterm exam • October 30 Chapter 6 Text and Multimedia Languages... • November 6 Chapter 7 Text Operations • November 13 Chapter 8 Indexing and Searching • November 20 Chapter 10 User Interfaces and Visualization • November 27 Chapter 13 Searching the Web • December 4 Chapter 14 Libraries and Bibliographical Systems • December 11 Chapter 15 Digital Libraries December – final exam
Class structure Main course: Information Retrieval • Discussion of previous chapter. Questions • I briefly present a new chapter Research seminar: Natural Language Processing • Discussion of previous paper. Questions. • Identification of possible research topics • Presentation of a new paper or current work • Discussion and questions • Goal: publications!
What CL is about Computers to process natural language text • “Understand” • Generate • Search • Organize • Translate • … Useful in IR
Methods • No: text as a stream of letters • Brute force statistics • Simplified heuristics (ex.: Porter) • Yes: attention to language rules • Linguistically motivated approaches • Knowledge-based approaches • Corpus-based approaches
What IR is about • Classical IR: find words? Concepts! • Question answering • Summarization • Clustering • … Take language seriously
Text representations for IR • Represent the retrieval features • Strings → stems (lexemes), synsets, phrases. • Women → woman, lady, female • Old men and women→ old woman • Structured representation of text • Network of related events and entities • Enables logical inference
CL tasks useful in IR • Morphology (stemming) • POS / Word dense disambiguation • Word relatedness • Anaphora resolution • Parsing and semantics (phrase search) • Synonymic rephrasing • Translation etc… Each one a whole science in itself
Morphology • Q: pig T: piggish • Simple: stemming • piggish → pig- • Lexeme: set of word forms • same stem can give different words • pigment → not pig; piny → pine, not pin • Dictionary/corpus-based methods • Learning; dictionary management
Part of Speech Disambiguation • Q: oil well T: He did very well • Q: what is an are? T: They are nice • Important for English, Chinese. Less important for other types • Perhaps not so helpful directly, but is necessary for most other tasks • Usually statistical / heuristic methods
Word Sense Disambiguation • Q: bank account T: on the beautiful banks of Han river ... • bill: document, banknote, law, ax, peak, Gates... • Very frequent, almost any word in text • Statistical & dictionary methods • International competitions
Word relatedness • Q: female T: woman (women) • Synonyms. Subtypes/super-types • Dictionaries. WordNet. Similarity. Lesk. • Q: Korea T: Seoul • Other linguistic relationships (e.g., part) • Real-world relationships (facts) • Q: Clinton T: Lewinsky • Statistical co-occurrence (MI)
Anaphora resolution • Q: Awards of Prof. Han T: Prof. Han said... He did... IBM awarded him... • Frequency • Phrases, co-occurrence, summarization, inference, translation • Heuristic (Mitkov) and knowledge-based methods • Other types of co-reference
Parsing, semantics • Q: Awards of Prof. Han T1: Prof. Han among many other prizes has several IBM awards T2: Mr. Kang has an award Prof. Han does not know of • Understanding of text • Rich structured representation • Better phrase search; question answering, summarization, ...
Synonymic rephrasing, reasoning • Q: experiencedcomputer scientists T: Prof. Han has been programmingfor many years and awarded an IBM award • Requires good syntactic and semantic analysis • Knowledge-based methods
Multilingual access • Q: 요구르트T: We sell excellent yoghurt. Продаем йогурт. Se vende rico yogur. • Search multilingual collections • Europe: dozens of official languages of EU • If you don’t know how to say it in English • Dictionaries, bilingual corpora, ...
Tasks are entangled • Many of CL tasks require other tasks • Morphology → syntax → semantics • Many CL tasks form circles • parsing ← WSD ← parsing • I see a wild cat with a telescope (tripod?) • Can be done quick-and-dirty (?) • Fighting for last %s • Zipf law: 20% of men drink 80% of beer
Tools and infrastructure • Analysis tools • Tasks, methods • Dictionaries and grammars • Types, structure • Automatic acquisition • Corpora • Corpora analysis tools and methods
Possible tasks • WSD to help IR • Clustering + summarization in IR results • Anaphora and coreference resolution to help IR • Multilingual IR • Applications to Korean • ... a lot of others
Reading • Textbooks • Manning & Schütze, Allen, Jurafsky, Hausser, ... • CICLing proceedings • Computational Linguistics • Google, ResearchIndex
Questions • Who expects to publish? • Who will make a presentation at the next seminar?
Thank you! Till September 18