560 likes | 698 Views
IR Lecture 1 . Information Retrieval. Information retrieval is concerned with representing, searching, and manipulating large collections of electronic text and other human-language data .
E N D
Information Retrieval • Information retrieval is concerned with representing, searching, and manipulating large collections of electronictextandotherhuman-language data. • Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
Basic techniques (BooleanRetrieval) • Searching, browsing, ranking, retrieval • Indexing algorithms and data structures
NLP DB IR ML-AI
Database Management • Library and Information Science • Artificial Intelligence • Natural Language Processing • Machine Learning
Database Management • Focused on structured data stored in relational tables rather than free-form text. • Focused on efficient processing of well-defined queries in a formal language (SQL). • Clearer semantics for both data and queries. • Recent move towards semi-structured data (XML) brings it closer to IR.
Library and Information Science • Focused on the human user aspects of information retrieval (human-computer interaction, user interface, visualization). • Concerned with effective categorization of human knowledge. • Concerned with citation analysis and bibliometrics (structure of information). • Recent work on digital libraries brings it closer to CS & IR.
Artificial Intelligence • Focused on the representation of knowledge, reasoning, and intelligent action. • Formalisms for representing knowledge and queries: • First-order Predicate Logic • Bayesian Networks • Recent work on web ontologies and intelligent information agents brings it closer to IR.
Natural Language Processing • Focused on the syntactic, semantic, and pragmatic analysis of natural language text and discourse. • Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on meaning rather than keywords.
Natural Language Processing:IR Directions • Methods for determining the sense of an ambiguous word based on context (word sense disambiguation). • Methods for identifying specific pieces of information in a document (information extraction). • Methods for answering specific NL questions from document corpora.
Machine Learning • Focused on the development of computational systems that improve their performance with experience. • Automated classification of examples based on learning concepts from labeled training examples (supervised learning). • Automated methods for clustering unlabeled examples into meaningful groups (unsupervised learning).
Machine Learning:IR Directions • Text Categorization • Automatic hierarchical classification (Yahoo). • Adaptive filtering/routing/recommending. • Automated spam filtering. • Text Clustering • Clustering of IR query results. • Automatic formation of hierarchies (Yahoo). • Learning for Information Extraction • Text Mining
Information Retrieval • The indexing and retrieval of textual documents. • Searching for pages on the World Wide Web is the most recentandwidelyusedapplication. • Concerned firstly with retrieving relevantdocuments to a query. • Concerned secondly with retrieving from large sets of documents efficiently.
Information RetrievalSystems Given: • A corpus of textual natural-language documents. • A user query in the form of a textual string. Find: • A ranked set of documents that are relevant to the query. • Most IR systemsshare a basicarchitectureandorganizations. (adaptedtotherequirements of specificapplications) • Likeanytechnicalfield, IR has itsown jargon. • Nextpageillustratesthemajorcomponents in an IR system.
Document corpus Query String 1. Doc1 2. Doc2 3. Doc3 . . Ranked Documents IR System
Beforeconducting a search, a user has an informationneed Thisinformationneedsometimesreferred as a topic Thisinformationneeddrivesthesearchprocess A majortask of a search engine is tomaintainandmanipulate an invertedindexfor a documentcollection. User constructsandissues a querytothe IR system. Typicallythisqueryconsists of smallnumber of terms(instead of wordweuse «term» İndexprovides a mappingbetweentermsandthelocations in thecollection in whichtheyoccure. Thisindexformstheprincipal data structureusedby engine forsearchingandrelevanceranking.
Relevance • Relevance is a subjective judgment and may include: • Being on the proper subject. • Being timely (recent information). • Being authoritative (from a trusted source). • Satisfying the goals of the user and his/her intended use of the information (information need). • Simplest notion of relevance is that the query string appears verbatim in the document. • Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).
Problems • May not retrieve relevant documents that include synonymous terms. • “restaurant” vs. “café” • “Turkey” vs. “TR” • May retrieve irrelevant documents that include ambiguous terms. • “bat” (baseball vs. mammal) • “Apple” (company vs. fruit) • “bit” (unit of data vs. act of eating) For instance, the word "bank" has several distinct lexical definitions, including "financial institution" and "edge of a river".
IR System Architecture User Interface Text User Need Text Operations Logical View User Feedback Query Operations Indexing Database Manager Inverted file Searching Query Index Text Database Ranked Docs Retrieved Docs Ranking 19
IR Components • Text Operations forms index words (tokens). • Stopword removal • Stemming • Indexing constructs an inverted index of word to document pointers. • Searching retrieves documents that contain a given query token from the inverted index. • Ranking scores all retrieved documents according to a relevance metric.
IR Components • User Interface manages interaction with the user: • Query input and document output. • Relevance feedback. • Visualization of results. • Query Operations transform the query to improve retrieval: • Query expansion using a thesaurus. • Query transformation using relevance feedback.
Web Search • Application of IR to HTML documents on the World Wide Web. • Differences: • Must assemble document corpus by spidering the web. • Can exploit the structural layout information in HTML (XML). • Documents change uncontrollably. • Can exploit the link structure of the web.
Web Spider Document corpus Query String 1. Page1 2. Page2 3. Page3 . . Ranked Documents Web Search System IR System 23
Other IR-Related Tasks • Automated document categorization • Information filtering (spam filtering) • Information routing • Automated document clustering • Recommending information or products • Information extraction • Information integration • Question answering
History of IR • 1940-50’s: • World War II denotedtheofficialformation of Information RepresentationandRetrieval. Because of war, a massivenumber of technicalreportsanddocumentswereproducedtorecordtheresearchanddevelopmentactivitiessurroundingweaponaryproduction.
History of IR • 1960-70’s: • Initial exploration of text retrieval systems for “small” corpora of scientific abstracts, and law and business documents. • Development of the basic Boolean and vector-space models of retrieval. • Prof. Salton and his students at Cornell University are the leading researchers in the area.
IR History Continued • 1980’s: • Large document database systems, many run by companies: • Lexis-Nexis (On April 2, 1973, LEXIS launched publicly, offering full-text searching of all Ohio and New York cases) • Dialog (manualtocomputerizedinformationretrieval) • MEDLINE ((MedicalLiterature Analysis andRetrievalSystem)
IR History Continued • 1990’s:NetworkedEra • Searching FTPable documents on the Internet • Archie • WAIS • Searching the World Wide Web • Lycos • Yahoo • Altavista
IR History Continued • 1990’s continued: • Organized Competitions • NIST TREC (Text REtrieval Conference (TREC) is an on-going series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks.) • Recommender Systems • Ringo • Amazon • Automated Text Categorization & Clustering
Recent IR History • 2000’s • Link analysis for Web Search • Google (pagerank) • Automated Information Extraction • Whizbang(Build Highly Structured Topic-Specific/Data-Centric Databases) (White Paper, Information Extraction and Text Classification” via WhizBang! Corp) • Burning Glass (Burning Glass’s technology for reading, understanding, and cataloging information directly from free text resumes and job postings is truly state-of-the-art) • Question Answering • TREC Q/A track (Question answering systems return an actual answer, rather than a ranked list of documents, in response to a question.)
Recent IR History • 2000’s continued: • Multimedia IR • Image • Video • Audio and music • Cross-Language IR • DARPA Tides (Translingual Information Detection, Extraction and Summarization) • Document Summarization
An example information retrieval problem • A fat book which many people own is Shakespeare’s Collected Works. Supposeyouwanted to determinewhich plays of Shakespeare contain thewordsBrutus AND Caesar AND NOT Calpurnia. • One way to do that is to start at thebeginning and to read through all the text, noting for each play whetherit contains Brutus and Caesar and excluding it from consideration if it containsCalpurnia. • The simplest form of document retrieval is for a computerto do this sort of linear scan through documents.
This process is commonlyreferred to as grepping. • Grepping through text can be a very effective process,especially given the speed of modern computers • With modern computers, for simple querying of modest collections(the size of Shakespeare’s Collected Works is a bit under one million wordsof text in total), you really need nothing more.
But for many purposes, you do need more: • To process large document collections quickly. The amount of online datahas grown at least as quickly as the speed of computers, and we wouldnow like to be able to search collections that total in the order of billionsto trillions of words. • To allow more flexible matching operations. For example, it is impracticalto perform the query Romans NEAR countrymen with grep, where NEARmight be defined as “within 5 words” or “within the same sentence”. • To allow ranked retrieval: in many cases you want the best answer to aninformation need among many documents that contain certain words.
Indexing • The way to avoid linearly scanning the texts for each query is to index thedocuments in advance.
Basicsof the Boolean RetrievalModelTerm-document incidence Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar. Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in Thetempest.
Basicsof the Boolean RetrievalModel Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar. Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in Thetempest.
So we have a 0/1 vector for each term. • To answer the query Brutus and Caesar and notCalpurnia: Take the vectors for Brutus, Caesar, and Calpurnia Complement the vector of Calpurnia Do a (bitwise) and on the three vectors 110100 and 110111 and 101111 = 100100
Answerstoquery Antony and Cleopatra,Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius:I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.
Biggercollections • Consider N = 10^6 documents, each with about 1000 tokens • ⇒ total of 10^9 tokens On average 6 bytes per token, including spaces and • punctuation ⇒ size of document collection is about 6 ・ 10^9 =6 GB Assume there are M = 500,000 distinct terms in the collection • M = 500,000 × 10^6 = half a trillion 0s and 1s.
But the matrix has no more than one billion 1s. Matrix is extremely sparse. What is a better representations? We only record the 1s.
1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Inverted index • For each term t, we must store a list of all documents that contain t. • Identify each by a docID, a document serial number • Can we use fixed-size arrays for this? Brutus 174 Caesar Calpurnia 2 31 54 101 What happens if the word Caesar is added to document 14?
Sec. 1.2 1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Dictionary Postings Inverted index • We need variable-size postings lists • On disk, a continuous run of postings is normal and best • In memory, can use linked lists or variable length arrays • Some tradeoffs in size/ease of insertion Posting Brutus 174 Caesar Calpurnia 2 31 54 101 Sorted by docID (more later on why).
Sec. 1.2 Tokenizer Token stream Friends Romans Countrymen Linguistic modules More on these later. friend friend roman countryman roman Modified tokens Indexer 2 4 countryman 1 2 Inverted index 16 13 Inverted index construction Documents to be indexed Friends, Romans, countrymen.
Tokenization and preprocessing Doc 1. I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. Doc 2. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious: Doc 1. i did enact juliuscaesari was killed i’ the capitol brutus killed me Doc 2. so let it be with caesar the noble brutus hath told you caesar was ambitious
Indexer steps: Token sequence Sequence of (Modified token, Document ID) pairs. Doc 1. i did enact juliuscaesari was killed i’ the capitol brutus killed me Doc 2. so let it be with caesar the noble brutus hath told you caesar was ambitious
Sec. 1.2 Indexer steps: Sort • Sort by terms • And then docID
Sec. 1.2 Indexer steps: Dictionary & Postings • Multiple term entries in a single document are merged. • Split into Dictionary and Postings • Doc. frequency information is added. Why frequency? Will discuss later.