450 likes | 645 Views
Fundamentals of Information Retrieval Illustration with Apache Lucene. By Majirus FANSI. Fundamentals of Information Retrieval Core of any IR application Scientific underpinning of information retrieval Boolean and Vector Space Models Inverted index construction and scoring
E N D
Fundamentals of Information Retrieval Illustration with Apache Lucene By Majirus FANSI
Fundamentals of Information Retrieval • Core of any IR application • Scientific underpinning of information retrieval • Boolean and Vector Space Models • Inverted index construction and scoring • Apache Lucene Library Abstract ApacheCon Europe 2012
Definition ApacheCon Europe 2012
Finding material • usually documents • Of an unstructured nature • usually text • That satisfies an information need from within large collections • usually stored on computers • Query is an attempt to communicate the information need • “Some argue that on the web, users should specify more accurately what they want and add more words to their query, we disagree vehemently with this position.” S. Brin and L. Page, Google 1998 Information Retrieval ApacheCon Europe 2012
Corporation’sinternal documents • Technical docs, meeting reports, specs, … • Thousands of documents • Lucene AND Cutting AND NOT Solr • Grepping the collection? • What about the response time? • Flexible queries: lucenecutting ~5 An example IR problem ApacheCon Europe 2012
Web search • Search over billions of documents stored on millions of computers. • Personalsearch • Consumer operating systemsintegrates IR • Email program search • Enterprise domain-specificsearch • Retrieval for collections such as reseach articles • Scenario for software developer Atwhichscale do youoperate? ApacheCon Europe 2012
Booleanmodels • Main option until approximately the arrival of the WWW • Query in the form of boolean expressions • Vector Space Models • Free textqueries • Queries and documents are viewed as vectors • Probabilistic Models • Rank documents by theyestimatedprobability of relevance wrt the information need. • Classification problem Domain-specificsearch - Models ApacheCon Europe 2012
Document • Unit we have decided to build a retrieval system on • Bad idea to index an entire book as a document • Bad idea to index a sentence in a book as a document • Precision/recalltradeoff • Term • Indexed unit, usuallyword • The set of termsisyour IR dictionary • Index • “An alphabetical list, such as one printed at the back of a book showing which page a subject is found on” Cambridge dictionary • Weindex documents to avoidgrepping the texts • “Queries must be handled quickly, at a rate of hundreds to thousands per second” Brin and Page Core notions ApacheCon Europe 2012
Precision • Fraction of retrieved docs that are relevant to user’s information need • Recall • Fraction of relevant docs in collection that are retrieved • “People are still only willing to look at the first few tens of results. Because of this, as the collection size grows, we need tools that have very high precision...This very high precision is important even at the expense of recall” Brin & Page How good are the retrieved docs? ApacheCon Europe 2012
Index: structure and construction ApacheCon Europe 2012
Consider N = 1 million documents, each with about 1000 words • Nearly 1 trillion words • M = 500K distinctterms among these • Which structure for our index? Index structure ApacheCon Europe 2012
Matrixisextremelysparse • Do wereallyneed to record the 0s? Term-document incidence matrix ApacheCon Europe 2012
Postings • For each term t, we must store a list of all documents that contain t. • Identify each by a docID, a document serial number Hatcher 1 2 4 11 31 45 173 Posting 132 lucene 1 2 4 5 6 16 57 Dictionary 2 31 54 101 solr Postings Sorted by docID Inverted index ApacheCon Europe 2012
Analysis step friend roman 2 4 countryman 1 2 16 13 Documents to be indexed Friends, Romans, countrymen. Tokenizer Token stream Friends Romans Countrymen Linguistic modules friend roman countryman Modified tokens Indexer Inverted index construction Inverted index ApacheCon Europe 2012 Indexing step
Tokenization • Given a character sequence, tokenization is the task of chopping it up into pieces, called tokens • Perhaps at the same time throwing away characters such as punctuation • Language-specific • Dropping common terms: stop words • Sort the terms by collection frequency • Take the most frequent terms as candidate stop list and let the domain people decide • Be careful about phrase query: “Queen of England” Analyzing the text : Tokenization ApacheCon Europe 2012
if yousearch for USA, youmighthope to also match documents containing U.S.A • Normalizationis the process of canonicalizingtokenssothat matches occurdespitesuperficialdifferences in charactersequences. • Removes accents and diacritics (cliché, naïve) • Capitalization/case-folding • Reducing all letters to lowercase • Stemming and lemmatization • reduceinflectionalforms and sometimesderivationallyrelatedforms of a word to a common base form • Porter stemmer • Ex: breathe, breathes, breathingreduced to breath • Increasesrecallwhileharmingprecision Analyzing the text: Normalization ApacheCon Europe 2012
Map: doc collection -- list(termID, docID) • Entry with the sametermId are merged • Reduce: (<termID1, list(docID)>, <termID2, list(DocID)>, …) -- (postings_list1, postings_list2, …) • Positional indexes for phrase query • Doc. frequency, term freq, positions are added. • lucene (128): doc1, 2<1, 8> Indexing steps: Dictionary & Postings ApacheCon Europe 2012
Lucene: Document, Fields, Index structure By Majirus FANSI
To index your raw content sources, you must first translate it into Lucene’s documents and fields • Document is what is returned as hit • It is a set of fields • Field is what searches are performed on • It is the actual content holder • Multi-valued field • Preferred to catch-all field How Lucene models content: Documents & Fields ApacheCon Europe 2012
For indexing (EnumField.Index) • Index.ANALYZED : (body, title,…) • Index.NOT_ANALYZED: treats the field entire value as a single token (social sec number, identifier, …) • Index.ANALYZED_NO_NORMS: doesn’t store norms information • Index.NO don’t make this field value available for searching • For storing fields (EnumField.Store) • Store.YES stores the value of the field • Store.NO recommended for large text field • Doc .add(new Field (“author”, author, Field.Store.YES, Field.Index.ANALYZED)) Field options ApacheCon Europe 2012
Boost a document • Instruct Lucene to consider it more or less important w.r.t other documents in the index when computing relevance • Doc.setBoost (boostValue) • boostValue > 1 upgrades the document • boostValue < 1 downgrades the document • Boost a field • Instruct Lucene to consider a field more or less important w.r.t other fields • aField.setBoost(boostValue) • Be careful about multivalued field • Payload mechanism for per-term boosting Document and Field Boosting ApacheCon Europe 2012
IndexWriter.addDocument (doc) to add the document to the index • After analyzing the input, Lucene stores it in an inverted index • Tokens extracted from the input doc are treated as lookup keys. • Lucene index directory consists of one or more segments • Each segment is a standalone index (subset of indexed docs) • Documents are updated by deleting and reinserting them • Periodically IndexWriter will select segments and merge them • Lucene is a Dynamic indexing tool Lucene Index Structure ApacheCon Europe 2012
Boolean model By Majirus FANSI
Consider the querylucene AND solr • Locatelucene in the dictionary • Retrieveitspostings • Locatesolr in the dictionary • Retrievesitspostings • Merge the twopostings If list lengths are x and y, merge takes O(x+y) operations. Crucial: postings sorted by docID. 8 8 16 4 2 2 4 16 lucene 32 32 64 64 128 Query processing: AND 2 8 1 3 3 5 2 1 5 solr 2 8 8 13 21 13 21 34 34 ApacheCon Europe 2012
The Boolean retrieval model is being able to ask a query that is a Boolean expression • BooleanQueries use AND, OR and NOT to joinqueryterms • Vieweach doc as set of words • Is precise: document matches condition or not • Lucene addsbooleanshortcutslike+ and - • +lucene +solrmeanslucene AND solr • +lucene -solrmeanslucene AND NOT solr Boolean queries: Exact match ApacheCon Europe 2012
Boolean queries often result in either too few (=0) or too many (1000s) results • AND gives too few; OR gives too many • Considered for expert usage • As a user are you able to process 1000 results? • Limited wrt. user information need • Extended boolean model with term proximity • “Apache Lucene” ~10 Problem with boolean model ApacheCon Europe 2012
A Boolean model only records term presence or absence • We wish to give more weight to documents that have a term several times as opposed to ones that contains it only once • Need for term frequency information in the postings lists • Boolean queries just retrieve a set of matching documents • We wish to have an effective method to order the returned results • Requires a mechanism for determining a document score • encapsulates how good a match a document is for a query What do we need? ApacheCon Europe 2012
Rankedretrieval By Majirus FANSI
Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language • Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query • Large result sets are not an issue: just show the top k (=~10) • Premise: the ranking algorithm works • Score is the key component of ranked retrieval models Ranked retrieval models ApacheCon Europe 2012
We would like to compute a score between a query term t and a document d. • The simplest way is to say score(q, d) = tft,d • The term frequency tft,dof term t in document d • Number of times that t occurs in d • Relevance does not increase proportionally with term frequency • Certain terms have little or no discriminating power in determining relevance • Need a mechanism for attenuating the effects of frequent terms • Less informative than rare terms Term frequency and weighting ApacheCon Europe 2012
The tf-idf weight of a term is the product of its tf weight and its idf weight • Best known weighting scheme in information retrieval • Increases with the number of occurrences within a document • Increases with the rarity of the term in the collection tf-idf weighting ApacheCon Europe 2012
At this point, we may view each document as a vector • with one component corresponding to each term in the dictionary, together with a tf-idfweight for each component. • This is an |N|-dimensional vector • For dictionary terms that do not occur in a document, this weight is zero • In practice we consider d as a |q|-dimensional vector • |q| is the number of distinct terms in the query q Document vector ApacheCon Europe 2012
Vector Space Model(VSM) By Majirus FANSI
The set of documents in the collection are viewed as set of vectors in a vector space • One axis for each term in the query • User query is treated as a very short doc • It is represented as a vector in this space • VSM computes the similarity between the query vector and each document vector • Rank documents in decreasing order of the angle between query and document • The user is returned the top-scoring documents VSM principles ApacheCon Europe 2012
How do you determine the angle between a document vector and a query vector? • Instead of ranking in decreasing order of the angle (q, d) • Rank documents in increasing order of cosine(q, d) • Thus the cosine similarity • The model assign a score between 0 and 1 • Cos(0) = 1 Cosine similarity ApacheCon Europe 2012
Dot product Fundamental to IR systemsbased on VSM Unit vectors qi is the tf-idf weight of term i in the query q diis the tf-idf weight of term i in the document d Cosine is computed on the vector representatives to compensate for doc length cosine(query, document) Variations from one VS scoring method to another hinge on the specific choices of weights in the vector v(d) and v(q) ApacheCon Europe 2012 Euclideannorms
Lucene combines Boolean Model (BM) of IR and Vector Space Model (VSM) of IR • Documents “approved” by BM are scored by VSM • This is a Weighted zone scoring or Ranked Boolean Retrieval • Lucene VSM score of document d for query q is the cosine Similarity • Lucene refines VSM score for both search quality and ease of use Lucene scoring algorithm ApacheCon Europe 2012
Normalizing document vector by the Euclidean length of vector eliminates all information on the length of the original document • Fine only if the doc is made by successive duplicates of distinct terms • Doc-len-norm(d) normalizes to a vector equal or larger than the unit vector • It is a pivoted normalized document length. • Compensation independent of term and doc freq. • Users can boost docs at indexing time • Score of a doc d is multiplied by doc-boost(d) How does Lucene refine VSM? ApacheCon Europe 2012
At search time users can specify boosts to each query, sub-query, query term • The contribution of a query term to the score of a document is multiplied by the boost of that query term (query-boost(q)) • A document may match a multi term query without containing all the terms of that query • Coord-factor(q,d) rewards documents matching more query terms How does Lucene refine VSM (2) ApacheCon Europe 2012
Assuming the document is composed of only one field • doc-len-norm(d) and doc-boost(d) are know at indexing time. • Computed in advance and their multiplication is saved in the index as norm(d) Lucene conceptual scoring formula ApacheCon Europe 2012
Derived from the conceptual formula and assuming document has more than one field • Idf(t) is squared because t appears in both d and q • queryNorm(q) is computed by the query Weigth object • lengthNorm is computed so that shorter fields contribute more to the score Lucene practical scoring functionDefaultSimilarity ApacheCon Europe 2012
Acknowledments By Majirus FANSI
PanduNayak and PrabhakarRaghavan: Introduction to Information Retrieval • Apache Lucene Dev Team • S. Brin and L. Page: The Anatomy of a Large-ScaleHypertextual Web searchEngine • M. McCandless, E. Hatcher, and O. Gospodnetic: Lucene in Action 2nd Ed • ApacheCon Europe 2012 organizers • Management atValtechTechnology Paris • Michels, Maj-Daniels, and Sonzia FANSI • Of course , all of you for yourpresence and attention A big thank you ApacheCon Europe 2012
To those whose life is dedicated to Education and Research By Majirus FANSI