Architecture of Search Engines: Index Construction & Query Process

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 6: Search Engines Mining Massive Datasets

Outline Architecture of Search Engines Index Construction Boolean Retrieval Vector Space Model for Ranked Retrieval

Paid Search Ads Algorithmic results.

Architecture How do search engines like Google work? Architecture of Search Engines (SE)

Architecture Architecture User Web spider Search Indexer The Web Indexes Ad indexes

Architecture Indexing Process

Architecture Indexing Process • Text acquisition • identifies and stores documents for indexing • Text transformation • transforms documents into index terms • Index creation • takes index terms and creates data structures (indexes) to support fast searching

Architecture Query Process

Architecture Query Process • User interaction • supports creation and refinement of query, display of results • Ranking • uses query and indexes to generate ranked list of documents • Evaluation • monitors and measures effectiveness and efficiency (primarily offline)

Architecture Details: Text Acquisition • Crawler • Identifies and acquires documents for search engine • Many types – web, enterprise, desktop • Web crawlers follow links to find documents • Must efficiently find huge numbers of web pages (coverage) and keep them up-to-date (freshness) • Single site crawlers for site search • Topicalorfocusedcrawlers for verticalsearch • Document crawlers for enterprise and desktop search • Follow links and scan directories

Architecture Web Crawler Starts with a set of seeds, which are a set of URLs given to it asparameters Seeds are added to a URL request queue Crawler starts fetching pages from the request queue Downloaded pages are parsed to find link tags that might contain other useful URLs to fetch New URLs added to the crawler’s request queue, or frontier Continue until no more new URLs or disk full

Architecture Crawling picture URLs crawled and parsed URLs frontier Web Unseen Web Seed pages

Architecture Crawling the Web

Architecture Text Transformation • Parser • Processing the sequence of text tokensin the document to recognize structural elements • e.g., titles, links, headings, etc. • Tokenizer recognizes “words” in the text • must consider issues like capitalization, hyphens, apostrophes, non-alpha characters, separators • Markup languages such as HTML, XML often used to specify structure • Tags used to specify document elements • E.g., <h2> Overview </h2> • Document parser uses syntax of markup language (or other formatting) to identify structure

Architecture Text Transformation • Stopping • Remove common words (stop words) • e.g., “and”, “or”, “the”, “in” • Some impact on efficiency and effectiveness • Can be a problem for some queries • Stemming • Group words derived from a common stem • e.g., “computer”, “computers”, “computing”, “compute” • Usually effective, but not for all queries • Benefits vary for different languages

Architecture Text Transformation • Link Analysis • Makes use of links and anchortextin web pages • Link analysis identifies popularity and community information • e.g., PageRank • Anchor text can significantly enhance the representation of pages pointed to by links • Significant impact on web search • Less importance in other applications

Architecture Text Transformation • Information Extraction • Identify classes of index terms that are important for some applications • e.g., named entity recognizers identify classes such as people, locations, companies, dates, etc. • Classifier • Identifies class-related metadata for documents • i.e., assigns labels to documents • e.g., topics, reading levels, sentiment, genre • Use depends on application

Architecture Index Creation • Document Statistics • Gathers counts and positions of words and other features • Used in ranking algorithm • Weighting • Computes weights for index terms • Used in ranking algorithm • e.g.,tf.idfweight • Combination of term frequency in document and inverse document frequency in the collection

Architecture Index Creation • Inversion • Core of indexing process • Converts document-term information to term-document for indexing • Difficult for very large numbers of documents • Format of inverted file is designed for fast query processing • Must also handle updates • Compression used for efficiency

Architecture Index Creation • Index Distribution • Distributes indexes across multiple computers and/or multiple sites • Essential for fast query processing with large numbers of documents • Many variations • Document distribution, term distribution, replication • P2P and distributed IR involve search across multiple sites

Architecture Query Process

Architecture User Interaction • Query input • Provides interface and parser for query language • Most web queries are very simple, other applications may use forms • Query language used to describe more complex queries and results of query transformation • e.g., Boolean queries, Indri and Galago query languages • similar to SQL language used in database applications • IR query languages also allow content and structure specifications, but focus on content

Architecture User Interaction • Query transformation • Improves initial query, both before and after initial search • Includes text transformation techniques used for documents • Spell checking and query suggestion provide alternatives to original query • Query expansion and relevance feedbackmodify the original query with additional terms

Architecture User Interaction • Results output • Constructs the display of ranked documents for a query • Generates snippets to show how queries match documents • Highlights important words and passages • Retrieves appropriate advertising in many applications • May provide clustering and other visualization tools

Architecture Ranking • Scoring • Calculates scores for documents using a ranking algorithm • Core component of search engine • Basic form of score is  qi di • qi and di are query and document term weights for term i • Many variations of ranking algorithms and retrieval models

Architecture Ranking • Performance optimization • Designing ranking algorithms for efficient processing • Term-at-a time vs. document-at-a-time processing • Safe vs. unsafe optimizations • Distribution • Processing queries in a distributed environment • Query broker distributes queries and assembles results • Caching is a form of distributed searching

Index Construction Unstructured data in 1680 • Which plays of Shakespeare contain the words BrutusANDCaesar but NOTCalpurnia? • One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? • Why is that not the answer? • Slow (for large corpora) • NOTCalpurnia is non-trivial • Other operations (e.g., find the word Romans nearcountrymen) not feasible • Ranked retrieval (best documents to return) 28

Index Construction Term-document incidence 1 if play contains word, 0 otherwise BrutusANDCaesarBUTNOTCalpurnia

Index Construction Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. 110100 AND 110111 AND 101111 = 100100. 30

Index Construction Answers to query Antony and Cleopatra,Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. 31

Index Construction Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to the user’s information needand helps the user complete a task 32

Index Construction The classic search model Misconception? Mistranslation? Misformulation? Get rid of mice in a politically correct way TASK Info Need Info about removing mice without killing them Verbal form How do I trap mice alive? Query mouse trap SEARCHENGINE QueryRefinement Results Corpus

Index Construction Bigger collections • Consider N = 1 million documents, each with about 1000 words. • Avg 6 bytes/word including spaces/punctuation • 6GB of data in the documents. • Say there are M = 500K distinct terms among these. 34

Index Construction Can’t build the matrix Why? • 500K x 1M matrix has half-a-trillion 0’s and 1’s. • But it has no more than one billion 1’s. • matrix is extremely sparse. • What’s a better representation? • We only record the 1 positions. 35

Index Construction Inverted index 1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Brutus 174 Caesar Calpurnia 2 31 54 101 What happens if the word Caesar is added to document 14? • For each term t, we must store a list of all documents that contain t. • Identify each by a docID, a document serial number • Can we use fixed-size arrays for this? 36

Index Construction Inverted index 1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Dictionary Postings Posting Brutus 174 Caesar Calpurnia 2 31 54 101 Sorted by docID (more later on why). • We need variable-size postings lists • On disk, a continuous run of postings is normal and best • In memory, can use linked lists or variable length arrays • Some tradeoffs in size/ease of insertion 37

Index Construction Inverted index construction Tokenizer Friends Romans Countrymen Token stream. Linguistic modules friend friend roman countryman Modified tokens. roman Indexer 2 4 countryman 1 2 Inverted index. 16 13 Documents to be indexed. Friends, Romans, countrymen.

Index Construction Indexer steps: Token sequence Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Sequence of (Modified token, Document ID) pairs.

Index Construction Indexer steps: Sort Core indexing step • Sort by terms • And then docID

Index Construction Indexer steps: Dictionary & Postings Multiple term entries in a single document are merged. Split into Dictionary and Postings Doc. frequency information is added.

Index Construction Where do we pay in storage? Lists of docIDs Terms and counts Pointers 42

Index Construction For web-scale indexing must use a distributed computing cluster Individual machines are fault-prone Can unpredictably slow down or fail How do we exploit such a pool of machines? MapReduce for indexing Distributed indexing

Index Construction We will use two sets of parallel tasks Parsers Inverters Break the input document collection into splits Each split is a subset of documents Parallel tasks

Index Construction Master assigns a split to an idle parser machine Parser reads a document at a time and emits (term, doc) pairs Parser writes pairs into j partitions Each partition is for a range of terms’ first letters (e.g., a-f, g-p, q-z) – here j = 3. Now to complete the index inversion Parsers

Index Construction Inverters An inverter collects all (term,doc) pairs (= postings) for one term-partition. Sorts and writes to postings lists

Index Construction Data flow Master assign assign Postings Parser Inverter a-f g-p q-z a-f Parser a-f g-p q-z Inverter g-p Inverter splits q-z Parser a-f g-p q-z Map phase Reduce phase Segment files

Index Construction Schema for index construction in MapReduce Schema of map and reduce functions map: input → list(k, v) reduce: (k,list(v)) → output Instantiation of the schema for index construction map: web collection → list(termID, docID) reduce: (<termID1, list(docID)>, <termID2, list(docID)>, …) → (postings list1, postings list2, …) Example for index construction map: d2 : C died. d1 : C came, C c’ed. → (<C, d2>, <died,d2>, <C,d1>, <came,d1>, <C,d1>, <c’ed, d1> reduce: (<C,(d2,d1,d1)>, <died,(d2)>, <came,(d1)>, <c’ed,(d1)>) → (<C,(d1:2,d2:1)>, <died,(d2:1)>, <came,(d1:1)>, <c’ed,(d1:1)>)

Boolean Retrieval • How do we process a (Boolean) query? 50

Architecture of Search Engines: Index Construction & Query Process

Architecture of Search Engines: Index Construction & Query Process

Presentation Transcript

Hongjian Li Department of Computer Science and Engineering Chinese University of Hong Kong

Hongjian Li Department of Computer Science and Engineering Chinese University of Hong Kong

Spring 2013 China ETA Program Shanghai Jiao Tong University

Xiangdong Ji University of Maryland Shanghai Jiao Tong University

Shanghai Jiao Tong University/ Northwestern University Dual MS Degree Program

Dept. Phys., Shanghai Jiao Tong Univ., China

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Minglu Li ( Department of Computer Science and Engineering, Shanghai Jiao Tong University )

Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Yong Yang Shanghai Jiao Tong University On behalf of Collaboration