CSCI5250/ENGG5106: Information Retrieval and Search Engines

CSCI5250/ENGG5106:Information Retrieval and Search Engines Lecture 1: Introduction and Boolean Information Retrieval Prof. Michael R. Lyu

Outline • Administrative • Overall Course Introduction • Boolean Retrieval System (Ch.1 of IR Book) • Inverted Index • Processing Boolean Queries • Query Optimization

Outline • Administrative • Overall Course Introduction • Boolean Retrieval System • Inverted Index • Processing Boolean Queries • Query Optimization

Motivation • Do you want to work in these companies?

Motivation of the Course • To understand the infrastructure and techniques behind Search Engines. • To know the existing literature and research challenges in the area of Information Retrieval. • To realize how to organize and manage huge amount of information, such as that from on the Web. • To practice a real project in Information Retrieval and/or Search Engine

Textbook • Introduction to Information Retrieval • Christopher Manning, Associate Professor of Linguistics and Computer Science at Stanford • PrabhakarRaghavan, Consulting Professor of Computer Science at Stanford, Vice President of Engineering at Google, previous was Head of Yahoo! Research • HinrichSchütze,Chair of Theoretical Computational Linguistics Institute for Natural Language Processing, University of Stuttgart

Textbook • Amazon • Link • PDF of the book for online viewing • http://www-nlp.stanford.edu/IR-book/

Instructors • Prof. Michael R. Lyu 呂榮聰 • www.cse.cuhk.edu.hk/~lyu • Room 927; lyu@cse • TA: HU Junjie胡俊傑 • www.cse.cuhk.edu.hk/~jjhu • Room 1024; jjhu@cse • TA: ZHAO Tong 趙桐 • www.cse.cuhk.edu.hk/~tzhao • Room 1024; tzhao@cse

Time, Venue, and Website • Lecture • Monday from 9:30 am to 12:15 pm • LHC 104 & G04 (Y.C. Liang Hall, 潤昌堂) • Tutorial • Tuesday from 10:30am to 11:15 am • LSB LT4 • Course URL • http://www.cse.cuhk.edu.hk/csci5250 • http://www.cse.cuhk.edu.hk/engg5106 • Course email: csci5250@cse / engg5106@cse

Grade Assessment Scheme • Two Assignments (20%) • Written assignments • Some programming • One Midterm Examination (40%) • 9:30am – 12:15pm, November 3, 2014 • Open one A4-size paper (double sided fine) • One Project (40%) • Presentations • Report

Class Project • Project is for everyone • 3-4 persons per project group • Each group is to design and implement a search engine of your choice • Email to csci5250@cse or engg5106@cse the names and student IDs of your group by Friday • Project specification and schedule will be assigned next Monday and published on course website.

STUDENT EXPECTATIONS • a positive, respectful, and engaged academic environment inside and outside the classroom; • to attend classes at regularly scheduled times without undue variations, and to receive before term-end adequate make-ups of classes that are canceled due to leave of absence of the instructor; • to receive a course syllabus • to consult with the instructor and tutors through regularly scheduled office hours or a mutually convenient appointment;

STUDENT EXPECTATIONS • to have reasonable access to University facilities and equipment for assignments and/or objectives; • to have access to guidelines on University’s definition of academic misconduct; • to have reasonable access to grading instruments and/or grading criteria for individual assignments, projects, or exams and to review graded material; • to consult with each course’s faculty member regarding the petition process for graded coursework.

FACULTY EXPECTATIONS • a positive, respectful, and engaged academic environment inside and outside the classroom; • students to appear for class meetings timely; • to select qualified course tutors; • students to appear at office hours or a mutual appointment for official academic matters; • full attendance at examination, midterms, presentations, and laboratories;

FACULTY EXPECTATIONS • students to be prepared for class, appearing with appropriate materials and having completed assigned readings and homework; • full engagement within the classroom, including focus during lectures, appropriate and relevant questions, and class participation; • to cancel class due to emergency situations and to cover missed material during subsequent classes; • students to act with integrity and honesty. • CUHK has zero tolerance on plagiarism. Read: http://www.cuhk.edu.hk/policy/academichonesty/

Outline • Administrative • Overall Course Introduction • Boolean Retrieval System • Inverted Index • Processing Boolean Queries • Query Optimization

Definition of Information Retrieval • Information retrieval (IR) is finding material (usually documents) of an unstructured nature that satisfies an information needfrom within large collections (usually stored on computers)

Information Retrieval • Hot in both industrial and research societies

Information Retrieval • Conferences related to IR • SIGIR • WWW • AAAI • CIKM • WSDM • KDD • TREC • ECIR • ACL • EMNLP • COLING • …

Search Engine Issues • Domain of Information • Size, type, etc. • Search Interface • User Interface • Hardware Systems • Scaling Problems • Performance Issues • Search Accuracy • Search Speed

Anatomy of A Search Page

Anatomy of A SearchResult Page

(circa1997) Anatomy of A SearchEngine

SearchEngineModules • Crawling • Storage • Indexing • Queries

The web crawling (downloading of web pages) is done by several distributed crawlers • URLserver sends lists of URLs to be fetched to the crawlers • The webpages that are fetched are then sent to the storeserver • The storeserver then compresses and stores the web pages into a repository • Every web page has an associated ID number called a docID which is assigned Crawler

The indexing function is performed by the indexer and the sorter. • It reads the repository, uncompressesthe documents, and parses them. • Each document is converted into a set of word occurrences called hits. • The hits record the word, position in document, an approximation of font size, and capitalization. • The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. • It parses out all the links in every web page and stores important information about them in an anchors file. • This file contains enough information to determine where each link points from and to, and the text of the link. Indexer

The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. • It puts the anchor text into the forward index, associated with the docID that the anchor points to. • It also generates a database of links which are pairs of docIDs. • The links database is used to compute PageRanks for all the documents. URLresolver

The sorter takes the barrels, which are sorted by docID, and resorts them by wordID to generate the inverted index. • The sorter also produces a list of wordIDsand offsets into the inverted index. • A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. • The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries. Sorter

Topics to Cover • Boolean retrieval • The term vocabulary & postings lists • Dictionaries and tolerant retrieval • Scoring, term weighting & the vector space model • Computing scores in a complete search system • Probabilistic information retrieval • Text classification & Naive Bayes • Matrix decompositions & latent semantic indexing • Web search basics • Web crawling and indexes • Link analysis • Multimedia Information Retrieval

Information Retrieval and Search Engine Web Crawling (20) Document Parsing (1,2) Indices Indexing (2,20) Ads (19) Scoring (6,7,9，11,12,18) Classification (13,14,15) Result Clustering (16,17) Quality Ranking (21) Query Query Processing (3)

Crawling (Ch. 20) • Initialize queue with URLs of known seed pages • Repeat • TakeURL from queue • Fetch and parse page • Extract URLs from page • AddURLs to queue • Fundamental assumption: The Web is well linked

Crawling (Ch. 20) • How do we distribute the crawler so we can scale up? • We can’t index everything: we need to subselect. How? • How do we fight against spam and spider traps? • What are basic requirements a crawler should meet?

Document Parsing (Ch. 2) • What decisions should we make when parsing a document? • Language? Character set? Tokenization?

Indexing (Ch. 2, 20) • Why we need index? • Gain speed benefits of indexing at retrieval time, need to build index in advance • Search query: Brutus Calpurnia

Indexing (Ch. 2, 20) • If we employ distribute crawler, how do we construct distributed indices? • Partition by terms? • Partition by documents?

Query Processing (Ch. 3) • How do we deal with wildcard queries? E.g. mon*, *mon • How do we do spelling correction? E.g., googel->google

Classification (13,14,15) How does a computer know whether a news is technology and health? Classification

Clustering (16,17) • Document clustering is the process of grouping a set of documents into clusters of similar documents

Quality Ranking (21) • There are millions of documents relevant to query “information retrieval”, how do we rank them? • Some spammer pages contain repetition of keywords, how do we downgrade their rankings?

Scoring (6,7,9,11,12,18) • Goal: measure how a document relevant to a query • Term frequency, inverse document frequency • Vector space modeling • Relevance feedback • Probabilistic information retrieval • Language modeling • Latent semantic indexing • …

Motivation of this Lecture • Introduce the simplest form of information retrieval system • Boolean information retrieval • Understand each component of the Boolean information retrieval system

Does Google Use the Boolean Model? • On Google, the default interpretation of a query [w1 w2. . .wn] is w1AND w2AND . . .ANDwn

Cases where you get hits that do not contain one of the wi • Anchor text • Anchor text usually gives relevantdescriptive and contextualinformation about the content of the link‘s destination • <a href="http://en.wikipedia.org/wiki/Main_Page">Wikipedia</a> Anchor text: “Wikipedia”

Cases where you get hits that do not contain one of the wi • Page contains variant of wi(morphology, spellingcorrection, synonym) • Long queries (n is large) • Boolean expression which generates very few hits

Simple Boolean vs. Ranking of Result Set • Simple Boolean retrieval returns matching documents in noparticularorder • Google (and most well-designed Boolean engines) rank the result set – they rank good hits (according to some estimator of relevance) higher than bad hits. 49

CSCI5250/ENGG5106: Information Retrieval and Search Engines