620 likes | 890 Views
Search Engine Technology (1). Prof. Dragomir R. Radev radev@cs.columbia.edu. SET FALL 2013. … Introduction … … … …. Examples of search engines. Conventional (library catalog). Search by keyword, title, author, etc .
E N D
Search Engine Technology(1) Prof. Dragomir R. Radev radev@cs.columbia.edu
SET FALL 2013 • … • Introduction • … • … • … • …
Examples of search engines • Conventional (library catalog). Search by keyword, title, author, etc. • Text-based (Lexis-Nexis, Google, Yahoo!).Search by keywords. Limited search using queries in natural language. • Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors,… ). • Question answering systems (Ask, NSIR, Answerbus)Search in (restricted) natural language • Clustering systems (Vivísimo, Clusty) • Research systems (Lemur, Nutch)
What does it take to build a search engine? • Decide what to index • Collect it • Index it (efficiently) • Keep the index up to date • Provide user-friendly query facilities
What else? • Understand the structure of the web for efficient crawling • Understand user information needs • Preprocess text and other unstructured data • Cluster data • Classify data • Evaluate performance
Goals of the course • Understand how search engines work • Understand the limits of existing search technology • Learn to appreciate the sheer size of the Web • Learn to write code for text indexing and retrieval • Learn about the state of the art in IR research • Learn to analyze textual and semi-structured data sets • Learn to appreciate the diversity of texts on the Web • Learn to evaluate information retrieval • Learn about standardized document collections • Learn about text similarity measures • Learn about semantic dimensionality reduction • Learn about the idiosyncracies of hyperlinked document collections • Learn about web crawling • Learn to use existing software • Understand the dynamics of the Web by building appropriate mathematical models • Build working systems that assist users in finding useful information on the Web
Course logistics • Wednesdays 6:10-7:55 in 410 IAB • Dates: • Sep 4, 11, 18, 25 • Oct 2, 9, 16, 23, 30 • Nov 6, 13, 20, 27 • Dec 4 + final in mid-December, date TBA • URL: http://www1.cs.columbia.edu/~cs6998/ • Instructor: Dragomir Radev • Email: radev@cs.columbia.edu • Office hours: TBA • TAs: Amit Ruparel and Ashlesha Shirbhate • {ar3202, ass2167}@columbia.edu • set_ta@lists.cs.columbia.edu
Course outline • Classic document retrieval: storing, indexing, retrieval • Web retrieval: crawling, query processing. • Text and web mining: classification, clustering • Network analysis: random graph models, centrality, diameter and clustering coefficient
Syllabus • Introduction. • Queries and Documents. Models of Information retrieval. The Boolean model. The Vector model. • Document preprocessing. Tokenization. Stemming. The Porter algorithm. Storing, indexing and searching text. Inverted indexes. • Word distributions. The Zipf distribution. The Benford distribution. Heap's law. TF*IDF. Vector space similarity and ranking. • Retrieval evaluation. Precision and Recall. F-measure. Reference collections. The TREC conferences. • Automated indexing/labeling. Compression and coding. Optimal codes. • String matching. Approximate matching. • Query expansion. Relevance feedback. • Text classification. Naive Bayes. Feature selection. Decision trees.
Syllabus • Linear classifiers. k-nearest neighbors. Perceptron. Kernel methods. Maximum-margin classifiers. Support vector machines. Semi-supervised learning. • Lexical semantics and Wordnet. • Latent semantic indexing. Singular value decomposition. • Vector space clustering. k-means clustering. EM clustering. • Random graph models. Properties of random graphs: clustering coefficient, betweenness, diameter, giant connected component, degree distribution. • Social network analysis. Small worlds and scale-free networks. Power law distributions. Centrality. • Graph-based methods. Harmonic functions. Random walks. • PageRank. Hubs and authorities. Bipartite graphs. HITS. • Models of the Web.
Syllabus • Crawling the web. Webometrics. Measuring the size of the web. The Bow-tie-method. • Hypertext retrieval. Web-based IR. Document closures. Focused crawling. • Question answering • Burstiness. Self-triggerability • Information extraction • Adversarial IR. Human behavior on the web. • Text summarization POSSIBLE TOPICS • Discovering communities, spectral clustering • Semi-supervised retrieval • Natural language processing. XML retrieval. Text tiling. Human behavior on the web.
Readings • required: Information Retrieval by Manning, Schuetze, and Raghavan (http://nlp.stanford.edu/IR-book/information-retrieval-book.html), freely available, hard copy for sale • optional: Modeling the Internet and the Web: Probabilistic Methods and Algorithms by Pierre Baldi, Paolo Frasconi, Padhraic Smyth, Wiley, 2003, ISBN: 0-470-84906-1 (http://ibook.ics.uci.edu). • papers from SIGIR, WWW and journals (to be announced in class).
Prerequisites • Linear algebra: vectors and matrices. • Calculus: Finding extrema of functions. • Probabilities: random variables, discrete and continuous distributions, Bayes theorem. • Programming: experience with at least one web-aware programming language such as Perl (highly recommended) or Java in a UNIX environment. • Required CS account
Course requirements • Three assignments (30%) • Some of them will be in Perl. The rest can be done in any appropriate language (e.g. Python or Java). All will involve some data analysis and evaluation. • Final project (30%) • Research paper or software system. • Class participation (10%) • Final exam (30%)
Final project format • Research paper - using the SIGIR format. Students will be in charge of problem formulation, literature survey, hypothesis formulation, experimental design, implementation, and possibly submission to a conference like SIGIR or WWW. • Software system - develop a working system or API. Students will be responsible for identifying a niche problem, implementing it and deploying it, either on the Web or as an open-source downloadable tool. The system can be either stand alone or an extension to an existing one.
Active research projects • Scientific paper analysis, bibliometrics • Citation analysis • Question answering • Social media • Political debates • Blogs and rumors • IR for the humanities • Health IR • Collective intelligence • Sentiment analysis and word polarity • Cartoons • Social networks
More project ideas • Shingling • Build a language identification system. • Participate in the Netflix challenge. • Query log analysis. • Build models of Web evolution. • Information diffusion in blogs or web. • Author-topic models of web pages. • Using the web for machine translation. • News recommendation system. • Compress the text of Wikipedia (losslessly). • Spelling correction using query logs. • Automatic query expansion.
List of projects from the past • Document Closures for Indexing • Tibet - Table Structure Recognition Library • Ruby Blog Memetracker • Sentence decomposition for more accurate information retrieval • Extracting Social Networks from LiveJournal • Google Suggest Programming Project (Java Swing Client and Lucene Back-End) • Leveraging Social Networks for Organizing and Browsing Shared Photographs • Media Bias and the Political Blogosphere • Measuring Similarity between search queries • Extracting Social Networks and Information about the people within them from Text • LSI + dependency trees
Netflix challenge AOL query logs Blogs Bio papers AAN Email Generifs Web pages Political science corpus VAST del.icio.us SMS News data: aquaint, tdt, nantc, reuters, setimes, trec, tipster Europarl multilingual US congressional data DMOZ Pubmedcentral DUC/TAC Timebank Wikipedia wt2g/wt10g/wt100g dotgov RTE Paraphrases GENIA Generifs Hansards IMDB MTA/MTC nie cnnsumm Poliblog Sentiment xml epinions Enron Available corpora
Related courses elsewhere • Stanford (Chris Manning, Prabhakar Raghavan, and Hinrich Schuetze) • Cornell (Jon Kleinberg) • CMU (Yiming Yang and Jamie Callan) • UMass (James Allan) • UTexas (Ray Mooney) • Illinois (Chengxiang Zhai) • Johns Hopkins (David Yarowsky) • UNT (Rada Mihalcea)
The size of the World Wide Web • The size of the indexed world wide web pages (By Sep.4, 2012) • Indexed by Google: about 40 billion pages • Indexed by Bing: about 16.5 billion pages • Indexed by Yahoo: about 4.8 billion pages http://www.worldwidewebsize.com/
Twitter hits 400 million tweets per day (June, 2012. Dick Costolo, CEO at Twitter) • Over 2.5 billion photos uploaded to Facebook each month (2010. blog.facebook.com) • Google’s clusters process a total of more than 20 petabytes of data per day. (2008. Jeffrey Dean from Google [link])
55 Million WordPress Sites in the World • WordPress.com users produce about 500,000 new posts and 400,000 new comments on an average day http://en.wordpress.com/stats/
Dynamically generated content • New pages get added all the time • The size of the blogosphere doubles every 6 months • Yahoo deals with 12TB of data per day (according to Ron Brachman)
SET FALL 2013 … 2. Models of Information retrieval The Vector model The Boolean model … …
Sample queries (from Excite) In what year did baseball become an offical sport? play station codes . com birth control and depression government "WorkAbility I"+conference kitchen appliances where can I find a chines rosewood tiger electronics 58 Plymouth Fury How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? emeril Lagasse Hubble M.S Subalaksmi running
Key Terms Used in IR • QUERY: a representation of what the user is looking for - can be a list of words or a phrase. • DOCUMENT: an information entity that the user wants to retrieve • COLLECTION: a set of documents • INDEX: a representation of information that makes querying easier • TERM: word or concept that appears in a document or a query
Mappings and abstractions Reality Data Information need Query From Robert Korfhage’s book
Documents • Not just printed paper • Can be records, pages, sites, images, people, movies • Document encoding (Unicode) • Document representation • Document preprocessing (e.g., removing metadata) • Words, terms, types, tokens
Sample query sessions (from AOL) • toley spies gramestolley spies gamestotally spies games • tajmahal restaurant brooklyn nytaj mahal restaurant brooklyn nytaj mahal restaurant brooklyn ny 11209 • do you love me like you saydo you love me like you say lyricsdo you love me like you say lyrics marvin gaye
Characteristics of user queries • Sessions: users revisit their queries. • Very short queries: typically 2 words long. • A large number of typos. • A small number of popular queries. A long tail of infrequent ones. • Almost no use of advanced query operators with the exception of double quotes
Queries as documents • Advantages: • Mathematically easier to manage • Problems: • Different lengths • Syntactic differences • Repetitions of words (or lack thereof)
Document representations • Term-document matrix (m x n) • Document-document matrix (n x n) • Typical example in a medium-sized collection: 3,000,000 documents (n) with 50,000 terms (m) • Typical example on the Web: n=30,000,000,000, m=1,000,000 • Boolean vs. integer-valued matrices
Storage issues • Imagine a medium-sized collection with n=3,000,000 and m=50,000 • How large a term-document matrix will be needed? • Is there any way to do better? Any heuristic?
Tokenizing text • (CNN) -- A tropical storm has strengthened into Hurricane Leslie in the Atlantic Ocean, forecasters said Wednesday. • The slow-moving storm could affect Bermuda this weekend, according to the National Hurricane Center in Miami. • The Category 1 hurricane was churning Wednesday afternoon about 465 miles (750 kilometers) south-southeast of the British territory and moving north at 2 mph (4 kph), the hurricane center said. http://www.cnn.com/2012/09/05/world/americas/bermuda-hurricane-leslie/index.html
Inverted index • Instead of an incidence vector, use a posting table • CLEVELAND: D1, D2, D6 • OHIO: D1, D5, D6, D7 • Use linked lists to be able to insert new document postings in order and to remove existing postings. • Can be used to compute document frequency • Keep everything sorted! This gives you a logarithmic improvement in access.
Basic operations on inverted indexes • Conjunction (AND) – iterative merge of the two postings: O(x+y) • Disjunction (OR) – very similar • Negation (NOT) – can we still do it in O(x+y)? • Example: MICHIGAN AND NOT OHIO • Example: MICHIGAN OR NOT OHIO • Recursive operations • Optimization: start with the smallest sets
Major IR models • Boolean • Vector • Probabilistic • Language modeling • Fuzzy retrieval • Latent semantic indexing
The Boolean model Venn diagrams z x w y D1 D2
Boolean queries • Operators: AND, OR, NOT, parentheses • Example: • CLEVELAND AND NOT OHIO • (MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA) • Ambiguous uses of AND and OR in human language • Exclusive vs. inclusive OR • Restrictive operator: AND or OR?
Canonical forms of queries • De Morgan’s Laws: NOT (A AND B) = (NOT A) OR (NOT B) NOT (A OR B) = (NOT A) AND (NOT B) • Normal forms • Conjunctive normal form (CNF) • Disjunctive normal form (DNF) • Some people swear by CNF - why?
Evaluating Boolean queries • Incidence vectors: • CLEVELAND: 1100010 • OHIO: 1000111 • Examples: • CLEVELAND AND OHIO • CLEVELAND AND NOT OHIO • CLEVALAND OR OHIO
Exercise • D1 = “computer information retrieval” • D2 = “computer retrieval” • D3 = “information” • D4 = “computer information” • Q1 = “information AND retrieval” • Q2 = “information AND NOT computer”
Exercise ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
SET FALL 2013 … 3. Document preprocessing. Tokenization. Stemming. The Porter algorithm. Storing, indexing and searching text. Inverted indexes. …
Document preprocessing • Dealing with formatting and encoding issues • Hyphenation, accents, stemming, capitalization • Tokenization: • USA vs. U.S.A. – equivalence class • Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc, can’t • Example: “The New York-Los Angeles flight” • Hewlett-Packard • numbers, e.g., (888) 555-1313, 1-888-555-1313 • dates, e.g., Jan-13-2012, 20120113, 13 January 2012, 01/13/12 • MIT, mit (in German)?