Web Search – Summer Term 2006 I. General Introduction

Web Search – Summer Term 2006I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University

Introduction: Search What is “search” (by machine)? Data bases: Relational data bases, SQL, … Search in structured data Information Retrieval Search in un- (or semi-)structured data Example: Email-Archive ‘All Emails with sender x@y.z from April 1st-3rd, 2006’Search in exactly specified (meta) data ‘All Emails that are somehow related to project x’Search in unspecified and unstructured body

INFORMATION INFORMATION NEED DATA / DOCUMENTS QUERY Information Retrieval (IR) Information Retrieval (IR) deals with the representation, storage, organization of, and access to information items. (Page 1, Baeza-Yates und Ribeiro-Neto [1]) Information Retrieval (IR) = Part of computer science which studies the retrieval of information (not data) from a collection of written documents. The retrieved documents aim at satisfying a user information need usually expressed in natural language. (Glossary, page 444, Baeza-Yates & Ribeiro-Neto [1]) Note: Many other definitions exist Generally, all share this common view:

USER SEARCH PROCESS DATA INFORMATION NEED DOCUMENTS INFORMATION RETRIEVAL SYSTEM

QUERY RESULT INDEXING INDEX USER SEARCH PROCESS DATA INFORMATION NEED DOCUMENTS QUERY PROCESSING & SEARCHING & RANKING INFORMATION RETRIEVAL SYSTEM

INFORMATION INFORMATION NEED DATA / DOCUMENTS QUERY Information Retrieval (IR) Main problem: Unstructured, imprecisely, and imperfectly defined data But also: The whole search process can be characterized as uncertain and vague Hence: Information is often returned in form of a sorted list (docs ranked by relevance).

‘Data Retrieval’ vs. ‘IR’ Source: C. J. van RIJSBERGEN: INFORM. RETRIEVAL (http://www.dcs.gla.ac.uk/Keith/Chapter.1/Ch.1.html)

Summary of most imporant terms Query = The expression of the user information need in the input language provided by the information system. The most common type of input language simply allows the specification of keywords and of a few boolean connectivities.(Glossary, page 449, Baeza-Yates & Ribeiro-Neto [1]) Index = A data structure built on the text to speed up searching.(Glossary, page 443, Baeza-Yates & Ribeiro-Neto [1]) The concept of relevance = Measure to quantify relevance of a particular document for a particular user in a particular situation.

DOCS. QUERY SELECT DATA FOR INDEXING RESULTS QUERY PROCESSING (PARSING & TERM PROCESSING) RESULT REPRESENTATION PARSING & TERM PROCESSING RANKING LOGICAL VIEW OF THE INFORMATION NEED SEARCHING IR Process: Tasks Involved INFORMATION NEED User Interface DOCUMENTS LOGICAL VIEW OF THEDOCUMENTS (INDEX) PERFORMANCE EVALUATION

Evaluation of IR Systems Standard approaches for algorithm and computer system evaluationSpeed / processing timeStorage requirementsCorrectness of used algorithms But most importantlyPerformance, effectiveness Questions:What is a good / better search engine?How to measure search engine quality?Etc.

Evaluation of IR Systems Another important issue:Usability, users’ perception Example: User 1 & system 1:‘It took me 10 min to find the information.’ User 2 & system 2:‘It took me 14 min to find the information.’

Evaluation of IR Systems Another important issue:Usability, users’ perception Example: User 1 & system 1:‘It took me 10 min to find the information. Those were the worst 10 minutes of my life. I really hate this system!’ User 2 & system 2:‘It took me 14 min to find the information.I never had so much fun using any search engine before!’

Some Historical Remarks 1950s: Basic idea of searching text with a computer 1960s: Key developments, e.g.The SMART system (G. Salton, Harvard/Cornell)The Crainfield evaluations 1970s and 1980s: Advancements of basic ideasBut: mainly with small test collections 1990s: Establishment of TREC (Text Retrieval Conference) series (since 1992 till today)Large text collections, expansion to other fields and areas, e.g. spoken document retrieval, non-english or multi-lingual retrieval, information filtering, user interactions, WWW, video retrieval, etc. SOURCE: AMIT SINGHAL ‘MODERN INFORMATION RETRIEVAL: A BRIEF OVERVIEW’ (CH. 1), IEEE BULLETIN, 2001

Information Retrieval & Web Search Historically, IR was mainly motivated by text search (libraries, etc.) Today: Various other areas and data, e.g. multi media (images, video, etc.), WWW, etc. Web search: perfect example for an IR systemGoal: Find best possible results (web pages) based ona) Unstructured, heterogeneous, semistructured datab) Imprecise, ambiguous, short queries (Note: ‘Best possible results‘ is also a very vague specification of the ultimate goal) But: Very different from traditional IR tasks!

Characteristics of the Web Size: The web is big! An there are lots of users! Documents: Extreme variety regarding formats, structure, quality, etc. Users: Very different skills & intensions, e.g. Find all information about related patents Find some good tourist inform. about Paris Find the phone no. of the tourist office Location: The web is a distributed system Spam: Expect manipulation instead of cooperation from the document providers Dynamic: The web keeps growing & changing

Web Search Web search is an active research area with high economical impact Many open questions & challenges for research:Improving existing systems,adapting to new scenarios (more data, spam, …),new challenges (diff. data formats, multimedia, …),new tasks (desktop search, personalization, …),etc. Many other approaches & techniques exist, e.g.Clustering,specialized search engines,meta search engines,etc. We will cover some of this here, i.e. …

Web Search Course: Rough Outline Traditional (text) retrieval:Index generation (data structures), text processing, ranking (TF*IDF, …), models (Boolean, Vector Space, Probabilistic), evaluation (precision & recall, TREC, …)Only most important concepts as required for main part of the course, i.e.: Web search (special case of IR):Special characteristics of the web, ranking (PageRank, HITs, …), crawling (Spiders, Robots), indexing, and some selected topics

Text books about (text) IR [1] RICARDO BAEZA-YATES, BERTHIER RIBEIRO-NETO: ‘MODERN INFORMATIN RETRIEVAL’, ADDISON WESLEY, 1999 [2] WILLIAM B. FRAKES, RICARDO BAEZA-YATES (EDS.): ‘INFORMATION RETRIEVAL – DATA STRUCTURES AND ALGORITHMS’, P T R PRENTICE HALL, 1992 [3] C. J. VAN RIJSBERGEN: ‘INFORMATION RETRIEVAL’, 1979, AVAILABLE ONLINE AT http://www.dcs.gla.ac.uk/Keith/Preface.html [4] I. WITTEN, A. MOFFAT, T. BELL: ‘MANAGING GIGABYTES’, MORGAN KAUFMANN PUBLISHING, 1999 EXCERPTS FROM A NEW BOOK ‘INTRODUCTION TO INFORMATION RETRIEVAL’ BY C. MANNING, P. RAGHAVAN, H. SCHÜTZ (TO APPEAR 2007) ARE AVAILABLE ONLINE AT http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html Only certain topics will be covered in this course. No books on web search, but selected articles will be recommended in the lecture

Web Search – Summer Term 2006 I. General Introduction