910 likes | 1.38k Views
Introduction to Information Retrieval and Web-based Searching Methods Mark Sanderson, University of Sheffield m.sanderson@shef.ac.uk, dis.shef.ac.uk/mark/ Contents Introduction Ranked retrieval Models Evaluation Advanced ranking Future Sources Aims
E N D
Introduction to Information Retrieval and Web-based Searching Methods Mark Sanderson, University of Sheffield m.sanderson@shef.ac.uk, dis.shef.ac.uk/mark/ ©Mark Sanderson, Sheffield University
Contents • Introduction • Ranked retrieval • Models • Evaluation • Advanced ranking • Future • Sources ©Mark Sanderson, Sheffield University
Aims • To introduce you to basic notions in the field of Information Retrieval with a focus on Web based retrieval issues. • To squeeze it all into 4 hours, including coffee breaks • If it’s not covered in here, hopefully there will at least be a reference ©Mark Sanderson, Sheffield University
Objectives • At the end of this you will be able to… • Demonstrate the workings of document ranking • Remove suffixes from words. • Explain how recall and precision are calculated. • Exploit Web specific information when searching. • Outline the means of automatically expanding users’ queries. • List IR publications. ©Mark Sanderson, Sheffield University
Introduction • What is IR? • General definition • Retrieval of unstructured data • Most often it is • Retrieval of text documents • Searching newspaper articles • Searching on the Web • Other types • Image retrieval ©Mark Sanderson, Sheffield University
Typical interaction • User has information need. • Expresses it as a query • in their natural language? • IR system find documents relevant to the query. ©Mark Sanderson, Sheffield University
Text • No computer understanding of document or query text • Use “bag of words” approach • Pay no heed to inter-word relations: • syntax, semantics • Bag does characterise document • Not perfect, words are • ambiguous • used in different forms or synonymously ©Mark Sanderson, Sheffield University
To recap Documents Documents User Query Process IR System Process Retrieved relevant(?)documents Store ©Mark Sanderson, Sheffield University Retrieval Part
Processing • “The destruction of the amazon rain forests” • Case normalisation • Stop word removal. • From fixed list • “destruction amazon rain forests” • Suffix removal, also know as stemming. • “destruct amazon rain forest” • Documents processed as well ©Mark Sanderson, Sheffield University
Different forms - stemming • Matching the query term “forests” • to “forest” and “forested” • Stemmers remove affixes • removal of suffixes - worker • prefixes? - megavolt • infixes? - un-bloody-likely • Stick with suffixes ©Mark Sanderson, Sheffield University
Plural stemmer • Plurals in English • If word ends in “ies” but not “eies”, “aies” • “ies” -> “y” • if word ends in “es” but not “aes, “ees”, “oes” • “es” -> “e” • if word ends in “s” but not “us” or “ss” • “s” -> “” • First applicable rule is the one used ©Mark Sanderson, Sheffield University
Plural stemmer reference • Good review of stemming • Frakes, W. (1992): Stemming algorithms, in Frakes, W. & Baeza-Yates, B. (eds.), Information Retrieval: Data Structures & Algorithms: 131-160 ©Mark Sanderson, Sheffield University
Plural stemmer • Examples • Forests - ? • Statistics - ? • Queries - ? • Foes - ? • Does - ? • Is - ? • Plus - ? • Plusses - ? ©Mark Sanderson, Sheffield University
Take more off? • What about • “ed”, “ing”, “ational”, “ation”, “able”, “ism”, etc, etc. • Porter, M.F. (1980): An algorithm for suffix stripping, in Program - automated library and information systems, 14(3): 130-137 • Three pages of rules • What about • “bring”, “table”, “prism”, “bed”, “thing”? • When to strip, when to stop ©Mark Sanderson, Sheffield University
CVCs • Porter used pattern of letters • [C*](VC)m[V*] • Tree - m=? • Trouble - m=? • Troubles - m=? • m = 0 or sometimes 1 • stop • Syllables? • Pinker, S. (1994): The Language Instinct ©Mark Sanderson, Sheffield University
Problems • Porter doesn’t always return words • “query”, “queries”, “querying”, etc • -> “queri” • Krovetz, R. (1993): Viewing morphology as an inference process, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval: 191-202 • Xu, J., Croft, W.B. (1998): Corpus-Based Stemming using Co-occurrence of Word Variants, in ACM Transactions on Information Systems, 16(1): 61-81 ©Mark Sanderson, Sheffield University
Is it used? • Research says it is useful • Hull, D.A. (1996): Stemming algorithms: A case study for detailed evaluation, in Journal of the American Society for Information Science, 47(1): 70-84 • Web search engines hardly use it • Why? • Unexpected results • computer, computation, computing, computational, etc. • User expectation? • Foreign languages? ©Mark Sanderson, Sheffield University
Ranked retrieval • Everything processed into a bag… • …calculate relevance score between query and every document • Sort documents by their score • Present top scoring documents to user. ©Mark Sanderson, Sheffield University
The scoring • For each document • Term frequency (tf) • t: Number of times term occurs in document • dl: Length of document (number of terms) • Inverse document frequency (idf) • n: Number of documents term occurs in • N: Number of documents in collection ©Mark Sanderson, Sheffield University
TF • More often a term is used in a document • More likely document is about that term • Depends on document length? • Harman, D. (1992): Ranking algorithms, in Frakes, W. & Baeza-Yates, B. (eds.), Information Retrieval: Data Structures & Algorithms: 363-392 • Watch out for mistake: not unique terms. • Problems with spamming ©Mark Sanderson, Sheffield University
Spamming the tf weight • Searching for Jennifer Anniston? SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK ©Mark Sanderson, Sheffield University
IDF • Some query terms better than others? • In general, fair to say that… • “amazon” > “forest” “destruction” > “rain” ©Mark Sanderson, Sheffield University
To illustrate All documents Relevant documents ©Mark Sanderson, Sheffield University
To illustrate All documents amazon ©Mark Sanderson, Sheffield University
To illustrate All documents rain ©Mark Sanderson, Sheffield University
IDF and collection context • IDF sensitive to the document collection content • General newspapers • “amazon” > “forest” “destruction” > “rain” • Amazon book store press releases • “forest” “destruction” > “rain” > “amazon” ©Mark Sanderson, Sheffield University
Very successful • Simple, but effective • Core of most weighting functions • tf (term frequency) • idf (inverse document frequency) • dl (document length) ©Mark Sanderson, Sheffield University
Robertson’s BM25 • Q is a query containing terms T • w is a form of IDF • k1, b, k2, k3 are parameters. • tf is the document term frequency. • qtf is the query term frequency. • dl is the document length (arbitrary units). • avdl is the average document length. ©Mark Sanderson, Sheffield University
Reference for BM25 • Popular weighting scheme • Robertson, S.E., Walker, S., Beaulieu, M.M., Gatford, M., Payne, A. (1995): Okapi at TREC-4, in NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4): 73-96 ©Mark Sanderson, Sheffield University
Getting the balance • Documents with all the query terms? • Just those with high tf•idf terms? • What sorts of documents are these? • Search for a picture of Arbour Low • Stone circle near Sheffield • Try Google and AltaVista ©Mark Sanderson, Sheffield University
Very short “arbour” only Longer, lots of “arbour”, no “low” ©Mark Sanderson, Sheffield University
“arbour low” Arbour Low documents do exist ©Mark Sanderson, Sheffield University
Lots of Arbour Low documents ©Mark Sanderson, Sheffield University Disambiguation?
Result • From Google • “The Stonehenge of the north” ©Mark Sanderson, Sheffield University
Caveat • Search engines don’t say much • Hard to know how they work ©Mark Sanderson, Sheffield University
Boolean searching? • Start with query • “amazon” & “rain forest*” & (“destroy” | “destruction”) • Break collection into two unordered sets • Documents that match the query • Documents that don’t • User has complete control but… • …not easy to use. ©Mark Sanderson, Sheffield University
Boolean • Two forms of query/retrieval system • Ranked retrieval • Long championed by academics • Boolean • Rooted in commercial systems from 1970s • Koenig, M.E. (1992): How close we came, in Information Processing and Management, 28(3): 433-436 • Modern systems • Hybrid of both ©Mark Sanderson, Sheffield University
Don’t need Boolean? • Ranking found to be better than Boolean • But lack of specificity in ranking • destruction AND (amazon OR south american) AND rain forest • destruction, amazon, south american, rain forest • Jansen, B.J., Spink, A., Bateman, J., and Saracevic, T. (1998): Real Life Information Retrieval: A Study Of User Queries On The Web, in SIGIR Forum: A Publication of the Special Interest Group on Information Retrieval, 32(1): 5-17 ©Mark Sanderson, Sheffield University
Models • Mathematically modelling the retrieval process • So as to better understand it • Draw on work of others • Vector space • Probabilistic ©Mark Sanderson, Sheffield University
Vector Space • Document/query is a vector in N space • N = number of unique terms in collection • If term in doc/qry, set that element of its vector • Angle between vectors = similarity measure • Cosine of angle (cos(0) = 1) • Doesn’t model term dependence D Q ©Mark Sanderson, Sheffield University
Model references • wx,y - weight of vector element • Vector space • Salton, G. & Lesk, M.E. (1968): Computer evaluation of indexing and text processing. Journal of the ACM, 15(1): 8-36 • Any of the Salton SMART books ©Mark Sanderson, Sheffield University
Modelling dependence • Latent Semantic Indexing (LSI) • Reduce dimensionality of N space • Bring related terms together. • Furnas, G.W., Deerwester, S., Dumais, S.T., Landauer, T.K., Harshman, R.A., Streeter, L.A., Lochbaum, K.E. (1988): Information retrieval using a singular value decomposition model of latent semantic structure, in Proceeding of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 465-480 • Manning, C.D., Schütze, H. (1999): Foundations of Statistical Natural Language Processing: 554-566 ©Mark Sanderson, Sheffield University
Probabilistic • Assume independence ©Mark Sanderson, Sheffield University
Model references • Probabilistic • Original papers • Robertson, S.E. & Sparck Jones, K. (1976): Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3): 129-146. • Van Rijsbergen, C.J. (1979): Information Retrieval • Chapter 6 • Survey • Crestani, F., Lalmas, M., van Rijsbergen, C.J., Campbell, I. (1998): “Is This Document Relevant? ...Probably”: A Survey of Probabilistic Models in Information Retrieval, in ACM Computing Surveys, 30(4): 528-552 ©Mark Sanderson, Sheffield University
Recent developments • Probabilistic language models • Ponte, J., Croft, W.B. (1998): A Language Modelling Approach to Information Retrieval, in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval: 275-281 ©Mark Sanderson, Sheffield University
Evaluation • Measure how well an IR system is doing • Effectiveness • Number of relevant documents retrieved • Also • Speed • Storage requirements • Usability ©Mark Sanderson, Sheffield University
Effectiveness • Two main measures • Precision is easy • P at rank 10. • Recall is hard • Total number of relevant documents? ©Mark Sanderson, Sheffield University
Test collections • Test collection • Set of documents (few thousand-few million) • Set of queries (50-400) • Set of relevance judgements • Humans check all documents! • Use pooling • Take top 100 from every submission • Remove duplicates • Manually assess these only. ©Mark Sanderson, Sheffield University
Test collections • Small collections (~3Mb) • Cranfield, NPL, CACM - title (& abstract) • Medium (~4 Gb) • TREC - full text • Large (~100Gb) • VLC track of TREC • Compare with reality (~10Tb) • CIA, GCHQ, Large search services ©Mark Sanderson, Sheffield University
Where to get them • Cranfield, NPL, CACM • www.dcs.gla.ac.uk/idom/ • TREC, VLC • trec.nist.gov ©Mark Sanderson, Sheffield University