260 likes | 630 Views
Advanced best match Mark Sanderson Porto, 2000 Aims To build on what you have done so far, reviewing more sophisticated ways of ranking documents in relation to a query. Objectives At the end of this lecture you will be able to describe a range of statistically based approaches to IR namely
E N D
Advanced best match Mark Sanderson Porto, 2000
Aims • To build on what you have done so far, reviewing more sophisticated ways of ranking documents in relation to a query.
Objectives • At the end of this lecture you will be able to • describe a range of statistically based approaches to IR namely • Best match ranking • Passage retrieval • Web page popularity techniques • Web link counting
Why? • Extend your knowledge in how IR systems work
Up till now… • Inverse document frequency • More frequent - bad • Less frequent - good • More often a term is used (term frequency) • More likely document is about that term • Depends on document length?
Basically covers it • More sophisticated formulas about • Basically tf*idf • Some variation, e.g. • tf normalisation • Divide by max tf? • Look at test collections • Singhal, A., Buckley, C., Mitra, M. (1996): Pivoted Document Length Normalization, in Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval : 21-29
Anything else? • How to make ranking better? • Term location? • Why? big documents • Web link analysis • Why? • Popularity • Why?
Warning • Customise system to a collection… • …and it become less generally applicable.
Term location • Prefer documents where terms are closer together? • Passage retrieval • Hearst, M.A., Plaunt, C. (1993): Subtopic structuring for full-length document access, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval: 59-68 • Callan, J. (1994): PassageLevel Evidence in Document Retrieval, in Proceedings of the 17th annual international ACM-SIGIR conference on Research and development in information retrieval: 302-310
Research interests/publications/lecturing/supervising My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below. I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising. Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences. My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system searching over two years of the Financial Times newspaper. The system, called NRT (News Retrieval Tool), was described in a paper by Donna Harman, "User Friendly Systems Instead of User-Friendly Front Ends". Donna's paper appears in JASIS and the "Readings in IR" book by Sparck-Jones and Willett. 1st 2nd 3rd 4th Hearst and Plaunt • Split document into passages • Rank a document based on scores of each of it’s passages • What is a passage? • Paragraph? • TextTiling? • (non-overlapping) Fixed window?
TextTiling? • Compute sentence block similarity
Results? • TextTiling and paragraphs work equally well • Fixed windows work less well.
Research interests/publications/lecturing/supervising My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below. I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising. Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences. My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system searching over two years of the Financial Times newspaper. The system, called NRT (News Retrieval Tool), was described in a paper by Donna Harman, "User Friendly Systems Instead of User-Friendly Front Ends". Donna's paper appears in JASIS and the "Readings in IR" book by Sparck-Jones and Willett. 1st 2nd 3rd 4th Callan • Split document into passages • Rank a document based on score of its highest ranking passage • What is a passage? • Paragraph? • Bounded paragraph • (overlapping) Fixed window?
Results? • Overlapping passage did the best.
Other use for best passage? • Anyone?
How do they do that? • Best passage retrieval in the result list • Tombros, A. and Sanderson, M. Advantages of query-biased summaries in IR, in Proceedings of the 21st ACM SIGIR conference, Pages 2-10, 1998
Research interests/publications/lecturing/supervising My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below. I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising. Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences. My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system Research interests/publications/lecturing/supervising My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below. I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising. Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences. My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system Research interests/publications/lecturing/supervising My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below. I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising. Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences. My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system Research interests/publications/lecturing/supervising My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below. I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising. Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences. My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system Web link analysis • Expand representation of document with text • Image retrieval by hypertext linksV. Harmandas, M. Sanderson and M.D. DunlopIn the Proceedings of the 20th ACM SIGIR conference, Pages 296-303, 1997
Research interests/publications/lecturing/supervising My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below. I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising. Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences. My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system Research interests/publications/lecturing/supervising My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below. I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising. Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences. My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system Research interests/publications/lecturing/supervising My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below. I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising. Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences. My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system Research interests/publications/lecturing/supervising My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below. I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising. Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences. My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system Research interests/publications/lecturing/supervising My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below. I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising. Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences. My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system Research interests/publications/lecturing/supervising My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below. I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising. Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences. My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system Some sites better than others? • Hubs and authorities • Search for “Harvard” • Gibson, D., Kleinberg, J., Raghavan, P. (1998): Inferring Web Communities from Link Topology, in Proceedings of The 9th ACM Conference on Hypertext and Hypermedia: links, objects, time and space—structure in hypermedia systems: 225-234
Like Google? • www.google.com • Base ranking on best match plus authority weight • Harder to spam authorities?
Popularity? • Query IMDB (www.imdb.org) for “Titanic” Titanic (1915) Titanic (1997) Titanic (1943) Titanic (1953) Titanic 2000 (1999) Titanic: Anatomy of a Disaster (1997) Titanic: Answers from the Abyss (1999) Titanic Chronicles, The (1999) Titanic in a Tub: The Golden Age of Toy Boats (1981) Titanic Too: It Missed the Iceberg (2000) Titanic Town (1998) Titanic vals (1964)...aka Titanic Waltz (1964) (USA) Atlantic (1929)...aka Titanic: Disaster in the Atlantic (1999) (USA: video title) Night to Remember, A (1958)...aka Titanic latitudine 41 Nord (1958) (Italy) Gigantic (2000)...aka Untitled Titanic Spoof (1998) (USA: working title) …
Use popularity • Query “titanic” on IMDB • Titanic (1997) • Titanic Too: It Missed the Iceberg (2000)
Other examples • Direct Hit • www.directhit.com • Increasingly popular
URL specific stuff • Prefer web pages with short URLs • http://dis.shef.ac.uk/mark/ • http://dis.shef.ac.uk/mark/trivia/links/stuff/ • Prefer with “.com” • not “.pt”?
Conclusions • By now, you will be able to • describe a range of statistically based approaches to IR namely • Best match ranking • Passage retrieval • Web page popularity techniques • Web link counting