270 likes | 393 Views
Stephen Robertson Microsoft Research Cambridge and City University ser@microsoft.com. The study of information retrieval – a long view. A half-century of lab experiments. Cranfield began in 1958 some precursor experiments, but can treat that as the start of the experimental tradition in IR
E N D
Stephen Robertson Microsoft Research Cambridge and City University ser@microsoft.com The study of information retrieval – a long view IIiX, London
A half-century of lab experiments Cranfield began in 1958 • some precursor experiments, but can treat that as the start of the experimental tradition in IR A brief timeline: • 1960s & 70s: various experiments, mostly with purpose-built test collections • late 60s on: exchange of test collections among researchers • mid-late-seventies: the ‘ideal’ test collection project • 1981: The Book (Information Retrieval Experiment, KSJ) • 1980s: relatively fallow period • 1990s to date: TREC • late 90s on: TREC spin-offs (CLEF, NTCIR, INEX etc.) • (and of course, late 90s on: web search engines) IIiX, London
Some highlights (personal selection) • Cranfield 1 and 2 • Smart: VSM • Medlars: indexing and searching • KSJ: term weighting; test collections • Keen: index languages • Belkin and Oddy: ASK and user models • Okapi: simple search and feedback • UMass: various experimental systems • TREC: adhoc; feedback; the web; interaction • CLEF, NTCIR, INEX, DUC etc. [S Robertson, On the history of evaluation in IR, Journal of Information Science, Vol. 34, No. 4, 439-456 (2008)] IIiX, London
A half-century of lab experiments Recapitulation of outcome (a gross over-simplification!) • Don’t worry too much about the NLP • ... or the semantics • ... or the knowledge engineering • ... or the interaction issues • ... or the user’s cognitive processes • but pay attention to the statistics • ... and to the ranking algorithms • bag-of-words rules OK IIiX, London
A half-century... ... that deserves considerable celebration • but of course has a downside So, let’s explore a little • why we do lab experiments in the first place • what the alternatives are • what they might or might not tell us • what is good and bad about them • which directions they lead us in • more importantly, which they deflect us from and maybe, finally, • how they might be improved Note: this is my personal take on these questions! IIiX, London
Abstraction Lab experiments involve abstraction • choice of variables included/excluded • control on variables • restrictions on values/ranges of variables [Note: models and theories also involve abstraction • but usually different abstractions, for different reasons] Why? • First, to make them possible IIiX, London
Abstraction Why else? • study simple cases • clarify relationships • reduce noise • ensure repeatability • validate abstract theories IIiX, London
Example: Newton’s laws IIiX, London
The scientific method(simple-minded outline!) Collect empirical data • by observation and/or experiment Formulate hypotheses/models/theories Derive testable predictions • about events which may be studied empirically Conduct further observation/experiment • designed to test predictions Refine/reject models/theories • and reiterate IIiX, London
Observation versus experiment(simple-minded outline again!) The experimental approach is a very powerful one Given a simple choice, we would usually choose experiment over observation • at least for hypothesis testing ... but the choice is rarely simple IIiX, London
Traditional science The traditional image of science involves experiments in laboratories • but actually this is misleading Some sciences thrive in the laboratory • e.g. chemistry, small-scale physics Others have made a transition • e.g. the biochemical end of biology Others still are almost completely resistant • e.g. astrophysics, geology (not to mention such non-traditional sciences such as economics) IIiX, London
Limitations of abstraction Abstractions involve assumptions • choosing one variable and eliminating another assumes that the two can be treated separately • if an abstraction is built into an experiment, then its assumptions cannot be tested by the experiment Even if we could do everything in a laboratory, we should not all do the same thing! • that is, we should not all use the same abstractions based on the same assumptions IIiX, London
Limitations of abstraction Some phenomena resist abstraction • so that an abstract representation would be unrealistic or even illusory This gives us the basic conflict • between control and realism Note: I have exaggerated the polarity between observation and experiment • most investigations have elements of both ... but I have not exaggerated the conflict • most investigations struggle seriously with it • and have to make compromises IIiX, London
Research in IR A conventional separation: • Laboratory experiments in the Cranfield/TREC tradition, usually on ranking algorithms • Obervational experiments addressing user-oriented issues Of course this is over-simplified • there are laboratory experiments addressing other issues • semantics, language, etc. • user interaction etc. • as well as observational experiments on algorithms IIiX, London
Research in IR The Cranfield/TREC tradition is richer than it is often given credit for • TREC tracks and spin-offs have pushed the boundaries of lab experimentation, with some different outcomes Some examples: • QA: Here NLP and some aspects of semantics / knowledge engineering are critical • Cross-lingual: Here we need resources constructed from comparable corpora • The web: Here we are beginning to extract useful knowledge from usage data and resources such as wikipedia All of these are unconventional • Although all are dominated by statistical ideas IIiX, London
Research in IR Communities involved in user-oriented issues have developed laboratory methods • in interactive tasks within TREC-like projects • in new forms of lab experiments Some core IR algorithm work is moving into observational user experiments • particularly in the web environment • particularly using click (and other user behaviour) data IIiX, London
Observational IR research Aspects that suggest an observational approach: • interaction (human-system) • collaboration (human-human) • temporal scale • user cognition • context • task context • user knowledge IIiX, London
Observational IR research Issues: • scale • it is hard to expand the scale of an observational study • reproducibility • it is hard to perform an observational study in such a way that it can be repeated by someone else • control • it is hard to control the variables that might affect an experiment (either the independent variables of interest, or the noise variables) IIiX, London
Observational IR research Advantages: • realism • we have more confidence that the results of an observational study represent some kind of reality • context • those (perhaps unknown) aspects of context that are affect can be assumed to be present Maybe another significant difference... IIiX, London
Hypothesis testing Back to the scientific method: • need to formulate predictions as testable hypotheses Properly, any prediction of a model or theory is a candidate for this • the objective is to test the model or theory • not to achieve some practical result from it • ideally, look for critical cases • where the predictions of the model in question differ from those of other models IIiX, London
IR models and theories What are IR models designed to tell us? Different kinds of models might be expected to explain/predict many observables ... but in the Cranfield/TREC tradition, we usually interpret them in a narrow way specifally, we look only for effects on effectiveness This seems to be a limitation in our ways of thinking about them IIiX, London
Hypothesis testing At least some user-oriented studies in IR ask other questions • and try to develop appropriate models/theories • e.g. about user behaviour Obviously we are interested in making systems better... • but a model or theory may (should) tell us more than just how to achieve that aim • and indeed other predictions may also be useful Even statistical models could be interpreted more broadly IIiX, London
Other predictions(maybe accessible to statistical models) Patterns of term occurrence • maybe simply not believable Calibrated probabilities of relevance • hard to do but maybe useful Clicks • probability of click • patterns of click behaviour • e.g. click trails Other behaviours • abandonment • reformulation • dwell time IIiX, London
Probabilities of relevance Usual assumption: • do not need actual probabilities, only rank order • the result of focussing on standard evaluation metrics • independence models are typically bad at giving calibrated probabilities Cooper suggested systems should give probabilities • as guide to user There are other practical reasons • filtering • combination of evidence IIiX, London
Clicks There is a new movement in statistical modelling for IR: • we would like to integrate aspects of user behaviour into our models • specifically clicks Predicting patterns of click behaviour is a major component • which gives us the impetus to investigate and test other kinds of hypothesis Might use clicks to justify effectiveness metrics • but such predictions may also be useful for other reasons IIiX, London
In general It seems to me that we should be trying to move in this direction • Constructing models or theories which are capable of making other kinds of predictions • Devising test of these other predictions • Laboratory tests • Observational tests … which would encourage rapprochement between the laboratory and observational traditions IIiX, London
Finally I strongly believe in the science of search • as a theoretical science • in which models and theories have a major role to play • and as an empirical science • requiring the full range of empirical investigations • including, specifically, both laboratory experiments and observational studies The lack of a strong unified theory of IR reinforces the need for good empirical work IIiX, London