360 likes | 483 Views
A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems. Dmitriy Fradkin, Paul Kantor DIMACS, Rutgers University. What Is This Work About?. Small-scale view: We analyze differences between two implementations of Rocchio method and discuss choices of parameters.
E N D
A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems Dmitriy Fradkin, Paul Kantor DIMACS, Rutgers University Dmitriy Fradkin, CIKM'04
What Is This Work About? • Small-scale view: We analyze differences between two implementations of Rocchio method and discuss choices of parameters. • Large-scale view: The problem of constructing an IR/AF system can be seen as an optimization problem in a large design space. (Well-known methods are simply points in this space.) Dmitriy Fradkin, CIKM'04
Large-Scale View • Use optimization methods to find optimal choices of parameters. These optimal choices do not have to correspond to well-known methods or standard practices. • Design space optimization methods have been suggested for designing VLSI chips [Bahuman et. al. 2002], airplanes [Schwabacher and Gelsey, 1996; Zha et. al. 19996] and HVAC systems [Szykman 1997]. Dmitriy Fradkin, CIKM'04
What’s in a name? • We find that even a single “name” involves an enormous number of design choices. • TREC2002 Adaptive Filtering • DIMACS: Rocchio method • Chinese Academy of Sciences: Rocchio Method • One method performs almost twice as well as the other. Dmitriy Fradkin, CIKM'04
For any system: • Choose Data Representation • Construct Initial Classifier • Training Phase: • Incorporate labeled examples • Supplement with “pseudo positives” and “pseudo negatives” • Set the threshold • Filtering Phase: as new documents arrive • Evaluate performance • Update the classifier model • Update threshold Dmitriy Fradkin, CIKM'04
All of these are usually: • Characterized informally, as a choice, and the exclusion of alternatives. • Seen as points on a map – but to understand the significance of these choices we need to explore the real territory. • So: we must interpolate between the choices made in one method and those made in another. Dmitriy Fradkin, CIKM'04
Interpolation • Identify the corresponding design decisions • Develop a “path” between them • sometimes called a “homotopy” from the topological concept of smoothly distorting one shape (say a coffee cup) into another (say, a doughnut). • Study the effectiveness along various paths among design options. Dmitriy Fradkin, CIKM'04
Interpolation Aspects for IR/AF • Term Representation • Term Weighting • Computing Scores • Setting Classifier Threshold • Document Set Representation • Pseudolabeled Documents in Training Dmitriy Fradkin, CIKM'04
Interpolation Aspects (cont.) • Query Initialization • Unjudged document in test • Query Update • Quitting Strategy Dmitriy Fradkin, CIKM'04
Example: Term Representation Where f’(t,d) is number of times a term occurs in a document Dmitriy Fradkin, CIKM'04
Example: Term Weighting • DIMACS: • CAS: • Homotopy: i’(t) is the number of documents, in training set T, containing term t. Dmitriy Fradkin, CIKM'04
Example: Score Computation i’(t) is the number of documents, in training set T, containing term t. W is a diagonal matrix of weights • DIMACS: • CAS: • Homotopy: Dmitriy Fradkin, CIKM'04
Example: Score Interpolation Same mapping for scores and for thresholds from CAS scale to DIMACS scale: Homotopy: Dmitriy Fradkin, CIKM'04
Example: Setting Thresholds Threshold for query q after seeing document i: • DIMACS: • CAS: • Homotopy: is chosen to optimize utility Dmitriy Fradkin, CIKM'04
Example: Set Representation • DIMACS • CAS • Homotopy Dmitriy Fradkin, CIKM'04
Example: Pseudo-labeled Documents • CAS method does not make use of pseudo-labeled documents in training stage • DIMACS method: Given “density” parameters (d+ and d-) and “proportion” (p+ and p-), score unlabeled training documents and choose top and bottom sets according to “proportion”. Then pick documents out of these sets according to corresponding “density”. • Interpolate between density and proportion parameters (DIMACS) and 0 (CAS). Dmitriy Fradkin, CIKM'04
Example: Query Initialization General Formula: DIMACS: CAS: Homotopy: Dmitriy Fradkin, CIKM'04
Example: Unjudged Documents • A submitted document for which there is no label is “unjudged”. DIMACS ignores such documents. CAS considers such documents pseudo-negative if its score is less than 0.6. • Can view this as a threshold: Dmitriy Fradkin, CIKM'04
Example: Query Update General Formula: DIMACS: CAS: Homotopy: Dmitriy Fradkin, CIKM'04
Example: Quitting Strategy • DIMACS: if after 50 submissions the utility is negative, stop submitting for this topic • CAS: no quitting strategy Alternatively: Dmitriy Fradkin, CIKM'04
Experimental Evaluation • TREC11 Data - Reuters Corpus v1 • 23,000 training; 800,000 test • 100 topics (50 assessor, 50 intersection) • 3 positive and 0 negative examples per topic T+ - all positive documents; D+ - submitted positive; D- - submitted negative; Du – submitted unlabelled Dmitriy Fradkin, CIKM'04
Diagonal Interpolation Dmitriy Fradkin, CIKM'04
Documents Retrieved Dmitriy Fradkin, CIKM'04
Parameter Analysis • It is possible to analyze effect of individual parameters at each point in space by taking “small steps” along the parameter axis. • Requires a lot of computational effort • Results may not be easy to interpret Dmitriy Fradkin, CIKM'04
Example of Parameter Analysis Effect of individual parameters on number of relevant and nonrelevant documents retrieved around 0.8 point Dmitriy Fradkin, CIKM'04
Results based on topic type Comparison of CAS results and 0.8 diagonal homotopy point Dmitriy Fradkin, CIKM'04
Additional Experiments • Reordered TREC documents • Experimented with 77 topics on OHSUMED dataset (1987-1988 as training data, 1989-1991 as test) The results are similar to those on the original TREC task. Dmitriy Fradkin, CIKM'04
Result of Experiments with Reordering Average Results on 5 re-orderings of TREC test set: Dmitriy Fradkin, CIKM'04
OHSUMED Results Dmitriy Fradkin, CIKM'04
Documents Retrieved: OHSUMED Dmitriy Fradkin, CIKM'04
Discussion • We demonstrate the design complexity hidden under “Rocchio method” • We provide specific models for interpolating between design choices • These interpolation options can work for methods that are significantly more different (for example Rocchio and SVM). Dmitriy Fradkin, CIKM'04
Discussion (cont.) • These models should help researchers explore their systems, and regions “between systems” • Suggests a new approach to designing IR systems: finding a set of (interpolation) parameters optimizing performance • This can be done with existing optimization methods. Dmitriy Fradkin, CIKM'04
A Note on Interpolation Limits The need for two endpoint systems is not very restrictive: • Some interpolation parameters can be moved beyond [0,1] interval. • The endpoints themselves can be moved. Dmitriy Fradkin, CIKM'04
Abstract Interpolation • More abstractly: do not interpolate every single parameter –work at higher abstraction levels • Ex: representation block, scoring block, thresholding block, etc. • Can use this with several systems • This is at a lower level than ensembles of classifiers. Dmitriy Fradkin, CIKM'04
Caveat In moving to large design space we still face two major problems: • The range of parameters cannot be explored exhaustively, and non-smooth optimization is needed • Requires a lot of labeled data that is usually produced manually and is in short supply. Dmitriy Fradkin, CIKM'04
Acknowledgments • KD-D group via NSF grant EIA-0087022 • Andrei Anghelescu, Vladimir Menkov • Jamie Callan • Members of DIMACS MMS project • CAS researchers • Ian Soboroff • Anonymous reviewers Dmitriy Fradkin, CIKM'04