440 likes | 455 Views
Explore insights on success and failure analysis in the field of Information Retrieval, as discussed by Donna Harman. Learn about key milestones, participants, and methods in TREC events, and delve into macro-analysis to understand system performance.
E N D
Some thoughts on failure analysis (success analysis) for CLIR Donna Harman Scientist Emeritus National Institute of Standards and Technology
Welcome to the family!! TREC FIRE
Congratulations!! • To the FIRE organizers who actually made this happen (in an incredibly short amount of time) • To the participants who got their systems running, using (probably) insufficient resources, and FINISHED!!
Now what?? • Do some success/failure analysis, looking more deeply at what worked (and didn’t work) • Write papers so that others can learn from your successes and failures • Come up with plans for the next FIRE!!
TREC-1 (1992) • Hardware • Most machines ran at 75 MHz • Most machines had 32 MB of memory • A 2 GB disk drive cost $5,000 • Software • IR systems previously worked on CACM and other small collections • This means 2 or 3 thousand documents • And those documents were abstracts
WSJ AP Ziff 50 ad hoc queries Automatic: no manual intervention Manual FederalRegister ~2 GB of documents TREC Ad Hoc Task 50 topics
Some TREC-1 participants and methods Carnegie Mellon University (Evans: CLARIT system) City University, London (Robertson: OKAPI) Cornell University (Salton/Buckley: SMART) Universitaet Dortmund (Fuhr: SMART) Siemens Corporate Research, Inc (Voorhees: SMART) New York University (Strzalkowski: NLP methods) Queens College, CUNY (Kwok: PIRCS, spreading activation) RMIT (Moffat, Wilkinson, Zobel: compression study) University of California, Berkeley ( Cooper/Gey: logistic regression) University of Massachusetts, Amherst (Croft: inference network) University of Pittsburgh (Korfhage: genetic algorithms) VPI&SU (Fox: combining multiple manual searches) Bellcore (Dumais: LSI) ConQuest Software, Inc (Nelson) GE R & D Center (Jacobs/Rau: Boolean approximation) TRW Systems Development Division (Mettler: hardware array processor)
What SMART did in TREC-1 • Two official runs testing single term indexing vs single term plus two-term statistical phrases • Many other runs investigating the effects of standard procedures • Which stemmer to use • How large a stopword list • Different “simple” weighting schemes
After TREC-1 • Community had more confidence in the evaluation (TREC was unknown before) • Training data available • Major changes to algorithms for most systems to cope with the wide variation in documents, and in particular for the very long documents
Pseudo-relevance feedback • Pseudo-relevance feedback “pretends” that the top X documents are relevant and then uses these to add expansion terms and/or to reweight the original query terms
The TREC Tracks Blog Spam Personal documents Legal Genome Retrieval in a domain Novelty Q&A Answers, not docs Enterprise Terabyte Web VLC Web searching, size Video Speech OCR Beyond text X{X,Y,Z} Chinese Spanish Beyond just English Human-in-the-loop Interactive, HARD Filtering Routing Streamed text Static text Ad Hoc, Robust
TREC Spanish and Chinese • Initial approaches to Spanish (1994) used methods for English but with new stopword lists and new stemmers • Initial approaches to Chinese (1996) worked with character bi-grams in place of words, and sometimes used stoplists (many of the groups had no access to speakers of Chinese)
CLIR for English, French, and German (1996) • Initial collection was Swiss newswire in three languages, plus the AP newswire from the TREC English collection • Initial approaches for monolingual work were stemmers and stoplists for French and German, and the use of n-grams • Initial use of machine-readable bi-lingual dictionaries for translation of queries
NTCIR-1 (1999) • 339,483 documents in Japanese and English • 23 groups, of which 17 did monolingual Japanese and 10 groups did CLIR • Initial approaches worked with known methods for English, but had to deal with the issues of segmenting Japanese or working with bi-grams or n-grams
CLEF 2000 • Co-operative activity across five European countries • Multilingual, bilingual and monolingual tasks; 40 topics in 8 European languages • 20 groups, over half working in the multilingual task; others were groups new to IR who worked monolingually in their own language • Many different kinds of resources used for the CLIR part of the task
So you finished--now what?? • Analyze the results, otherwise NOTHING will have been learned • Do some success/failure analysis, looking more deeply at what worked (and didn’t work) • Try to understand WHY something worked or did not work!!
Macro-analysis • Bugs in system; if your results are seriously worse than others, check for bugs • Effects of document length; look at the ranking of your documents with respect to their length • Effects of topic lengths • Effects of different tokenizers /“stemmers” • Baseline monolingual results vs CLIR results; both parts should be analyzed separately
Stemming Performance increases and decreases on a per topic basis (out of 225 topics)
Now dig into a per topic analysis This is a lot of work but it is really the only way to understand what is happening
Micro-analysis, step 1 • Select specific topics to investigate • Look at results on a per topic basis with respect to the median of all the groups; pick a small number of topics (10?) that did much worse than the median; these are the initial set to explore; • Optionally pick a similar set that did much BETTER than the median to see where your successes are
Micro-analysis, step 2 • Now for each topic, pick a set of documents to analyze • Look at top X documents (around 20) • Look at the non-relevant ones (failure analysis) • Optionally, look at the relevant ones (success analysis) • Also look at the relevant documents that were NOT retrieved in the top Y (say 100) documents
Micro-analysis, step 3 • For the relevant that were NOT retrieved in the top Y set, analyze for each document why it was not retrieved: what query terms were not in the document, very short or long document, etc. • Do something similar for the top X non-relevant documents; why were they ranked highly? • Develop a general hypothesis for this topic as to what the problems were • Now try to generalize across topics
Possible Monolingual issues • Tokenization and “stemmer” problems • Document length normalization problems • Abbreviation/commonword problems • Term weighting problems such as • Where are the global weights (IDF, etc.) coming from?? • Term expansion problems; generally not enough expansion (low recall)
Possible CLIR Problems • Bi-lingual dictionary too small or missing too many critical words (names, etc.) • Multiple translations in dictionary leading to bad precision; particularly important when using term expansion techniques • Specific issues with cross-language “synonyms”, acronyms, etc.; need a better technique for acquiring these
Reliable Information Access Workshop (RIA), 2003 • Goals: understand/control variability • Participating systems: Clairvoyance, Lemur (CMU, UMass), MultiText, OKAPI, SMART (2 versions) • Methodology: • controlled experiments in pseudo-relevance feedback across 7 systems • massive, cooperative failure analysis • http: ir.nist.gov/ria
RIAFailure analysis • Chose 44 “failure" topics from 150 old TREC • Mean Average Precision <= average • Also picking most variance across systems • Use results from 6 systems’ standard runs • For each topic, people spent 45-60 minutes • looking at results from their assigned system • Short group discussion to come to consensus • Individual and overall report on-line.
Topic 362 • Title: Human smuggling • Description: Identify incidents of human smuggling • 39 relevant: FT (3), FBIS (17), LA (19)
Issues with 362 • Most documents dealt with “smuggling” but missed the “human” concept • City’s title only worked OK, but no expansion • CMU expansion: smuggle (0.14), incident (0.13), identify (0.13), human (0.13) • Sabir expansion: smuggl (0.84), incid (0.29), identif (0.19), human (0.19) • Waterloo SSR and passages worked well • Other important terms: aliens, illegal emigrants/immigrants,
Topic 435 • Title: curbing population growth • Description: What measures have been taken worldwide and what countries have been effective in curbing population growth? • 117 relevant: FT (25), FBIS (81), LA (1)
Issues with 435 • Use of phrases important here • Sabir the only group using phrases • City’s use of title only “approximated” this; note that expansion was not important • Waterloo’s SSR also “approximated” this, but why did they get so few relevant by 1000
Topic 436 • Title: railway accidents • Description: What are the causes of railway accidents throughout the world? • 180 relevant: FT (49), FR (1), FBIS (5), LA (125)
Issues with 436 • Query expansion is critical here, but tricky to pick correct expansion terms • city did no expansion, title was not helpful • cmu, sabir did good expansion but with the full documents • waterloo’s passage-level expansion good • Most relevant documents not retrieved • “some very short relevant documents (LA)” • “55 relevant documents contain no query keywords”
Thoughts to take home • You have spent months of time on software and runs; now make that effort pay off • Analysis is more than statistical tables • Failure/Success analysis: look at topics where your method failed or worked well • Dig REALLY deep to understand WHY something worked or didn’t work • Think about generalization of what you have learned; • Then PUBLISH
What needs to be in that paper • Basic layout of your FIRE experiment • Related work--where did you get your ideas and why did you pick these techniques; what resources did you use • What happened when you applied these to a new language; what kinds of language specific issues did you find • What worked (and why) and what did not work (and why)