440 likes | 619 Views
Some thoughts on failure analysis (success analysis) for CLIR. Donna Harman Scientist Emeritus National Institute of Standards and Technology. Welcome to the family!!. TREC. FIRE. Congratulations!!.
E N D
Some thoughts on failure analysis (success analysis) for CLIR Donna Harman Scientist Emeritus National Institute of Standards and Technology
Welcome to the family!! TREC FIRE
Congratulations!! • To the FIRE organizers who actually made this happen (in an incredibly short amount of time) • To the participants who got their systems running, using (probably) insufficient resources, and FINISHED!!
Now what?? • Do some success/failure analysis, looking more deeply at what worked (and didn’t work) • Write papers so that others can learn from your successes and failures • Come up with plans for the next FIRE!!
TREC-1 (1992) • Hardware • Most machines ran at 75 MHz • Most machines had 32 MB of memory • A 2 GB disk drive cost $5,000 • Software • IR systems previously worked on CACM and other small collections • This means 2 or 3 thousand documents • And those documents were abstracts
WSJ AP Ziff 50 ad hoc queries Automatic: no manual intervention Manual FederalRegister ~2 GB of documents TREC Ad Hoc Task 50 topics
Some TREC-1 participants and methods Carnegie Mellon University (Evans: CLARIT system) City University, London (Robertson: OKAPI) Cornell University (Salton/Buckley: SMART) Universitaet Dortmund (Fuhr: SMART) Siemens Corporate Research, Inc (Voorhees: SMART) New York University (Strzalkowski: NLP methods) Queens College, CUNY (Kwok: PIRCS, spreading activation) RMIT (Moffat, Wilkinson, Zobel: compression study) University of California, Berkeley ( Cooper/Gey: logistic regression) University of Massachusetts, Amherst (Croft: inference network) University of Pittsburgh (Korfhage: genetic algorithms) VPI&SU (Fox: combining multiple manual searches) Bellcore (Dumais: LSI) ConQuest Software, Inc (Nelson) GE R & D Center (Jacobs/Rau: Boolean approximation) TRW Systems Development Division (Mettler: hardware array processor)
What SMART did in TREC-1 • Two official runs testing single term indexing vs single term plus two-term statistical phrases • Many other runs investigating the effects of standard procedures • Which stemmer to use • How large a stopword list • Different “simple” weighting schemes
After TREC-1 • Community had more confidence in the evaluation (TREC was unknown before) • Training data available • Major changes to algorithms for most systems to cope with the wide variation in documents, and in particular for the very long documents
Pseudo-relevance feedback • Pseudo-relevance feedback “pretends” that the top X documents are relevant and then uses these to add expansion terms and/or to reweight the original query terms
The TREC Tracks Blog Spam Personal documents Legal Genome Retrieval in a domain Novelty Q&A Answers, not docs Enterprise Terabyte Web VLC Web searching, size Video Speech OCR Beyond text X{X,Y,Z} Chinese Spanish Beyond just English Human-in-the-loop Interactive, HARD Filtering Routing Streamed text Static text Ad Hoc, Robust
TREC Spanish and Chinese • Initial approaches to Spanish (1994) used methods for English but with new stopword lists and new stemmers • Initial approaches to Chinese (1996) worked with character bi-grams in place of words, and sometimes used stoplists (many of the groups had no access to speakers of Chinese)
CLIR for English, French, and German (1996) • Initial collection was Swiss newswire in three languages, plus the AP newswire from the TREC English collection • Initial approaches for monolingual work were stemmers and stoplists for French and German, and the use of n-grams • Initial use of machine-readable bi-lingual dictionaries for translation of queries
NTCIR-1 (1999) • 339,483 documents in Japanese and English • 23 groups, of which 17 did monolingual Japanese and 10 groups did CLIR • Initial approaches worked with known methods for English, but had to deal with the issues of segmenting Japanese or working with bi-grams or n-grams
CLEF 2000 • Co-operative activity across five European countries • Multilingual, bilingual and monolingual tasks; 40 topics in 8 European languages • 20 groups, over half working in the multilingual task; others were groups new to IR who worked monolingually in their own language • Many different kinds of resources used for the CLIR part of the task
So you finished--now what?? • Analyze the results, otherwise NOTHING will have been learned • Do some success/failure analysis, looking more deeply at what worked (and didn’t work) • Try to understand WHY something worked or did not work!!
Macro-analysis • Bugs in system; if your results are seriously worse than others, check for bugs • Effects of document length; look at the ranking of your documents with respect to their length • Effects of topic lengths • Effects of different tokenizers /“stemmers” • Baseline monolingual results vs CLIR results; both parts should be analyzed separately
Stemming Performance increases and decreases on a per topic basis (out of 225 topics)
Now dig into a per topic analysis This is a lot of work but it is really the only way to understand what is happening
Micro-analysis, step 1 • Select specific topics to investigate • Look at results on a per topic basis with respect to the median of all the groups; pick a small number of topics (10?) that did much worse than the median; these are the initial set to explore; • Optionally pick a similar set that did much BETTER than the median to see where your successes are
Micro-analysis, step 2 • Now for each topic, pick a set of documents to analyze • Look at top X documents (around 20) • Look at the non-relevant ones (failure analysis) • Optionally, look at the relevant ones (success analysis) • Also look at the relevant documents that were NOT retrieved in the top Y (say 100) documents
Micro-analysis, step 3 • For the relevant that were NOT retrieved in the top Y set, analyze for each document why it was not retrieved: what query terms were not in the document, very short or long document, etc. • Do something similar for the top X non-relevant documents; why were they ranked highly? • Develop a general hypothesis for this topic as to what the problems were • Now try to generalize across topics
Possible Monolingual issues • Tokenization and “stemmer” problems • Document length normalization problems • Abbreviation/commonword problems • Term weighting problems such as • Where are the global weights (IDF, etc.) coming from?? • Term expansion problems; generally not enough expansion (low recall)
Possible CLIR Problems • Bi-lingual dictionary too small or missing too many critical words (names, etc.) • Multiple translations in dictionary leading to bad precision; particularly important when using term expansion techniques • Specific issues with cross-language “synonyms”, acronyms, etc.; need a better technique for acquiring these
Reliable Information Access Workshop (RIA), 2003 • Goals: understand/control variability • Participating systems: Clairvoyance, Lemur (CMU, UMass), MultiText, OKAPI, SMART (2 versions) • Methodology: • controlled experiments in pseudo-relevance feedback across 7 systems • massive, cooperative failure analysis • http: ir.nist.gov/ria
RIAFailure analysis • Chose 44 “failure" topics from 150 old TREC • Mean Average Precision <= average • Also picking most variance across systems • Use results from 6 systems’ standard runs • For each topic, people spent 45-60 minutes • looking at results from their assigned system • Short group discussion to come to consensus • Individual and overall report on-line.
Topic 362 • Title: Human smuggling • Description: Identify incidents of human smuggling • 39 relevant: FT (3), FBIS (17), LA (19)
Issues with 362 • Most documents dealt with “smuggling” but missed the “human” concept • City’s title only worked OK, but no expansion • CMU expansion: smuggle (0.14), incident (0.13), identify (0.13), human (0.13) • Sabir expansion: smuggl (0.84), incid (0.29), identif (0.19), human (0.19) • Waterloo SSR and passages worked well • Other important terms: aliens, illegal emigrants/immigrants,
Topic 435 • Title: curbing population growth • Description: What measures have been taken worldwide and what countries have been effective in curbing population growth? • 117 relevant: FT (25), FBIS (81), LA (1)
Issues with 435 • Use of phrases important here • Sabir the only group using phrases • City’s use of title only “approximated” this; note that expansion was not important • Waterloo’s SSR also “approximated” this, but why did they get so few relevant by 1000
Topic 436 • Title: railway accidents • Description: What are the causes of railway accidents throughout the world? • 180 relevant: FT (49), FR (1), FBIS (5), LA (125)
Issues with 436 • Query expansion is critical here, but tricky to pick correct expansion terms • city did no expansion, title was not helpful • cmu, sabir did good expansion but with the full documents • waterloo’s passage-level expansion good • Most relevant documents not retrieved • “some very short relevant documents (LA)” • “55 relevant documents contain no query keywords”
Thoughts to take home • You have spent months of time on software and runs; now make that effort pay off • Analysis is more than statistical tables • Failure/Success analysis: look at topics where your method failed or worked well • Dig REALLY deep to understand WHY something worked or didn’t work • Think about generalization of what you have learned; • Then PUBLISH
What needs to be in that paper • Basic layout of your FIRE experiment • Related work--where did you get your ideas and why did you pick these techniques; what resources did you use • What happened when you applied these to a new language; what kinds of language specific issues did you find • What worked (and why) and what did not work (and why)