Some thoughts on failure analysis (success analysis) for CLIR

Some thoughts on failure analysis (success analysis) for CLIR Donna Harman Scientist Emeritus National Institute of Standards and Technology

Welcome to the family!! TREC FIRE

Congratulations!! • To the FIRE organizers who actually made this happen (in an incredibly short amount of time) • To the participants who got their systems running, using (probably) insufficient resources, and FINISHED!!

Now what?? • Do some success/failure analysis, looking more deeply at what worked (and didn’t work) • Write papers so that others can learn from your successes and failures • Come up with plans for the next FIRE!!

TREC-1 (1992) • Hardware • Most machines ran at 75 MHz • Most machines had 32 MB of memory • A 2 GB disk drive cost $5,000 • Software • IR systems previously worked on CACM and other small collections • This means 2 or 3 thousand documents • And those documents were abstracts

WSJ AP Ziff 50 ad hoc queries Automatic: no manual intervention Manual FederalRegister ~2 GB of documents TREC Ad Hoc Task 50 topics

Some TREC-1 participants and methods Carnegie Mellon University (Evans: CLARIT system) City University, London (Robertson: OKAPI) Cornell University (Salton/Buckley: SMART) Universitaet Dortmund (Fuhr: SMART) Siemens Corporate Research, Inc (Voorhees: SMART) New York University (Strzalkowski: NLP methods) Queens College, CUNY (Kwok: PIRCS, spreading activation) RMIT (Moffat, Wilkinson, Zobel: compression study) University of California, Berkeley ( Cooper/Gey: logistic regression) University of Massachusetts, Amherst (Croft: inference network) University of Pittsburgh (Korfhage: genetic algorithms) VPI&SU (Fox: combining multiple manual searches) Bellcore (Dumais: LSI) ConQuest Software, Inc (Nelson) GE R & D Center (Jacobs/Rau: Boolean approximation) TRW Systems Development Division (Mettler: hardware array processor)

What SMART did in TREC-1 • Two official runs testing single term indexing vs single term plus two-term statistical phrases • Many other runs investigating the effects of standard procedures • Which stemmer to use • How large a stopword list • Different “simple” weighting schemes

After TREC-1 • Community had more confidence in the evaluation (TREC was unknown before) • Training data available • Major changes to algorithms for most systems to cope with the wide variation in documents, and in particular for the very long documents

Ad Hoc Technologies

Pseudo-relevance feedback • Pseudo-relevance feedback “pretends” that the top X documents are relevant and then uses these to add expansion terms and/or to reweight the original query terms

Query expansion (TREC-7)

The TREC Tracks Blog Spam Personal documents Legal Genome Retrieval in a domain Novelty Q&A Answers, not docs Enterprise Terabyte Web VLC Web searching, size Video Speech OCR Beyond text X{X,Y,Z} Chinese Spanish Beyond just English Human-in-the-loop Interactive, HARD Filtering Routing Streamed text Static text Ad Hoc, Robust

TREC Spanish and Chinese • Initial approaches to Spanish (1994) used methods for English but with new stopword lists and new stemmers • Initial approaches to Chinese (1996) worked with character bi-grams in place of words, and sometimes used stoplists (many of the groups had no access to speakers of Chinese)

CLIR for English, French, and German (1996) • Initial collection was Swiss newswire in three languages, plus the AP newswire from the TREC English collection • Initial approaches for monolingual work were stemmers and stoplists for French and German, and the use of n-grams • Initial use of machine-readable bi-lingual dictionaries for translation of queries

NTCIR-1 (1999) • 339,483 documents in Japanese and English • 23 groups, of which 17 did monolingual Japanese and 10 groups did CLIR • Initial approaches worked with known methods for English, but had to deal with the issues of segmenting Japanese or working with bi-grams or n-grams

CLEF 2000 • Co-operative activity across five European countries • Multilingual, bilingual and monolingual tasks; 40 topics in 8 European languages • 20 groups, over half working in the multilingual task; others were groups new to IR who worked monolingually in their own language • Many different kinds of resources used for the CLIR part of the task

Savoy’s web page

So you finished--now what?? • Analyze the results, otherwise NOTHING will have been learned • Do some success/failure analysis, looking more deeply at what worked (and didn’t work) • Try to understand WHY something worked or did not work!!

Macro-analysis • Bugs in system; if your results are seriously worse than others, check for bugs • Effects of document length; look at the ranking of your documents with respect to their length • Effects of topic lengths • Effects of different tokenizers /“stemmers” • Baseline monolingual results vs CLIR results; both parts should be analyzed separately

CLEF 2008 Experiments (Paul McNamee)

X to EFGI Results

Stemming Performance increases and decreases on a per topic basis (out of 225 topics)

Now dig into a per topic analysis This is a lot of work but it is really the only way to understand what is happening

Average Precision per Topic

Average Precision vs.Number Relevant

Micro-analysis, step 1 • Select specific topics to investigate • Look at results on a per topic basis with respect to the median of all the groups; pick a small number of topics (10?) that did much worse than the median; these are the initial set to explore; • Optionally pick a similar set that did much BETTER than the median to see where your successes are

Micro-analysis, step 2 • Now for each topic, pick a set of documents to analyze • Look at top X documents (around 20) • Look at the non-relevant ones (failure analysis) • Optionally, look at the relevant ones (success analysis) • Also look at the relevant documents that were NOT retrieved in the top Y (say 100) documents

Micro-analysis, step 3 • For the relevant that were NOT retrieved in the top Y set, analyze for each document why it was not retrieved: what query terms were not in the document, very short or long document, etc. • Do something similar for the top X non-relevant documents; why were they ranked highly? • Develop a general hypothesis for this topic as to what the problems were • Now try to generalize across topics

Possible Monolingual issues • Tokenization and “stemmer” problems • Document length normalization problems • Abbreviation/commonword problems • Term weighting problems such as • Where are the global weights (IDF, etc.) coming from?? • Term expansion problems; generally not enough expansion (low recall)

Possible CLIR Problems • Bi-lingual dictionary too small or missing too many critical words (names, etc.) • Multiple translations in dictionary leading to bad precision; particularly important when using term expansion techniques • Specific issues with cross-language “synonyms”, acronyms, etc.; need a better technique for acquiring these

Reliable Information Access Workshop (RIA), 2003 • Goals: understand/control variability • Participating systems: Clairvoyance, Lemur (CMU, UMass), MultiText, OKAPI, SMART (2 versions) • Methodology: • controlled experiments in pseudo-relevance feedback across 7 systems • massive, cooperative failure analysis • http: ir.nist.gov/ria

RIAFailure analysis • Chose 44 “failure" topics from 150 old TREC • Mean Average Precision <= average • Also picking most variance across systems • Use results from 6 systems’ standard runs • For each topic, people spent 45-60 minutes • looking at results from their assigned system • Short group discussion to come to consensus • Individual and overall report on-line.

Topic 362 • Title: Human smuggling • Description: Identify incidents of human smuggling • 39 relevant: FT (3), FBIS (17), LA (19)

Topic 362

Issues with 362 • Most documents dealt with “smuggling” but missed the “human” concept • City’s title only worked OK, but no expansion • CMU expansion: smuggle (0.14), incident (0.13), identify (0.13), human (0.13) • Sabir expansion: smuggl (0.84), incid (0.29), identif (0.19), human (0.19) • Waterloo SSR and passages worked well • Other important terms: aliens, illegal emigrants/immigrants,

Topic 435 • Title: curbing population growth • Description: What measures have been taken worldwide and what countries have been effective in curbing population growth? • 117 relevant: FT (25), FBIS (81), LA (1)

Issues with 435 • Use of phrases important here • Sabir the only group using phrases • City’s use of title only “approximated” this; note that expansion was not important • Waterloo’s SSR also “approximated” this, but why did they get so few relevant by 1000

Topic 436 • Title: railway accidents • Description: What are the causes of railway accidents throughout the world? • 180 relevant: FT (49), FR (1), FBIS (5), LA (125)

Issues with 436 • Query expansion is critical here, but tricky to pick correct expansion terms • city did no expansion, title was not helpful • cmu, sabir did good expansion but with the full documents • waterloo’s passage-level expansion good • Most relevant documents not retrieved • “some very short relevant documents (LA)” • “55 relevant documents contain no query keywords”

CLEF 2008 Experiments (Paul McNamee)

Thoughts to take home • You have spent months of time on software and runs; now make that effort pay off • Analysis is more than statistical tables • Failure/Success analysis: look at topics where your method failed or worked well • Dig REALLY deep to understand WHY something worked or didn’t work • Think about generalization of what you have learned; • Then PUBLISH

What needs to be in that paper • Basic layout of your FIRE experiment • Related work--where did you get your ideas and why did you pick these techniques; what resources did you use • What happened when you applied these to a new language; what kinds of language specific issues did you find • What worked (and why) and what did not work (and why)

Some thoughts on failure analysis (success analysis) for CLIR

Some thoughts on failure analysis (success analysis) for CLIR

Presentation Transcript

Failure Analysis Overview

Some Thoughts on Pentecost

Some Thoughts on Discipleship

Some thoughts on Journaling

Some Thoughts on Electronic Voting

Failure Analysis Report By Failure Analysis Engineers

Some thoughts!!

Some Thoughts on e -business

Some thoughts on engagement

Some thoughts….

Some thoughts on XLINAC

Thoughts on Export Success

Failure Analysis Reporting

Some thoughts on Road Pricing

Some Thoughts on Ecology

Some Thoughts on Machine Understanding

FAILURE ANALYSIS REPORT

SOME THOUGHTS ON THE TCI REPORT

Failure Analysis Market Analysis | Overview

Some Thoughts on Data Representation

Some Thoughts on Academic Presentations

Some thoughts on Instrumental Techniques