1 / 18

Rayid Ghani Accenture Technology Labs, USA Rosie Jones Carnegie Mellon University, USA

Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web. Rayid Ghani Accenture Technology Labs, USA Rosie Jones Carnegie Mellon University, USA Dunja Mladenic J. Stefan Institute, Slovenia. Motivation.

seidl
Download Presentation

Rayid Ghani Accenture Technology Labs, USA Rosie Jones Carnegie Mellon University, USA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid GhaniAccenture Technology Labs, USA Rosie Jones Carnegie Mellon University, USA Dunja Mladenic J. Stefan Institute, Slovenia

  2. Motivation • Need a collection of documents matching a particular concept • Search on the web, modify query, analyze documents, modify query,… • Repetitive, time-consuming, requires reasonable familiarity with the concept

  3. Task • Given: • 1 Document in Target Concept • 1 Other Document (negative example) • Access to a Web Search Engine • Create a Corpus of the Target Concept quickly with no human effort

  4. Algorithm Query Generator WWW Seed Docs Filter/Classifier

  5. Build Query Learning Web Initial Docs Word Statistics Relevant Filter Non-Relevant

  6. Query Generation • Examine current relevant and non-relavent documents to generate a query likely to find documents that ARE similar to the relevant ones and NOT similar to non-relevant ones • A Query consists of minclusion terms and nexclusion terms • e.g +intelligence +web –military

  7. Query Term Selection Methods • Uniform (UN) – select k words randomly from the current vocabulary • Term-Frequency (TF) – select top k words ranked according to their frequency • Probabilistic TF (PTF) – k words with probability proportional to their frequency

  8. Query Term Selection Methods • RTFIDF – top k words according to their rtfidf scores • Odds-Ratio (OR) – top k words according to their odds-ratio scores • Probabilistic OR (POR) – select k words with probability proportional to their Odds-Ratio scores

  9. Query Parameters • 4 Parameters • Inclusion Term-Selection Method • Exclusion Term-Selection Method • Inclusion Length • Exclusion Length • Example: Odds-Ratio, rtfidf, 3,6

  10. Experimental Setup • Language: Slovenian • Initial documents: 1 web page in Slovenian, 1 in English • Search engine: Altavista

  11. Evaluation • Goal: Collect as many relevant documents as possible while minimizing the cost • Cost • Number of totaldocumentsretrieved from the Web • Number of distinct Queries issued to the Search Engine • Evaluation Measures • Percentage of retrieved documents that are relevant • Number of relevant documents retrieved per unique query

  12. Fixed Query Parameters • Fix Query Lengths and Vary Term-Selection Methods • Fix Term-Selection Methods and Vary Query Lengths • Results (Ghani et al. , SIGIR 2001): • Odds-Ratio works well overall • Long Queries are precise but with low recall

  13. Why Online Learning? • Different Term-Selection Methods Excel with different Query Lengths • Best Combination of methods and lengths may change as different parts of the Web/feature space are explored

  14. Learning Methods • Memory-Less (ML) Learning • Ignore all history and only use the current performance • Long-Term Memory (LT) Learning • Use all of the previous history • Additive Update Rule • Multiplicative Update Rule • Fading Memory (FM) Learning • Use all of the history but with a decay function over time

  15. LTM LTM Memory-Less Memory-Less Results

  16. Results

  17. Further Experiments • Other Languages • Similar results with Croatian, Czech and Tagalog • Keywords • Similar results when initializing with keywords instead of documents • Comparison to Altavista’s “More Like This” • Better performance than Altavista’s feature

  18. Conclusions • Successfully able to build corpora for minority languages (Slovenian, Croatian, Czech, Tagalog) using Web search engines • Online Learning is useful in adapting to different parts of the Web space • System and Corpora are/will be available at www.cs.cmu.edu/~TextLearning/CorpusBuilder

More Related