180 likes | 209 Views
Explore how to generate queries efficiently for finding documents matching a minority concept on the web using innovative online learning methods. Experiment with various term selection techniques to maximize relevant search results. Access the system and corpora at www.cs.cmu.edu/~TextLearning/CorpusBuilder.
E N D
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid GhaniAccenture Technology Labs, USA Rosie Jones Carnegie Mellon University, USA Dunja Mladenic J. Stefan Institute, Slovenia
Motivation • Need a collection of documents matching a particular concept • Search on the web, modify query, analyze documents, modify query,… • Repetitive, time-consuming, requires reasonable familiarity with the concept
Task • Given: • 1 Document in Target Concept • 1 Other Document (negative example) • Access to a Web Search Engine • Create a Corpus of the Target Concept quickly with no human effort
Algorithm Query Generator WWW Seed Docs Filter/Classifier
Build Query Learning Web Initial Docs Word Statistics Relevant Filter Non-Relevant
Query Generation • Examine current relevant and non-relavent documents to generate a query likely to find documents that ARE similar to the relevant ones and NOT similar to non-relevant ones • A Query consists of minclusion terms and nexclusion terms • e.g +intelligence +web –military
Query Term Selection Methods • Uniform (UN) – select k words randomly from the current vocabulary • Term-Frequency (TF) – select top k words ranked according to their frequency • Probabilistic TF (PTF) – k words with probability proportional to their frequency
Query Term Selection Methods • RTFIDF – top k words according to their rtfidf scores • Odds-Ratio (OR) – top k words according to their odds-ratio scores • Probabilistic OR (POR) – select k words with probability proportional to their Odds-Ratio scores
Query Parameters • 4 Parameters • Inclusion Term-Selection Method • Exclusion Term-Selection Method • Inclusion Length • Exclusion Length • Example: Odds-Ratio, rtfidf, 3,6
Experimental Setup • Language: Slovenian • Initial documents: 1 web page in Slovenian, 1 in English • Search engine: Altavista
Evaluation • Goal: Collect as many relevant documents as possible while minimizing the cost • Cost • Number of totaldocumentsretrieved from the Web • Number of distinct Queries issued to the Search Engine • Evaluation Measures • Percentage of retrieved documents that are relevant • Number of relevant documents retrieved per unique query
Fixed Query Parameters • Fix Query Lengths and Vary Term-Selection Methods • Fix Term-Selection Methods and Vary Query Lengths • Results (Ghani et al. , SIGIR 2001): • Odds-Ratio works well overall • Long Queries are precise but with low recall
Why Online Learning? • Different Term-Selection Methods Excel with different Query Lengths • Best Combination of methods and lengths may change as different parts of the Web/feature space are explored
Learning Methods • Memory-Less (ML) Learning • Ignore all history and only use the current performance • Long-Term Memory (LT) Learning • Use all of the previous history • Additive Update Rule • Multiplicative Update Rule • Fading Memory (FM) Learning • Use all of the history but with a decay function over time
LTM LTM Memory-Less Memory-Less Results
Further Experiments • Other Languages • Similar results with Croatian, Czech and Tagalog • Keywords • Similar results when initializing with keywords instead of documents • Comparison to Altavista’s “More Like This” • Better performance than Altavista’s feature
Conclusions • Successfully able to build corpora for minority languages (Slovenian, Croatian, Czech, Tagalog) using Web search engines • Online Learning is useful in adapting to different parts of the Web space • System and Corpora are/will be available at www.cs.cmu.edu/~TextLearning/CorpusBuilder