350 likes | 515 Views
Automatically Building a Stopword List for an Information Retrieval System. University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis. Outline. Stopwords Investigation of two approaches Approach based on Zipf’s Law New Term-based random sampling approach Experimental Setup
E N D
Automatically Building a Stopword List for an Information Retrieval System University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis
Outline • Stopwords • Investigation of two approaches • Approach based on Zipf’s Law • New Term-based random sampling approach • Experimental Setup • Results and Analysis • Conclusion
What is a Stopword? • Common words in a document • e.g. the, is, and, am, to, it • Contains no information about documents • Low discrimination value in terms of IR • meaningless, no contribution • Search with stopwords will usually result in retrieving irrelevant documents
Objective • Different collection contains different contents and word patterns • Different collections may require a different set of stopwords • Given a collection of documents • Investigate ways to automatically create a stopword list
Baseline Approach (benchmark) 4 variants inspired by Zipf’s Law TF Normalised TF IDF Normalised IDF How informative a term is (new proposed approach) Objective (cont)
Fox’s Classical Stopword List and Its Weakness • Contains 733 stopwords • > 20 years old • Lacks potentially new words • Defined for General Purpose • different collections require different stopword lists • Outdated
Zipf’s Law • Based on the term frequencies of terms, rank these terms accordingly • term with highest TF will have rank = 1, next highest term with rank = 2 etc • Zipf’s Law
Baseline Approach Algorithm • Generate a list of frequencies vs terms based on corpus • Sort the frequencies in descending order • Rank the terms according to their frequencies. Highest frequencies would have rank=1 and next highest would have rank=2 etc. • Draw a graph of frequencies vs rank
Baseline Approach Algorithm (cont.) • Choose a threshold and any words that appear above the threshold are treated as stopwords • Run the queries with the above said stopword list, all stopwords in the queries will be removed • Evaluate system with Average Precision
Baseline Approach - Variants • Term Frequency • Normalised Term Frequency • Inverse Document Frequency (IDF) • Normalised IDF
Baseline Approach – Choosing Threshold • Produce best set of stopwords • > 50 stopword lists for each variant • Investigate the frequencies difference between two consecutive ranks • big difference (i.e. sudden jump) • Important to choose appropriate threshold
Term-Based Random Sampling Approach (TBRSA) • Our proposed new approach • Depends on how informative a term is • Based on the Kullback-Leibler divergence measure • Similar to the idea of query expansion
Kullback-Leibler Divergence Measure • Used to measure the distance between two distributions. • In our case, distribution of two terms, one of which is a random term • The weight of a term t in the sampled document set is given by: • where and
Repeat Y times Random term Retrieve KL divergence measure 0.0 0.0 0.1 0.1 0.3 0.3 0.5 0.7 Normalise weights by max weight Rank in ascending order Top X ranked TBRAS Algorithm
0.0 0.1 0.3 0.15 0.5 1.0 0.8 0.9 0.85 0.05 0.7 merge 0.0 0.1 0.3 0.15 0.75 1.0 0.8 0.05 0.7 0.05 0.1 0.15 0.75 0.8 1.0 0.7 0.0 0.3 sort 0.0 0.05 0.1 0.15 0.3 0.7 Extract top L ranked as stopwords TBRSA Algorithm (cont.)
Advantages / Disadvantages • Advantages • based on how informative a term is • computational effort minimal, compared to baselines • better coverage of collection • No need to monitor progress • Disadvantages • Generates first term randomly, could retrieve a small data set • Repeat experiments Y times
Experimental Setup • Four TREC collections • http://trec.nist.gov/data/docs_eng.html • Each collection is indexed and stemmed with no pre-defined stopwords removed • No assumption of stopwords in the beginning • Long queries were used • Title, Description and Narrative • Maximise our chances of using the new stopword lists
Experimental Platform • Terrier - TERabyte RetrIEveR • IR Group, University of Glasgow • Based on Divergence From Randomness (DFR) framework • Deriving parameter-free probabilistic models • PL2 model • http://ir.dcs.gla.ac.uk/terrier/
PL2 Model • One of the DFR document weighting models • Relevance score of a document d for query Q is: • where
Collections • disk45, WT2G, WT10G and DOTGOV
Merging Stopword Lists • Merging classical with best generated using baseline and novel approach respectively • Adding 2 lists together, removing duplicates • Might be stronger in terms of effectiveness • Follows from classical IR technique of combining evidence
Results and Analysis • Produce as many sets of stopwords (by choosing different thresholds for baseline approach) • Compare results obtained to Fox’s classical stopword list, based on average precision
Baseline Approach – Overall Results • * indicates significant difference at 0.05 level • Normalised IDF and for every collection
TBRSA – Overall Results • * indicates significant difference at 0.05 level • disk45 and WT2G both show improvements
Refinement - Merging • New approach (TBRSA) gives comparable results • Computation effort is less • Fox’s classical stopword list was very effective, despite its old age • Worth using • Queries were quite “conservative”
Merging – Baseline Approach • * indicates significant difference at 0.05 level • Produced a more effective stopword list
Merging – TBRSA • * indicates significant difference at 0.05 level • Produced an improved stopword list with less computational effort
Conclusion & Future Work • Proposed a novel approach for automatically generating a stopword list • Effectiveness and robustness • Compared to 4 baseline variants, based on Zipf’s Law • Merge classical stopword list with best found result to produce a more effective stopword list
Conclusion & Future Work (cont.) • Investigate other divergence metrics • Poisson-based approach • Verb vs Noun • “I can open a can of tuna with a can opener” • “to be or not to be” • Detect nature of context • Might have to keep some of the terms but remove others
Thank you! • Any questions? • Thank you for your attention