Automatically Building a Stopword List for an Information Retrieval System

Automatically Building a Stopword List for an Information Retrieval System University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis

Outline • Stopwords • Investigation of two approaches • Approach based on Zipf’s Law • New Term-based random sampling approach • Experimental Setup • Results and Analysis • Conclusion

What is a Stopword? • Common words in a document • e.g. the, is, and, am, to, it • Contains no information about documents • Low discrimination value in terms of IR • meaningless, no contribution • Search with stopwords will usually result in retrieving irrelevant documents

Objective • Different collection contains different contents and word patterns • Different collections may require a different set of stopwords • Given a collection of documents • Investigate ways to automatically create a stopword list

Baseline Approach (benchmark) 4 variants inspired by Zipf’s Law TF Normalised TF IDF Normalised IDF How informative a term is (new proposed approach) Objective (cont)

Fox’s Classical Stopword List and Its Weakness • Contains 733 stopwords • > 20 years old • Lacks potentially new words • Defined for General Purpose • different collections require different stopword lists • Outdated

Zipf’s Law • Based on the term frequencies of terms, rank these terms accordingly • term with highest TF will have rank = 1, next highest term with rank = 2 etc • Zipf’s Law

Zipf’s Law

Baseline Approach Algorithm • Generate a list of frequencies vs terms based on corpus • Sort the frequencies in descending order • Rank the terms according to their frequencies. Highest frequencies would have rank=1 and next highest would have rank=2 etc. • Draw a graph of frequencies vs rank

Baseline Approach Algorithm (cont.)

Baseline Approach Algorithm (cont.) • Choose a threshold and any words that appear above the threshold are treated as stopwords • Run the queries with the above said stopword list, all stopwords in the queries will be removed • Evaluate system with Average Precision

Baseline Approach - Variants • Term Frequency • Normalised Term Frequency • Inverse Document Frequency (IDF) • Normalised IDF

Baseline Approach – Choosing Threshold • Produce best set of stopwords • > 50 stopword lists for each variant • Investigate the frequencies difference between two consecutive ranks • big difference (i.e. sudden jump) • Important to choose appropriate threshold

Term-Based Random Sampling Approach (TBRSA) • Our proposed new approach • Depends on how informative a term is • Based on the Kullback-Leibler divergence measure • Similar to the idea of query expansion

Kullback-Leibler Divergence Measure • Used to measure the distance between two distributions. • In our case, distribution of two terms, one of which is a random term • The weight of a term t in the sampled document set is given by: • where and

Repeat Y times Random term Retrieve KL divergence measure 0.0 0.0 0.1 0.1 0.3 0.3 0.5 0.7 Normalise weights by max weight Rank in ascending order Top X ranked TBRAS Algorithm

0.0 0.1 0.3 0.15 0.5 1.0 0.8 0.9 0.85 0.05 0.7 merge 0.0 0.1 0.3 0.15 0.75 1.0 0.8 0.05 0.7 0.05 0.1 0.15 0.75 0.8 1.0 0.7 0.0 0.3 sort 0.0 0.05 0.1 0.15 0.3 0.7 Extract top L ranked as stopwords TBRSA Algorithm (cont.)

Advantages / Disadvantages • Advantages • based on how informative a term is • computational effort minimal, compared to baselines • better coverage of collection • No need to monitor progress • Disadvantages • Generates first term randomly, could retrieve a small data set • Repeat experiments Y times

Experimental Setup • Four TREC collections • http://trec.nist.gov/data/docs_eng.html • Each collection is indexed and stemmed with no pre-defined stopwords removed • No assumption of stopwords in the beginning • Long queries were used • Title, Description and Narrative • Maximise our chances of using the new stopword lists

Experimental Platform • Terrier - TERabyte RetrIEveR • IR Group, University of Glasgow • Based on Divergence From Randomness (DFR) framework • Deriving parameter-free probabilistic models • PL2 model • http://ir.dcs.gla.ac.uk/terrier/

PL2 Model • One of the DFR document weighting models • Relevance score of a document d for query Q is: • where

Collections • disk45, WT2G, WT10G and DOTGOV

Queries

Merging Stopword Lists • Merging classical with best generated using baseline and novel approach respectively • Adding 2 lists together, removing duplicates • Might be stronger in terms of effectiveness • Follows from classical IR technique of combining evidence

Results and Analysis • Produce as many sets of stopwords (by choosing different thresholds for baseline approach) • Compare results obtained to Fox’s classical stopword list, based on average precision

Baseline Approach – Overall Results • * indicates significant difference at 0.05 level • Normalised IDF and for every collection

Baseline Approach – Additional Terms Produced

TBRSA – Overall Results • * indicates significant difference at 0.05 level • disk45 and WT2G both show improvements

TBRSA – Additional Terms Produced

Refinement - Merging • New approach (TBRSA) gives comparable results • Computation effort is less • Fox’s classical stopword list was very effective, despite its old age • Worth using • Queries were quite “conservative”

Merging – Baseline Approach • * indicates significant difference at 0.05 level • Produced a more effective stopword list

Merging – TBRSA • * indicates significant difference at 0.05 level • Produced an improved stopword list with less computational effort 

Conclusion & Future Work • Proposed a novel approach for automatically generating a stopword list • Effectiveness and robustness • Compared to 4 baseline variants, based on Zipf’s Law • Merge classical stopword list with best found result to produce a more effective stopword list

Conclusion & Future Work (cont.) • Investigate other divergence metrics • Poisson-based approach • Verb vs Noun • “I can open a can of tuna with a can opener” • “to be or not to be” • Detect nature of context • Might have to keep some of the terms but remove others

Thank you! • Any questions? • Thank you for your attention 

Automatically Building a Stopword List for an Information Retrieval System

Automatically Building a Stopword List for an Information Retrieval System

Presentation Transcript

Ideas for Building a List

List Building System

Building a list

An Overview of Information Retrieval

It is an information retrieval system designed to help find information stored in a computer system

Building a Climate Information System

Building an Augmented Index for Genomic Information Retrieval

An information retrieval system for parliamentary documents

Building an email list

Information Retrieval System

Bits and Pieces An information retrieval system for academic research

Medical Information Retrieval: eEvidence System

Building an Information System for a Distributed Testbed

PAS: A Personal Alert System for Information Retrieval in CRISs

Using Automatically Extracted Information in Species Page Retrieval: a use case

“A Visual Toolkit For Information Retrieval”

Information Retrieval System for IIT’s Writing Center

INFORMATION STROAGE AND RETRIEVAL SYSTEM

An Overview of Information Retrieval

A Chinese Information Retrieval System Using SDD

Tips for building an effective mailing list

“A Visual Toolkit For Information Retrieval”