Terrier: TER abyte Ret RI ev ER

Terrier:TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21st 2009)

About Terrier • Information Retrieval Toolkit • Developed by Information Retrieval Group at the University of Glasgow - since 2001 • The team: • 3 Researchers • 5 PhD students • 5 Programmers

About Terrier • Provides platform for development of large-scale IR applications • Uses Hadoop to distribute indexing • Splits indexing tasks across different nodes on a cluster • JAVA Based • Weighting model in Terrier is based on Divergence From Randomness (DFR) framework [ Read More ] • Also includes other IR models

State-of-the-art functionalities • hyperlink structure analysis to rank pages • automatic query expansion/re-formulation techniques • pre-retrieval query performance predictors • compression techniques

Other notable features • selects optimalweighting model • based on the statistical features of the query

Toolkit Comparison

Out of the box capabilities • Index and evaluate on TREC test collections • Index standard files formats • HTML, PDF, Word, Excel, PowerPoint files • GUI based desktop search application

Other out of the box capabilities • Indexing support using Hadoop • Highly compressed index data structures • Options for various stemming techniques • Many document weighting model options • 126 Divergence From Randomness (DFR) models • Okapi BM25 • Language modeling • TF-IDF • Modifiable Code • open source code base (Mozilla Public Licence).

Nice to have…but not there • Ability to easily build a search engine • Incremental indexing • Re-create index every time • Write your own code for incremental indexing • Flexible Indexer • Implement your own indexer for non standard data format

Benefits of using Terrier • Terrier – active ongoing project • Benefit from new models • Performance enhancements • New features • Can index large amounts of data • Scalable in the long run • Good support from the team • Wiki • Discussion forums

…Benefits of using Terrier • Easy to set up and use • Very modular • Source files are fully modifiableand well documented[ Show ]

How To Get Started? • Download the Binary [ download ] • You get the full source code with this download • Unzip the file to a directory • Modify configuration files • Models to use • Stemmer • Etc…. • You are now ready to index and evaluate • Use pre-existing scripts to index and evaluate [ Full Setup Instructions ]

Terrrier’s Directory Structure The directories of Terrier are – bin/: contains useful scripts for running Terrier – etc/: contains the configuration files – doc/ : contains the documentation of Terrier – lib/ : contains the compiled Terrier classes and the external libraries used by Terrier – licenses/ : contains the license information of the components included with Terrier – share/ : contains a stop word list, an example of documents to test with Terrier, and other infrequently changing files – src/: contains the source code of Terrier – var/index : contains the data structures – var/results : contains the retrieval results -Which models? -Stopword list -Stemmer -etc Source files needed to start modifications

Terrier: TER abyte Ret RI ev ER