1 / 13

Terrier: TER abyte Ret RI ev ER

Terrier: TER abyte Ret RI ev ER. An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009). About Terrier . Information Retrieval Toolkit Developed by Information Retrieval Group at the University of Glasgow - since 2001 The team: 3 Researchers 5 PhD students

kata
Download Presentation

Terrier: TER abyte Ret RI ev ER

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Terrier:TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21st 2009)

  2. About Terrier • Information Retrieval Toolkit • Developed by Information Retrieval Group at the University of Glasgow - since 2001 • The team: • 3 Researchers • 5 PhD students • 5 Programmers

  3. About Terrier • Provides platform for development of large-scale IR applications • Uses Hadoop to distribute indexing • Splits indexing tasks across different nodes on a cluster • JAVA Based • Weighting model in Terrier is based on Divergence From Randomness (DFR) framework [ Read More ] • Also includes other IR models

  4. State-of-the-art functionalities • hyperlink structure analysis to rank pages • automatic query expansion/re-formulation techniques • pre-retrieval query performance predictors • compression techniques

  5. Other notable features • selects optimalweighting model • based on the statistical features of the query

  6. Toolkit Comparison

  7. Out of the box capabilities • Index and evaluate on TREC test collections • Index standard files formats • HTML, PDF, Word, Excel, PowerPoint files • GUI based desktop search application

  8. Other out of the box capabilities • Indexing support using Hadoop • Highly compressed index data structures • Options for various stemming techniques • Many document weighting model options • 126 Divergence From Randomness (DFR) models • Okapi BM25 • Language modeling • TF-IDF • Modifiable Code • open source code base (Mozilla Public Licence).

  9. Nice to have…but not there • Ability to easily build a search engine • Incremental indexing • Re-create index every time • Write your own code for incremental indexing • Flexible Indexer • Implement your own indexer for non standard data format

  10. Benefits of using Terrier • Terrier – active ongoing project • Benefit from new models • Performance enhancements • New features • Can index large amounts of data • Scalable in the long run • Good support from the team • Wiki • Discussion forums

  11. …Benefits of using Terrier • Easy to set up and use • Very modular • Source files are fully modifiableand well documented[ Show ]

  12. How To Get Started? • Download the Binary [ download ] • You get the full source code with this download • Unzip the file to a directory • Modify configuration files • Models to use • Stemmer • Etc…. • You are now ready to index and evaluate • Use pre-existing scripts to index and evaluate [ Full Setup Instructions ]

  13. Terrrier’s Directory Structure The directories of Terrier are – bin/: contains useful scripts for running Terrier – etc/: contains the configuration files – doc/ : contains the documentation of Terrier – lib/ : contains the compiled Terrier classes and the external libraries used by Terrier – licenses/ : contains the license information of the components included with Terrier – share/ : contains a stop word list, an example of documents to test with Terrier, and other infrequently changing files – src/: contains the source code of Terrier – var/index : contains the data structures – var/results : contains the retrieval results -Which models? -Stopword list -Stemmer -etc Source files needed to start modifications

More Related