1 / 32

Introduction to Terrier

Introduction to Terrier. Existing IR Platforms. • Academic:. Terrier: • Flexible & ideal for experimentation • Rapid development of new research ideas • Not just one model – Implements various modern state-of-the-art IR models • Proven effective retrieval. – Terrier

primo
Download Presentation

Introduction to Terrier

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Terrier

  2. Existing IR Platforms • Academic: Terrier: • Flexible & ideal for experimentation • Rapid development of new research ideas • Not just one model – Implements various modern state-of-the-art IR models • Proven effective retrieval – Terrier – Zettair – Lemur/Indri • Non-Academic: – Lucene/Nutch – Xapian 3/31

  3. Open Source Terrier • Why Open Source? Terrier is a community project – you use & benefit – you contribute – Everyone benefits • Cross-OS developed in Java – runs on Windows, *nix, MacOS X • Indexing and Querying APIs – Easy to extend – adapt for new applications – Modular architecture – Simple to start working with – Many configuration options 4/31

  4. What’s in Terrier Scripts to Start Terrier Documentation Configuration Files Compiled Java files Stopwords & tests Java Source – index/ – results/ 5/31

  5. Compiling Terrier • To use your code with Terrier, add your jar file or your class folder to the CLASSPATH environment variable • If you do need to alter the code in Terrier, then you have to recompile. • bin/compile.sh • bin/compile.bat •ant • In Eclipse, you will need the Antlr plugin to compile 6/31

  6. File->New->Project 7/31

  7. 8/31

  8. Using Terrier Terrier comes with three applications: • Desktop Terrier • Interactive Terrier • Batch (TREC) Terrier 9/31

  9. • Desktop Terrier • SimpleFileCollection – Simple Text , PDF , MS Word , MS PowerPoint , MS Excel , HTML ,XML , XHTML , etc • Java Swing GUI • Comes with Terrier Back 10/31

  10. • Interactive Terrier Back 11/31

  11. • Batch (TREC) Terrier • ./bin/trec_terrier.sh -i – -H for Hadoop Indexing. • ./bin/trec_terrier.sh -r -Dtrec.model=PL2 – Classical models, such as tf-idf, BM25 – -q for Query Expansion. Bo1, Bo2 and KL • ./bin/trec_terrier.sh -e – p@20 p@30 etc. 12/31

  12. Indexing 13/31

  13. Term Pipelining • In Terrier, each token from a Document is passed through the Term Pipeline • Each Term Pipeline stage can either: – Transform the term. Stemming, ala Porter’s English stemming etc. – Drop the term Stopword removal 14/31

  14. Example • Original Text • Tokenisation • Stopword removal • Stemming 15/31

  15. Indexing API 16/31

  16. Retrieval in IR 17/31

  17. 18/31

  18. Scoring Documents • A simple model of scoring documents to a query is TF.IDF: • Also Language Modelling (Hiemstra) - A query term w(t,d) is scored by how different its term distribution in the document d is, compared to the whole collection 19/31

  19. Weighting Models in Terrier • Terrier provides many state-of-the-art document weighting models: – TF-IDF (with length normalisation, aka BM11) – Lemur’s TF-IDF – Okapi BM25 – Hiemstra and Ponte&Croft Language Models • All in org.terrier.matching.models 20/31

  20. Score Documents • TAAT Term-At-A-Time • DAAT Document-At-A-Time advantageous for retrieving from large indices 21/31

  21. Query Expansion •Why using Query Expansion? – Achieve a better retrieval performance • How to use? – Add –q parameter in your command • Terrier’s QE is a pseudo-relevance feedback technique that – Expands the query by adding new query terms – Re-weights the query terms(KL,Bo1,Bo2) 22/31

  22. Expanding the query • The added query terms are meant to be related to the topic • QE brings more information to the query • It helps to retrieve more relevant documents BUT it can also bring noise 23/31

  23. Extending Retrieval Use Cases: Document Priors • Assumption: You have a file containing PageRank scores for each document in the collection • Integrate with retrieval score as • How: Use a DocumentScoreModifier – Modify retrieval scores at end of Matching 24/31

  24. 25/31

  25. Evaluation • How well did the system perform? • Specify the qrels file with the relevance assessments to use in etc/trec.qrels • Evaluate all the result files in the var/results directory • .eval contains usual evaluation measures, P@10 P@20 etc. 26/31

  26. Data Structures Builders • Lexicon 27/31

  27. • DocumentIndex 28/31

  28. • CollectionStatistics 29/31

  29. • DirectIndex 30/31

  30. • InvertedIndex 31/31

  31. 2/22

More Related