330 likes | 582 Views
Introduction to Terrier. Existing IR Platforms. • Academic:. Terrier: • Flexible & ideal for experimentation • Rapid development of new research ideas • Not just one model – Implements various modern state-of-the-art IR models • Proven effective retrieval. – Terrier
E N D
Existing IR Platforms • Academic: Terrier: • Flexible & ideal for experimentation • Rapid development of new research ideas • Not just one model – Implements various modern state-of-the-art IR models • Proven effective retrieval – Terrier – Zettair – Lemur/Indri • Non-Academic: – Lucene/Nutch – Xapian 3/31
Open Source Terrier • Why Open Source? Terrier is a community project – you use & benefit – you contribute – Everyone benefits • Cross-OS developed in Java – runs on Windows, *nix, MacOS X • Indexing and Querying APIs – Easy to extend – adapt for new applications – Modular architecture – Simple to start working with – Many configuration options 4/31
What’s in Terrier Scripts to Start Terrier Documentation Configuration Files Compiled Java files Stopwords & tests Java Source – index/ – results/ 5/31
Compiling Terrier • To use your code with Terrier, add your jar file or your class folder to the CLASSPATH environment variable • If you do need to alter the code in Terrier, then you have to recompile. • bin/compile.sh • bin/compile.bat •ant • In Eclipse, you will need the Antlr plugin to compile 6/31
File->New->Project 7/31
Using Terrier Terrier comes with three applications: • Desktop Terrier • Interactive Terrier • Batch (TREC) Terrier 9/31
• Desktop Terrier • SimpleFileCollection – Simple Text , PDF , MS Word , MS PowerPoint , MS Excel , HTML ,XML , XHTML , etc • Java Swing GUI • Comes with Terrier Back 10/31
• Interactive Terrier Back 11/31
• Batch (TREC) Terrier • ./bin/trec_terrier.sh -i – -H for Hadoop Indexing. • ./bin/trec_terrier.sh -r -Dtrec.model=PL2 – Classical models, such as tf-idf, BM25 – -q for Query Expansion. Bo1, Bo2 and KL • ./bin/trec_terrier.sh -e – p@20 p@30 etc. 12/31
Indexing 13/31
Term Pipelining • In Terrier, each token from a Document is passed through the Term Pipeline • Each Term Pipeline stage can either: – Transform the term. Stemming, ala Porter’s English stemming etc. – Drop the term Stopword removal 14/31
Example • Original Text • Tokenisation • Stopword removal • Stemming 15/31
Indexing API 16/31
Retrieval in IR 17/31
Scoring Documents • A simple model of scoring documents to a query is TF.IDF: • Also Language Modelling (Hiemstra) - A query term w(t,d) is scored by how different its term distribution in the document d is, compared to the whole collection 19/31
Weighting Models in Terrier • Terrier provides many state-of-the-art document weighting models: – TF-IDF (with length normalisation, aka BM11) – Lemur’s TF-IDF – Okapi BM25 – Hiemstra and Ponte&Croft Language Models • All in org.terrier.matching.models 20/31
Score Documents • TAAT Term-At-A-Time • DAAT Document-At-A-Time advantageous for retrieving from large indices 21/31
Query Expansion •Why using Query Expansion? – Achieve a better retrieval performance • How to use? – Add –q parameter in your command • Terrier’s QE is a pseudo-relevance feedback technique that – Expands the query by adding new query terms – Re-weights the query terms(KL,Bo1,Bo2) 22/31
Expanding the query • The added query terms are meant to be related to the topic • QE brings more information to the query • It helps to retrieve more relevant documents BUT it can also bring noise 23/31
Extending Retrieval Use Cases: Document Priors • Assumption: You have a file containing PageRank scores for each document in the collection • Integrate with retrieval score as • How: Use a DocumentScoreModifier – Modify retrieval scores at end of Matching 24/31
Evaluation • How well did the system perform? • Specify the qrels file with the relevance assessments to use in etc/trec.qrels • Evaluate all the result files in the var/results directory • .eval contains usual evaluation measures, P@10 P@20 etc. 26/31
Data Structures Builders • Lexicon 27/31
• DocumentIndex 28/31
• CollectionStatistics 29/31
• DirectIndex 30/31
• InvertedIndex 31/31