290 likes | 404 Views
Evaluating Retrieval Systems with Findability Measurement Shariq Bashir PhD-Student Technology University of Vienna. Agenda. Document Findability Calculating Findability Measure GINI Coefficient Queries Creation for Findability Measure Experiments. Document Findability.
E N D
Evaluating Retrieval Systems with Findability MeasurementShariq BashirPhD-StudentTechnology University of Vienna
Agenda Document Findability Calculating Findability Measure GINI Coefficient Queries Creation for Findability Measure Experiments
Document Findability • Large Findability of each and every document in Collection is considered an important factor inLegal or Patent RetrievalSettings. • For example, in Patent Retrieval Settings, un-accessibility of a single related Patent document can approve wrong Patent application.
Document Findability • Easy vs. Hard Findability • A patent is called easy Findable, if it is accessible on top rank resultsof its several relevant queries. • More the Patent will far away from the top rank results, the harder will be its Findability. • Why?, because users are more interested on only top rank results (say top 30).
Document Findability • Considered two Retrieval Systems (RS1, RS2) and three Patents (P1, P2, P3). • The following table shows the Findability values of three Patents on top 30 results. • It is clear, RS2 makes all Patents more Findable than RS1.
What Makes Hard to Find Documents • System Bias • Bias is a term used in IR, when retrieval system give preference to some features of documents when it rank results of queries. • Example, PageRankis bias toward larger in-links, BM25, BM25F, TF-IDFare bias toward large terms frequencies. • Bias is dangerous, why?, since under Bias some documents will be more findable, while rest of others will be very hard to find.
Bias with Findability analysis • We can capture the bias impact of different retrieval systems using Findability analysis. • If a system has less bias, then it will make the individual documents more Findable. • Findability evaluation vs. Precision based Evaluation • We can’t use Findability evaluation at individual queries level. • It is just large scale evaluation, only use for capturing the bias of retrieval systems.
Findability Measure • Given a collection of documents dD, with large set of Queries Q. • kdq is the rank of dD in the result set of query qQ, c denotes the maximum rank that a user is willing to proceed down. The function f(kdq,c)returns a value of 1 if kdq<= c, and 0 otherwise.
GINI Coefficient • For viewing the Bias of Retrieval System in a single value, we can use GINI coefficient. • For computing GINI index, r(di) should be sort in ascending order. N total number of documents. • If G = 0, then no bias, because all document are equally Findable. If G = 1, then only one document is Findable, and all other document have r(d) = 0.
Bias with Findability (Example) GINI Coefficient with Lorenz Curve
Bias of Retrieval Systems • Experiment Setting • We used total Patents listed under United State Patent Classification (USPC) classes • 433 (Dentistry), 424 (Drug, Bio-affecting and body treating compositions), 422 (Chemical apparatus and process disinfecting, deodorizing, preserving, or sterilizing), and 423 (Chemistry of inorganic compounds).
Experiment Setting • Retrieval Systems used: • The OKAPI retrieval function (BM25). • Exact match model. • TFIDF • Language Modeling with term smoothing for Pseudo Relevance Feedback selection (LM). • Kullback-Leibler divergence (KLD). • Term selection value (Robertson and Walker) (QE TS). • Pseudo Relevance Feedback documents selection using clustering approach (Cluster). • For all Query Expansion models, we used top 35 documents for Pseudo relevance feedback and 50 terms for query expansion.
Experiment Setting • Queries Creation for Findability analysis • In query creation, we try to reflect the approach of Patent Examiners, how they create their query sets during “Patent Invalidity Search”.
Experiment Setting • Approach 1: • First, we extract all the single frequent terms from the Claim sections which have support greater than some threshold. • Then we combine these single frequent terms with two, three, and four terms combinations for construction longer queries. Patent (A) ---------------------- Patent --------------------------------- Patent --------------------------------- Patent --------------------------------- Patent --------------------------------- Patent --------------------------------- Patent --------------------------------- Patent --------------------------------- Patent --------------------------------- Patent --------------------------------- Use Patent (A) as a query for searching related documents.
Experiment Setting Terms with Support >= 3
Experiment Setting • Approach 2: • If patent contain many rare terms, • then • using queries collected from only single Patent document, we can’t search all of its similar Patents. • In this Query Creation approach, we construct queries with considering Patent relatedness.
Experiment Setting • Approach 2 Steps: • (Step 1): For each Patent, group all of its related Patents in set (R) using k-nearest neighbor approach . • (Step 2): Then using this R, construct its language model, for finding dominant terms which can search the documents in R. • Where Pjm(t|R) is the probability of term t in set R, and Pjm (t|corpos) is the probability of term t in whole collection. • This is similar approach, as terms in Language Modeling (Query Expansion) are used for brining up relevant documents. • (Step 3): Combine single terms with two, three, and four terms combinations for constructing longer queries.
Experiment Setting • Properties of Queries used in Experiments CQG 1: Approach 1 CQG 2: Approach 2
Bias of Retrieval Systems with Patent Collection (433, 424) With Query Creation Approach 1
Bias of Retrieval Systems with Patent Collection (433, 424) With Query Creation Approach 1
Bias of Retrieval Systems with Patent Collection (433, 424) With Query Creation Approach 1
Bias of Retrieval Systems with Patent Collection (433, 424) With Query Creation Approach 2
Bias of Retrieval Systems with Patent Collection (433, 424) With Query Creation Approach 2
Bias of Retrieval Systems with Patent Collection (433, 424) With Query Creation Approach 2
GINI Index of Retrieval Systems with Patent Collection (433, 424)
GINI Index of Retrieval Systems with Patent Collection (422, 423)
Future Work We are working toward improving Findability of Patents using Query Expansion approach. We have results, in which selecting better documents for Pseudo Relevance Feedback can improve the Findability of documents. Considering external provided Ontology in Query Expansion, can also create its role in improving Findability of documents.
References Leif Azzopardi, Vishwa Vinay, Retrievability: an evaluation measure for higher order information access tasks, CIKM '08:Proceeding of the 17th ACM conference on Information and knowledge management, pages 561--570, October 26-30, 2008, Napa Valley, California, USA. Chris Jordan, Carolyn Wattters, Qigang Gao, Using controlled query generation to evaluate blind relevance feedback algorithms, JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, 2006, Pages 286--295, Chapel Hill, NC, USA. Tonya Custis, Khalid Al-Kofahi, A new approach for evaluating query expansion: query-document term mismatch, SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 575--582, July 23-27, 2007, Amsterdam, The Netherlands.