390 likes | 508 Views
Gravitation-Based Model for Information Retrieval. Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com. From: http://www.awesomelibrary.org/images/solar-system-nasa.jpg. Background. A core problem in Information Retrieval (IR):
E N D
Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com From: http://www.awesomelibrary.org/images/solar-system-nasa.jpg SIGIR’2005
Background A core problem in Information Retrieval (IR): Determine the relevance of a document to a query Query: Bill Clinton Document: Relevant? How relevant? SIGIR’2005
Background • IR Models & Perspectives • IR models define the representation of documents, queries, and the relevance relationship between them • The key behind all IR models is primary perspectives on information retrieval SIGIR’2005
Background • Hard questions • What is the essence of information retrieval? • What is the right perspective of it? • Till now, we know more about IR each time when a new perspective is adopted • It would also be helpful to view IR problems from more new perspectives • We try to view IR from the perspective of physics SIGIR’2005
Background (1687 AD.) From:http://csep10.phys.utk.edu/astr161/lect/history/newtongrav.html SIGIR’2005
Background From http://www.enterprisemission.com/hyper2a.php SIGIR’2005
Background • We are living in a physical world which is dominated by fundamental physics laws. • Can we get help from “the God” in acquiring deeper understanding of information retrieval? • Simply start from Newton’s Universal Law of Gravitation… SIGIR’2005
Preliminary Achievements • First discovered by Robertson et al, inspired by the shape of a complex formula derived from a probabilistic model under the 2-Poisson assumption. • Amati and Rijsbergen proposed a probabilistic framework with which the BM25 function with some special parameters (k1=1.2, b=0.75; or k1=2, b=0.75) can be approximated numerically We lack a complete derivation of BM25 formula in theory. It is encouraging that we can really benefit from the nature. With the new perspective, we get the following preliminary achievements, • We build a new IR model GBM from which many effective ranking functions can be derived • The BM25 formula can be derived from our model, so we give an intuitive physical interpretation of this powerful and robust function. • A more reasonable approach for structured document retrieval can be obtained directly from the model. This approach is not only highly effective but also robust to be used in various conditions. SIGIR’2005
Outline • Background • Gravitation-based Model • Notations & Basic Concepts • Discrete GBM Model • Continuous GBM Model • Model analysis • GBM Model for Structured Document Retrieval • Summary SIGIR’2005
GBM: Initial Idea IR concepts & notations: |D| Document length df(t) Document frequency of t avdlAverage document length in a collection NTotal number of documents c(t,D) Times of occurrences of t in D (or written as tf(t,D)) A mapping is need to be build from concepts of information retrieval to those of physics Query: Bill Clinton Document: Relevance score Attractive force Physics concepts mass distance … … SIGIR’2005
GBM: Notations & Basic Concepts • Particle • (=atom): Basic element of any object • A particle has two attributes: mass and type • Type: Determined by the term object it composes SIGIR’2005
GBM: Notations & Basic Concepts H(D): Hidden terms in document D Two natural assumptions: A term object has 4 attributes: type, shape, mass, and diameter SIGIR’2005
Notation List SIGIR’2005
Outline • Background • Gravitation-based Model • Notations & Basic Concepts • Discrete GBM Model • Continuous GBM Model • Model analysis • GBM Model for Structured Document Retrieval • Summary SIGIR’2005
Discrete GBM Model • Key Points: • Under the attraction of query terms, the structure of each document would be adjusted to an optimized-term-placement state. • 2. The relevance between a document and a query is defined by the attractive force between them when the document is in its optimized-term-placement state. Optimized-term-placement state A state where the aggregated force between the document and the query gets maximized SIGIR’2005
Term Weighting Formula Unknown expressions: m(t,Q), m(t,D), anddi(t,D) Need: Mass and diameter estimation The force between query term t and its i-th nearest occurrence in D: The maximal (optimized) gravitational force between t and D: The attractive force between D and Q: SIGIR’2005
Mass and Diameter Estimation For any two terms, their mass ratio in any document is equal to the ratio of their average masses in the whole collection. Assume that all terms in the same document have equal diameters (Assumption-2) (Assumption-1) Define a document-independent mass for each (type of) term. It denotes the average mass of term t in the whole collection. (Assumption-3) (Assumption-4) SIGIR’2005
Ultimate Discrete GBM Formula • The mass of a document is a measure of its quality, which depends on how informative and important it is. • Relationship with PageRank? <Future work> The average (document-independent) mass of term t in the collection The ultimate term-weighting function: where and SIGIR’2005
Ultimate Discrete GBM Formula If m(D) = const, di(D) = const, and Then a special case of the term-weighting function: where Two parameters: SIGIR’2005
Outline • Background • Gravitation-based Model • Notations & Basic Concepts • Discrete GBM Model • Continuous GBM Model • Model analysis • GBM Model for Structured Document Retrieval • Summary SIGIR’2005
Continuous GBM Model Term shape: Ideal cylinder Document D is now in its optimized-term-placement state SIGIR’2005
Term Weighting Formula The force between query term t and its i-th nearest occurrence in D: The maximal (optimized) gravitational force between t and D: SIGIR’2005
Ultimate Continuous GBM Formula By doing mass and diameter estimation, we have the ultimate term-weighting function: where and If:m(D) = const, di(D) = const,and Then a special case of the above term-weighting function: (Two parameters: ) SIGIR’2005
Outline • Background • Gravitation-based Model • Notations & Basic Concepts • Discrete GBM Model • Continuous GBM Model • Model analysis • GBM Model for Structured Document Retrieval • Summary SIGIR’2005
Continuous GBM Formula vs. BM25 A special case of the continuous GBM term-weighting function: where BM25 term-weighting function SIGIR’2005
Other Ranking Formulas Derived Ranking formulas (highly simplified version) derived from the continuous GBM model with various gravitational-field-functions SIGIR’2005
Check with Heuristic Constraints • [Fang et al, SIGIR’04]: Some heuristic constraints related to TF, IDF, and document length that all reasonable ranking formulas should satisfy • TFC1, TFC2 • TDC M-TDC • LNC1, LNC2 • TF-LNC • All our derived term weighting functions satisfy all the above constraints. SIGIR’2005
Preliminary Experiments • Experimental Setup Corpora characteristics Query-sets used in the experiments SIGIR’2005
Preliminary Experiments • Experimental Results Optimal performance comparison among some formulas over various corpora and tasks (measure: mean average precision) SIGIR’2005
Outline • Background • Gravitation-based Model • Notations & Basic Concepts • Discrete GBM Model • Continuous GBM Model • Model analysis • GBM Model for Structured Document Retrievalskip • Summary SIGIR’2005
Structured Document Retrieval • A document is said to be structured here when it contains multiple fields. • Current approaches for structured document retrieval • Score combination • The most commonly used and well-studied approach • Rank combination is a special case of score combination • Term-frequency combination • [Robertson et al, CIKM’04]: An extension of BM25 • [Ogilvie et al, SIGIR’03]: Linearly combining language models Each approach works moderately well, but… SIGIR’2005
Score Combination Issues • For a multi-term query, a document matching a single query term over many fields could get unreasonably higher score than another document which matches all the query terms in a few fields. (See discussions in [Robertson et al, CIKM’04]) score(d1) = s + s + s + … + s = 8s score(d2) = 2s + 2s + 0 + … + 0 = 4s score(d1) > score(d2) Unreasonable SIGIR’2005
TF Combination Issues Consider a single-term query Q=t, and some documents with two fields (F1, F2). Assuming:w1 = weight(F1) = 5; w2 = weight(F2) = 1 tf(t,d1) = w1 * 1 + w2 * 0 = 5 tf(t,d2) = w1 * 0 + w2 * 6 = 6 score(d1) < score(d2) Reasonable • Larger w1? • Can’t remove this issue • Potential risk of making the case of example-1 unreasonable Example-1 (assuming |d1|=|d2|) tf(t,d3) = w1 * 1 + w2 * 8 = 13 tf(t,d4) = w1 * 0 + w2 * 14 = 19 score(d3) < score(t,d4) Unreasonable Example-2 (assuming |d3|=|d4|) SIGIR’2005
Structured Document Retrievalby GBM SIGIR’2005
Experimental Results Performance comparison of different approaches for the combination of body and title fields SIGIR’2005
Outline • Background • Gravitation-based Model • Notations & Basic Concepts • Discrete GBM Model • Continuous GBM Model • Model analysis • GBM Model for Structured Document Retrieval • Summary SIGIR’2005
Summary • Viewing IR from a different viewpoint is the same important as going deeper from traditional perspectives. • This paper may be a first step to take a physics viewpoint • It is encouraging that we can really benefit from the nature • A family of effective ranking functions derived • Give BM25 a physics interpretation • A more reasonable approach for structured document retrieval obtained SIGIR’2005
Sorry, Sir Isaac Newton. Hope I am not abusing your laws. SIGIR’2005
The End Gravitation-Based Model for Information Retrieval Please send your comments to: shumings@microsoft.com SIGIR’2005