250 likes | 413 Views
Dependence Language Model for Information Retrieval. Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval, SIGIR 2004. Reference. Structure and performance of a dependency language model. Ciprian, David Engle and et al . Eurospeech 1997.
E N D
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval, SIGIR 2004
Reference • Structure and performance of a dependency language model. Ciprian, David Engle and et al. Eurospeech 1997. • Parsing English with a Link Grammar. Daniel D. K. Sleator and Davy Temperley. Technical Report CMU-CS-91-196 1991.
Why we use independence assumption? • The independence assumption is one of the assumptions widely adopted in probabilistic retrieval theory. • Why? • Make retrieval models easier. • Make retrieval operation tractable. • The shortage of independence assumption • Independence assumption does not hold in textual data.
Latest ideas of dependence assumption • Bigram • Some language modeling approach try to incorporate word frequency by using bigram. • Shortage: • Some of word dependencies not only exist between adjacent words but also exist at more distant. • Some of adjacent words are not exactly connected. • Bigam language model showed only marginally better effectiveness than the unigram model. • Bi-term • Bi-term language model is similar to the bigram model except the constraint of order in terms is relaxed. • “information retrieval” and “retrieval of information” will be assigned the same probability of generating the query.
Introduction • This paper present a maximal entropy language model that incorporates both syntax and semantics via a dependency grammar. • Dependency grammar: express the relations between words by a directed graph which can incorporate the predictive power of words that lie outside of bigram or trigram range.
Introduction • Why we use Ngram • Assume if we want to record we need to store independent parameters • The drawback of Ngram • Ngram blindly discards relevant words that lie N or more positions in the past.
Structure of the model • Develop an expression for the joint probability , K is the linkages in the sentence. • Then we get • Assume that the sum is dominated by a single term, then
A dependency language model of IR • A query we want to rank • Previous work: • Assume independence between query terms : • New work: • Assume that term dependencies in a query form a linkage
A dependency language model of IR • Assume that the sum over all the possible Ls is dominated by a single term • Assume that each term is dependent on exactly one related query term generated previous.
A dependency language model of IR • Assume • The generation of a single term is independent of L • By this assumption, we would have arrived at the same result by starting from any term. L can be represented as an undirected graph.
Parameter Estimation • Estimating • Assume that the linkages are independent. • Then count the relative frequency of link l between and given that they appear in the same sentence. Have a link in a sentence in training data A score The link frequency of query i and query j
Parameter Estimation assumption Assumption:
Parameter Estimation • Estimating • The document language model is smoothed with a Dirichlet prior Constant discount Dirichilet distribution
Parameter Estimation • Estimating
Experimental Setting • Stemmed and stop words were removed. • Queries are TREC topics 202 to 250 on TREC disk 2 and 3.
The flow of the experimental document Training data For weight computation query Find the linkage of query Get Count the frequency Find the max L by maxlP(l|Q) Get P(L|D) Count the frequency Ranking document Get combine Count the frequency Get
Result-BM & UG • BM: binary independent retrieval • UG: unigram language model approach • UG achieves the performance similar to, or worse than, that of BM.
Result- DM • DM: dependency model • The improve of DM over UG is statistically significant.
Result- BG • BG: bigram language model • BG is slightly worse than DM in five out of six TREC collections but substantially outperforms UG in all collection.
Result- BT1 & BT2 • BT: bi-term language model
Conclusion • This paper introduce the linkage of a query as a hidden variable. • Generate each term in turn depending on other related terms according to the linkage. • This approach cover several language model approaches as special cases. • The experimental of this paper outperforms substantially over unigram, bigram and classical probabilistic retrieval model.