1 / 45

Minimum Rank Error Training for Language Modeling

Minimum Rank Error Training for Language Modeling. Meng-Sung Wu Department of Computer Science and Information Engineering National Cheng Kung University, Tainan, TAIWAN. Contents. Introduction Language Model for Information Retrieval Discriminative Language Model

Download Presentation

Minimum Rank Error Training for Language Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Minimum Rank Error Training for Language Modeling Meng-Sung Wu Department of Computer Science and Information Engineering National Cheng Kung University, Tainan, TAIWAN

  2. Contents • Introduction • Language Model for Information Retrieval • Discriminative Language Model • Average Precision versus Classification Accuracy • Evaluation of IR Systems • Minimum Rank Error Training • Summarization and Discussion

  3. Introduction • Language Modeling: • Provides linguistic constraints to the text sequence W. • Based on statistical N-gram language models • Speech recognition system is always evaluated by the word error rate. • Discriminative learning methods • maximum mutual information (MMI) • minimum classification error (MCE) • Classification error rate is not a suitable metric for measuring the rank of input document.

  4. Language Model for Information Retrieval

  5. Standard Probabilistic IR Information need d1 matching d2 query … dn document collection

  6. IR based on LM Information need d1 generation d2 query … … dn document collection

  7. Language Models • Mathematical model of text generation • Particularly important for speech recognition, information retrieval and machine translation. • N-gram model commonly used to estimate probabilities of words • Unigram, bigram and trigram • N-gram model is equivalent to an (n-1)th order Markov model • Estimates must be smoothed by interpolating combinations of n-gram estimates

  8. Using Language Models in IR • Treat each document as the basis for a model (e.g., unigram sufficient statistics) • Rank document d based on P(d | q) • P(d | q) = P(q | d) x P(d) / P(q) • P(q) is the same for all documents, so ignore • P(d) [the prior] is often treated as the same for all d • But we could use criteria like authority, length, genre • P(q | d) is the probability of q given d’s model • Very general formal approach

  9. Using Language Models in IR • Principle 1: • Document D: Language model P(w|MD) • Query Q = sequence of words q1,q2,…,qn (uni-grams) • Matching: P(Q|MD) • Principle 2: • Document D: Language model P(w|MD) • Query Q: Language model P(w|MQ) • Matching: comparison between P(.|MD) and P(.|MQ) • Principle 3: • Translate D to Q

  10. Problems • Limitation to uni-grams: • No dependence between words • Problems with bi-grams • Consider all the adjacent word pairs (noise) • Cannot consider more distant dependencies • Word order – not always important for IR • Entirely data-driven, no external knowledge • e.g. programming  computer • Direct comparison between D and Q • Despite smoothing, requires that D and Q contain identical words (except translation model) • Cannot deal with synonymy and polysemy

  11. Discriminative Language Model

  12. Minimum Classification Error • The advent of powerful computing devices and success of statistical approaches • A renewed pursuit for more powerful method to reduce recognition error rate • Although MCE-based discriminative methods is rooted in the classical Bayes’ decision theory, instead of a classification task to distribution estimation problem, it takes a discriminant-function based statistical pattern classification approach • For a given family of discriminant function, optimal classifier/recognizer design involves finding a set of parameters which minimize the empirical pattern recognition error rate

  13. Loss function: • Expected loss: Score of target hypothesis Score of competing hypotheses Minimum Classification Error LM • Discrinimant function: • MCE classifier design based on three steps • Misclassification measure:

  14. MCE approach has several advantages in classifier design: • It is meaningful in the sense of minimizing the empirical recognition error rate of the classifier • If the true class posterior distributions are used as discriminant functions, the asymptotic behavior of the classifier will approximate the minimum Baye’s risk

  15. Average Precision versus Classification Accuracy

  16. The relevant documents = 10 Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0 Precision 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5 Recall 0.0 0.2 0.2 0.2 0.4 0.6 0.8 1.0 1.0 1.0 Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.63 0.55 0.5 Example • The same classification accuracy but different average precision Accuracy=50.0% AvgPrec=62.2% AvgPrec=52.0%

  17. Evaluation of IR Systems

  18. Measures of Retrieval Effectiveness • Precision and Recall • Single-valued P/R measure • Significance tests

  19. Precision and Recall • Precision • Proportion of a retrieved set that is relevant • Precision = |relevant ∩ retrieved| / | retrieved | = P(relevant | retrieved) • Recall • Proportion of all relevant documents in the collection included in the retrieved set • Recall = |relevant ∩ retrieved| / | relevant | = P(retrieved | relevant) • Precision and recall are well-defined for sets

  20. The relevant documents = 5 Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0 Precision 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5 Recall 0.0 0.2 0.2 0.2 0.4 0.6 0.8 1.0 1.0 1.0 Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.63 0.55 0.5 Average Precision • Often want a single-number effectiveness measure • E.g., for a machine-learning algorithm to detect improvement • Average precision is widely used in IR • Average precision at relevant ranks • Calculate by averaging precision when recall increases AvgPrec=62.2% AvgPrec=52.0%

  21. Trec-eval demo • Queryid (Num): 225 • Total number of documents over all queries • Retrieved: 179550 • Relevant: 1838 • Rel_ret: 1110 • Interpolated Recall - Precision Averages: • at 0.00 0.6139 • at 0.10 0.5743 • at 0.20 0.4437 • at 0.30 0.3577 • at 0.40 0.2952 • at 0.50 0.2603 • at 0.60 0.2037 • at 0.70 0.1374 • at 0.80 0.1083 • at 0.90 0.0722 • at 1.00 0.0674 • Average precision (non-interpolated) for all rel docs(averaged over queries) • 0.2680 • Precision: • At 5 docs: 0.3173 • At 10 docs: 0.2089 • At 15 docs: 0.1564 • At 20 docs: 0.1262 • At 30 docs: 0.0948 • At 100 docs: 0.0373 • At 200 docs: 0.0210 • At 500 docs: 0.0095 • At 1000 docs: 0.0049 • R-Precision (precision after R (= num_rel for a query) docs retrieved): • Exact: 0.2734

  22. Significance tests • System A beats system B on one query • Is it just a lucky query for system A? • Maybe system B does better on some other query • Need as many queries as possible • Empirical research suggests 25 is minimum need • TREC tracks generally aim for at least 50 queries • System A and B identical on all but one query • If system A beats system B by enough on that one query, average will make A look better than B.

  23. Sign Test Example • For methods A and B, compare average precision for each pair of result generated by queries in test collection. • If difference is large enough, count as + or -, otherwise ignore. • Use number of +’s and the number of significant difference to determine significance level • E.g. for 40 queries, method A produced a better result than B 12 times, B was better than A 3 times, and 25 were the “same”, p < 0.035 and method A is significantly better than B. • If A > B 18 times and B > A 9 times, p < 0.1222 and A is not significantly better than B at the 5% level.

  24. Wilcoxon Test • Compute differences • Rank differences by absolute value • Sum separately + ranks and – ranks • Two tailed test • T= min (+ ranks, -ranks) • Reject null hypothesis if T < T0, where T0 is found in a table

  25. Wilcoxon Test Example • + ranks = 44 • - ranks = 11 • T= 11 • T0 = 8 (from table) • Conclusion : not significant

  26. Minimum Rank Error Training

  27. Document ranking principle • A ranking algorithm aims at estimating a function. • The problem can be described as follows: • Two disjoint sets SR and SI • A ranking function f(x) assigns to each document d of the document collection a score value. • denote that is ranked higher than . • The objective function

  28. Document ranking principle • There are different ways to measure the ranking error of a scoring function f. • The natural criterion might be the proportion of misordered pair over the total pair number. • This criterion is an estimate of the probability of misordering a pair

  29. Document ranking principle • Total distance measure is defined as

  30. Illustration of the metric of average precision

  31. Intuition and Theory • Precision is the ratio of relevant documents retrieved to documents retrieved at a given rank. • Average precision is the average of precision at the ranks of relevant documents r is returned documents sk is relevance of document k

  32. Discriminative ranking algorithms • Maximizing the average precision is tightly related to minimizing the following ranking error loss

  33. Discriminative ranking algorithms • Similar to MCE algorithm, ranking loss function LAP is express as a differentiable objective. • The error count nir is approximated by the differentiable loss function defined as

  34. Discriminative ranking algorithms The differentiation of the ranking loss function turns out to be

  35. Discriminative ranking algorithms • We use a bigram language model as an example • Using the steepest descent algorithm, the parameters of language model are adjusted iteratively by

  36. Experiments

  37. Experimental Setup • We evaluated our model with two different TREC collections – • Wall Street Journal 1987 (WSJ87), • Asscosiated Press Newswire 1988 (AP88).

  38. Language Modeling • We used WSJ87 dataset as training data for language model estimation. The AP88 dataset is used as the test data. • During MRE training procedure, parameters is adopted as • Comparison of perplexity

  39. Experimental on Information Retrieval • Two query sets and the corresponding relevant documents in this collection. • TREC topics 51-100 as training queries • TREC topics 101-150 as test queries. • Queries were sampled from the ‘title’ and ‘description’ fields of the topics. • ML language model is used as the baseline system. • To test the significance of improvement, Wilcoxon test was employed in the evaluation.

  40. Comparison of Average Precision

  41. Comparison of Precision in Document Level

  42. Summary

  43. Ranking learning requires to consider nonrelevance information. • We will extend this method for spoken document retrieval • Future work is focused on the area under of the ROC curves (AUC).

  44. References • M. Collins, “Discriminative reranking for natural language parsing”, in Proc. 17th International Conference on Machine Learning, pp. 175-182, 2000. • J. Gao, H. Qi, X. Xia, J.-Y. Nie, “Linear discriminant model for information retrieval”, in Proc. ACM SIGIR, pp.290-297, 2005. • D. Hull, “Using statistical testing in the evaluation of retrieval experiments”, in Proc ACM SIGIR, pp. 329-338, 1993. • B. H. Juang, W. Chou, and C.-H. Lee, “Minimum classification error rate methods for speech recognition”, IEEE Trans. Speech and Audio Processing, pp. 257-265, 1997. • B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification”, IEEE Trans. Signal Processing, vol. 40, no. 12, pp. 3043-3054, 1992. • H.-K. J. Kuo, E. Fosler-Lussier, H. Jiang, and C.-H. Lee, “Discriminative training of language models for speech recognition”, in Proc. ICASSP, pp. 325-328, 2002. • R. Nallapati, “Discriminative models for information retrieval”, in Proc. ACM SIGIR, pp. 64-71, 2004. • J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval”, in Proc. ACM SIGIR, pp.275-281, 1998. • J.-N. Vittaut and P. Gallinari, “Machine learning ranking for structured information retrieval”, in Proc. 28th European Conference on IR Research, pp.338-349, 2006.

  45. Thank You for Your Attention

More Related