1 / 34

Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track. Jing Jiang, Xin He, ChengXiang Zhai University of Illinois at Urbana-Champaign. Goal of Participation. To test the effectiveness of some recent language modeling methods for genomics retrieval

yosef
Download Presentation

Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Robust Pseudo Feedback& HMM Passage ExtractionUIUC at TREC 2006 Genomics Track Jing Jiang, Xin He, ChengXiang Zhai University of Illinois at Urbana-Champaign

  2. Goal of Participation • To test the effectiveness of some recent language modeling methods for genomics retrieval • Robust pseudo feedback[Tao & Zhai 06] • HMM passage extraction[Jiang & Zhai 06] • Task at 2006 genomics track • Document-level retrieval • Passage-level retrieval • Aspect-level retrieval

  3. Overall Approach Medline articles paragraphs ranked passages k 2 1 … … 1 Document Retrieval Module Passage Extraction Module Q 2 ranked paragraphs pseudo relevance feedback … k user relevance feedback …

  4. Goal of Participation • To test the effectiveness of some recent language modeling methods for genomics retrieval • Robust pseudo feedback[Tao & Zhai 06] • HMM passage extraction [Jiang & Zhai 06]

  5. KL-Divergence Retrieval Model[Lafferty & Zhai 01] the 0.020 for 0.015 prp 0.102 mad 0.034 cow 0.034 diseas 0.068 … … The…for… spongiform…PrP protein… D1 document Prion diseases… that…(PrP C)…This… D2 … role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 topic …which…(PrP C)…to the…prion protein… Dk …

  6. KL-Divergence Retrieval Model[Lafferty & Zhai 01] the 0.020 for 0.015 prp 0.102 mad 0.034 cow 0.034 diseas 0.068 … … The…for… spongiform…PrP protein… D1 document Prion diseases…that…(PrP C)…This… D2 … role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 topic …which…(PrP C)…to the…prion protein… Dk …

  7. Model-Based Feedback[Zhai & Lafferty 01] the 0.02 for 0.01 … … prp 0.003 prion 0.004 The…for… spongiform…PrP protein… background D1 Prion diseases… that…(PrP C)…This… D2 role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 the ? for ? … … prp ? prion ? … topic feedback …which…(PrP C)…to the…prion protein… Dk …

  8. Model-Based Feedback[Zhai & Lafferty 01] the 0.02 for 0.01 … … prp 0.003 prion 0.004 The…for… spongiform…PrP protein… background D1 EM algorithm Prion diseases… that…(PrP C)…This… D2 role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 the 0.003 for 0.002 … … prp 0.02 prion 0.05 … topic feedback …which…(PrP C)…to the…prion protein… Dk …

  9. Model-Based Feedback[Zhai & Lafferty 01] the 0.02 for 0.01 … … prp 0.003 prion 0.004 The…for… spongiform…PrP protein… background D1 Prion diseases… that…(PrP C)…This… D2 role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 the 0.003 for 0.002 … … prp 0.02 prion 0.05 … topic feedback …which…(PrP C)…to the…prion protein… Dk … 2 parameters α and λ

  10. Regularized Estimation[Tao & Zhai 06] the 0.02 for 0.01 … … prp 0.003 prion 0.004 The…for… spongiform…PrP protein… background D1 Prion diseases… that…(PrP C)…This… D2 role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 the ? for ? … … prp ? prion ? … topic feedback …which…(PrP C)…to the…prion protein… Dk …

  11. Regularized Estimation[Tao & Zhai 06] the 0.02 for 0.01 … … prp 0.003 prion 0.004 The…for… spongiform…PrP protein… background D1 regularized EM algorithm prior Prion diseases… that…(PrP C)…This… D2 role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 the 0.003 for 0.002 … … prp 0.02 prion 0.05 … topic feedback …which…(PrP C)…to the…prion protein… Dk …

  12. Regularized Estimation[Tao & Zhai 06] the 0.02 for 0.01 … … prp 0.003 prion 0.004 The…for… spongiform…PrP protein… background D1 prior Prion diseases… that…(PrP C)…This… D2 role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 the 0.003 for 0.002 … … prp 0.02 prion 0.05 … topic feedback …which…(PrP C)…to the…prion protein… Dk … 1 parameter η

  13. regularized α D1 α D2 … D1 Dk D2 … α Dk D1 D2 … Dk Original vs. Regularized EM original α manually set α dynamically set

  14. Goal of Participation • To test the effectiveness of some recent language modeling methods for genomics retrieval • Robust pseudo feedback [Tao & Zhai 06] • HMM passage extraction[Jiang & Zhai 06]

  15. B B … B R R R … R R R B … B HMM Passage Extraction[Jiang & Zhai 06] relevant passage paragraph w w … w w w w … w w w w … w p(w|B1) the: 0.02 for: 0.01 prp: 0.001 … p(w|R) the: 0.003 for: 0.002 prp: 0.02 … p(w|B2) the: 0.02 for: 0.01 prp: 0.001 … B1 R B2 HMM p(R|B1) = 0.1 p(B2|R) = 0.05 p(B1|B1) = 0.9 p(R|R) = 0.95 p(B2|B2) = 1

  16. HMM Passage Extraction[Jiang & Zhai 06] transition probabilities estimated from observations end-of-paragraph state B1 R B3 E B2 a background state for smoothing

  17. Experiment Design • Pre-processing • HTML parsing • paragraph boundaries • Tokenization • User relevance feedback

  18. Official Runs Medline articles paragraphs ranked passages k 2 1 … … 1 KL-Div Retrieval HMM Passage Extraction Q 2 ranked paragraphs … Q' k …

  19. UIUCauto Medline articles paragraphs ranked passages k 2 1 … … 1 KL-Div Retrieval HMM Passage Extraction Q 2 ranked paragraphs … Q' k regularized estimation …

  20. UIUCinter Medline articles paragraphs ranked passages k 2 1 … … 1 KL-Div Retrieval HMM Passage Extraction Q 2 ranked paragraphs … Q' k regularized estimation …

  21. UIUCinter2 Medline articles paragraphs ranked passages k 2 1 … … 1 KL-Div Retrieval HMM Passage Extraction Q 2 ranked paragraphs … Q' k F original estimation …

  22. Pseudo Relevance Feedback(k = 10) η is similar to λ / (1 − λ)

  23. Pseudo Relevance Feedback(k = 10) η is similar to λ / (1 − λ)

  24. Pseudo Relevance Feedback(k = 10) η is similar to λ / (1 − λ)

  25. Parameter Sensitivity(pseudo feedback, k = 10)

  26. User Relevance Feedback

  27. User Relevance Feedback

  28. User Relevance Feedback

  29. HMM Passage Extraction

  30. Passage Length (In Bytes) HMM passages are generally too long!

  31. Example Passage Prion diseases, which include Creutzfeldt-Jacob disease in humans, mad cow disease in cattle, and scrapie in sheep, involve the misfolding of the benign cellular prion protein (PrP C) 1 to the infectious disease-causing scrapie isoform PrP Sc. The prion protein (PrP C) is a copper-binding cell surface glycoprotein. The role of copper in the normal function of PrP, as well as in prion diseases, has been the subject of a number of excellent reviews. The mature cellular form of PrP consists of residues 23 to 231 and is tethered to the cell surface via a glycosylphosphatidylinositol anchor at the C terminus. There are now a number of NMR solution structures of copper-free mammalian PrPs. A crystal structure of PrP C has also been published; this structure is dimeric involving domain swapping of the monomeric form.

  32. Example Passage Prion diseases, which include Creutzfeldt-Jacob disease in humans, mad cow disease in cattle, and scrapie in sheep, involve the misfolding of the benign cellular prion protein (PrP C) 1 to the infectious disease-causing scrapie isoform PrP Sc. The prion protein (PrP C) is a copper-binding cell surface glycoprotein. The role of copper in the normal function of PrP, as well as in prion diseases, has been the subject of a number of excellent reviews. The mature cellular form of PrP consists of residues 23 to 231 and is tethered to the cell surface via a glycosylphosphatidylinositol anchor at the C terminus. There are now a number of NMR solution structures of copper-free mammalian PrPs. A crystal structure of PrP C has also been published; this structure is dimeric involving domain swapping of the monomeric form.

  33. Conclusions and Future Work • The two language modeling methods in general works well in genomics domain • Regularized feedback estimation can effectively eliminates parameter α • HMM passages improves over paragraphs • User relevance feedback is effective • Limitations and future work • Regularized feedback estimation still has parameter η to tune • How to eliminate η? • The inherent coherence property of HMM passages may not suit the task well • Different/better HMM architecture?

  34. The End • Questions?

More Related