340 likes | 480 Views
Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track. Jing Jiang, Xin He, ChengXiang Zhai University of Illinois at Urbana-Champaign. Goal of Participation. To test the effectiveness of some recent language modeling methods for genomics retrieval
E N D
Robust Pseudo Feedback& HMM Passage ExtractionUIUC at TREC 2006 Genomics Track Jing Jiang, Xin He, ChengXiang Zhai University of Illinois at Urbana-Champaign
Goal of Participation • To test the effectiveness of some recent language modeling methods for genomics retrieval • Robust pseudo feedback[Tao & Zhai 06] • HMM passage extraction[Jiang & Zhai 06] • Task at 2006 genomics track • Document-level retrieval • Passage-level retrieval • Aspect-level retrieval
Overall Approach Medline articles paragraphs ranked passages k 2 1 … … 1 Document Retrieval Module Passage Extraction Module Q 2 ranked paragraphs pseudo relevance feedback … k user relevance feedback …
Goal of Participation • To test the effectiveness of some recent language modeling methods for genomics retrieval • Robust pseudo feedback[Tao & Zhai 06] • HMM passage extraction [Jiang & Zhai 06]
KL-Divergence Retrieval Model[Lafferty & Zhai 01] the 0.020 for 0.015 prp 0.102 mad 0.034 cow 0.034 diseas 0.068 … … The…for… spongiform…PrP protein… D1 document Prion diseases… that…(PrP C)…This… D2 … role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 topic …which…(PrP C)…to the…prion protein… Dk …
KL-Divergence Retrieval Model[Lafferty & Zhai 01] the 0.020 for 0.015 prp 0.102 mad 0.034 cow 0.034 diseas 0.068 … … The…for… spongiform…PrP protein… D1 document Prion diseases…that…(PrP C)…This… D2 … role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 topic …which…(PrP C)…to the…prion protein… Dk …
Model-Based Feedback[Zhai & Lafferty 01] the 0.02 for 0.01 … … prp 0.003 prion 0.004 The…for… spongiform…PrP protein… background D1 Prion diseases… that…(PrP C)…This… D2 role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 the ? for ? … … prp ? prion ? … topic feedback …which…(PrP C)…to the…prion protein… Dk …
Model-Based Feedback[Zhai & Lafferty 01] the 0.02 for 0.01 … … prp 0.003 prion 0.004 The…for… spongiform…PrP protein… background D1 EM algorithm Prion diseases… that…(PrP C)…This… D2 role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 the 0.003 for 0.002 … … prp 0.02 prion 0.05 … topic feedback …which…(PrP C)…to the…prion protein… Dk …
Model-Based Feedback[Zhai & Lafferty 01] the 0.02 for 0.01 … … prp 0.003 prion 0.004 The…for… spongiform…PrP protein… background D1 Prion diseases… that…(PrP C)…This… D2 role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 the 0.003 for 0.002 … … prp 0.02 prion 0.05 … topic feedback …which…(PrP C)…to the…prion protein… Dk … 2 parameters α and λ
Regularized Estimation[Tao & Zhai 06] the 0.02 for 0.01 … … prp 0.003 prion 0.004 The…for… spongiform…PrP protein… background D1 Prion diseases… that…(PrP C)…This… D2 role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 the ? for ? … … prp ? prion ? … topic feedback …which…(PrP C)…to the…prion protein… Dk …
Regularized Estimation[Tao & Zhai 06] the 0.02 for 0.01 … … prp 0.003 prion 0.004 The…for… spongiform…PrP protein… background D1 regularized EM algorithm prior Prion diseases… that…(PrP C)…This… D2 role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 the 0.003 for 0.002 … … prp 0.02 prion 0.05 … topic feedback …which…(PrP C)…to the…prion protein… Dk …
Regularized Estimation[Tao & Zhai 06] the 0.02 for 0.01 … … prp 0.003 prion 0.004 The…for… spongiform…PrP protein… background D1 prior Prion diseases… that…(PrP C)…This… D2 role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2 the 0.003 for 0.002 … … prp 0.02 prion 0.05 … topic feedback …which…(PrP C)…to the…prion protein… Dk … 1 parameter η
regularized α D1 α D2 … D1 Dk D2 … α Dk D1 D2 … Dk Original vs. Regularized EM original α manually set α dynamically set
Goal of Participation • To test the effectiveness of some recent language modeling methods for genomics retrieval • Robust pseudo feedback [Tao & Zhai 06] • HMM passage extraction[Jiang & Zhai 06]
B B … B R R R … R R R B … B HMM Passage Extraction[Jiang & Zhai 06] relevant passage paragraph w w … w w w w … w w w w … w p(w|B1) the: 0.02 for: 0.01 prp: 0.001 … p(w|R) the: 0.003 for: 0.002 prp: 0.02 … p(w|B2) the: 0.02 for: 0.01 prp: 0.001 … B1 R B2 HMM p(R|B1) = 0.1 p(B2|R) = 0.05 p(B1|B1) = 0.9 p(R|R) = 0.95 p(B2|B2) = 1
HMM Passage Extraction[Jiang & Zhai 06] transition probabilities estimated from observations end-of-paragraph state B1 R B3 E B2 a background state for smoothing
Experiment Design • Pre-processing • HTML parsing • paragraph boundaries • Tokenization • User relevance feedback
Official Runs Medline articles paragraphs ranked passages k 2 1 … … 1 KL-Div Retrieval HMM Passage Extraction Q 2 ranked paragraphs … Q' k …
UIUCauto Medline articles paragraphs ranked passages k 2 1 … … 1 KL-Div Retrieval HMM Passage Extraction Q 2 ranked paragraphs … Q' k regularized estimation …
UIUCinter Medline articles paragraphs ranked passages k 2 1 … … 1 KL-Div Retrieval HMM Passage Extraction Q 2 ranked paragraphs … Q' k regularized estimation …
UIUCinter2 Medline articles paragraphs ranked passages k 2 1 … … 1 KL-Div Retrieval HMM Passage Extraction Q 2 ranked paragraphs … Q' k F original estimation …
Pseudo Relevance Feedback(k = 10) η is similar to λ / (1 − λ)
Pseudo Relevance Feedback(k = 10) η is similar to λ / (1 − λ)
Pseudo Relevance Feedback(k = 10) η is similar to λ / (1 − λ)
Passage Length (In Bytes) HMM passages are generally too long!
Example Passage Prion diseases, which include Creutzfeldt-Jacob disease in humans, mad cow disease in cattle, and scrapie in sheep, involve the misfolding of the benign cellular prion protein (PrP C) 1 to the infectious disease-causing scrapie isoform PrP Sc. The prion protein (PrP C) is a copper-binding cell surface glycoprotein. The role of copper in the normal function of PrP, as well as in prion diseases, has been the subject of a number of excellent reviews. The mature cellular form of PrP consists of residues 23 to 231 and is tethered to the cell surface via a glycosylphosphatidylinositol anchor at the C terminus. There are now a number of NMR solution structures of copper-free mammalian PrPs. A crystal structure of PrP C has also been published; this structure is dimeric involving domain swapping of the monomeric form.
Example Passage Prion diseases, which include Creutzfeldt-Jacob disease in humans, mad cow disease in cattle, and scrapie in sheep, involve the misfolding of the benign cellular prion protein (PrP C) 1 to the infectious disease-causing scrapie isoform PrP Sc. The prion protein (PrP C) is a copper-binding cell surface glycoprotein. The role of copper in the normal function of PrP, as well as in prion diseases, has been the subject of a number of excellent reviews. The mature cellular form of PrP consists of residues 23 to 231 and is tethered to the cell surface via a glycosylphosphatidylinositol anchor at the C terminus. There are now a number of NMR solution structures of copper-free mammalian PrPs. A crystal structure of PrP C has also been published; this structure is dimeric involving domain swapping of the monomeric form.
Conclusions and Future Work • The two language modeling methods in general works well in genomics domain • Regularized feedback estimation can effectively eliminates parameter α • HMM passages improves over paragraphs • User relevance feedback is effective • Limitations and future work • Regularized feedback estimation still has parameter η to tune • How to eliminate η? • The inherent coherence property of HMM passages may not suit the task well • Different/better HMM architecture?
The End • Questions?