160 likes | 283 Views
NLDB 2006. An Information Retrieval Approach based on Discourse Type. Department of Computing The Hong Kong Polytechnic University 1 Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2 Department of Computer Science City University of New York.
E N D
NLDB 2006 An Information Retrieval Approach based on Discourse Type Department of Computing The Hong Kong Polytechnic University 1Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2Department of Computer Science City University of New York D. Y. Wang, R. W. P. Luk, K.F. Wong1 and K.L. Kwok2 DY Wang @ 2006
Content • Introduction • Motivation • Discourse Type • Information Unit • Problem Formulation • Score of topic terms • Score of discourse type • Document Re-ranking • Experimental Results • Conclusion DY Wang @ 2006
Motivation • The effectiveness of information retrieval (IR) systems varies substantially from one topic to another. • One reason: Users’ Information need is very diverse • Our approach: finding the discourse type of the topic and adopt appropriate strategy DY Wang @ 2006
Discourse Type • Definition of discourse type: The functions (including properties and relations that cannot exist independently) of the independent entities DY Wang @ 2006
Performance Difference Average =0.2768 DY Wang @ 2006
Why Choose “Advantage / Disadvantage” as our example? • Its performance is worse than the average • 0.204 v.s. 0.277 • It is relatively abstract and therefore it is unlikely to be investigated before. • Compared with concrete things (e.g. people, country) • It is related to some cue phrases (e.g., “more than”) that are composed of stop words. • Conventional IR ignores stop words DY Wang @ 2006
Why Choose “Advantage / Disadvantage” as example? (cont.) • It is a popular discourse type of information need. • we found that there are at least 40 questions that are asking about advantages and disadvantages of something at a website (http://www.answerbag.com). • It has a reasonable amount (i.e., eight) of TREC topics for investigation • See next slide DY Wang @ 2006
Eight Queries with discourse type Advantage / Disadvantage DY Wang @ 2006
Information Unit (IU) w words w words t A document …………........................ term1........................ ……………............................................................. ……………................................... term2................. ……………...... term1.............................................. DY Wang @ 2006
Why IU? • Assumption: terms inside an IU (around topic terms) are more important to relevance of document than the terms outside the IU • Simplify the processing of the documents • Compute score for each IU • Aggregate the scores of all IU as the score of the document DY Wang @ 2006
Score of Topic Terms • sumtf = 4 • Dtf = 3 (d: distinct) Graph-based Model: • atS3 = 1/1+1/5+1/3 • atS4 = 1/5+1/3 1 5 3 DY Wang @ 2006
Example: Score of Discourse Type • more (comparative words)=3 support=[' back ',' confirm ',' contest ',' contrari ',' defend ',' encourag ',' endors ',' object ',' oppon ',' oppos ',' opposit ',' prove ',' quibbl ',' refer ',' sponsor ',' support '] ( from www.answers.com ) • support=2 DY Wang @ 2006
Documents Re-ranking • IU score before re-ranking: S0 • S0: similarity score of the document that contains the IU • IU re-ranking score S’ • S’= S0* score of topic terms • S’= S0 * score of discourse type • S’= S0 * score of topic term* score of discourse type • Aggregate the re-ranking score of all IUs in a document as the final score of the document. • Re-rank the documents by the final score. DY Wang @ 2006
Re-ranking Results in MAP DY Wang @ 2006
Conclusion • Re-ranking based on topic terms and discourse type can both improve the retrieval performance. • Combining above two can improve the results most significantly (at 95% confidence level, already considering the sample size). • This approach is promising and is worth further investigation. Acknowledgement: We thank the Center for Intelligent Information Retrieval, University of Massachusetts, for facilitating Robert Luk to develop the basic IR system, when he was on leave there. This work is supported by the CERG Project # PolyU 5226/05E. DY Wang @ 2006