170 likes | 288 Views
An information-pattern-based approach to novelty detection. Presenter : Lin, Shu-Han Authors : Xiaoyan Li, W. Bruce Croft. Information Processing and Management (2008). Outline. Motivation Objective Definition Observation Methodology Experiments Conclusion Personal Comments.
E N D
An information-pattern-based approach to novelty detection Presenter : Lin, Shu-Han Authors : Xiaoyan Li, W. Bruce Croft Information Processing and Management (2008)
Outline • Motivation • Objective • Definition • Observation • Methodology • Experiments • Conclusion • Personal Comments
Motivation - specific topic • It is very difficult for traditional word-based approaches to separate the two non-relevant sentences(3&4) from the two relevant sentences(1&2). • The two non-relevant sentences are very likely to be indentified as novel because they contain many new words that do not appear in previous sentences. 3
Motivation - general topic • It is very difficult for traditional word-based approaches to separate the non-relevant sentence(2) from the relevant sentence(1). 4
Objectives • To attack above hard problem: • To provide a new and more explicit definition of novelty. Novelty is defined as new answers to the potential questions representing a user’s request or information need. • To propose a new concept in novelty detection – query-related information patterns. Very effective information patterns for novelty detection at the sentence level have been identified. • To propose a unified pattern-based approach that includes the following three steps: query analysis, relevant sentence detection and new pattern detection. The unified approach works for both specific topics and general topics. 5
Definition - Information Patterns • Information patterns of specific topics • Informationpatternsofgeneraltopics • Opinion patterns and opinion sentences • Eventpatternsandeventsentences Table. Word patterns for the five types of NE(Name Entities)-questions Table. Examples of opinion patterns 6
Observation – information patterns • Sentence lengths • Relevant sentences on average have more words than non-relevant sentences. • Novel sentences on average have slightly more words than relevant sentences. • Opinion patterns • There are relatively more opinion sentences in relevant (and novel) sentences than in non-relevant sentences. • The novel sentences’ percentage of opinion sentences is slightly larger than relevant sentences’. Table. Statistics of sentence lengths Table. Statistics on opinion patterns for 22 opinion topics (2003) 7
Observation – information patterns(Cont.) • NE(Named entity) combinations • PLD(PERSON, LOCATION, DATE) types are more effective in separating relevant and non-relevant sentence. • POLD types(PERSON, ORGANIZATION, LOCATION, DATE) will be used in new pattern detection; NEs of the ORGANIZATION type may provide different sources of new information. • NEs of the PLD types play a more important role in event topics than in opinion topics. 8
Methodology 9 Fig. ip-BAND: a unified information-pattern-based approach to novelty detection.
Methodology(Cont.) • (1) Query analysis and question formulation How many (2) Where (3) 10
Methodology(Cont.) • (2) Using patterns in relevance re-ranking • Ranking with TFISF(term frequency –inverse sentencefrequency) models • TFISFwithinformationpatterns • Sentencelengths • NameEntities • Opinionpatterns • (3) Novelsentenceextraction 11
Experiments • Baselineapproaches • B-NN:initialretrievalranking • B-NW:newworddetection • B-NWT:newworddetectionwithathreshold • B-MMR:MaximalMarginalRelevance(MMR) 12
Experiments • PerformanceforspecifictopicsfromTREC2002,2003,2004 ③ ② ④ ① Table.Performance of novelty detection for 8 specific topics (queries) from TREC 2002 3.4of15novelsentence Table.Performance of novelty detection for 15 specific topics (queries) from TREC 2003 10.1of15novelsentence Table.Performance of novelty detection for 11 specific topics (queries) from TREC 2004 4.6of15novelsentence 13 Note: Data with * pass significance test at 95% confidence level by the Wilcoxon test and ** for significance test at 90%level. Chg%:Improvementoverthefirst(B-NN)baselinein%.
Experiments • PerformanceforgeneraltopicsfromTREC2002,2003,2004 ④ ① Table.Performance of novelty detection for 41 generaltopics (queries) from TREC 2002 3.2of15novelsentence Table.Performance of novelty detection for 35 generaltopics (queries) from TREC 2003 7.5of15novelsentence Table.Performance of novelty detection for 3 generaltopics (queries) from TREC 2004 3.4of15novelsentence 14 Note: Data with * pass significance test at 95% confidence level by the Wilcoxon test and ** for significance test at 90%level. Chg%:Improvementoverthefirst(B-NN)baselinein%.
Experiments • Comparisonamongspecific,generalandalltopicsattop15ranks Table.Comparison among specific, general and all topics at top 15 ranks Note: Chg%: Improvement over the first baseline in percentage; Nvl#: Number of true novel sentences; Rdd#: Number of relevant butredundant sentences; NRl#: Number of non-relevant sentences. 15
Conclusions • Noveltymeansnewanswerstothepotentialquestionsrepresentingauser’srequestorinformationneed. • Theproposedip-BANDoutperformsallbaselinesforspecifictopicsandgeneraltopics,andspecifictopicsisbetterthangeneraltopics. • Itisimpossibletocollectcompletenoveltyjudgmentsinreality • Baselineselectionandevaluationmeasurebyhumanassessors • Misjudgmentofrelevanceand/ornoveltybyhumanassessorsanddisagreementofjudgmentsbetweenthehumanassessors • Limitationandaccuracyofquestionformulations • Noveltydetectionprecisionwillbelowsincesomenon-relevantsentencesmaybetreatedasnovel.
Personal Comments • Advantage • … • Drawback • … • Application • …