Sentence Semantic Distance and Novelty Detection

Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS 2003-10-20

祝王树西生日快乐 祝于满泉生日快乐

Outline • Motivation • Semantic distance computation • Overview on WordNet • WordNet-based Sentence semantic distance computation • Novelty Detection using sentence semantic distance • Experiments and analysis • Problems and future works • Conclusion

Motivation • Introduction to Novelty Track in TREC The Novelty Track is designed to investigate systems' abilities to locate relevant AND new information within a set of documents relevant to a TREC topic. Systems are given the topic and a set of relevant documents ordered by date, and must identify sentences containing relevant and/or newinformation in those documents.

Motivation II • Topic • <num>Number: N3 • <title>Gingrich Speaker of House • <toptype>event/opinion • <desc>Description: After a serious challenge by Rep. Livingston in 1998, Newt Gingrich announced he would not seek re-election as Speaker of the U.S. House of Representatives. • <narr>Narrative: Relevant is the political maneuvering that led up to Newt Gingrich's announcement in 1998 that he would not run again for the position of speaker of the house; • Relevant document

Motivation III • Contrast between sentence and document • Content: average 6-12 words VS. over 30 sentences • Overlapping: very less VS. key words occur frequently • Information(After stemming and removal of stop words): 4 or 5 pairs and frequent are mostly 1 VS. dozens of (word, frequent ) pair

Motivation IV • Sample • Event: 1) However, the Scottish team was the first to make a clone from adult animal cell.2) The seminar was held in the context of a recently reported sheep cloning case in Britain. • Opinion 1) Daily we read news stories about dissatisfaction with managed care, Medicare fraud and overbilling.2) Eighty percent agreed with this. • Conclusion: Only considering words is far from requirement; We must extend a sentence as possible as we can.

Semantic distance computation • Semantic distance or similarity computation • Previous works tends on word-word semantic distance • Corpus-based: co-occurrence distribution [Church and Hanks 1989, Grefenstette 1992] knowledge-free • Ontology-based: Taxonomy, or other psudo-knowledge: WordNet, HowNet, Tongyi CiLin [刘群(2002)，王斌(1999),李素建(2002)，车万翔(2003)] • Hybrid [Jay J Jiang 1997]

Semantic distance computation II • Corpus-based approach • Select a group of words to be features, and train a feature vector for each word. Then compute its similarity using vector computation (i.e. cosine) • Assumption: similar word <- -> similar context • Lu Song 1999, Degan(1999) • “Invariable Point” [Bai Shuo 1995] to make word clustering • [张三]{吃}<饭> • [李四]{穿}<衣> • A word, a category_____?_____All words, a category

Semantic distance computation III • Liu Qun, Li SuJian, 2002 Using HowNet 对于两个汉语词语W1和W2，如果W1有n个义项（概念）：S11，S12，……，S1n，W2有m个义项（概念）：S21，S22，……，S2m，我们规定，W1和W2的相似度各个概念的相似度之最大值，也就是说： …… (3) 其中p1和p2表示两个义原（primitive），d是p1和p2在义原层次体系中的路径长度，是一个正整数。α是一个可调节的参数。

Semantic distance computation IV • Wang Bin, Using Tongyi Ci Lin O A B L a b a l …… 01 02... 01… 01… …… 01 01 02... 01 ... 01 … 01 …… … 01 02…01...01 01 … 01 …… ... 虚线用于标识某上层节点到下层节点的路径

Semantic distance computation V • 车万翔 • 编辑距离：删除、插入、替换（HowNet,同义词林估计代价） (a)“爱吃苹果”与“喜欢吃香蕉”之间的编辑距离为4，如四条虚线所显示; (b)“爱吃苹果”与“喜欢吃香蕉”之间的改进编辑距离为1.1，其中“爱”à “喜欢”代价为0.5，“苹果”à “香蕉”代价为0.6

Semantic distance computation VI • Ontology-based: Simple but effective, understandable, however subjective • Corpus-based: ignoring the inherent semantic knowledge, practical for application in parsing, semantic or language usage. • Hybrid: Using taxonomy as its frame and using corpus to compute edge value instead of arbitrary setting.

Overview on WordNet • URL： • http://www.cogsci.princeton.edu/~wn/ • 开发单位： • 普林斯顿大学心理语言学实验室 • 初衷是作为研究人类词汇记忆的心理语言学成果 • 在自然语言处理中得到广泛的应用 • 免费的在线词汇数据库 • 世界很多语种都开发了相应的版本 • 各种欧洲语言：EuroNet • 汉语：CCD（Chinese Concept Dictioanry）

WordNet II • 同义词集 Synset • 用一组同义词的集合Synset来表示一个概念 • 每一个概念有一段描述性的说明 • 关系 • 上下位关系（hyponymy，troponymy） • 同义反义关系（synonymy，antonymy） • 部分整体关系（entailment，meronymy） • ……

WordNet III 名词概念的组织：

WordNet IV 形容词概念的组织：

Wordnet V • Only covering open class word : adj, n, v, adv • Scale: see attachment • Data format: see attachment • API: How to use it? wninit() (Word,POS) [getindex(char *, int)](synset offset)[read_synset(int, long, char *)->tagged frequency be%2:42:03:: 1 10720 sense_keysense_numbertag_cnt

WordNet-based Sentence semantic distance • Information content • P(c) 1/P(c) • IC(c)= -logP(c) A dog bites a man VS. A man bites a dog • Entropy(H)=∑P(c)*IC(c) • Information content of synset IC(c)=-logP(c) P(c)=freq(c)/N freq(c)= ∑freq(w) where w∈c or w ∈c* and c* is a child of c

WordNet-based Sentence semantic distance II • Edge value: Wt(c,p)=(β+(1-β)E’/E(p)){1+1/d(p)}^a[IC(c)-IC(p)]T(c,p) Where: • c child, p: parent • E density • T(c,p) : Link type • Only focus IC, therefore, Wt(c,p)=IC(c)-IC(p) • [Jay J. Jiang,1997] Dist(w1,w2)=IC(c1)+IC(c2)-2*IC(LSuper(c1,c2)) • LSuper: lowest super-ordinate of c1 and c2

WordNet-based Sentence semantic distance III • It’s not enough if only focus hyperlink. • We introduce more relationship such as: • VERBGROUP$: develop 6 acquire 5 evolve • Similar & : aborning 0 003 & 00003552 a 0000 & 00003671 a 0000 ! 00003777 a 0101 • Derivation adj->adv; v<->n, v->adj; • Noun-Noun: ISMEMBERPTR SSTUFFPTR ,ISPARTPTR ,HASMEMBERPTR ,HASSTUFFPTR, HASPARTPTR • Example: Friend->friendly->friendliness

WordNet-based Sentence semantic distance IV • Word-Sentence Semantic Distance (WSSD) WSSD(W,S)=min {WWSD(W, wi)| wi∈S } where W is a word, S is a sentence and wi is a word in sentence S. • Sentence-Sentence Semantic Distance (SSSD)

WordNet-based Sentence semantic distance IV w1 w2 w3 w6 w4 w5 Sentence A Sentence B W3 W1 W2 W4 Note: wi : Word in Sentence A Wi :Word in Sentence B Semantic Distance from Word wi to Sentence B Semantic Distance from Word Wi to Sentence A Sentence Semantic Distance

Novelty Detection using sentence semantic distance • What factors determine whether a sentence S is novel or new? • Semantic distance between S and topic T. Less is better • Semantic distance between S and previous valid content T. More is better • Word overlapping with T: More is better • Word overlapping with previous sentence: less is better • Is a paragraph head? Head sentence is more likely to be new. • …

Novelty Detection using sentence semantic distance II • How to link various factors with decision? • A binary decision: New or not. • Various factor treated as different dimension, they forms a featured or factor vector. However, different dimension has a different weight. • Problems turns into binary categorization: put the relevant sentence S factor vector into which category: new or not? • Similar approaches could be applied in relevance detection.

Novelty Detection using sentence semantic distance III • Training on known result using winnow algorithm. • Factor should take same direction and same range: 0->1, ascendly • N2 XIE19970228.0169:05 0.430 1.00 0.174 1.00 1 1 • N2 XIE19970228.0169:06 0.364 0.52 0.143 0.81 1 0 • N2 XIE19970302.0039:04 0.5978 1.0 0.20 1.0 1 ? • ∑Wi*Fi>N: new; otherwise: not new

Experiments and analysis • ICT03NOV4OTP: word overlapping with topic and previous valid context • Averages over 50 topics: Average precision: 0.59; Average recall: 0.70; Average F: 0.610 • Best: N5 209 227 200 0.88 0.96 0.917 • Worst:N49 50 5 0 0.00 0.00 0.000 • Sounds not bad with the simple information.

Experiments and analysis II • ICT03NOV4LFF: word overlapping + semantic similarity with topic and previous valid context • Averages over 50 topics:Average precision: 0.59 Average recall: 0.64Average F: 0.568 • Best: N5 209 224 197 0.88 0.94 0.910 • Worst:N46 93 2 0 0.00 0.00 0.000 • Seems little reduction after introducing semantic distance.

Experiments and analysis III • ICT03NOV4ALL: word overlapping + semantic similarity with topic and previous valid context+head sentence • Averages over 50 topics:Average precision: 0.60Average recall: 0.68 Average F: 0.598 • Best: N5 209 224 197 0.88 0.94 0.910 • Worst:N46 93 2 0 0.00 0.00 0.000 • Enhance after adding head sentence information

Problems and future works • Semantic computation using corpus depends more on corpus. Data sparseness is bottleneck especially when we lack large scale of semantic corpus till now. • Semantic computation model should unify taxonomy relationship and other relationship. • Winnow training algorithm has deficiency in (factors, decision) modeling

Problems and future works II • Some modification could be applied in semantic computation. • Along with the current approach, some works could be optimized. 1) semantic distance between selected sentence and its previous one extends to valid contexts; 2) winnow init; 3) winnow step 4) KNN or other way to replace winnow

Problems and future works III • Synset-based VSM or Synset overlapping to expand the sentence though overlapping/VSM is proved simple but effective.

Conclusion • Semantic distance computation using WordNet is effective for many applications such as disambiguation, bilingual word alignment. It could help compare words whose forms are different. • Factors and decision could be converted into categorization • Relevance and Novelty detection seems promising though it is difficult. It is still in earlier stage. More works or approaches could be tried. Best results determine best approach. Good ideas should be checked with final result.

Acknowledgements • Dr. Jian Sun for instructive discussion and providing papers on semantic computation and novelty. • Mr. Wei-Feng Pan for winnow program and • Associate Prof. Qun Liu for discussion on semantic similarity. • Stanford Univ. for providing WordNet

Thanks for your attention!

Sentence Semantic Distance and Novelty Detection