280 likes | 523 Views
Topic Segmentation with Shared Topic Detection and Alignment of Multiple Documents. Bingjun Sun, Prasenjit Mitra, Hongyuan Zha, C.Lee Giles, John Yen. Introduction. Capture the local and sequential information of documents Topic detection and tracking Topic segmentation
E N D
Topic Segmentation with Shared Topic Detection and Alignment of Multiple Documents Bingjun Sun, Prasenjit Mitra, Hongyuan Zha, C.Lee Giles, John Yen
Introduction • Capture the local and sequential information of documents • Topic detection and tracking • Topic segmentation • Two issues of Topic segmentation • Text stream segmentation • Application in automatic speech recognition • Coherent document segmentation • Document spilt into sub-topics • Application in partial-text query of long document in information retrieval • Problem of Topic segmentation • Traditional approaches perform topic segmentation on documents one at a time. • Most of them perform badly.
Introduction • Deal with stop words: • not remove stop words directly • Removing common stop words may result in the loss of useful information in a specific domain • Hard classification of stop words and non stop words cannot represent the gradually changing amount of information content of each word. • Employ a soft classification using term weight.
The main structure of this paper • For each document segment segment segment sentence
Mutual Information • Measure how dependent T and S are: Minimize the loss Maximizing
Weighted Mutual Information • Categorize terms into four types • Common stop words • Common stop words are common both along the dimensions of documents and segments. • Document-dependent stop words • Document-dependent stop words that depends on the personal writing style are common only along the dimension of segments for some documents. • Cue words • Cue words are common along the dimension of documents only for the same segment, and they are not common along the dimensions of segments. • Noisy words • Noisy words are other words which are not common along document and segment dimensions.
Weighted Mutual Information • Common stop words • Document-dependent stop words • Cue words • Noisy words
Algorithm • Stage1: initialization • Segment: • Segment documents equally by sentences. • Find the optimal segmentation just for each document which maximizes the WMI. • Alignment: • Assume that the order of segments for each document is the same. • Term cluster: • Randomly set cluster labels.
Algorithm • Stage 2: • Find the best cluster • For each document, check all sequential segmentations, find the best one. The cycle is repeated to find a local maximum based on MI until it converges.
Evaluation • There are predicted segmentation results and real segmentation result • real: real segmentation • pred: predicted segmentation • diff: pair of words in different segments. • same: pair of words real in the same segment. • miss: pair of words in different segments for real segmentation, but predicted as in the same segment. • false alarm: pair of words in the same segment, but predicted as in different segments.
Evaluation False alarm miss
Experiment 1- • 700 samples • Each is concatenation of ten segments. Each segment is the first n sentence selected randomly from the Brown corpus.
Experiment 2- shared topic detection • 80 news articles from Google News • 8 topics and each has 10 articles Sentence in the same topic Sentences set For a pair of documents
Experiment 3- multi-document segmentation • The data set is the introduction part of a lab report selected from the corpus of Biol 240W, Pennsylvania State University. • 102 samples and 2264 sentences. • Each sample has two segments. • The sentences range of the samples is from 2 to 56 sentences
Conclusion • This paper proposed a novel method for multi-document topic segmentation and alignment based on weighted mutual information, which can also handle single-document cases. • Natural segmentations like paragraphs are hint that can be used to find the optimal boundaries. • Supervised learning
System design Training corpus New Document input Term frequency of word in new documents Term frequency of word in documents Term frequency of word in titles Using NBL and NBF system Count P(T|W) P(w) Top K words output Title generation
NBL (Naïve Bayesian Title Generation with Limited Vocabulary) • From the training corpus, count the occurrence of each document-word-title-word pair whose document word and title word is the same. Word w在此訓練語料中的term frequency Word w在此訓練語料中相對應title的term frequency
NBL (Naïve Bayesian Title Generation with Limited Vocabulary) • To generate a title for a new document 若w出現越多在此document中 ,則此篇document的title越形重要
NBF (Naïve Bayesian Approach with Full Vocabulary) • Count the occurrence of document-word-title-word pairs when the word is dw and the title word is tw. • Count the probability • Combine with the information of the new document, generate the probability for each possible title word.