120 likes | 245 Views
Draft of segmentation. Haei-Ming Chu 2004/08/09. Reference. Topic Detection and Tracking Pilot Study Final Report James Allan, Jaime Carbonell y , George Doddington z , Jonathan Yamron x , and Yiming Yang y UMass Amherst, yCMU, zDARPA, xDragon Systems, and yCMU 1997
E N D
Draft of segmentation Haei-Ming Chu 2004/08/09
Reference • Topic Detection and Tracking Pilot Study Final ReportJames Allan, Jaime Carbonelly, George Doddingtonz, Jonathan Yamronx, and Yiming YangyUMass Amherst, yCMU, zDARPA, xDragon Systems, and yCMU1997 • A Critique and Improvement of an Evaluation Metric for Text SegmentationLev Pevzne Marti A. HearstyHarvard University University of California, BerkeleyComputational Linguistics 2002 p.19~p.36 • Chinese Spoken Document Segmentation withConsideration of Features, Language Models and ExtraInformation - Examples using Broadcast NewNTU2003-陳家甫
IntroductionThe tasks definition in the TDT final report • The Segmentation Task • Segmenting a continuous stream of text into its constituent stories • The Detection Task • Retrospective Event Detection • Identify all of the events in a corpus of stories • On-line New event Detection • The Tracking Task • Associating incoming stories with events known to the system
IntroductionSegmentation • Automatically dividing a text stream into topically homogeneous blocks • Segmentation is therefore an “enabling” technology for other applications. • Segmenter which is designed for this task will find story boundaries
IntroductionCorpus • Training data • 40 hours公視新聞 人工轉寫文字檔 • 採用 “,”及 “。”做為斷句依據 • 只取長度大於 4 及小於 40 的句子 • 共23945句 • Testing data • 80 hours公視新聞人工轉寫文字檔 • 共58090句
IntroductionEvaluation • Window difference • b(i,j) represents the number of boundaries between positions i and j in the text • N represent the number of sentences in the text • Slightly affected by variation of segment size distribution • With less penalty in Near-Miss Error
MethodHMM based segmentation • Based on the Dragon approach
MethodHMM based segmentation • Training topic language model • Clustering the training data to 10 clusters (topics) • Using K-means Clustering with cosine measure and 20 iterations • Get the character unigram of each topics • Feature Selection • 中文詞庫-character count >9 and <1000 • total 2872 characters
MethodHMM based segmentation • Testing • Using Viterbi algorithm to find the maximum topic sequences • Using LogAdd() avoid the under flow • Set the unknown character probability in the sentence in testing data LSMALL (-0.5E10)
Future Wordother models • K-means algorithm modification • Language model smoothing • Another baseline : sentence similarity • PLSI model • Relevance model or Translation model • Survey method used in the segmentation or tracking