Draft of segmentation

Draft of segmentation Haei-Ming Chu 2004/08/09

Reference • Topic Detection and Tracking Pilot Study Final ReportJames Allan, Jaime Carbonelly, George Doddingtonz, Jonathan Yamronx, and Yiming YangyUMass Amherst, yCMU, zDARPA, xDragon Systems, and yCMU1997 • A Critique and Improvement of an Evaluation Metric for Text SegmentationLev Pevzne Marti A. HearstyHarvard University University of California, BerkeleyComputational Linguistics 2002 p.19~p.36 • Chinese Spoken Document Segmentation withConsideration of Features, Language Models and ExtraInformation － Examples using Broadcast NewNTU2003-陳家甫

IntroductionThe tasks definition in the TDT final report • The Segmentation Task • Segmenting a continuous stream of text into its constituent stories • The Detection Task • Retrospective Event Detection • Identify all of the events in a corpus of stories • On-line New event Detection • The Tracking Task • Associating incoming stories with events known to the system

IntroductionSegmentation • Automatically dividing a text stream into topically homogeneous blocks • Segmentation is therefore an “enabling” technology for other applications. • Segmenter which is designed for this task will find story boundaries

IntroductionCorpus • Training data • 40 hours公視新聞人工轉寫文字檔 • 採用 “，”及 “。”做為斷句依據 • 只取長度大於 4 及小於 40 的句子 • 共23945句 • Testing data • 80 hours公視新聞人工轉寫文字檔 • 共58090句

IntroductionEvaluation • Window difference • b(i,j) represents the number of boundaries between positions i and j in the text • N represent the number of sentences in the text • Slightly affected by variation of segment size distribution • With less penalty in Near-Miss Error

MethodHMM based segmentation • Based on the Dragon approach

MethodHMM based segmentation

MethodHMM based segmentation • Training topic language model • Clustering the training data to 10 clusters (topics) • Using K-means Clustering with cosine measure and 20 iterations • Get the character unigram of each topics • Feature Selection • 中文詞庫-character count >9 and <1000 • total 2872 characters

MethodHMM based segmentation • Testing • Using Viterbi algorithm to find the maximum topic sequences • Using LogAdd() avoid the under flow • Set the unknown character probability in the sentence in testing data LSMALL (-0.5E10)

Experiment ResultHMM based segmentation

Future Wordother models • K-means algorithm modification • Language model smoothing • Another baseline : sentence similarity • PLSI model • Relevance model or Translation model • Survey method used in the segmentation or tracking

Draft of segmentation

Draft of segmentation

Presentation Transcript

Levels of Market Segmentation

Segmentation

Data Segmentation for Privacy SATVA Pilot Brief - draft

Segmentation

Draft of layouts

Evaluation of segmentation

Segmentation

Segmentation

SEGMENTATION

Types of Segmentation

Segmentation

Segmentation

Segmentation

SEGMENTATION

Draft of layouts

Segmentation

Segmentation

Concept of market segmentation