A Statistical Model for Domain-Independent Text Segmentation

A Statistical Model for Domain-Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost

Introduction • Algorithm find maximum-probability segmentation using a statistical method. • No training required. • Domain-independent.

Other Methods • Lexical Cohesion • Statistical • Hidden Markov model (Yamron et al., 1998)

Statistical Model • Find the probability of a segmentation S given a text W. • Use Bayes rule to find maximum-probability segmentation.

Definition of Pr(W|S) • Assume statistical independence of topics and of words within the scope of a topic. • Assume different topics have different word distributions. • Can breakdown into double product of probabilities across words and segments. • Uses Laplace estimator for word frequency prediction.

Definition of Pr(S) • Varies depending on prior information. • In general, assume no prior information. • Prevents the algorithm from generating too many segments; counteracts Pr(W|S).

Algorithm • Convert the probability function into a cost function by taking the negative log. • Given a text W, define gi to be the gap between word wi and wi+1. • Create a directed graph where the nodes are the gaps between words and the edges cover a segment between the gaps the edge connects. • Calculate all edge weights by using the cost function and find the minimum-cost path from the first to last node.

Algorithm • The calculated path represents the minimum-cost segmentation by correlating the edges to segments.

Algorithm – Features • Determines the number of segments, but can also specify the number of edges in the shortest path. • Can specify where segmentation occurs by only using a subset of all possible edges where both nodes connected by the edge meet user-specified conditions. • Algorithm is insensitive to text length. • Good for summarization

Algorithm – Evaluation • Compared algorithm against C99 (Choi 2000). • Artificial test corpus extracted from the Brown corpus used. • Probabilistic error metric used to evaluate performance. • Results of Utiyama algorithm significantly better at 1% level than Choi algorithm.

Algorithm – Evaluation • Assessment of algorithm using real texts is needed. • Advantages over HMM • No training required (implies domain-independence). • Can incorporate probabilistic information into model. • Might be expandable to detect word descriptions in text.

A Statistical Model for Domain-Independent Text Segmentation

A Statistical Model for Domain-Independent Text Segmentation

Presentation Transcript

Domain-Independent Concepts

Model Segmentation

Domain Model

The Independent Domain Model for Hysteresis

A statistical model for hot hadronic matter

A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation

A Text-Independent Speaker Recognition System

Domain Independent Generative Modeling

Text Independent Speaker Identification Using Gaussian Mixture Model

Statistical Properties for Text

Text segmentation

A digital business model for independent musicians

A Random Text Model for the Generation of Statistical Language Invariants

A Probabilistic Model for Melody Segmentation

Semiautomatic domain model building from text-data

Statistical Text Categorization

A Statistical Inverse Analysis For Model Calibration

Adapting a model for independent writing

Domain Model:

The Independent Domain Model for Hysteresis

Statistical Methods for Text Mining

Domain Adaptation for Statistical Machine Translation