110 likes | 229 Views
A Statistical Model for Domain-Independent Text Segmentation . Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost. Introduction. Algorithm find maximum-probability segmentation using a statistical method. No training required. Domain-independent. Other Methods.
E N D
A Statistical Model for Domain-Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost
Introduction • Algorithm find maximum-probability segmentation using a statistical method. • No training required. • Domain-independent.
Other Methods • Lexical Cohesion • Statistical • Hidden Markov model (Yamron et al., 1998)
Statistical Model • Find the probability of a segmentation S given a text W. • Use Bayes rule to find maximum-probability segmentation.
Definition of Pr(W|S) • Assume statistical independence of topics and of words within the scope of a topic. • Assume different topics have different word distributions. • Can breakdown into double product of probabilities across words and segments. • Uses Laplace estimator for word frequency prediction.
Definition of Pr(S) • Varies depending on prior information. • In general, assume no prior information. • Prevents the algorithm from generating too many segments; counteracts Pr(W|S).
Algorithm • Convert the probability function into a cost function by taking the negative log. • Given a text W, define gi to be the gap between word wi and wi+1. • Create a directed graph where the nodes are the gaps between words and the edges cover a segment between the gaps the edge connects. • Calculate all edge weights by using the cost function and find the minimum-cost path from the first to last node.
Algorithm • The calculated path represents the minimum-cost segmentation by correlating the edges to segments.
Algorithm – Features • Determines the number of segments, but can also specify the number of edges in the shortest path. • Can specify where segmentation occurs by only using a subset of all possible edges where both nodes connected by the edge meet user-specified conditions. • Algorithm is insensitive to text length. • Good for summarization
Algorithm – Evaluation • Compared algorithm against C99 (Choi 2000). • Artificial test corpus extracted from the Brown corpus used. • Probabilistic error metric used to evaluate performance. • Results of Utiyama algorithm significantly better at 1% level than Choi algorithm.
Algorithm – Evaluation • Assessment of algorithm using real texts is needed. • Advantages over HMM • No training required (implies domain-independence). • Can incorporate probabilistic information into model. • Might be expandable to detect word descriptions in text.