1 / 11

A Statistical Model for Domain-Independent Text Segmentation

A Statistical Model for Domain-Independent Text Segmentation . Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost. Introduction. Algorithm find maximum-probability segmentation using a statistical method. No training required. Domain-independent. Other Methods.

loki
Download Presentation

A Statistical Model for Domain-Independent Text Segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Statistical Model for Domain-Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost

  2. Introduction • Algorithm find maximum-probability segmentation using a statistical method. • No training required. • Domain-independent.

  3. Other Methods • Lexical Cohesion • Statistical • Hidden Markov model (Yamron et al., 1998)

  4. Statistical Model • Find the probability of a segmentation S given a text W. • Use Bayes rule to find maximum-probability segmentation.

  5. Definition of Pr(W|S) • Assume statistical independence of topics and of words within the scope of a topic. • Assume different topics have different word distributions. • Can breakdown into double product of probabilities across words and segments. • Uses Laplace estimator for word frequency prediction.

  6. Definition of Pr(S) • Varies depending on prior information. • In general, assume no prior information. • Prevents the algorithm from generating too many segments; counteracts Pr(W|S).

  7. Algorithm • Convert the probability function into a cost function by taking the negative log. • Given a text W, define gi to be the gap between word wi and wi+1. • Create a directed graph where the nodes are the gaps between words and the edges cover a segment between the gaps the edge connects. • Calculate all edge weights by using the cost function and find the minimum-cost path from the first to last node.

  8. Algorithm • The calculated path represents the minimum-cost segmentation by correlating the edges to segments.

  9. Algorithm – Features • Determines the number of segments, but can also specify the number of edges in the shortest path. • Can specify where segmentation occurs by only using a subset of all possible edges where both nodes connected by the edge meet user-specified conditions. • Algorithm is insensitive to text length. • Good for summarization

  10. Algorithm – Evaluation • Compared algorithm against C99 (Choi 2000). • Artificial test corpus extracted from the Brown corpus used. • Probabilistic error metric used to evaluate performance. • Results of Utiyama algorithm significantly better at 1% level than Choi algorithm.

  11. Algorithm – Evaluation • Assessment of algorithm using real texts is needed. • Advantages over HMM • No training required (implies domain-independence). • Can incorporate probabilistic information into model. • Might be expandable to detect word descriptions in text.

More Related