Hao -Chin Chang Department of Computer Science & Information Engineering

Document Summarization using Conditional Random FieldsDou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng ChenIJCAI 2007 Hao-Chin Chang Department of Computer Science & Information Engineering National Taiwan Normal University 2011/09/05

Outline • Introduction • CRF-based Summarization • Experiments and Result • Conclusion and Future work

Introduction(1/2) • Text document summarization has attracted much attention since the original work by Luhn (1958) • Text mining tasks such as document classification [Shen 2004] • Help readers to catch the main points of a long document with less effort • Summarization tasks can be grouped into different categories • Input • Single document summary • Multiple documents summary • Purpose • Generic summary • Query-oriented summary [Goldstein 1999] • Output [Mani 1999] • Extractive summary • Abstractive summary

Introduction(2/2) • Extractive document summarization • Supervised algorithms • a two class classification problem • classify each sentence individually without leveraging the relation ship among sentences • Unsupervised algorithms • use heuristic rules to select the most informative sentences into a summary directly, which are hard to generalize • Conditional Random Fields (CRF) • avoid two disadvantages • as a sequence labeling problem instead of a simple classification problem • Solve to fail to predict the sequence labels given the observation sequences in many situations because they inappropriately use a generative joint model P(D|S) in order to solve a discriminative conditional problem when observations are given

CRF-based Summarization(1/3) • Observation sequence (sentence sequence) • Corresponding state sequence • The probability of Y conditioned on X defined in CRF • Feature functions • Weights

CRF-based Summarization(2/3) • is the set of weights in a CRF model • is usually estimated by a maximum likelihood procedure in the train data • To avoid overfitting, some regularization methods add variances of the Gaussian priors

CRF-based Summarization(3/3) • Given probability CRF and paremeters, the most probable labeling sequence can be obtained as • We can order the sentences based on andselect the top ones into the summary • Forward value • Back value

Experiment(1/5) • Basic Feature • Position • Thematic word : most frequent word • Upper case word : authors want to emphasize • Similary to Neighboring sentence • Complex Feature • LSA score • HIT score – document must be treat as a graph

Experiment(2/5) • 147 document summary pairs from Document Understanding Conference (DUC) 2001 • Supervise Method • Naive Bayes(NB) • Logistic Regression (LR) • Support Vector Machine (SVM) • Hidden Markov Model(HMM) • Conditional Random Fields (CRF) • Unsupervise Method • Select sentences randomly from the document is denoted as RANDOM • Select the lead sentence in each paragraph is denoted as LEAD • LSA • Graph based ranking algorithm such as HITS

Experiment(3/5) • Random is worst • CRF is best • HMM and LR improve the performance as compared to NB due to the advantages of leveraging sequential information • CRF makes a further improvement by 8.4% and 11.1% , over both HMM and LRin terms of ROUGE-2 and F1 • CRF outperforms HITS by 5.3% and 5.7% in terms of ROUGE-2 and F1

Experiment(4/5) • CRF is still the best method, which improves the values of ROUGE-2 and F1 achieved by the best baselines by more than 7.1% and 8.8% • Compared with the best unsupervised method HITS ,the CRF based on both kinds of features improves the performance by 12.1% and 13.9% in terms of ROUGE-2 and F1 • we compared CRFto the linear combination method used to combine the results of LSA, HITS and CRF based only on the basic features , the best result we can obtain on DUC01 is 0.458 and 0.392 in terms of ROUGE-2 and F1

Experiment(5/5) • 10-fold cross validation procedure, where one fold is for training and the other nine folds for test • we can obtain more precise parameters of the models with more training data • CRF-based methods and the other four supervised methods is clearly larger when the size of the training data is small • HMM is are not particularly relevant to the task of inferring the class labels • The bad performance of NB, LR and SVM overfitting with a small amount of training data.

Conclusion • We provided a framework to consider all available features that include the interactions between sentences • We plan to exploit more features, especially the linguistic features which are not covered in this paper, such as the rhetorical structures

Hao -Chin Chang Department of Computer Science & Information Engineering