160 likes | 258 Views
The use of unlabeled data to improve supervised learning for text summarization. MR Amini, P Gallinari (SIGIR 2002). Slides prepared by Jon Elsas for the Semi-supervised NL Learning Reading Group. Presentation Outline. Overview of Document Summarization
E N D
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the Semi-supervised NL Learning Reading Group
Presentation Outline • Overview of Document Summarization • Major contribution: Semi-Supervised Logistic Classification Maximum Likelihood summaries. • Evaluation • Baseline Systems • Results
Document Summarization • Motivation: [text volume] >> [user’s time] • Single Document Summarization: • Used for display of search results, automatic ‘abstracting’, browsing, etc. • Multi-Document Summarization: • Describe clusters & document collections, QA, etc. • Problem: What is the summary used for? Does a generic summary exist?
Document Summarization • Generative Summaries: • Synthetic text produced after analysis of high level linguistic features: discourse, semantics, etc. • Hard. • Extract Summaries: • Text excerpts (usually sentences) composed together to create summary • Boils down to a passage classification/ranking problem
Major Contribution • Semi-supervised Logistic Classifying Expectation Maximization (CEM) for passage classification • Advantage over other methods: • Works on small set of labeled data + large set of unlabeled data • No modeling assumptions for density estimation • Cons: • (probably) slow; no performance numbers given
Expectation Maximization (EM) • Finds maximum likelihood estimates of parameters when underlying distribution depends on unobserved latent variables. • Maximizes model fit to data distribution • Criterion function:
Classifying EM (CEM) • Like EM, with the addition of an indicator variable for component membership. • Maximizes ‘quality’ of clustering • Criterion function:
Semi-supervised generative-CEM • Fix component membership for labeled data. • Criterion function: Labeled Data Unlabeled Data
Semi-supervised logistic-CEM • Use discriminative classifier (logistic) instead of generative. • M-step, need to re-do gradient descent to estimate β’s Labeled Data Unlabeled Data
Evaluation • Algorithm evaluated against 3 other single-document summarization algorithms • Non-trainable System: passage ranking • Trainable System: Naïve Bayes sentence classifier • Generative-CEM (using full Gaussians) • Precision/Recall with regard to gold-standard extract summaries • The fine print: • All systems used *similar* representation schemes, but not the same…
Baseline System: Sentence Ranking • Rank sentences, using a TF-IDF similarity measure with query expansion (Sim2) • Blind-relevance feedback from the top sentences • WordNet similarity thesaurus • Generic query created with the most frequent words in the training set.
Naïve Bayes Model: Sentence Classification Simple Naïve Bayes classifier trained on 5 features: • Sentence length < tlength {0,1} • Sentence contains ‘cue words’ {0,1} • Sentence query similarity (Sim2) > tsim {0,1} • Upper-case/Acronym features (count?) • Sentence/paragraph position in text {1, 2, 3}
Logistic-CEM: Sentence Representation Features Features used to train Logistic-CEM: • Normalized sentence length [0, 1] • Normalized ‘cue word’ frequency [0, 1] • Sentence Query Similarity (Sim2) [0, ∞) • Normalized acronym frequency [0, 1] • Sentence/paragraph position in text {1, 2, 3} (All of the binary features converted to continuous.)