The use of unlabeled data to improve supervised learning for text summarization

The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the Semi-supervised NL Learning Reading Group

Presentation Outline • Overview of Document Summarization • Major contribution: Semi-Supervised Logistic Classification Maximum Likelihood summaries. • Evaluation • Baseline Systems • Results

Document Summarization • Motivation: [text volume] >> [user’s time] • Single Document Summarization: • Used for display of search results, automatic ‘abstracting’, browsing, etc. • Multi-Document Summarization: • Describe clusters & document collections, QA, etc. • Problem: What is the summary used for? Does a generic summary exist?

Single Document Summarization example

Document Summarization • Generative Summaries: • Synthetic text produced after analysis of high level linguistic features: discourse, semantics, etc. • Hard. • Extract Summaries: • Text excerpts (usually sentences) composed together to create summary • Boils down to a passage classification/ranking problem

Major Contribution • Semi-supervised Logistic Classifying Expectation Maximization (CEM) for passage classification • Advantage over other methods: • Works on small set of labeled data + large set of unlabeled data • No modeling assumptions for density estimation • Cons: • (probably) slow; no performance numbers given

Expectation Maximization (EM) • Finds maximum likelihood estimates of parameters when underlying distribution depends on unobserved latent variables. • Maximizes model fit to data distribution • Criterion function:

Classifying EM (CEM) • Like EM, with the addition of an indicator variable for component membership. • Maximizes ‘quality’ of clustering • Criterion function:

Semi-supervised generative-CEM • Fix component membership for labeled data. • Criterion function: Labeled Data Unlabeled Data

Semi-supervised logistic-CEM • Use discriminative classifier (logistic) instead of generative. • M-step, need to re-do gradient descent to estimate β’s Labeled Data Unlabeled Data

Evaluation • Algorithm evaluated against 3 other single-document summarization algorithms • Non-trainable System: passage ranking • Trainable System: Naïve Bayes sentence classifier • Generative-CEM (using full Gaussians) • Precision/Recall with regard to gold-standard extract summaries • The fine print: • All systems used *similar* representation schemes, but not the same…

Baseline System: Sentence Ranking • Rank sentences, using a TF-IDF similarity measure with query expansion (Sim2) • Blind-relevance feedback from the top sentences • WordNet similarity thesaurus • Generic query created with the most frequent words in the training set.

Naïve Bayes Model: Sentence Classification Simple Naïve Bayes classifier trained on 5 features: • Sentence length < tlength {0,1} • Sentence contains ‘cue words’ {0,1} • Sentence query similarity (Sim2) > tsim {0,1} • Upper-case/Acronym features (count?) • Sentence/paragraph position in text {1, 2, 3}

Logistic-CEM: Sentence Representation Features Features used to train Logistic-CEM: • Normalized sentence length [0, 1] • Normalized ‘cue word’ frequency [0, 1] • Sentence Query Similarity (Sim2) [0, ∞) • Normalized acronym frequency [0, 1] • Sentence/paragraph position in text {1, 2, 3} (All of the binary features converted to continuous.)

Results on Reuters dataset

The use of unlabeled data to improve supervised learning for text summarization

The use of unlabeled data to improve supervised learning for text summarization

Presentation Transcript

Text summarization

Semi-Supervised Learning over Text

Supervised learning for text

Text Summarization

Text summarization

Stochastic Unsupervised Learning on Unlabeled Data

Soft-Supervised Learning for Text Classification

Combining Labeled and Unlabeled Data for Multiclass Text Categorization

Automatic Text Summarization

Text summarization

Text summarization

Learning from labelled and unlabeled data

Text summarization

Text Summarization

Automatic text summarization

Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

Exploring the Use of Data Inquiry Cycle to Improve Student Learning

Classification of unlabeled data:

Incorporating Unlabeled Data in the Learning Process

Text Summarization

Incorporating Unlabeled Data in the Learning Process