1 / 25

Enhancing Single-document Summarization by Combing RankNet and Third-party Sources

Enhancing Single-document Summarization by Combing RankNet and Third-party Sources. Yu-Mei Chang Lab Name National Taiwan Normal University.

clovis
Download Presentation

Enhancing Single-document Summarization by Combing RankNet and Third-party Sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enhancing Single-document Summarization by Combing RankNet and Third-party Sources Yu-Mei Chang Lab Name National Taiwan Normal University Krysta M. Svore, Lucy Vanderwende, Christopher J.C. Burges, “Enhancing Single-document Summarization by Combing RankNet and Third-party Sources,” in Proc. of the 2007Joint Conference on Empirical Method in Natural Language Processing and Computational Naural Language Learning

  2. Abstract • They present a new approach to automatic summarization based on neural nets, called NetSum. • Features: • extract a set of features from each sentence • novel features based on news search query logs and Wikipedia entities.(third-party datasets) enhance sentence feature • Using the RankNet learning algorithm • a pair-based sentence ranker to score every sentence in the document and identify the most important sentences. • Neural network ranking algorithm • They apply their system to documents gathered from CNN.com, • each document includes highlights and an article. • Their system significantly outperforms the standard baseline in the ROUGE-1 measure on over 70% of their document set.

  3. Their task(1) • Focus on single-document summarization of newswire documents. • Each document consists of • three or four highlight sentences and the article text. • Each highlight sentence is human-generated but is based on the article. • The output of their system: • Purely extracted sentences, where they do not perform any sentence compression or sentence generation. • Their data consists of : • 1365 news documents(gathered from CNN.com) • Hand-collected (form December 2006 to February 2007) • A maximum of 50 documents per day were collected.

  4. CNN.com (http://edition.cnn.com/2009/WORLD/asiapcf/08/14/typhoon.village/index.html) Article text Story Highlights Timestamp

  5. Their task(2) • They develop two separate problems based on their document set. • can they extract three sentences that best “match” the highlights as a whole? • concatenate the three sentences produced by their system into a single summary or block, and the same on the three highlight sentences(disregards sentence order) • can they extract three sentences that best “match” the three highlights, such that ordering is preserved? • The first sentence is compared against the first highlight, the second sentence is compared against the second highlight, and the third sentence is compared against the third highlight.

  6. The system • Goal : • To extract three sentences from a single news document that best match various characteristics of the three document highlights. • A set of features is extracted from each sentence in the training and test sets, and the training set is used to train the system. • The system learns from the train set the distribution of features for the best sentences and outputs a ranked list of sentences for each document. • In this paper, they rank sentences using a neural network algorithm called RankNet (Burges et al., 2005). • A pair-based neural network algorithm to rank a set of inputs, in this case, the set of sentences in a given document.

  7. Matching Extracted to Generated Sentences • Choosing three sentences most similar to the three highlights is very challenging • since the highlights include content that has been gathered across sentences and even paragraphs • the highlights include vocabulary that may not be present in the text • For 300 news articles • 19% of human-generated summary sentences contain no matching article sentence • only 42% of the summary sentences match the content of a single article sentence • where there are still semantic and syntactic transformations between the summary sentence and article sentence. • Since each highlight is human generated and does not exactly match any one sentence in the document.

  8. RankNet(1) • The system is trained on pairs of sentences • should be ranked higher or equal to • Pairs are generated between sentences in a single document, not across documents. • Each pair is determined from the input labels. • Their sentences are labeled using ROUGE , if the ROUGE score of is greater than the ROUGE score of , then is one input pair. • The cost function for RankNet • the probabilistic cross-entropy cost function. • Their system, NetSum, is a two-layer neural net trained using RankNet. • To speed up the performance of RankNet in the framewok of LambdaRank.

  9. RankNet (2) • They implement 4 versions of NetSum. • NetSum(b): • is trained for their first summarization problem(b indicates block) • They label each by , the maximum ROUGE-1 score between and each highlight , for , given by • NetSum(n): • For the second problem of matching three sentences to the three highlights individually, • They label each sentence by , the ROUGE-1 score betweenand , given by • The ranker for highlight n, NetSum(n), is passed samples labeled using . • Their reference summary • The set of human-generated highlights • The highlights as a block • A single highlight sentence

  10. Features(1) • RankNet takes as input a set of samples, where each sample contains a label and feature vector. • They generate 10 features for each sentence in each document, listed in Table 1. • They include a binary feature • that equals if is the first sentence of the document: , where is the Kronecker delta function. (wikipedia) • This feature is used only for • NetSum(b) and NetSum(1). p.s. Kronecker delta function

  11. Features(2) • Include sentence position • They found in empirical studies that • the sentence to best match highlight is on average 10% down the article • the sentence to best match is on average 20% down the article • the sentence to best match is on average 31% down the article • The position of in document as • , where is the number of sentences in . • SumBasic score of a sentence • To estimate the importance of a sentence based on word frequency • SumBasic score of in document as • where is the probability of word and is the number of words in sentence ,

  12. Features(3) • They also include the SumBasic score over bigrams, where in 3) is replaced by bigrams and they normalize by the number of bigrams in . • They compute the similarity of a sentence in document with the title of as the relative probability of title terms in as • , where is the number of times term appears in over the number of terms in . • The remaining features they use are based on third-party data sources. • They propose using news query logs and Wikipedia entities to enhance features. • They base several features on query terms frequently issued to Microsoft’s news search engine http://search.live.com/news • Entitiesfound in the online open-source encyclopedia Wikipedia (Wikipedia.org, 2007)

  13. Features(4) • They collected several hundred of the most frequently queried terms in February 2007 from the news query logs. They took the daily top 200 terms for 10 days. • They calculate the average probability of news query terms in as • where is the probability of a news term and is the number of news in • , where is the number of times term appears in and is the number of news query terms in • They also include the sum of news query terms in , given by • The relative probability of news query terms in , given by

  14. Features(4) • Sentences in CNN document that contain Wikipedia entities that frequently appear in CNN document are considered important. • They calculate the average Wikipedia entity score for as • , where is the probability of entity , given by ,where is the number of times entity appears in CNN document and is the total number of entities in CNN document . • They also include the sum of Wikipedia entities, given by • All features are computed over sentences where every word has been lowercased and punctuation has been removed after sentence breaking.

  15. Evaluation • For the first summarization task • They compare against the baseline of choosing the first three sentences as the block summary • For the second highlights task • They compare NetSum(n) against the baseline of choosing sentence n( to match highlight n ). • They perform all experiments using five-fold cross-validation on their dataset of 1365 documents. • They divide their corpus into five random sets and train on three combined sets, validate on one set, and test on the remaining set. • They consider ROUGE-1 to be the measure of importance and thus train our model on ROUGE-1 (to optimize ROUGE-1 scores) and likewise evaluate their system on ROUGE-1.

  16. Results: Summarization(1) • Table 2 shows the average ROUGE-1 and ROUGE-2 scores obtained with NetSum(b) and the baseline. • NetSum(b) produces a higher quality block on average for ROUGE-1. • Table 3 lists the statistical tests performed and the significance of performance differences between NetSum and the baseline at 95% confidence.

  17. Results: Summarization(2) • Table 4 lists the sentences in the block produced by NetSum(b) and the baseline block, for the articles shown in Figure 1. • The NetSum(b) summary achieves a ROUGE-1 score of 0.52, while the baseline summary scores only 0.36.

  18. Result: Highlights(1) • Their second task is to extract three sentences from a document that best match the three highlights in order. • They train NetSum(n) for each highlight They compare NetSum(n) with the baseline of picking the thsentence of the document. • NetSum(1) produces a sentence with a ROUGE-1 score that is equal to or better than the baseline score for 93.26%of documents. • Under ROUGE-2, NetSum(1) performs equal to or better than the baseline on 94.21% of documents. • Table 5 shows the average ROUGE-1 and ROUGE-2 scores obtained with NetSum(1) and the baseline. • NetSum(1) produces a higher quality sentence on average under ROUGE-1.

  19. Result: Highlights(1) • Table 6 gives highlights produced by NetSum(n) and the highlights produced by the baseline, for the article shown in Figure 1. • The NetSum(n) highlights produce ROUGE-1 scores equal to or higher than the baseline ROUGE-1 scores. • They confirmed that the inclusion of news-based and Wikipedia-based features improves NetSum’speformance. • They removed all news-based and Wikipedia-based features in NetSum(3). • The resulting performance moderately declined.

  20. Conclusions • They have presented a novel approach to automatic single-document summarization based on neural networks, called NetSum. • Their work is the first to use both neural networks for summarization and third-party datasets for features, using Wikipedia and news query logs. • They have evaluated our system on two novel tasks: • 1) producing a block of highlights • 2) producing three ordered highlight sentences. • Their experiments were run on previously unstudied data gathered from CNN.com. • Their system shows remarkable performance over the baseline of choosing the first n sentences of the document, where the performance difference is statistically significant under ROUGE-1.

  21. RankNet C.J.C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. 2005. Learning to Rank using Gradient Descent. ICML, pages 89–96. ACM.

  22. A Probability Ranking Cost Function(1) • They consider models where the learning algorithm is given a set of pairs of samples in , together with target probabilities that sample A is to be ranked higher than sample B. • They consider models: such that the rank order of a set of test samples is specified by the real values that takes, specifically, is taken to mean that the model asserts that . • Denote the modeled posterior by ,and let be the desired target values for those posteriors. • Define and . .They will use the cross entropy cost function

  23. A Probability Ranking Cost Function(2) • map from outputs to probabilities are modeled using a logistic function • then becomes • asymptotes to a linear function • target probability is defined as:

  24. RankNet: Learning to rank with neural nets(1) • The above cost function is quite general; here they explore using it in neural network models, as motivated above. • It is useful first to remind the reader of the back-prop equations for a two layer net with output nodes . • For the th training sample, denote the outputs of net by , the targets by , let the transfer function of each node in the th layer of nodes be , and let the cost function be • If are the parameters of the model. Then a gradient descent step amounts to where the are positive learning rates.

  25. RankNet: Learning to rank with neural nets(2) • The net embodies the function • where for the weights and offsets , the upper indices index the node layer, and the lower indices index the nodes within each corresponding layer.

More Related