Summarization and Personal Information Management

Summarization and Personal Information Management Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Announcements • Questions? • Homework 2 assigned today and due in 1 week • Plan for Today • Hyland Chapter • Hidden Markov Modeling • Jing, 2002 paper

Getting into Technology Problem Human Behavior * Hidden Markov Models Solution Design Technology Component Technology Tool for understanding the process of generating summaries Today’s focus Problem?

Hyland Chapter

Always know what problem you are trying to solve!!!

According to Hyland,What is the problem that abstracts solve?

What do we get from Hyland? • Methodology for understanding how humans write abstracts • Important: Acknowledgement of the social context in which abstracts are written • Mixed methods: interviews, rhetorical analysis, comparative analysis, interpretation • Not shown: collocational analysis

Hyland’s Coding Scheme • Note that selected rhetorical strategy says something about what the writer assumes about the audience • What do you remember from that?

What’s your analysis: • 1 - Purpose • 2 – Introduction, Purpose+Method, Method • 3 - Product • 4 – Product • 5 - Product • 6 – Product, Conclusion

What’s your analysis:

What’s your analysis: • 7 – Purpose, Introduction, Method • 8 – Purpose, Introduction, Conclusion • 9 – Purpose, Method, Introduction • 10 – Product, Introduction, Conclusion

What’s your analysis:

Comparison Across Genres

Quotes

Change over time

Homework Two • Taking into account the feedback you received on assignment 1, refine the focus of your term project • State what is the problem you are trying to solve now • Assignment 2 focuses on Rhetorical Analysis • Find some data to work with – for the assignment you’ll need 3 examples of what you are trying to summarize. This can be 3 documents or 3 collections of documents • Design a coding scheme like Hyland did and do a rhetorical analysis of your data. If you are working on 3 collections of documents, just do a sampling. You don’t have to analyze the whole of 3 collections. • Now, based on your rhetorical analysis, “generate” by hand the summary you think you should get from your 3 examples • Now argue why you think this summary should “solve” the problem you set out to solve

Hidden Markov Modeling

Hidden Markov Modeling • Different from typical markov models because states not directly observable • From one sequence of observations, more than one sequence of states is possible • Viterbi search is used at decoding time to identify the most likely sequence of states

Hidden Markov Modeling • Pattern • y1 y1 y1 y3 y4 • State Sequences • x1 x2 x1 x2 x1 • x1 x2 x1 x2 x3

Question from Nitin • From what I have understood, assigning probability values to the transition of states (P1-P6) is experimental.

Resources in Wikipedia

Jing, 2002

Simplistic Summarization • Select a subset of sentences from the source document or documents • Present them in the same order in which they appeared in the source

Less Simplistic Summarization • Select a subset of sentences from the source document or documents • Paraphrase those sentences • Present them in the same or different order in which they appeared in the source

Advantages of Solving the Decomposition Problem • Gain insight into desirable generation techniques for summarization • They could have provided more analysis to this end • Automatically produce training data for extraction based summarization approaches

Paraphrase Operations • Sentence reduction • Sentence combination • Syntactic transformation • Lexical paraphrasing • Generalization or specification

Student Quote from Last Time • They say "based on careful analysis of human-written summaries", which suggests that they sat in a room by themselves reading summaries and original texts, trying to figure out what human summarizers do. Why didn't they just go out and talk to some real people?

Sentence Reduction • Non-essential phrases are removed • What counts as non-essential?

Sentence Combination • Merge sentences, typically after reducing both • How you merge depends on overlap between sentences • When is it advantageous to merge?

Syntactic Transformation • Changing the syntactic structure • Which syntactic transformations are allowed? • Do these two sentences mean the same thing?

Lexical Paraphrasing • Replacing a phrase with something that means the same thing • “hits the nail on the head” versus “fit squarely into” • What counts as a lexical paraphrase?

Generalization or Specification • Similar to lexical paraphrasing

Problem Formulation • Identify the most likely position in the document (if any) of each summary word • Then apply the decomposition operations

Example

Remember: Connection with Statistical MT

Evaluations • Alignment • How accurately can this approach align summary sentences with document sentences • Only tests the HMM • Decomposition • Humans judged whether decomposition was correct • Only tests decomp operators • Portability evaluation – test of generality

Alignment • Used 10 documents paired with human written summaries • Other humans looked at the pairs and matched summary sentences to document sentences • Precision, Recall, and F-measure can be computed by comparing these extracts with the automatic ones • Error analysis: problems with creative rewordings or when irrelevant sentences contain summary words

Alignment

Decomposition • 50 summaries from telecommunications corpus • Ran decomposition program • 93.8% of sentences were correctly decomposed • Seems like a weak definition of correct decomposition • Correct pairing between sentences • Correctly identified where phrases came from

Portability • Test on a new type of data • Performed well

But what did we learn about how humans generate summaries? • Analyzed 300 human written summaries • 19% of summary sentences did not have a matching sentence • 42% matched a single sentence • Often along with sentence reduction • 36% were created by combining 2 or 3 sentences • 3% created by combining more than that

What would be interesting next steps?

Idea from Nitin • Also as we have seen from the Hyland chapter abstracts tend to implicitly map the actual meta-discourse structure of the entire document(P-M-Pr etc) we can use this structure in the heuristic to assign relevant probabilities according to the document position of word, e.g. coming from introduction section versus coming from methods section. This would allow the HMM to realistically model the transition probabilities accomodating the information about the discourse structure of the original document,.

Questions?

Summarization and Personal Information Management