480 likes | 492 Views
Summarization and Personal Information Management. Carolyn Penstein Ros é Language Technologies Institute/ Human-Computer Interaction Institute. Announcements. Questions? Homework 2 assigned today and due in 1 week Plan for Today Hyland Chapter Hidden Markov Modeling Jing, 2002 paper.
E N D
Summarization and Personal Information Management Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute
Announcements • Questions? • Homework 2 assigned today and due in 1 week • Plan for Today • Hyland Chapter • Hidden Markov Modeling • Jing, 2002 paper
Getting into Technology Problem Human Behavior * Hidden Markov Models Solution Design Technology Component Technology Tool for understanding the process of generating summaries Today’s focus Problem?
According to Hyland,What is the problem that abstracts solve?
What do we get from Hyland? • Methodology for understanding how humans write abstracts • Important: Acknowledgement of the social context in which abstracts are written • Mixed methods: interviews, rhetorical analysis, comparative analysis, interpretation • Not shown: collocational analysis
Hyland’s Coding Scheme • Note that selected rhetorical strategy says something about what the writer assumes about the audience • What do you remember from that?
What’s your analysis: • 1 - Purpose • 2 – Introduction, Purpose+Method, Method • 3 - Product • 4 – Product • 5 - Product • 6 – Product, Conclusion
What’s your analysis: • 7 – Purpose, Introduction, Method • 8 – Purpose, Introduction, Conclusion • 9 – Purpose, Method, Introduction • 10 – Product, Introduction, Conclusion
Homework Two • Taking into account the feedback you received on assignment 1, refine the focus of your term project • State what is the problem you are trying to solve now • Assignment 2 focuses on Rhetorical Analysis • Find some data to work with – for the assignment you’ll need 3 examples of what you are trying to summarize. This can be 3 documents or 3 collections of documents • Design a coding scheme like Hyland did and do a rhetorical analysis of your data. If you are working on 3 collections of documents, just do a sampling. You don’t have to analyze the whole of 3 collections. • Now, based on your rhetorical analysis, “generate” by hand the summary you think you should get from your 3 examples • Now argue why you think this summary should “solve” the problem you set out to solve
Hidden Markov Modeling • Different from typical markov models because states not directly observable • From one sequence of observations, more than one sequence of states is possible • Viterbi search is used at decoding time to identify the most likely sequence of states
Hidden Markov Modeling • Pattern • y1 y1 y1 y3 y4 • State Sequences • x1 x2 x1 x2 x1 • x1 x2 x1 x2 x3
Question from Nitin • From what I have understood, assigning probability values to the transition of states (P1-P6) is experimental.
Simplistic Summarization • Select a subset of sentences from the source document or documents • Present them in the same order in which they appeared in the source
Less Simplistic Summarization • Select a subset of sentences from the source document or documents • Paraphrase those sentences • Present them in the same or different order in which they appeared in the source
Advantages of Solving the Decomposition Problem • Gain insight into desirable generation techniques for summarization • They could have provided more analysis to this end • Automatically produce training data for extraction based summarization approaches
Paraphrase Operations • Sentence reduction • Sentence combination • Syntactic transformation • Lexical paraphrasing • Generalization or specification
Student Quote from Last Time • They say "based on careful analysis of human-written summaries", which suggests that they sat in a room by themselves reading summaries and original texts, trying to figure out what human summarizers do. Why didn't they just go out and talk to some real people?
Sentence Reduction • Non-essential phrases are removed • What counts as non-essential?
Sentence Combination • Merge sentences, typically after reducing both • How you merge depends on overlap between sentences • When is it advantageous to merge?
Syntactic Transformation • Changing the syntactic structure • Which syntactic transformations are allowed? • Do these two sentences mean the same thing?
Lexical Paraphrasing • Replacing a phrase with something that means the same thing • “hits the nail on the head” versus “fit squarely into” • What counts as a lexical paraphrase?
Generalization or Specification • Similar to lexical paraphrasing
Problem Formulation • Identify the most likely position in the document (if any) of each summary word • Then apply the decomposition operations
Evaluations • Alignment • How accurately can this approach align summary sentences with document sentences • Only tests the HMM • Decomposition • Humans judged whether decomposition was correct • Only tests decomp operators • Portability evaluation – test of generality
Alignment • Used 10 documents paired with human written summaries • Other humans looked at the pairs and matched summary sentences to document sentences • Precision, Recall, and F-measure can be computed by comparing these extracts with the automatic ones • Error analysis: problems with creative rewordings or when irrelevant sentences contain summary words
Decomposition • 50 summaries from telecommunications corpus • Ran decomposition program • 93.8% of sentences were correctly decomposed • Seems like a weak definition of correct decomposition • Correct pairing between sentences • Correctly identified where phrases came from
Portability • Test on a new type of data • Performed well
But what did we learn about how humans generate summaries? • Analyzed 300 human written summaries • 19% of summary sentences did not have a matching sentence • 42% matched a single sentence • Often along with sentence reduction • 36% were created by combining 2 or 3 sentences • 3% created by combining more than that
What would be interesting next steps?
Idea from Nitin • Also as we have seen from the Hyland chapter abstracts tend to implicitly map the actual meta-discourse structure of the entire document(P-M-Pr etc) we can use this structure in the heuristic to assign relevant probabilities according to the document position of word, e.g. coming from introduction section versus coming from methods section. This would allow the HMM to realistically model the transition probabilities accomodating the information about the discourse structure of the original document,.