Understanding Experimental Design in Evaluation Studies

Summarization and Personal Information Management Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Announcements • Questions? • Announcement: Take home final and public poster session • Plan for Today • General Discussion about Evaluation • Example Evaluation and Supplementary discussion about chat analysis • Arguello & Rosé, 2005 • Readings for Today • Nenkova et al., 2007

Experimental Design • Measure the effect of manipulating anindependent variable on the value of a dependent variable • In our case, the independent variable indicates which system is being used • The dependent variable is the outcome measure you are examining • Types of Evaluations • Intrinsic (Corpus Based) • This is what we’re talking about today • Extrinsic (Task Based) • We’ll discuss this next

Experimental Design • Procedure • Manipulating the Independent Variable • Pick a data set • Run each system on it • Measuring effect on Dependent variable • Design a gold standard • Design metrics that measure how well system output matches the Gold Standard • Evaluate significance of difference in measurement on Dependent variable, and effect size • Do an error analysis!!!!!!!!!!!!!!

Confidence Intervals • Performance on two different data sets will not be the same • Confidence intervals allow us to say that the probability of the real performance value being within a certain range from the observed value is 90% ( ) 0 10 20 30 40

Confidence Intervals • Confidence limits come from the normal distribution • Computed in terms of number of standard deviations from the mean • If the data is normally distributed, there is a 15% chance of the real value being more than 1 standard deviation above the mean

What is a significance test? • How likely is it that the difference you see occurred by chance? • How could the difference occur by chance? ( ( ) ) 0 10 20 30 40 If the mean of one distribution is within the confidence interval of another, the difference you observe could be by chance. If you want p<.05, you need the 90% confidence intervals. Find the corresponding Z scores from a standard normal distribution table.

What is the effect size? • Compute the difference in mean performance values between the two conditions • Divide by the standard deviation of the control condition • 1 standard deviation is good!

Independent Variables: Science versus Engineering • Clean manipulations: What are we trying to learn? • Go for generalizable knowledge rather than horse races • You can’t always win, but with a well designed study you can always learn something important • Think about what your results mean. • What is your research question? • What is your hypothesis?

Constructing a Dependent Variable • Start with a notion of what you want to measure • What is summary quality? • How would you define summary quality for ROUGE and the Pyramid Method? • Consider also how the computation interacts with the gold standard • Consider also how the computation and gold standard interact with your Independent variable • Consider what you are not measuring • How could you have defined it differently?

Validity • Face validity: are you really measuring what you say you’re measuring? • External validity: does your measure correlate with something else associated with what you want to measure? • Internal validity: are you controlling for random variance?

Evaluation from Last Time • Data: 10 logs • Gold standard: human written “extractive” summary • Dependent measure: precision, recall, and f-measure • Not clear what the unit of analysis was or how overlap was determined automatically

Student Comments from Last Time • Their evaluation approach is not clear. How do they deal with it for example when the system summary has m topics but the gold summary can be split into n topics? • Also, the "Topic-Summ" evaluation is not well described, I am not clear on that either.

More Student Comments • It would be worth investigating whether the fact that the [gold standard] summary was written by the user affected the performance of the machine generated summaries. • They manually changed gold standards, which are participant-written summaries, in order to simplify their evaluation process and decrease inconsistency in the future assessment. However, I wonder whether that's a good thing to do.

More Student Comments • Also why did they only evaluate 10 chat logs? I thought they had 100 chat logs and used approximately 2/3 of them as training and 1/3 as test. How did they choose the 10 to do the evaluations on?

My Complaints • Only one reference summary

Museli:A Multi-Source Evidence Integration Approach to Topic Segmentation of Spontaneous Dialogue Jaime Arguello Carolyn Rosé Language Technologies Institute Carnegie Mellon University

Things to think about • Operationalization of Independent Variables • Evaluation Metrics • Baselines • Hidden Variables: Effect of Unit of Analysis

Overview of Single Evidence Source Approaches • Models based on lexical cohesion • TextTiling (Hearst, 1997) • Foltz (Foltz, 1998) • Olney & Cai (Olney & Cai, 2005) • Models relying on regularities in topic sequencing • Barzilay & Lee (Barzilay & Lee, 2004)

Dialogue Transcript W1 W3 W2 Dialogue Transcript depth contribution # TextTiling Term Frequency (TF): accentuates frequent terms in this window Inverse Document Frequency (IDF): suppresses terms frequent in many windows W1 = 0.15 W2 = 0.20 W3 = 0.45 …. w TF.IDF W1 W3 W2

Dialogue Transcript Latent Semantic Analysis (LSA) Approaches • Foltz (Foltz, 1998) • Similar to TextTiling, except: • LSA space constructed using contributions from corpus • Supervised: logistic regression (Olney and Cai, 2005) • Orthonormal Basis (Olney and Cai, 2005) • A more expressive model of discourse coherence than TextTiling and Foltz Imagine 2 contributions w/ LSA vectors v1 & v2 6 #’s: (1) cosine (2) informativity (3) relevance wrt to previous and next window. Logistic Regression Informativity v2 v1 Relevance

Dialogue Transcripts c1 Build Cluster- Specific LMs c3 c2 HMM Apply HMM to data. TRANSITION = TOPIC SHIFT Content Models (Barzilay and Lee, 2004) c1 Cluster Contributions c2 c3 ….. Dialogue 1 Dialogue 2 Dialogue 3 EM-like iterative Re-assignment of Contributions to States

MUSELI • Integrates multiple sources of evidence of topic shift • Features: • Lexical Cohesion (via cosine correlation) • Time lag between contributions • Unigrams (previous and current contribution) • Bigrams (previous and current cont.) • POS Bigrams (previous and current cont.) • Contribution Length • Previous/Current Speaker • Contribution of Content Words

Dialogue Transcripts (Training Set) MUSELI (cont’d) Dialogue Transcripts (Training Set) Separate Student & Tutor Role Tutor Student Segmentation (Test Set) Feature Selection (chi-squared) Feature Selection (chi-squared) Naïve Bayes Classifier Naïve Bayes Classifier If Agent = Student, use prediction from Student Model If Agent = Tutor, use prediction from Tutor Model

Experimental Corpora • Olney and Cai (Olney and Cai, 2005) • Thermo corpus: student/tutor optimization problem, unrestricted interaction, virtually co-present • Our thermo corpus: • Is more terse! • Has fewer Contributions! • Has more Topics/Dialogue! • Strict turn-taking not enforced! * P < .005

Baseline Degenerate Approaches • ALL: every contribution = NEW_TOPIC • EVEN: every nth contribution = NEW_TOPIC • NONE:no NEW_TOPIC

Two Evaluation Metrics • A metric commonly used to evaluate topic segmentation algorithms (Olney & Cai, 2005) • F-measure: Precision (P): # correct predictions / # predictions Recall (R): # correct predictions / # boundaries • An additional metric designed specifically for segmentation problems (Beeferman et al., 1999) • Pk: Pr(error|k) The probability that two contributions, separated by k contributions, are misclassified Effective if k = ½ average topic length

Experimental Results Compared to degenerates: > NO DEG. > 1 DEG. > ALL 3 DEG. P < .05

Experimental Results Museli > all approaches in BOTH corpora P < .05

Error Analysis: Lexical Cohesion LOTS OF REGIONS W/ ZERO TERM OVERLAP BUT,MAY BE USEFUL AS ONE SOURCE OF EVIDENCE • Why does Lexical Cohesion Fail?

Error Analysis: Barzilay & Lee • Lack of content produces overly skewed cluster size distribution

[initiation] • [initiation] • [response] • [initiation] • [response] • [feedback] • [initiation] Dialogue Exchanges • A more coarse-grained discourse unit • Initiation, Response, Feedback (IRF) Structure (Stubbs, 1983; Sinclair and Coulthard, 1975) • T:Hi • T:Are you connected? • S:Ya. • T:Are you done reading? • S:Yah • T: ok • T:Your objective today is to learn as much as you can about Rankin Cycles

Topic Clustering of Exchanges

Perfect Dialogue Exchanges With Respect to Contributions Counter-part: P < .05 P < .1

Predicted Dialogue Exchanges • Museli approach for predicting exchange boundaries • F-measure: .5295 • Pk: .3358 • B&L with predicted exchanges: • F-measure: .3291 • Pk: .4154 • An improvement over B&L with contribution, P < .05

Nenkova et al., 2007

My Complaint • Assuming content selection is an important step in summarization, it is reasonable to have measures of that quality even if it is only part of the problem • But these measures can’t really be applied at a later stage – after the content has been transformed • Wouldn’t work at all for non-extractive summarization • Also: No error analysis!!!

Focus of Discussion: Computing the Dependent Measure • Main contrast: ROUGE vs. Pyramid • Both look at overlap between a set of gold standard summaries and a system generated summary • Both weight material repeated in multiple gold standard summaries higher than ones mentioned in fewer • What is different in what is being measured? • Can you think of what would make it hard to see this difference in numbers?

SCUs: Do these all mean the same thing?

Computing the Score Are all summaries of the same size with the same score equally good?

What does this graph mean?Do you believe it?

Which would you expect to be better? • What’s the rationale for the Pyramid method being better? • What’s the rationale for ROUGE being better?

Questions?

Understanding Experimental Design in Evaluation Studies

Understanding Experimental Design in Evaluation Studies

Presentation Transcript

Reading: Information Extraction and Summarization

Personal Information Management

Video Information Summarization and Testbed

Personal Information Management

Personal Information Management

Summarization and Personal Information Management

Summarization and Personal Information Management

Summarization and Personal Information Management

Summarization and Personal Information Management

Summarization and Personal Information Management

Summarization and Personal Information Management

Summarization and Personal Information Management

Summarization and Personal Information Management

Summarization and Personal Information Management

Summarization and Personal Information Management

Summarization and Personal Information Management

Summarization and Personal Information Management