220 likes | 329 Views
Detection at Dragon Systems. Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems. Outline. Data Sets Interpolated Models Targeting Against a Background English & Mandarin Word Stemming Effect of Automatic Boundaries 1999 System compared to 2000 Systems Comments on CDet Metric
E N D
Detection at Dragon Systems Paul van MulbregtSheera KnechtJon YamronDragon Systems
Outline • Data Sets • Interpolated Models • Targeting Against a Background • English & Mandarin • Word Stemming • Effect of Automatic Boundaries • 1999 System compared to 2000 Systems • Comments on CDet Metric • Conclusions
Data Sets • Two main data sets for experimentation • May/June English 1998 from TDT2. • Trained on January-April 1998 • April-May-June English 1998 from TDT2 (AMJ Data Set) • Trained on January-March April 1998 • May/June has only 34 topics, whilst April/May/June has 70 topics (69 after removal of 1 topic all of whose documents are on multiple topics). Smaller amount of training data, but larger number of topics hopefully allowing more informed decisions to be made. So we have been using AMJ for almost all recent experiments.
Interpolated Models For Tracking, Interpolation of Unigram Model with Background Model had been an improvement over backing off to Background Model.
Interpolated Models vs Backoff Models • AMJ, English only, Manual Boundaries • Interpolated appear to be a consistent win over Backoff.
Targeting Input Against Larger Background Corpus • The amount of data in a collection of TDT documents on a particular topic is not large. In Tracking, between 1 and 4 documents. In Detection there may be as few documents as 1 in a cluster. • Idea: Target the collection of documents against a much bigger (background) collection of documents. Augment the statistics of the small collection with the statistics of the big collection, and build a model from that.
Targeting (Tracking) • For Tracking, we actually do this. Take the seed documents (from TDT3), target against the background collection of documents in TDT2. Each document in TDT2 is assigned a weight, and these weights are then used to construct new counts for the seed collection. • Prob(w) = Sum (weight(Document d) * Prob(w in Document d)), summed over all documents. • Linearly interpolate this distribution with the original distribution from the seed documents. (Interpolate again with background to avoid effect of zeros.)
Targeting (Detection) • For Detection, this would involve a large amount of work. Every time a cluster changes, target against the background and rebuild the statistics... • Instead, target the incoming documents against the background just once. Interpolate the counts of the document with computed counts from the background corpus. Zeros don’t matter as this is for the incoming documents. • The hope is that this targeting will bring in background documents with words that didn’t occur in the original document, making it easier to pick up documents which discuss the topic. • Since these statistics are dumped into the clusters, it has the effect of providing smoothing the clusters.
Targeting Data Mixed in with Actual Data using Various Weightings. • AMJ, English only, Manual boundaries. • Didn’t improve best performance, but flattens out the graph. • Clearly 100% targeting is very sub-optimal, but a 15% mix was useful.
English vs English & Mandarin • TDT 2000 DryRun. • Very noisy graph. Few topics in Mandarin, so not much in the way of conclusions to be drawn.
Manual vs Automatic Boundaries • AMJ, English Only • About a 20% degradation for using Automatic Boundaries.
Stemming vs no Stemming • AMJ, English only, Manual Boundaries. • Stemming may make the graph a little less noisy, but...
Manual vs Automatic, Stemmed vs No-Stemmed • AMJ, English, Stem and No-Stem, Manual and Automatic Boundaries. • Stemming may help for manual boundaries, but appears a little worse for Automatic Boundaries.
1999 System vs 2000 Systems • AMJ, English only. 1999 system (TFKLB Backoff); 2000 systems Dragon 1 = TFKLB Interpolated, mixed 15% with targeted Background; Dragon2 = TFKLB Interpolated). • Dragon1, Dragon2 both better than 1999 System. Dragon1 is flatter than Dragon2, but not necessarily better.
2000 Results, Dragon2 System, Manual Boundaries, Interpolated. • Suffer a performance loss on English Detection by including Mandarin documents. • Performance on Newswire and BN seems comparable. • Reporting results on subsets doesn’t make much sense for Detection, especially language specific subsets. (For Tracking without adaptation, this should not be an issue.)
23% 19% -10%!! 56% 21% -25% 2000 Evaluation Numbers • Performance is better on Mandarin with Automatic boundaries than Manual boundaries. I don’t actually believe that this should be the case! • About 20% reduction in performance due to using Automatic boundaries
Why so non-continuous? • One cluster can split into two clusters, or can lose half its documents. • Small change in the number of correct leads to a big change in the score. The de-emphasis in the evaluation measure on False Alarms means the smaller “purer” cluster is regarded as being 9 times worse then the other cluster.
YDZ Metric? • YDZ seems conceptually a reasonable way of measuring goodness of fit.
However... • In practice however, it seems to have two problems: • No minimum value. In fact, it appears to be linear across a wide range of number of clusters. Even for choices of CMiss where it is not linear, the metric is not particularly discriminatory. (Assuming of course, that our system is producing outputs that do in fact have some difference.) • Sign of linear coefficient of depends on size of CMiss. Same issue as with CDet - what is a realistic use of the technology, and how to measure performance on that task. • Or one can just not spend time tuning for an evaluation -- just concentrate on improving the algorithm and lower the whole graph.
Miss-False Alarms on a DET plot, with Level Curves of CDet • AMJ, English only, Manual Boundaries. • The number of generated clusters varies from 16 (at the far right) to 4640 (at the far left) with the level curve intersections corresponding to 633 and 2204 clusters.
Why so little change in CDet? • When sweeping over a wide range of thresholds, with the number of clusters changing by a factor of more than 10, why is there so little change in the value of Cdet? • Find it hard to believe that 200 clusters are as equally useful as 2600 clusters. • Is it our (Dragon’s) distance measure? Is this phenomenon just restricted to one site or is it across all sites? • Discontinuities lead to wondering whether reported score differences are actually significant?
Overall Conclusions • Interpolation is better than backoff as a smoothing method. • Mixing targeted data is one approach to bringing in outside information, and it helped to smooth out performance but not to improve it. • Stemming may also smooth out performance without providing any overall gain. • Questions about CDet metric still remain. • Breakout of scores by subset do not make much sense for Detection.