320 likes | 501 Views
UMass Amherst at TDT 2003. James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor Lavrenko, Ramesh Nallapati, and Hema Raghavan
E N D
UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor Lavrenko, Ramesh Nallapati, and Hema Raghavan Center for Intelligent Information RetrievalDepartment of Computer ScienceUniversity of Massachusetts Amherst
What we did • Tasks • Story Link Detection • Topic Tracking • New Event Detection • Cluster Detection
Outline • Rule of Interpretation (ROI) classification • ROI-based vocabulary reduction • Cross-language techniques • Dictionary translation of Arabic stories • Native language comparisons • Adaptive tracking • Relevance models
ROI motivation • Analyzed vector space similarity measures • Failed to distinguish between similar topics • e.g. two “health care” stories from different topics • different locations and individuals • similarity dominated by “health care” terms • drugs, cost, coverage, plan, prescription • Possible solution: first categorize stories • different category different topics (mostly true) • use within-category statistics • “health care” may be less confusing • Rules of Interpretation provide natural categories
ROI intuition Sn simnew(s1,s2)<simold(s1,s2) Sn simnew(s1,s2)=simold(s1,s2) ROI tagged corpus • Each document in the corpus is classified into one of the ROI categories • Stories in different ROIs are less likely to be in same topic. • If two stories belong to different ROIs, we should trust their similarities less
ROI classifiers • Naïve Bayes • BoosTexter [Schapire and Singer, 2000 ] • Decision tree classifier • Generates and combines simple rules • Features are terms with tf as weights • Used most likely single class • Explored distribution of all classes • Unable to do so successfully
Training Data for Classification • Experiments: train on TDT-2,test on TDT-3 • Submissions: train on TDT-2 plus TDT-3 • Training data prepared the same way • Stories in each topic tagged with topic’s ROI • Remove duplicate stories (in topics with the same ROI) • Remove all stories with more than one ROI • Worst case: a single story relevant to… Chinese Labor Activists with ROI Legal/Criminal Cases Blair Visits China in October with ROI Political/Diplomatic Mtgs. China will not allow Opposition Parties with ROI Miscellaneous • Experiments with removing named entities for training
Naïve Bayes vs. BoosTexter • Similar classification accuracy • Overall accuracy is the same • Errors are substantially different • Our training results (TDT-3) • BoosTexter beat Naïve Bayes for SLD and NED • BoosTexter used in most tasks for submission • Evaluation results: • In Link Detection, using Naïve Bayes more useful
ROI classes in link detection • Given story pair and their estimated ROIs • If estimated ROIs are same, leave score alone • If they are different, reduce score • Reduced to 1/3 of original value based on training runs • Used four different ROI classifiers • ROI-BT,ne: BoosTexter with named entities • ROI-BT, no-ne: BoosTexter without named entities • ROI-NB, ne: Naïve Bayes with name entities • ROI-NB, no-ne: Naïve Bayes without name entities
Training effectiveness (TDT-3) • Story Link Detection • Minimum normalized cost
Evaluation results • Story link detection
ROI for tracking • Compare story to centroid of topic • Built from training stories • If ROI does not match, drop score based on how bad mismatch is • Used ROI-BT,ne classifier only
Training for tracking • Topic tracking on TDT-3 • Minimum normalized cost • ROI BoosTexter with named entities only
Evaluation results • Topic tracking on TDT-3 • Minimum normalized cost • ROI BoosTexter with named entities only
ROI-based vocabulary pruning • New Event Detection only • Create “stop list” for each ROI • 300 most frequent terms in stories within ROI • Obtained from TDT-2 corpus • When story is classified into an ROI… • Remove those terms from the story’s vector • ROI determined from BoosTexter classifier
New Event Detection approach • Cosine Similarity measure • ROI-based vocabulary pruning • Score normalization • Incremental IDF • Remove short documents • Preprocessing • Train BoosTexter on TDT-2 &TDT-3 • Include named entities while training
NED Results TDT 3 TDT 4
ROI Conclusions • Both uses of ROI helped in training • Score reduction for ROI mismatch • Tracking and link detection • Vocabulary pruning for new event detection • Score reduction failed in evaluation • Name entities important in ROI classifier • TDT-4 has different set of entities (time gap) • Possible overfitting to TDT-3? • Preliminary work applying to detection • Unsuccessful to date
Outline • Rule of Interpretation (ROI) classification • ROI-based vocabulary reduction • Cross-language techniques • Dictionary translation of Arabic stories • Native language comparisons • Adaptive tracking • Relevance models
Comparing multilingual stories • Baseline • All stories converted to English • Using provided machine translations • New approaches • Dictionary translation of Arabic stories • Native language comparisons • Adaptation in tracking
Dictionary Translation of Arabic • Probabilistic translation model • Each Arabic word has multiple English translations • Obtain P(e|a) from UN Arabic-English parallel corpus • Forms a pseudo-story in English representing Arabic Story • Can get large due to multiple translations per word • Keep English words whose summed probabilities are the greatest
Language specific comparisons • Language representations: • Arabic CP1256 encoding and light stemming • English stopped and stemmed with kstem • Chinese segmented if necessary and overlapping bigrams • Linking Task: • If stories in same language, use that language • All other comparisons done using all stories translated into English
Adaptation in tracking • Adaptation • Stories added to topic when high similarity score • Establish topic representation in each language as soon as added story in that language appears • Similarity of Arabic story compared to Arabic topic representation, etc.
Cross-Lingual Link Detection Results Translation Conditions: • 1DcosIDF: baseline, all stories in English using provided translations. • UDcosIDF: all stories in English but using dictionary translation of Arabic. • 4DcosIDF: comparing a pair of stories in native language if both stories within the same language, otherwise comparing them in English using the dictionary translation of Arabic
Cross-Lingual Topic Tracking Results (required condition: Nt=1,bnman) Translation Conditions: • 1DcosIDF: baseline. • UDcosIDF: dictionary translation of Arabic. • 4DcosIDF: comparing a pair of stories in native language. • ADcosIDF: baseline plus adaptation, add a story to the centroid vector if its similarity score > adapting threshold, the vector limited top 100 terms, at maximum 100 stories could be added to the centroid.
Cross-Lingual Topic Tracking Results (alternate condition: Nt=4,bnasr) • Translation Conditions: • 1DcosIDF: baseline. • UDcosIDF: dictionary translation of Arabic. • 4DcosIDF: comparing a pair of stories in native language. • ADcosIDF: baseline plus adaptation.
Outline • Rule of Interpretation (ROI) classification • ROI-based vocabulary reduction • Cross-language techniques • Dictionary translation of Arabic stories • Native language comparisons • Adaptive tracking • Relevance models
Relevance Models for SLD • Relevance Model (RM): “model of stories relevant to a query” • Algorithm: • Given stories A,B • compute “queries” QA and QB • estimate relevance models P(w|QA) and P(w|QB) • compute divergence between relevance models
Relevance Models for Tracking • Initialize: • set P(M|Q) = 1/Nt if M is a training doc • compute relevance model as before • For each incoming story D: • score = divergence between P(w|D) and RM • if (score > threshold) add D to the training, recompute RM • allow no more than k adaptations
Conclusions • Rule of Interpretation (ROI) classification • ROI-based vocabulary reduction • Cross-language techniques • Dictionary translation of Arabic stories • Native language comparisons • Adaptive tracking • Relevance models