Lecture 16: Filtering & TDT

Prof. Ray Larson University of California, Berkeley School of Information Lecture 16: Filtering & TDT Principles of Information Retrieval

Overview • Review • LSI • Filtering & Routing • TDT – Topic Detection and Tracking

How LSI Works • Start with a matrix of terms by documents • Analyze the matrix using SVD to derive a particular “latent semantic structure model” • Two-Mode factor analysis, unlike conventional factor analysis, permits an arbitrary rectangular matrix with different entities on the rows and columns • Such as Terms and Documents

How LSI Works • The rectangular matrix is decomposed into three other matices of a special form by SVD • The resulting matrices contain “singular vectors” and “singular values” • The matrices show a breakdown of the original relationships into linearly independent components or factors • Many of these components are very small and can be ignored – leading to an approximate model that contains many fewer dimensions

How LSI Works Titles C1: Human machine interface for LAB ABC computer applications C2: A survey of user opinion of computer system responsetime C3: The EPS user interface management system C4: System and human system engineering testing of EPS C5: Relation of user-percieved response time to error measurement M1: The generation of random, binary, unordered trees M2: the intersection graph of paths in trees M3: Graph minors IV: Widths of trees and well-quasi-ordering M4: Graph minors: A survey Italicized words occur and multiple docs and are indexed

How LSI Works Terms Documents c1 c2 c3 c4 c5 m1 m2 m3 m4 Human 1 0 0 1 0 0 0 0 0 Interface 1 0 1 0 0 0 0 0 0 Computer 1 1 0 0 0 0 0 0 0 User 0 1 1 0 1 0 0 0 0 System 0 1 1 2 0 0 0 0 0 Response 0 1 0 0 1 0 0 0 0 Time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 Survey 0 1 0 0 0 0 0 0 0 Trees 0 0 0 0 0 1 1 1 0 Graph 0 0 0 0 0 0 1 1 1 Minors 0 0 0 0 0 0 0 1 1

How LSI Works 11graph M2(10,11,12) 10 Tree 12 minor C4(1,5,8) C3(2,4,5,8) C1(1,2,3) M1(10) C5(4,6,7) C2(3,4,5,6,7,9) M2(10,11) M4(9,11,12) 2 interface 9 survey 7 time 3 computer 1 human 6 response 5 system 4 user Dimension 2 SVD to 2 dimensions Q(1,3) Blue dots are terms Documents are red squares Blue square is a query Dotted cone is cosine .9 from Query “Human Computer Interaction” -- even docs with no terms in common (c3 and c5) lie within cone. Dimension 1

How LSI Works docs X T0 = S0 D0’ terms txd txm mxm mxd X = T0S0D0’ T0 has orthogonal, unit-length columns (T0’ T0 = 1) D0 has orthogonal, unit-length columns (D0’ D0 = 1) S0 is the diagonal matrix of singular values t is the number of rows in X d is the number of columns in X m is the rank of X (<= min(t,d)

Filtering • Characteristics of Filtering systems: • Designed for unstructured or semi-structured data • Deal primarily with text information • Deal with large amounts of data • Involve streams of incoming data • Filtering is based on descriptions of individual or group preferences – profiles. May be negative profiles (e.g. junk mail filters) • Filtering implies removing non-relevant material as opposed to selecting relevant.

Filtering • Similar to IR, with some key differences • Similar to Routing – sending relevant incoming data to different individuals or groups is virtually identical to filtering – with multiple profiles • Similar to Categorization systems – attaching one or more predefined categories to incoming data objects – is also similar, but is more concerned with static categories (might be considered information extraction)

Structure of an IR System Search Line Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Adapted from Soergel, p. 19 Potentially Relevant Documents

Structure of an Filtering System Individual or Group users Incoming Data Stream Interest profiles Raw Documents & data Information Filtering System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Indexing/ Categorization/ Extraction Formulating query in terms of descriptors Storage of profiles Store1: Profiles/ Search requests Doc surrogate Stream Comparison/ filtering Adapted from Soergel, p. 19 Potentially Relevant Documents

Major differences between IR and Filtering • IR concerned with single uses of the system • IR recognizes inherent faults of queries • Filtering assumes profiles can be better than IR queries • IR concerned with collection and organization of texts • Filtering is concerned with distribution of texts • IR is concerned with selection from a static database. • Filtering concerned with dynamic data stream • IR is concerned with single interaction sessions • Filtering concerned with long-term changes

Contextual Differences • In filtering the timeliness of the text is often of greatest significance • Filtering often has a less well-defined user community • Filtering often has privacy implications (how complete are user profiles?, what to they contain?) • Filtering profiles can (should?) adapt to user feedback • Conceptually similar to Relevance feedback

Methods for Filtering • Adapted from IR • E.g. use a retrieval ranking algorithm against incoming documents. • Collaborative filtering • Individual and comparative profiles

TREC Filtering Track • Original Filtering Track • Participants are given a starting query • They build a profile using the query and the training data • The test involves submitting the profile (which is not changed) and then running it against a new data stream • New Adaptive Filtering Track • Same, except the profile can be modified as each new relevant document is encountered. • Since streams are being processed, there is no ranking of documents

TREC-8 Filtering Track • Following Slides from the TREC-8 Overview by Ellen Voorhees • http://trec.nist.gov/presentations/TREC8/overview/index.htm

TDT: Topic Detection and Tracking • Intended to automatically identify new topics – events, etc. – from a stream of text and follow the development/further discussion of those topics

Topic Detection and Tracking Introduction and Overview The TDT3 R&D Challenge TDT3 Evaluation Methodology • Slides from “Overview NIST Topic Detection and Tracking • Introduction and Overview” by G. Doddington • http://www.itl.nist.gov/iaui/894.01/tests/tdt/tdt99/presentations/index.htm

5 R&D Challenges: Story Segmentation Topic Tracking Topic Detection First-Story Detection Link Detection TDT3 Corpus Characteristics:† Two Types of Sources: Text • Speech Two Languages: English 30,000 stories Mandarin 10,000 stories 11 Different Sources: _8 English__ 3 MandarinABC CNN VOAPRI VOA XINNBC MNB ZBNAPW NYT TDT Task Overview* * see http://www.itl.nist.gov/iaui/894.01/tdt3/tdt3.htm for details † see http://morph.ldc.upenn.edu/Projects/TDT3/ for details

Preliminaries A topicis … a seminal event or activity, along with alldirectly related events and activities. A storyis … a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single event.

Example Topic Title: Mountain Hikers Lost • WHAT: 35 or 40 young Mountain Hikers were lost in an avalanche in France around the 20th of January. • WHERE: Orres, France • WHEN: January 1998 • RULES OF INTERPRETATION: 5. Accidents

The Segmentation Task: Transcription: text (words) (for Radio and TV only) Story: Non-story: To segment the source stream into its constituent stories, for all audio sources.

Story Segmentation Conditions • 1 Language Condition: • 3 Audio Source Conditions: • 3 Decision Deferral Conditions:

The Topic Tracking Task: To detect stories that discuss the target topic,in multiple source streams. • Find all the stories that discuss a given target topic • Training: Given Nt sample stories that discuss a given target topic, • Test: Find all subsequent stories that discuss the target topic. on-topic unknown unknown training data test data New This Year: not guaranteed to be off-topic

Topic Tracking Conditions • 9 Training Conditions: • 1 Language Test Condition: • 3 Source Conditions: • 2 Story Boundary Conditions:

The Topic Detection Task: To detect topics in terms of the (clusters of) storiesthat discuss them. • Unsupervised topic trainingA meta-definition of topic is required -independent of topic specifics. • New topics must be detected as the incoming stories are processed. • Input stories are then associated with one of the topics. a topic!

Topic Detection Conditions • 3 Language Conditions: • 3 Source Conditions: • Decision Deferral Conditions: • 2 Story Boundary Conditions:

The First-Story Detection Task: First Stories Time = Topic 1 = Topic 2 Not First Stories • There is no supervised topic training (like Topic Detection) To detect the first story that discusses a topic, for all topics.

First-Story Detection Conditions • 1 Language Condition: • 3 Source Conditions: • Decision Deferral Conditions: • 2 Story Boundary Conditions:

The Link Detection Task To detect whether a pair of stories discuss the same topic. same topic? • The topic discussed is a free variable. • Topic definition and annotation is unnecessary. • The link detection task represents a basic functionality, needed to support all applications (including the TDT applications of topic detection and tracking). • The link detection task is related to the topic tracking task, with Nt = 1.

Link Detection Conditions • 1 Language Condition: • 3 Source Conditions: • Decision Deferral Conditions: • 1 Story Boundary Condition:

TDT3 Evaluation Methodology • All TDT3 tasks are cast as statistical detection (yes-no) tasks. • Story Segmentation: Is there a story boundary here? • Topic Tracking: Is this story on the given topic? • Topic Detection: Is this story in the correct topic-clustered set? • First-story Detection: Is this the first story on a topic? • Link Detection: Do these two stories discuss the same topic? • Performance is measured in terms of detection cost, which is a weighted sum of miss and false alarm probabilities:CDet = CMiss • PMiss • Ptarget + CFA • PFA • (1- Ptarget) • Detection Cost is normalized to lie between 0 and 1:(CDet)Norm = CDet/ min{CMiss • Ptarget, CFA • (1- Ptarget)}

Example Performance Measures: 1 0.1 Normalized Tracking Cost 0.01 English Mandarin Tracking Results on Newswire Text (BBN)

More on TDT • Some slides from James Allan from the HICSS meeting in January 2005

Lecture 16: Filtering & TDT