350 likes | 365 Views
ÖZGÜR BAĞLIOĞLU SÜLEYMAN KARDAŞ H. ÇAĞDAŞ ÖCALAN ERKAN UYAR Bilkent Information Retrieval Group Computer Engineering Department Bilkent University. New Event Detection & Tracking. Outline. Introduction What is New event detection, tracking system Motivation Related Work TDT
E N D
ÖZGÜR BAĞLIOĞLU SÜLEYMAN KARDAŞ H. ÇAĞDAŞ ÖCALAN ERKAN UYAR Bilkent Information Retrieval Group Computer Engineering Department Bilkent University New Event Detection & Tracking
Outline • Introduction • What is New event detection, tracking system • Motivation • Related Work • TDT • Google News • NewsInEssence • Proposed System • Test Collection Preparation(TTracker), • Novelty Detection & Event Tracking • C3M concept • Design Details • Future Work • Named Entities with NED • Conclusion First Event Detection & Event Tracking
Introduction • Event • Time, space • Topic • Seminal event or activity • The differences • “Computer virus detected at Biritish Telecom, March 3, 1993 is an Event” • “Computer virus outbreaks” is a topic First Event Detection & Event Tracking
Introduction • New event detection: is the task of detecting stories about previously unseen events in a stream of news stories. • Airplane crash, earthquake, governmental elections, and etc. • Properties of New Event • When the event occurred • Who was involved • Where it took place • How it happened • Impact, significance or consequence of the event First Event Detection & Event Tracking
Introduction • Information filtering system • uses a long-lived profile of a user’s request to identify relevant material in a stream of arriving documents. • In contrast, new event detection has no knowledge of what events will happen in the news, so must operate without a pre-specified query. • NEDT usage areas • In categorization system • For people who need to know latest news, • govermental analyst, financial analyst, stock market traders • Identifying new mails from previous ones First Event Detection & Event Tracking
Related Work • Topic Detection and Tracking (TDT) • Researching since 1997 • Broadcast news, written and spoken news stories in multiple languages • Research Area • Story Segmentation - Detect changes between topically cohesive sections • Topic Tracking - Keep track of stories similar to a set of example stories • Topic Detection - Build clusters of stories that discuss the same topic • First Story Detection - Detect if a story is the first story of a new, unknown topic • Link Detection - Detect whether or not two stories are topically linked First Event Detection & Event Tracking
Related Work • Google News • A novel approach to News • Uses 4,500 English news sources worldwide • Groups similar stories together • Displays them according to each reader's personalized interests. First Event Detection & Event Tracking
Related Work • NewsInEssence • Since 2001 • Summarizing clusters of related news articles from multiple sources on the Web. • Developed by the CLAIR group at the University of Michigan. • Being partially funded by the NSF under the ITR program, grant number ITR-0082884. First Event Detection & Event Tracking
Proposed System • Handling of Test data (Milliyet, TRT, Zaman, Haber7, Cnnturk) • Distribution of the data among collections • Processing the raw data • Test Collection Preparation (TTracker) • Profiles and its properties • Sample profiles from collection • Novelty Detection & Event Tracking • C3M Concept • Algorithm details • Future Work • Named entities • System evaluation • Conclusion First Event Detection & Event Tracking
Handling of Test Data • Data is collected from 5 different sources; • CNN Türk (http://www.cnnturk.com), • Haber 7 (http://www.haber7.com), • Milliyet Gazetesi (http://www.milliyet.com.tr) • TRT (http://www.trt.net.tr), • Zaman Gazetesi (http://www.zaman.com.tr). • From these sources news of 2005 are crawled which has time stamps (date and time). First Event Detection & Event Tracking
Handling of Test Data • Each source is the representative of different angle of view; • CNN Türk – It is international, American style • TRT – It is governmental, more restrictive • Milliyet Gazetesi – It has modern perspective • Zaman Gazetesi – It is conservative • Haber 7 – It provides variety • Hence, different perspectives provides nice challenge while tracking the news. First Event Detection & Event Tracking
News Source No. of News % Addition to Total News Avarage News Length (no. of words) CNN Türk 31,919 14.2 270.57 Haber 7 59,304 26.3 237.85 Milliyet Gazetesi 72,506 32.1 218.34 TRT 19,102 8.5 120.75 Zaman Gazetesi 42,749 19.0 96.76 All 225,580 100.0 199.56 Handling of Test Data • Statistics about sources; • After crawling the data, the text is cleaned from html tags by using HTMLParser library. First Event Detection & Event Tracking
Test Collection Preparation TTracker • TTracker is a sub-component to collect the test and training data semi-automatically. • It is based on an information retrieval system. • This system is allowed define the profiles and its tracking news. • The system is also provides some statistical information about the profiles. • Success of the system will also be compared with manual tracking. First Event Detection & Event Tracking
Test Collection Preparation TTracker • Profile contents as follows; • Topic Title: One or two word definition. • Seminal Event: Definition with at most two or three sentences. • What: What happened during the event. • Who: Who involved the event. • When: When the event occurs. • Where: Where the event occurs. • Topic Size: Estimated number of tracking news. • Seed: Seed document of the event. • Event Type: Category of the event. First Event Detection & Event Tracking
Test Collection Preparation TTracker • Defining the tracking news in five stages; • Stage 1: Using seed document as a query. • Stage 2: Using event profile as a query. • Stage 3: Using tracking news as query. • Stage 4: Creative query searching. • Stage 5: Quality control of the profile. • After these stages are completed the quality of the profiles are also controlled by administrators. Start Create Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Finish First Event Detection & Event Tracking
Test Collection Preparation TTracker • In the stages annotators has right to define the news as “tracking”, “non-tracking”, “not-sure”, “not-evaluated”. • Annotators are evaluating; • 200 documents for the 1st stage, • 300 documents for the 2nd stage, • 400 documents for the 3rd stage, • 200 documents each for the queries of 4th stage. First Event Detection & Event Tracking
Retireved Tracking Non-Tracking Not-Sure Not-Evaluated Time-Spend Avg. 546 89 378 1 77 130 Min. 221 2 14 0 0 20 Max. 1129 454 761 37 614 825 Test Collection Preparation TTracker • Until now, we collect nearly 60 completed profile with valuable contrubiton of our friends. • We give extra importance not to occur bias in the collection. Number of profiles of a person, event types, profile lengths are all kept in balance. First Event Detection & Event Tracking
Pro. No News Title No. of Tracking News Life-Tine (day) No. of Tracking News in n Days n=100 n=50 n=25 n=10 1 Sahte Rakı 329 304 298 273 244 185 2 Papa 2. Jean Paul, hastalığı ve ölümü 291 288 287 76 52 34 3 Suriye’yi Lübnan’dan çıkaran suikast 318 330 221 166 58 6 4 Kırgızistan’da kadifemsi “devrim” 179 270 166 147 138 110 5 Live 8 konserlerinin G-8 zirvesine etkisi 110 241 94 36 1 0 6 Fransa’nın AB anayasasını referanduma götürmesi 329 353 99 52 26 14 7 Özbekistan’da kanla bastırılan isyan 231 241 206 188 172 33 8 2005 Eurovision Şarkı Yarışması 94 279 53 32 16 1 9 Formula 1 Türkiye Grand Prix 141 308 35 15 8 4 Test Collection Preparation TTracker • Example profiles and their life-time statistics; First Event Detection & Event Tracking
2005 Eurovision Şarkı Yarışması Sahte Rakı 8 80 6 60 4 40 2 20 0 0 Test Collection Preparation TTracker • Distribution of news in the year for two sample profiles which are generated by using TTracker; News amount News amount Days of 2005 Days of 2005 First Event Detection & Event Tracking
Test Collection Preparation TTracker • To prepare this system, we used information retrieval system – semi automatic; • TTracker’s recall value will be compared with the manual system recall value (=1). • By using T-test, correctness of the system would be measured. First Event Detection & Event Tracking
Proposed System • Novelty Detection & Event Tracking Noveltydetection • the identification of new data that a machine learning system is not aware of during training. • one of the fundamental requirements of a good classification or identification system. First Event Detection & Event Tracking
Window Old News First Event time 0 Tracking Events Now Proposed System • A special case of novelty detection... First Event Detection & Event Tracking
Proposed System Cover Coefficient Based Clustering Methodology(C3M) [Can F., Ozkarahan E.1990] • Single pass seed algorithm • Working principles are: • Determining number of clusters • Determining cluster seeds • Assigning other documents to clusters initiated by seeds • Two stage probability experiment is performed First Event Detection & Event Tracking
Proposed System • C3M CONCEPT • Example D(Document Term) and C(cover coefficient) matrixes • Cij=αi* ∑dIK*βK*dJK for k=1 to m First Event Detection & Event Tracking
Proposed System • NEDT using C3M Concept: • Threshold value δW (for new event detection) depends: • Window size • Cii of incoming event • Cij of incoming event to other events in window • δG depends: • Cluster centroid similarity(CIJ) • Cii of incoming event First Event Detection & Event Tracking
Proposed System • Two thresholds should be found: • In window • In collection • A possible selection for high in window but complicated and found by some experimental trials intuitionally... • Results are as follows: First Event Detection & Event Tracking
Proposed System • Some experiments will be conducted to improve threshold according to: -Some pattern recognition techniques such as • Mixture of Gaussian • SVM • Decision Trees • Another problem about threshold finding: • dataset is not large enough • only 2 feature available Note: Blue dots: New Event Green dots: Tracking event X axis: Cii Y axis:Cij First Event Detection & Event Tracking
Future Work • Improving NED => Using Named Entities • Topic-conditioned novelty detection (Yang, ..., 2002) • A new similarity measure with semantic classes (Makkonen, ..., 2002) • Modified similarity metrics (Kumaran and Allan, 2004) • Using names and topics (Kumaran and Allan, 2005) First Event Detection & Event Tracking
Future Work • Intuition behind named entities: • Who, Where, When • People, organization, places, date and time • How to embed named entities into NED • A new similarity matrix • Additional similarity comparison with extracted named entities First Event Detection & Event Tracking
Future Work • Evaluation of the NED • Judge documents • Select random documents from different categories • Annotators judge documents • Same documents are used by our system • Finally, evaluation is done according to precision and recall considering annotators’ judgements First Event Detection & Event Tracking
Future Work • Developing an • effective • real-time • Web application capable of • detecting new events • tracking old ones First Event Detection & Event Tracking
Conclusion • Mention about • New Event Detection and Tracking Concepts • Test collection preparation • Details of designed system • Goal: • Perform a leading research in Turkish • Make real of dreams in Information Retrival • “Rising like a sun in the science world” Fazli Can First Event Detection & Event Tracking
References • Can F. and Ozkarahan, E. A. “Concepts and effectiveness of the cover coefficient based clustering methodology for text databases”. 1990. • Kumaran G. and Allan J. “Text classification and named entities for new event detection”. 2004. • Makkonen J., Ahonen-Myka H., and Salmenkivi M. “Appliying semantic classes in event detection and tracking”. 2002. • Yang Y., Zhang J., Carbonell J., and Jin C. “Topic-conditioned novelty detection”. 2002. First Event Detection & Event Tracking
Questions? Thanks for your patience... Any questions? First Event Detection & Event Tracking