300 likes | 448 Views
Geo-spatial Event Detection in the Twitter Stream. Michael Kaisser, AGT International Berlin Buzzwords, June 3, 2013. Outline. Introduction & Context Social Media Analysis in a C2 Center The “Avalanche” event detection approach Identify posting “hot spots”
E N D
Geo-spatial Event Detection in the Twitter Stream Michael Kaisser, AGT International Berlin Buzzwords, June 3, 2013
Outline • Introduction & Context • Social Media Analysis in a C2 Center • The “Avalanche” event detection approach • Identify posting “hot spots” • Evaluate post clusters with Machine Learning approach • Evaluation • Future work
Background: Social Data • Social Media continuously creates massive amounts of data • E.g. 500 Million tweets each day: ~300 GB raw data • Nature of the data: • time-stamped • textual (many languages, lingos & slangs, spelling mistakes are ripe, only a few words per tweet) • links to pictures • links to news paper articles (more text) • sometimes geo-spatial (contains coordinates) • Creating real actionable insights from this isn’t an easy problem • This talk gives one specific example how this can be done
Use case: Urban Management & Public Safety • Cites today are complex and need to be organized • Administration is responsible for keeping population safe • emergency services • health services • fire fighters • police Command & Control Center
Urban Management & Public Safety • Why is Social Media relevant in this context? ?
Urban Management & Public Safety • Why is Social Media relevant in this context? “There's a plane in the Hudson. I'm on the ferry going to pick up the people. Crazy”
Urban Management & Public Safety • Why is Social Media relevant in this context? “De tering, wat een hel!!! 1,4 miljoen mensen op dat terrein! #loveparade”
Urban Management & Public Safety • Why is Social Media relevant in this context? “#Hoboken is on fire. Building above Hoboken Farm Corporation at 300 Washington is all smoked out” Social Media can help creating a situational awareness picture
How is it done? • Two step approach: • Identify locations with high tweet activity • Collect geo-spatial tweet clusters • Evaluate clusters with a Machine Learning approach • Do these clusters constitute an real-world event that the tweeters are witnessing first-hand? • Work in Progress: • Classify events according to type
Machine Learning – What is the task? = geo-located Social Media post (Tweet)
Machine Learning – What is the task? Good • Suspicious package in #GrandCentral #NYC #bomb threat possibility not sure?? http://t.co/VwU7SP3X • Suspicious package found in Grand Central Station... the 456 train..the trains are closed !! [pic]: http://t.co/9YPki4k2 • Something happened in the #456 #trainstation in #GrandCentral #NYC http://t.co/GGKvQura • Accident on the #456train in #midtown #NYC http://t.co/fj2mJJmf vs. • RT @refinery29: This image of Madeleine Albright playing the drums will be the best thing you'll see today: http://t.co/rGwQ5RdG • «@_PrettyPoison Guess ill fill out more job apps today» make punna fill out some 2! • The Glamour & Glitz at the 2012 Emmy' s that we loved! http://t.co/CiTFszfL • @IszwanieSyahira: i'm happy and i hope u feel the same too. weeeee ~.~ • How to prepare yourself for Friday's apocalypse http://cnet.co/lPU Bad We need to automatically determine which of the tweet clusters (tweets issued close to each other in a short time frame) represent real-world events and which are just random chatter.
Architecture • We look for geo-spatial clusters of tweets (e.g. 3 or more tweets in a 200m radius, posted within 30 mins) • These become “event candidates” • Event candidates are evaluated with a Machine Learning scheme. • We currently use C4.5 decision trees.
Machine Learning - Features • Tweet cluster: • Suspicious package in #GrandCentral #NYC #bomb threat possibility not sure?? http://t.co/VwU7SP3X • Suspicious package found in Grand Central Station... the 456 train..the trains are closed !! [pic]: http://t.co/9YPki4k2 • Something happened in the #456 #trainstation in #GrandCentral #NYC http://t.co/GGKvQura • Accident on the #456train in #midtown #NYC http://t.co/fj2mJJmf
Scalable Machine Learning … …with Weka! Blue = training Green = runtime In offline ML, we train once, but use the predictive model possibly millions of times a day. It’s okay if training isn’t fast as lightning. But during execution every CPU cycle can count.
Scalable Machine Learning … …with Weka! … … which can be optimized further in various ways. See e.g. Nima Asadi, Jimmy Lin, Arjen P. de Vries. Runtime Optimizations for Tree-Based Machine Learning Models. IEEE Transactions on Knowledge and Data Engineering, 2013.
Machine Learning - Evaluation • Evaluation setup: • 1,000 hand-labeled tweet clusters. • 319 good, 681 bad. • 10-fold cross validation.
Machine Learning - Evaluation • Evaluation setup: • 1,000 hand-labeled tweet clusters. 319 good, 681 bad. • 10-fold cross validation.
Machine Learning - Evaluation 1 Common Theme score 0 1 Unique Posters score Blue: event Red: no event • Evaluation setup: • 1,000 hand-labeled tweet clusters. 319 good, 681 bad. • 10-fold cross validation.
(Somewhat simplyfied) Summary • If there are several tweets … • from roughly the same location • at roughly the same time • from different users • that nevertheless use the same words • … chances are good that we have detected an event.
Outlook – work in progress and future work • Derive more coordinates • from shared pictures • from toponyms in posts • use image sharing sites directly • Make use of posts without coordinates • and add them to already existing clusters • Explore real-time TF-IDF • to get rid of the Kardashians & Beliebers • Evaluate system with real-world data • Because recall numbers are currently somewhat misleading
Machine Learning – Relevance Feedback Work in progress Machine Learning Model Good Bad Documents (e.g. tweets, post clusters) Good Users (journalists, C2 operators ) • Users implicitly rate documents by how they interact with them • User performs follow up actions relevant • User clicks document away irrelevant • System learns to present more relevant documents • System can adapt to changing needs over time
Example: Explosion in an image Image Analysis of shared pictures Work in progress Explosion detected with Image Analysis OMG!!! http://t.co/maiAgHoh OMG!!! • Problem: • Not all tweets contain useful textual information • Shared text might be hard to analyze • Solution: • ~35% of tweets contain linked images • Images provide a wealth of information that can be analyzed • Objects, events, persons • coordinates