540 likes | 695 Views
C rime /E vent D etection on T witter. Data Sciences Summer Institute 2011 Multimodal Information Access & Synthesis, University of Illinois at Urbana-Champaign. Our Team. Team member: Elisee Habimana Jicong Wang Sridevi Maharaj Ronald Doku Mingjia Zhang Tobias Kin Hou Lei
E N D
Crime/Event Detection on Twitter Data Sciences Summer Institute 2011 Multimodal Information Access & Synthesis, University of Illinois at Urbana-Champaign
Our Team Team member: Elisee Habimana Jicong Wang Sridevi MaharajRonald Doku Mingjia Zhang Tobias Kin Hou Lei Ravi KhadiwalaDuber Gomez Rui Yang Project leader: Yizhou Sun Rui Li
Motivation - why Twitter? Real Time Wide Coverage
Motivation - An Example • An earthquake happened in Chile at 03:34 local time, Sat Feb 27, 2010 • Traditional communication almost impossible for 2-3 hours, first video image available 6-7 hours after quake Source: <Information Credibility on Twitter>, by Carlos Castillo et al.
Motivation - Another Example • Tweet posted at 2:22pm, June 28th, 20 minutes after the shot, while first news report appears almost 3 hours later
Motivation • Twitter reshape the way people spread and receive information • The real time feature makes twitter a good source of breaking news • The official and verified accounts on twitter provides reliable information • We propose to build up a web application that provide reliable real time crime related information
Crime/Event Detection on Twitter Data Sciences Summer Institute 2011 Multimodal Information Access & Synthesis, University of Illinois at Urbana-Champaign
Table of Contents • Major Challenges • Crime Focused Crawling • Tweet Classification • Event Extraction • Tweet Ranking • Clustering • Tools • Summary
Major Challenges • Most tweet contents are useless for us • Pointless babble – 40% • Conversational – 38% • Pass-along value – 9% • Self-promotion – 6% • Spam – 4% • News – 4% • Crime related - 0.005% • Roughly 10,000 crime related tweets each day • Information like location and time not always explicit • Display only the most important tweets • Present results in an organized fashion Source: <Twitter Study – August 2009> Kelly, Ryan, ed (August 12, 2009)
Crime Focus CrawlingCrawling crime related tweets from TwitterPresented by Jicong Wang
A Snapshot of Twitter Data USERID 43893075 ID 68542312782905344 TEXT Break shooting scene 1 "No More" with @dindamanda @yuyayuyi http://lockerz.com/s/100883315 LOCATION GeoLocation latitude=-6.196612, longitude=106.829552 PLACE TIME Thu May 12 00:05:35 CDT 2011 URLS url=http://lockerz.com/s/100883315, MentionedEntities: 37623286 66072730 Hashtags: also number of Followers, number of Friends, name of User, etc
NOT ALL TWEETS ARE CRIME RELATED! ONLY about 0.005%!
Iteratively Refining Rules • Repeat the above procedures until an ideal rule is obtained
Problem However, there are STILL many "fake" crime tweets
Refine the Rules • Single Keyword • Combination of Keywords • Key Phrases e.g. crime, kill, death,police, cop, shot • found shot OR died OR injured OR body • armed OR unarmed robbery • police on scene of
Keyword Proportion of crime related tweets Single < 5% Combination 50% among results from single keywords Result • Improved crawling result: • Crawling result: About 25,000 crawled tweets per day. • Over 13,000 users per day.
Tweets ClassificationDetermine whether a tweet is a related event Presented by Tobias Kin Hou Lei
Features Engineering - Basic features • Concept clusters • Natural disaster: {earthquake,tornado, ...} • Weapon: {weapon,weapons,gun,guns,gunshot, ...} • Injure: {...} • Burglar: {...} • ... • Non-Fire : {hilarious,weather,red,moon,sun, ... ,musician, • pizza,cook,music,dance justin bieber} • Could predict unseen words. e.g. Train ontornado warning, could predict earthquake warning.
Tradition Classification Features • Only Text Classification • But Tweets are short and noisy. • at most 140 words • contain noisy words, • contain urls, tags;
Features Engineering - Social Features • Special tags: • #hpd • #breaking news
Features Engineering - Social Features • User as a feature • List of verified police departments on Twitter • URL • Date • Number
Classification Model • Naive Bayes • Easy and good-performance model for online classification. • Many meaningful features and training data, different classification models will performance the similar result.
Training Data • Crawled in from Twitter at different period of times • Manually labeled by our team • 2000 samples for training, among them: • 60% positive samples • 40% negative samples • 1000 samples for testing • 65% positive samples • 35% negative samples
Summary • About 100 concept clusters covers in different areas of the feature space • Average accuracy on test set is 83.788%
Event ExtractionExtracting event information and groupingPresented by Ravi Khadiwala
Event Extraction • Within the text of an individual tweet there may be information not previously found in through data crawling • This information is often useful to the user • Allows user to visualize where crime occurred • Allows user to view filter by category • Decreases the amount of raw tweets the user must read • This information is also useful to improve performance • Ranking • Clustering • Improves accuracy
Five potential sources of locations, listed in descending order of perceived usefulness: GPS tagged tweets latitude=57.8433342, longitude=12.6506338 'Place' tagged tweets(57.6190897,12.427637),(57.6190897,12.7635394) (57.8653997, 12.7635394),(57.8653997,12.427637) User location Textual Location Extraction Named Entity Recognition Regular Expressions Temporal/Spatial Information
Temporal/Spatial Information • Location information hierarchically structured based on reliability • Use Named Entity Recognition • Succeeds on: "I just witnessed a robbery in Champaign" • Fails on: "Breaking and entering at 128 Maple St." • Use regular expressions to recognize common formating of addresses, highways, etc. • Time based on tweet time
Regex Example "[0-9]+ ([A-Z][A-Za-z]* )+ (ALLEE|ALLEY|ALLY|ALY|ANEX|ANNEX|ANNX|ANX|ARC|ARCADE|AV|AVE|AVEN|AVENU|AVENUE|AVN| AVNUE|BAYOO|BAYOU|BCH|BEACH|BEND|BND|BLF|BLUF|BLUFF|BLUFFS|BOT|BOTTM|BOTTOM|BTM|BLVD|BOUL|BOULEVARD|BO ULV|BR|BRANCH|BRNCH|BRDGE|BRG|BRIDGE|BRK|BROOK|BROOKS|BURG|BURGS|BYP|BYPA|BYPAS|BYPASS|BYPS|CAMP|CMP| CP|CANYN|CANYON|CNYN|CYN|CAPE|CPE|CAUSEWAY|CAUSWAY|CSWY|CEN|CENT|CENTER|CENTR|CENTRE|CNTER|CNTR|CTR|C ENTERS|CIR|CIRC|CIRCL|CIRCLE|CRCL|CRCLE|CIRCLES|CLF|CLIFF|CLFS|CLIFFS|CLB|CLUB|COMMON|COR|CORNER|CORNERS|CORS| COURSE|CRSE|COURT|CRT|CT|COURTS|CTS|COVE|CV|COVES|CK|CR|CREEK|CRK|CRECENT|CRES|CRESCENT|CRESENT|CRSCNT|C RSENT|CRSNT|CREST|CROSSING|CRSSING|CRSSNG|XING|CROSSROAD|CURVE|DALE|DL|DAM|DM|DIV|DIVIDE|DV|DVD|DR|DRIV|DRI VE|DRV|DRIVES|EST|ESTATE|ESTATES|ESTS|EXP|EXPR|EXPRESS|EXPRESSWAY|EXPW|EXPY|EXT|EXTENSION|EXTN|EXTNSN|EXTE NSIONS|EXTS|FALL|FALLS|FLS|FERRY|FRRY|FRY|FIELD|FLD|FIELDS|FLDS|FLAT|FLT|FLATS|FLTS|FORD|FRD|FORDS|FOREST|FORE STS|FRST|FORG|FORGE|FRG|FORGES|FORK|FRK|FORKS|FRKS|FORT|FRT|FT|FREEWAY|FREEWY|FRWAY|FRWY|FWY|GARDEN|GA RDN|GDN|GRDEN|GRDN|GARDENS|GDNS|GRDNS|GATEWAY|GATEWY|GATWAY|GTWAY|GTWY|GLEN|GLN|GLENS|GREEN|GRN|G REENS|GROV|GROVE|GRV|GROVES|HARB|HARBOR|HARBR|HBR|HRBOR|HARBORS|HAVEN|HAVN|HVN|HEIGHT|HEIGHTS|HGTS|HT |HTS|HIGHWAY|HIGHWY|HIWAY|HIWY|HWAY|HWY|HILL|HL|HILLS|HLS|HLLW|HOLLOW|HOLLOWS|HOLW|HOLWS|INLET|INLT|IS|ISL AND|ISLND|ISLANDS|ISLNDS|ISS|ISLE|ISLES|JCT|JCTION|JCTN|JUNCTION|JUNCTN|JUNCTON|JCTNS|JCTS|JUNCTIONS|KEY|KY|KEYS |KYS|KNL|KNOL|KNOLL|KNLS|KNOLLS|LAKE|LK|LAKES|LKS|LAND|LANDING|LNDG|LNDNG|LA|LANE|LANES|LN|LGT|LIGHT|LIGHTS|
Location Disambiguation • Search extracted locations through a city to GPS lookup table • Many American city names are repeated (Atlanta,IL vs Atlanta,GA) • Check for well formated locations (city,state) • If not, resolve by selecting matched city with the largest population • Give preferences to other location sources (like user location and GPS) when there are multiple matches
Categorization • Would like categories with finer granularity than crime or not crime • Based on keyword partitions corresponding to categories, ex: • Robbery/Theft: {robbed,robbery,burglar,theft...} • Natural Disaster: {tornado,typhoon,earthquake...} • Keyword based crawling guarantees presence of words that convey meaningful category information
RankingScoring and Ordering Tweets based on ImportancePresented by Ravi Khadiwala
Ranking • We only want to display best "n" tweets • Nature of twitter may result in an extremely variable amount of data • Serves as another way to filter non-crime tweets • May be able to highlight important events • Summarize the most important data points • Avoid overwhelming the user with results
Learning to Rank Goal: Learn a function f: X -> r where X is a vector of features and r is a importance score Strategy: Take pointwise approach and use a sample of manually scored data find the curve that fits our labeled data We use linear regression using the simple least squares method to find weights such that r = w1x1 + w2x2 + w3x3 + . . . wnxn
Determine Ranking Features • Selected from a large pool of potential features • Social • Number of hashtags,urls,@ (indicates a reply), retweet count • Contextual • Tweet length, category, mentioned locations • User Credibility • Age of user account, friends, followers, status count, verification • Classifier Confidence
Ranking Features and Weights • Labeled ~500 tweets with a ranking (integer from 1 to 5) • Linear regression on all features (normalized) • Examined correlation coefficients • Examined weights • Pruned features • Repeated until we had an adequate feature set with logical weights
Ranking Features and Weights WeightsFeatures -0.996904004778 category 2.87974471144 account age 1.71671010105 favorites 1.17242993534 status count 2.67005302808 followers -3.97882564778 confidence
ClusteringGeographical location: determinant for grouping tweets togetherPresented by Ronald Doku
Clustering tweets • Clustering of tweets means to group overlapping tweets found in the same location into one category.
Why is tweet clustering important? • Clustered tweets inform the user about where most events are happening at a particular time. • The sizes of the clustered tweets also convey how relevant or important the tweets are. • eg. A user may want to find out how far a wild fire outbreak is spreading or has spread to. Clustered tweets of the wildfire on the map shows the user where the fire is or has spread to.
How do we cluster tweets? Also by defining at which zoom-levels each tweet should appear, we cluster the tweets to reduce the number shown at a time. We call this hierarchical clustering.