200 likes | 279 Views
Understanding RSM: Relief Social Media. William Murnane Anand Karandikar. 15 September 2009. Objective.
E N D
Understanding RSM:Relief Social Media William Murnane AnandKarandikar 15 September 2009
Objective Build better sensors into emerging social media environments. These environments are increasingly important in Humanitarian and Disaster Relief (HADR) and Security, Stability, Transition and Reconstruction (SSTR) scenarios, providing real-time situational awareness. Deliver an analytic toolkit that can be integrated into the Human, Social, Cultural and Behavioral (HSCB) computational infrastructure
Project Overview • Joint venture by Lockheed Martin Advanced Technology Laboratories (LM ATL) and the University of Maryland, Baltimore County (UMBC) • Team members Prof. Finin, Prof. Joshi – Principal Faculty, UMBC, CS Dept Dr. Brian Dennis – Staff Computer Scientist , LM ATL William Murnane, AnandKarandikar – Graduate students, UMBC, CS Dept
Project Overview What It’s Like Today • HADR/SSTR response has focused on highly centralized, tightly coordinated organization. • Responders • Domestic: FEMA, DHS, National Guard, State and Local, NGOs • International: Army, Navy, USAID, NGOs • Centralization slows response, throttles critical information, limits situational awareness Adopted from Dr. Brian Dennis's slides
Project Overview What’s Changing • Response at the edge • Affected populace is using Internet/Web for communication • Assuming network availability • Social media tools are being used for communication & coordination • Example social media platforms: • Twitter, Flickr, YouTube, open blogs • Social visibility + coordination + content Adopted from Dr. Brian Dennis's slides
Technical Approach • Harvesting of Data • Focus on social media like Twitter, Flickr • Capture data that has relief contexts • Computational models • Generative model of social connections that can help building forecasting tools • Building Analytics Toolkits • Capabilities to analyze and mine sentiment • Automated generation of appropriate confidence levels for information extracted
Twitter • Lots and lots of data: Lots and lots of stuff nobody cares about: "omg, when I get home I am so going to blog about your new haircut." --Nick Taylor ... but maybe some stuff someone might care about. People talk about getting sick, wild fires, floods, etc, so maybe we can track that.
Dataset #1: Twitter • Nicely segmented into tables: users, locations, statuses. • Referential integrity needs work: • select count(*) from (select follower_id from user_relationships except select id from users) as missing_uids; Count ------- 24201 • Fairly big: roughly 1.5M users, 150M statuses, 1M locations. 30GB on disk.
Current Progress • Dataset loaded into PostgreSQL from MySQL • Fixed corruption problems • Gave full-text indexing on tweets a try in Postgres • Too slow: 72 hours for CREATE INDEX and no progress • May try again on new hardware • Lucene-based app to build and search indices
Current Progress Status and speed of query • Pretty Good performance: • ~35k rows/second while creating index on current hardware, quick queries • Easy to write: 459 LOC counting the GUI, half that without it.
Tweet index design • Index only statuses: that's all we need to search quickly so far. • Document ID: maps to SQL primary key on statuses • Text: Analyze for words, do TF-IDF to order results. • UID: Can filter by user at the query level rather than have to go ask the database. We don't know if this will be useful, but it doesn't hurt.
Raw data for events of interest Example chosen here is ‘California Wildfires’ • Twitter tweets for California wildfires • Technocrati search for California wildfire videos • Yahoo! Pipes mashup for California wildfires using Flickr data
Twitter API methods • Search - Returns tweets that match a specified query. • statuses/public_timeline - Returns the 20 most recent statuses from users • statuses/show - Returns a single status, specified by the id parameter • Trends - Returns the top ten topics that are currently trending on Twitter • GeoLocation API from Twitter by October 2009
Facebook API methods • Users.getStandardInfo – Returns users current location, timezone etc. • Stream.get – if an user ID is specified it can return the last 50 posts from that user's profile stream. • Status.get - Returns the user's current and most recent statuses.
YouTube Data API • To search for videos, submit an HTTP GET request to the following http://gdata.youtube.com/feeds/api/videos • Example: California Fires • Other parameters like location, location-radius can be added while building the query.
GeoCoding API • GeoCoding is a process of converting addresses like ‘1000 Hilltop Circle Baltimore MD’ to geographical co-ordinates which can be used to mark that address on the map. • Google Map API: via GClientGeocoder object. Use GClientGeocoder.getLatLng() to convert a string address into latitudes and longitudes. • Yahoo! Maps web service: Example: 701 First Ave Sunnyvale CA
Similar Initiatives AirTwitter (Started in August 2009) • Designed to harvest user-generated content like tweets, delicious bookmarks, flickr pictures and youtube videos that are relevant to Air Quality Uses • Yahoo! Pipes for aggregated feed generation. • When events are identified, the location will be harvested from contextual information in the feed such as a place name or as development evolves IP address of tweet. • To further automate event identification, Air Twitter feeds will be archived in order to conduct temporal trend analysis that can be used to separate the background noise from AQ events in the social media stream.
Similar Initiatives Crisis Informatics • ConnectivIT Research Group at University of Colorado, Boulder • Investigates the evolving role of information and communication technologies (ICT) in emergency and disaster situations. • Particular focus on information dissemination and the implications of ICT-supported public participation on informal and formal crisis response
To Do • Index locations, too? Lucene or SQL? • Better Analyzer: discard non-English (tricky!) and do stemming (simple!) • Test on new hardware: SSD versus disk, for what parts? • Higher-level abstractions: what Tweets are similar? Build an ontology that things fit into, or search for particular things? • Run human classifier for a while, then train machine classifier off that data. • Geo-location in Twitter space