220 likes | 343 Views
Inferring from the Crowd. L Venkata Subramaniam. Event Detection and 360-degree P rofile Creation. Personal Attributes Identifiers : name, address, age, gender, occupation… Interests : sports, pets, cuisine… Life Cycle Status : marital, parental. Intent
E N D
Inferring from the Crowd L Venkata Subramaniam
Event Detection and 360-degree Profile Creation • Personal Attributes • Identifiers: name, address, age, gender, occupation… • Interests: sports, pets, cuisine… • Life Cycle Status: marital, parental Intent • Sentiment on products, services, campaigns • Personal preferences of products • Product Purchase history • Suggestions on products & services Social Media based 360-degree Consumer Profiles • Relationships • Personal relationships: family, friends and roommates… • Business relationships: co-workers and work/interest network… • Public Safety Events • Life-changing events: relocation, having a baby, getting married, getting divorced, buying a house… Public Safety Public Safety Event Detection Social Media Data There is a large fire at Mantralaya Intent Event Alerting, Mitigation and Management I am going to the rally tomorrow at 10 am @JantarMantar Sentiment Intelligence Corruption is a major problem and it sucks that the govtisint doing much about it Personal Events Citizen Services Looks like we'll be moving to New Orleans sooner than I thought. Personal Attributes I am a engineer, mom, and wife 360-degree Social Media Event and People Profiles Relationships Ritwik and I are both part of the anti makerite movement Intelligence Management Master Data on Troublemakers & Ringleaders (Internal + External) Investigative Management Passport Data Integrate Social media people profiles with Govt and Security Databases Next Best Action Immigration Data Mobile Records Citizen Intelligence Police Records Entity Identification
Social Media based Micro-segmentation and Real-time Correlation Value Proposition Construct a comprehensive view of entities of interest (e.g., people, companies, products, events) Identify actionable insights in real-time From 10-100’s of TBs of social media data from sources such as Twitter, blogs, and forums Using Unstructured data analytics, real-time, and predictive analytics Continuously analyze social media data from a wide range of sources, to construct 360-degree profiles of entities and leverage them in timely decision-making
Unstructured data sources Entity & Relationship Analytics Entity Views HIL AQL Entity Resolution Extract / Text Analytics Crawl Map/Fuse /Aggregate Entities & Relationships: Object-centric view BigInsights / BigData Platform • Main Problem: Assemble an entity view of the domain, where each entity aggregates data from thousands of different documents • Multiple stages of complex processing: • Information extraction • From each unstructured document, extract relevant structured records • Entity resolution • Link records (possibly across documents) that are about the same real-world “entity” • Entity population: mapping / fusion / aggregation • Collect all the facts about the same entity into one rich object with clean values and relationships to other entities Entity Integration
360-degree Consumer Profiles from Social Media • Personal Attributes • Identifiers: name, address, age, gender, occupation… • Interests: sports, pets, cuisine… • Life Cycle Status: marital, parental Timely Insights • Intent to buy various products • CurrentLocation • Sentiment on products, services, campaigns • Incidents damaging reputation • Customer satisfaction/attrition Social Media based 360-degree Consumer Profiles • Life Events • Life-changing events: relocation, having a baby, getting married, getting divorced, buying a house… Products Interests • Personal preferences of products • Product Purchase history • Suggestions on products & services • Relationships • Personal relationships: family, friends and roommates… • Business relationships: co-workers and work/interest network… Monetizable intent to buy products Life Events I need a new digital camera for my food pictures, any recommendations around 300? College: Off to Stanford for my MBA! Bbye chicago! Looks like we'll be moving to New Orleans sooner than I thought. What should I buy?? A mini laptop with Windows 7 OR a Apple MacBook!??! Intent to buy a house Location announcements I'm thinking about buying a home in Buckingham Estates per a recommendation. Anyone have advice on that area? #atx #austinrealestate #austin I'm at Starbucks Parque Tezontle http://4sq.com/fYReSj
Extraction: Loan Records from SEC Documents Extract and cleanse information from headers, tables main content and signatures Loan Document filed by Charles Schwab Corporation On Aug 6, 2009 Loan Information Loan Company Information 6
Person Information across Documents • Do these filings refer to the same person ? • variability in the person’s name, lack of a key identifier • supporting attributes vary depending on the context (form type) • All these facts need to be linked and integrated Who Is James Dimon? Committee memberships Signatures Insider Transactions Biographies 7
Entity Population Rules Mapping and transformation, aggregation Cleansing, conflict resolution Entities can be indexed by multiple “dimensions” Facilitate reuse and hierarchical construction of the master data Entity Integration: High-level rule language to specify entity integration - SQL-like statements to populate, aggregate and relate entities - Combines multiple stages of entity analytics into one framework - HIL compiles into Jaql and Hadoop External data subscriptions (e.g., Acxiom) Entity Resolution Fuse Master entities Extract External public data sources (e.g., SEC/FDIC, Twitter, Blogs, Facebook) Map Temporal Analyze Entity Integration • Entity Resolution Rules • Create links between entities • Rules can incorporate: • similarity functions with thresholds • scoring • blocking for efficient execution
Entries contain promotional messages, wishful thinking, questions, etc Integration across Social Media sites For many of the attributes we need to extract, cleanse, normalize and categorize Example Application : Lead Generation Real-time product intents enriched with consumer attributes Micro-segmentation of product intents by occupation Real-time tracking by micro-segmentation Micro-segmentation of consumers by hobbies
Social Networks and Communities • Social Network is a graph of individuals (nodes) tied by one or more specific types of interdependencies / interactions (edges). • Social communities are collections of users that display a high degree of relatedness among themselves than rest of the network.
Topic User Community Models (WWW 2012) • Generative Bayesian models for extracting latent communities from a social network using the link structure as well as the content exchanged between users • Community memberships are dependent on the topics of interest among users and their link relationships • Users can belong to multiple communities • Communities can be related to multiple topics (interests)
Visualizing Topics and Communities (i) Topic proportions for a user, (ii) Community proportions for a user, (iii) Distribution of topics in community 4, (iv) Global Distribution of topics within communities
(A1) Unstructured Entity Integration Complex analytics to populate master data set Text Analytics: Rule language (AQL) for extracting entities, events, relationships from text and html documents Entity Integration: Rule language(HIL) to express & customize the integration, cleansing, and aggregation of the master entities (A2) Entity Repository (on MDM) BigInsights Bridge: Generation of the MDM model for public master entities, from the BigInsights model; and bulk-loading of master entities Query-based Application Development: Supports the generation of custom queries for individual applications Architecture for Public Master Entities MDM DaaS Applications and Views select cik, Officers, Directors from Company where name = ‘Citigroup’ Data services Tooling based on entity model Queries A2 External data subscriptions (e.g., Acxiom) A1 Relational tables with public master entities Enterprise internal Master entities Text Analytics and Entity Integration External public data sources (e.g., SEC/FDIC, Twitter, Blogs, Facebook) Probabilistic Matching BigInsights
Semantic Name Variations Bill Chamberlin vs. Chamberlain, William H. C. Mohan vs. Mohan Chandrasekaran (Mohan) Geo Proximity Saratoga, CA vs. San Jose, CA New Jersey vs. New York Job Role Disambiguation “Software sales manager at IBM…” vs. “Managing SPSS Sales for Canada…” Matching Twitter profiles with Internal source • Current Scenario focused on linking Social Media profiles with Employee database • Similar approach to be taken for linking with Customer and Prospect databases Name, work location, job description Social media profiles (name, address, gender, age, employment, relationship, …) Social media profiles of IBM employees and their network Employment filter Resolution Current Demo focused on Name and Location matching, as well as EmployeeOf information Choice of social media profile attributes for linking constrained by availability of IBM BluePage attributes
Event Detection – using sensors, crowd sensing, social media, etc. Event 6 – 15:15 - warning, excessive crowds Event 1 – 12:10 – traffic accident Event data is uncertain, progressively changing Event 2 – 14:15 – traffic jam Event 5 – 15:05 – warning water pipe broken Event 3 – 14:25 –Unidentified object found at train station Event 4 – 14:45 – Fire in commercial establishment
Event Profile • December 2011, Magnitude 6.5 earthquake in Mexico kills 3 people • Actual event time: Sunday, December 11, 2011 at 01:47:26 UTC • Event Support 1123 tweets • WHAT • Methodology: Most frequent keywords extracted from the tweets in the event • #earthquake, Mexico, magnitude, USGS, #Acapulco • WHO • Methodology: Named Entity Extractor used to extract people and organizations • People:guerrero • WHEN • Methodology: Time and date of the first tweet in the event • Sunday, December 11, 2011 at 2:20:00 UTC • WHERE • Methodology: Named Entity Extractor to extract location names from the tweets. Reverse geocode the tweets, most frequent profile locations of the users who have published the tweets in the event • tuxpanguerrero, mexico city, acapulco, iguala, swmexico, mexico
Event Profiles Events reflect aggregated data – to prevent overloading by large volume of crowd-source data and to reduce uncertainty by fusing multiple posts (1) 10:10 river water surging from accumulation of tweets (**) (**) These are progressive events, keep changing as more data becomes available and confidence changes Events are progressive – keep updating as more crowd-source data becomes available (2) 11:15 fast moving water from accumulation of mobile messages (**) Uncertainty (confidence) built in – from the event description to the location (3) 11:15 flood, major road blocked from accumulation of mobile messages (**) Inter-event distance – events are ‘close’ if they share similar semantic meaning, location, time (4) 12:30 flood from accumulation of mobile messages (**) (5) 12:30 traffic accident from accumulation of mobile messages (**)
Analytics and Optimization Under Uncertainty • Observed data (sensor and crowd input) is uncertain and is not available for all points on the city network • Data needs to be mathematically estimated for locations that do not have observed data • Effect of other disturbances on the main event needs to be modeled, such as the effect of crowd accumulation, flood, etc., on traffic • There is uncertainty in both the observed data, and the modeled data • Applications such as traffic control, evacuation planning, need to do analytics and optimization under uncertainty • If segment A is dependent on segments B and C, and let us say segment B is affected. Then, the dependency can be such that, the path that goes from C to A will also get affected even though neither C nor A are directly affected. • Now, based on real-time event detection, we can compute the “cascaded impact” based on the dependencies. This will essentially “project” the “reduced capacities” of the segments that are not directly affected. • This in turn can be used for “Evacuation Plans” that adheres to several (source, destination, deadline) pairs that one might want to satisfy. For example, (city, airport, short-deadlines) and (city, suburbs, long-deadlines) or vice-versa depending on the need.
The need for managing uncertainty at scale is widespread Homeland Security Telco Profiles Call Detail Records Smart Grid Smarter Cities Text, Audio, Video WeatherModeling Sensor Data Market Trends Portfolio Risk Contact Centers Smarter Traffic Smarter Water Market Feeds Credit Card Transactions Retail Data Volume, Velocity, Variety MedicalTranscription Disease Progression Fraud Social Network Data Electronic Data Interchange Patient Records Services Predictive Modelingof Outcomes SWIFT CRM AccountManagement Customer Records Traditional Data & Processing Data Uncertainty at Scale Uncertainty(1/veracity) Precise, authoritative, well formed Inconsistent, imprecise, uncertain, unverified, spontaneous, ambiguous, deceptive