260 likes | 430 Views
Big Data – Analytics What makes it difficult. Kalapriya Kannan IBM Research Labs July, 2013. What is analytics?. Broadly refers to the methods of analysis Depends on what we want to learn from the data Method/Model used to make sense of the data Depends on the nature of the data
E N D
Big Data – Analytics What makes it difficult Kalapriya Kannan IBM Research Labs July, 2013
What is analytics? • Broadly refers to the methods of analysis • Depends on what we want to learn from the data • Method/Model used to make sense of the data • Depends on the nature of the data • Will “Chennai Express” enter the 100 crore club? • Historical data • Promotions • Star cast • Release date • Budget • Other events? • Analytics Methods:??????
A ever green story • |INSERT MAJOR RETAILER NAME| found on |INSERT DAY OF THE WEEK| that beer and diaper sales were strongly correlated. Once noticed on |INSERT BI TOOL OF CHOICE|, it was found |PICK ONE|: • That diapers are too heavy for recently pregnant women so they ask their husbands to pick them up coming home from work and since hubby is off the clock and ready to get his drink on, he also picks up beer. • That a diaper emergency occurs fairly late in the evening and the husband is sent out while the new mother cares for the baby. Being annoyed, he also picks up a 12 pack to relax. • The brilliant analyst at |SAME MAJOR RETAILER AS ABOVE| intuits that a simple relocation of beer next to diapers will lead to more purchases of beer and beer sales improve by |INSERT HIGHER %|.
Knowledge discovery from Big Data • Analyzing big data allows analysts, researchers, and business users to make better and faster decisions using data that was previously inaccessible or unusable. • Novelty discovery • Finding new, rare, one in a million (billion) (trillion) objects and events. • Class discovery • Finding new classes of objects and behaviors • Association discovery • Finding unusual (improbable) co-occurring associations
It starts with….. Discover gold dust in Desert • VS Gold in Mine
What matters when dealing with data (“Big Data”) ? • Smart Sampling of data • Reducing the original data while not losing the statistical properties of data • Finding similar terms • Efficient multi dimensional indexing • Incremental updating of models • (vs building models from scratch) • Crucial for streaming data • Distributed linear algebra • Dealing with large sparse matrices
On top .. • We perform usual data mining/machine learning/statistics operators: • Supervised learning (classification, regression) • Non supervised learning (clustering, different types of decompositions) • We are just more careful with algorithms we choose (typically linear or sub-linear versions)..
Meaningfulness of Analytics(1/2) • A risk with `big-data mining’ is that an analyst can ‘discover’ patterns that are not meaningful • Statisticians call it ‘Bonferroni’s principles’: • Roughly if you look in more places for interesting patterns, than your amount of data will support almost anything, .. And you are bound to find lots of nonsense.
Meaningfulness of Analytics (2/2) • Example: • We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day • 109 people being tracked. • 1000 days • Each person stays in a hotel 1% of the time (1 day out of 100) • Hotels hold 100 people (105 hotels). • If everyone behaves randomly (ie., no terrorist) will data mining detect anything suspicious? • Expected number of “suspicious” pairs of people: • 250,000 • .. Too many combinations to check- we need to have some additional evidence to find “suspicious” pair of people in some efficient way.
Very inefficient Early days • Early days: • Customized applications built on top of file systems • Drawbacks of using file systems to store data: • Data redundancy and inconsistency • Difficulty in accessing data • Atomicity of updates • Concurrency control • Security • Data isolation — multiple files and formats • Integrity problems
How is analytics done • Prepare and clear data • Explore data set: Cubes and descriptive statistics • Model Computation • Scoring and deployment
Application Demands on Big Data • Sub milli seconds – nano seconds are the demands of the applications. • Example: Tableau Desktop, Financial Analysis. • Low latency reads • Low Latency writes • Fault-tolerant • Scalable • Queries are ad-hoc • What is the next best optimal investment in the stocks. • Query = function (all data)
Where is the problem….. • Two sample programs • Computing Average • JSON reader
Big Data Is not about size • Finds insights from complex, noisy, heterogeneous, longitudinal and voluminous data. • It aims to answer questions that were previously unanswered. • The challenges include • capturing, storing, • searching, • sharing & • analyzing.
Digital Marketing Data is a Mess • The problem is exacerbated by: • Most (or all) metrics not being aligned with business objectives • Disparate Data sources – website, social, mobile, CRM etc., • How do we over come them? • Ask the question: Do the numbers even matter? • Reasons why they might not: • Aesthetics • Brand Value • Overarching business strategy
Ten common Big Data Problems Modeling True risk Customer churn analysis Recommendation Engine Ad targetting PoS transaction analysis Analyzing network data to predict failure Threat analysis Trade surveillance Search Quality Data ‘sandbox’
Look at a Few • Isolate metrics that matters, make sure: • They are actionable • They can be commonly interpreted • The calculations are transparent and simple • The data is easily accessible and credible. • Aggregate value • Visualize it • A visualization can be worth a thousand metrics • Use best practices, but utilize unique visualizations • “Data Visualizations” will become the new interface to your computing experiences”
Hadoop – Harvesting Cheap computation from commodity machines