Analysing and Modelling Large-Scale Enterprise Data

Analysing and Modelling Large-Scale Enterprise Data Thore Graepel Online Services and Advertising Group Microsoft Research Cambridge

Overview • Complex large-scale data in the enterprise • What kind of data is available? • What technologies are used? • Tasks and enterprise-specific challenges? • Methodology: • Bayesian Inference in Factor Graph Models • PQL: Using SQL to describe probability models • Applications: • Gamer Rating and Matchmaking: TrueSkill • Click-Through Rate Prediction: AdPredictor • Large-Scale Recommendations: Matchbox

Complex Data Joint work with Tom Minka & Phillip Trelford

Data Sources at Microsoft (External) • Online Services Division • Web index • Search and Ad click logs (12-15 TB / day) • Hotmail, Instant messaging, Internet Explorer (100s million users) • MSN portal and Bing maps • Xbox Live Gaming Service • User transaction log data • Ranking and matchmaking data • Game instrumentation for user testing

Data Sources at Microsoft (Internal) • Development and Software Instrumentation • Watson (customer feedback data) • Source depot (MS source code, e.g., Office, Windows) • Multilingual technical documentation • Business • Customer databases • Sales and Marketing

Data-Intensive Tasks at Microsoft • Prediction of user behaviour and preferences • Improve web search • Improve targeting for advertising • Spam filtering and content prioritisation • Improve user experience • Matchmaking for games • Multi-modal user interfaces (Natal, speech) • Improve software development process • Improve productivity of developers • Analyse software for defects

Technical Infrastructure • Relational Databases/SQL • Great agility for analysis and reliability for business • Limited scalability • Need to import data into SQL • Windows HPC • Complex computations / fine grained parallelism • Need to move data to HPC cluster • Cosmos • Take the computation to the data • Super efficient stream based computations

Cosmos Architecture SCOPE DryadLINQ Sputnik Dryad Cosmos Cluster Machine Cluster Machine Cluster Machine Cluster Machine Stream Stream Stream Stream

Enterprise/Online specific challenges • Privacy • Privacy limit the ways in which data can be used • Interesting trade-offs (differential privacy) • Incentives • Data produced by self-interested agents • Need to design incentive compatible mechanisms • Exploration/Exploitation • Results of inference feed back into business process and determine future observations. • Need to aim at long-term benefits

Factor Graphs

Factor Graphs / Trees • Definition: Graphical representation of product structure of a function (Wiberg, 1996) • Nodes: = Factors = Variables • Edges: Dependencies of factors on variables. • Question: • What are the marginals of the function (all but one variable are summed out)? • What is the mode of the function?

Factor Graphs and Bayesian Inference • Bayes’ law • Factorising prior • Factorising likelihood • Sum out latent variables s1 s2 s t1 t2 d y

Factor Trees: Separation y Observation: Sum of products becomes product of sums of all messages from neighbouring factors to variable! f3(x,y) v w x f1(v,w) f2(w,x) z f4(x,z)

Messages: From Factors To Variables y Observation: Factors only need to sum out all their local variables! f3(x,y) w x f2(w,x) z f4(x,z)

Messages: From Variables To Factors y Observation: Variables pass on the product of all incoming messages! f3(x,y) x f2(w,x) z f4(x,z)

The Sum-Product Algorithm • Three update equations (Aji & McEliece, 1997) • Update equations can be directly derived from the distributive law. • Efficient for messages in the exponential family. • Calculate all marginals at the same time.

Approximate Message Passing • Problem: The exact messages from factors to variables may not be closed under products. • Solution: Approximate the marginal as well as possible in the sense of minimal KL divergence. • Expectation Propagation (Minka, 2001): Approximate the marginal by moment-matching resulting in

Distributed Message Passing • Map-Reduce for IID data • Map: Nodes compute messages mfis from data yi and mfis • Reduce: Combine messages mfis into ps by multiplication • Caveats: • All approximate data factors need the incoming message msfi! • All messages mfis need to be stored if the same data point is considered multiple times s y1 y2 y3

PQL Joint work with Ralf Herbrich & Jurgen Van Gael

PQL as a Platform

PQL I – Augmenting Schemas People = AUGMENT DB.People ADD weight FLOAT DB.People People weight

PQL II – Factor Types Table 1 Table 2 Table 1 Table 1

PQL III – Single Relation Factors FACTOR Normal(p.weight,75.0,25.0) FROM People p People People

PQL IV – Cross Relation Factors FACTOR Normal(g.weight, p.weight, 1.0) FROM People p, DrVisit g WHERE p.PersonID = g.PersonID DrVisit People DrVisit People

PQL as a Unifying Platform

TrueSkill™ Joint work with Tom Minka & Phillip Trelford

Given: Match outcomes: Orderings among k teams consisting of n1, n2 , ..., nk players, respectively Questions: Skill si for each player such that Global ranking among all players Fair matches between teams of players TrueSkill™

Efficient Approximate Inference Gaussian Prior Factors s1 s2 s3 s4 Ranking Likelihood Factors Fast and efficient approximate message passing using Expectation Propagation t1 t2 t3 y12 y23

TrueSkill: Superfast convergence to True Skills 40 35 30 25 Level 20 15 char (TrueSkill™) 10 SQLwildman (TrueSkill™) char (Halo 2 Beta) 5 SQLwildman (Halo 2 Beta) 0 Games played 0 100 200 300 400

Leaderboard Global ranking of all players Matchmaking For gamers: Most uncertain outcome For inference: Most informative Both are equivalent! Applications to Online Gaming

Trueskill in Xbox 360 and Halo 3

AdPredictor Joint work with Joaquin Quiñonero Candela, OnnoZoeter, Tom Borchert , Phillip Trelford

Display (according to expected revenue) Charge (per click) Why Predict Probability-of-Click? • Advantages of improved probability estimates: • Increase user satisfaction by better targeting • Fairer charges to advertisers • Increase revenue by showing ads with high click-thru rate $1.00 * 10% =$0.10 $0.80 $2.00 * 4% =$0.08 $1.25 $0.10 * 50% =$0.05 $0.05

adPredictor Details 102.34.12.201 15.70.165.9 Client IP 221.98.2.187 92.154.3.86 P(pClick) + Match Type Exact Match Broad Match ML-1 Position SB-1 SB-2

Training Algorithm in Action w2 w1 + s c No Click Click

Client IP: Mean & Variance Low clickers High clickers

UserAgent: Mean Posterior Effects

AdPredictor in Bing Search Engine • AdPredictor is now running 100% Paid Search traffic in Microsoft’s Bing Search Engine • Relevance and Click-Through Rate of Ads improved • Calibrated CTR prediction provides solid foundation for further improvements • AdPredictor explored for other tasks such as contextual and display advertising

Matchbox Joint work with David Stern and Ralf Herbrich

Collaborative Filtering Items 1 2 3 4 5 6 Metadata? A B Users C ? ? ? D

Map Sparse Features To ‘Trait’ Space 234566 34 456457 345 User ID Item ID 13456 64 654777 5474 Male Horror Gender Female Movie Genre Drama Comedy UK Country Documentary USA Height 1.2m

Message Passing For Matchbox u11 u21 v11 v21 u01 s1 t1 + + * u12 u22 v12 v22 u02 s2 t2 + + * + r Message update functions powered by Infer.net

User/Item Taste Space ‘Preference Cone’ for user 145035

Applications

Conclusions

Conclusions • Great variety of data sources and tasks • Challenges: privacy, incentives, exploration • Tools: SQL, No-SQL , HPC • Modelling platform (Factor Graphs & PQL): • Represent uncertainty • Composable models • Distributed, data-centric computation • Applications: TrueSkill, AdPredictor, Matchbox • Thanks!

Analysing and Modelling Large-Scale Enterprise Data

Analysing and Modelling Large-Scale Enterprise Data

Presentation Transcript

Analysing Data.

Analysing and Interpreting Data

Analysing Data

iRODS and Large-Scale Data Management

Large-scale enterprise content management

Large scale genomic data mining

Collecting and Analysing Data

LARGE SCALE

Large-scale Data Processing Challenges

Parameterization in large-scale atmospheric modelling

Large scale genomic data mining

Large- scale Linked Data Management

Analysing and interpreting data

Evaluation of OCL for Large-Scale Modelling

Large scale data processing

Large-Scale “ Ethernets ” and Enterprise Networks

Fire Storms and Large Scale Modelling

Large scale

STRING Large-scale data and text mining

Analysing data

Large Scale Data Integration

Large Scale Data Analytics