480 likes | 700 Views
Analysing and Modelling Large-Scale Enterprise Data. Thore Graepel Online Services and Advertising Group Microsoft Research Cambridge. Overview. Complex large-scale data in the enterprise What kind of data is available? What technologies are used? Tasks and enterprise-specific challenges?
E N D
Analysing and Modelling Large-Scale Enterprise Data Thore Graepel Online Services and Advertising Group Microsoft Research Cambridge
Overview • Complex large-scale data in the enterprise • What kind of data is available? • What technologies are used? • Tasks and enterprise-specific challenges? • Methodology: • Bayesian Inference in Factor Graph Models • PQL: Using SQL to describe probability models • Applications: • Gamer Rating and Matchmaking: TrueSkill • Click-Through Rate Prediction: AdPredictor • Large-Scale Recommendations: Matchbox
Complex Data Joint work with Tom Minka & Phillip Trelford
Data Sources at Microsoft (External) • Online Services Division • Web index • Search and Ad click logs (12-15 TB / day) • Hotmail, Instant messaging, Internet Explorer (100s million users) • MSN portal and Bing maps • Xbox Live Gaming Service • User transaction log data • Ranking and matchmaking data • Game instrumentation for user testing
Data Sources at Microsoft (Internal) • Development and Software Instrumentation • Watson (customer feedback data) • Source depot (MS source code, e.g., Office, Windows) • Multilingual technical documentation • Business • Customer databases • Sales and Marketing
Data-Intensive Tasks at Microsoft • Prediction of user behaviour and preferences • Improve web search • Improve targeting for advertising • Spam filtering and content prioritisation • Improve user experience • Matchmaking for games • Multi-modal user interfaces (Natal, speech) • Improve software development process • Improve productivity of developers • Analyse software for defects
Technical Infrastructure • Relational Databases/SQL • Great agility for analysis and reliability for business • Limited scalability • Need to import data into SQL • Windows HPC • Complex computations / fine grained parallelism • Need to move data to HPC cluster • Cosmos • Take the computation to the data • Super efficient stream based computations
Cosmos Architecture SCOPE DryadLINQ Sputnik Dryad Cosmos Cluster Machine Cluster Machine Cluster Machine Cluster Machine Stream Stream Stream Stream
Enterprise/Online specific challenges • Privacy • Privacy limit the ways in which data can be used • Interesting trade-offs (differential privacy) • Incentives • Data produced by self-interested agents • Need to design incentive compatible mechanisms • Exploration/Exploitation • Results of inference feed back into business process and determine future observations. • Need to aim at long-term benefits
Factor Graphs / Trees • Definition: Graphical representation of product structure of a function (Wiberg, 1996) • Nodes: = Factors = Variables • Edges: Dependencies of factors on variables. • Question: • What are the marginals of the function (all but one variable are summed out)? • What is the mode of the function?
Factor Graphs and Bayesian Inference • Bayes’ law • Factorising prior • Factorising likelihood • Sum out latent variables s1 s2 s t1 t2 d y
Factor Trees: Separation y Observation: Sum of products becomes product of sums of all messages from neighbouring factors to variable! f3(x,y) v w x f1(v,w) f2(w,x) z f4(x,z)
Messages: From Factors To Variables y Observation: Factors only need to sum out all their local variables! f3(x,y) w x f2(w,x) z f4(x,z)
Messages: From Variables To Factors y Observation: Variables pass on the product of all incoming messages! f3(x,y) x f2(w,x) z f4(x,z)
The Sum-Product Algorithm • Three update equations (Aji & McEliece, 1997) • Update equations can be directly derived from the distributive law. • Efficient for messages in the exponential family. • Calculate all marginals at the same time.
Approximate Message Passing • Problem: The exact messages from factors to variables may not be closed under products. • Solution: Approximate the marginal as well as possible in the sense of minimal KL divergence. • Expectation Propagation (Minka, 2001): Approximate the marginal by moment-matching resulting in
Distributed Message Passing • Map-Reduce for IID data • Map: Nodes compute messages mfis from data yi and mfis • Reduce: Combine messages mfis into ps by multiplication • Caveats: • All approximate data factors need the incoming message msfi! • All messages mfis need to be stored if the same data point is considered multiple times s y1 y2 y3
PQL Joint work with Ralf Herbrich & Jurgen Van Gael
PQL I – Augmenting Schemas People = AUGMENT DB.People ADD weight FLOAT DB.People People weight
PQL II – Factor Types Table 1 Table 2 Table 1 Table 1
PQL III – Single Relation Factors FACTOR Normal(p.weight,75.0,25.0) FROM People p People People
PQL IV – Cross Relation Factors FACTOR Normal(g.weight, p.weight, 1.0) FROM People p, DrVisit g WHERE p.PersonID = g.PersonID DrVisit People DrVisit People
TrueSkill™ Joint work with Tom Minka & Phillip Trelford
Given: Match outcomes: Orderings among k teams consisting of n1, n2 , ..., nk players, respectively Questions: Skill si for each player such that Global ranking among all players Fair matches between teams of players TrueSkill™
Efficient Approximate Inference Gaussian Prior Factors s1 s2 s3 s4 Ranking Likelihood Factors Fast and efficient approximate message passing using Expectation Propagation t1 t2 t3 y12 y23
TrueSkill: Superfast convergence to True Skills 40 35 30 25 Level 20 15 char (TrueSkill™) 10 SQLwildman (TrueSkill™) char (Halo 2 Beta) 5 SQLwildman (Halo 2 Beta) 0 Games played 0 100 200 300 400
Leaderboard Global ranking of all players Matchmaking For gamers: Most uncertain outcome For inference: Most informative Both are equivalent! Applications to Online Gaming
AdPredictor Joint work with Joaquin Quiñonero Candela, OnnoZoeter, Tom Borchert , Phillip Trelford
Display (according to expected revenue) Charge (per click) Why Predict Probability-of-Click? • Advantages of improved probability estimates: • Increase user satisfaction by better targeting • Fairer charges to advertisers • Increase revenue by showing ads with high click-thru rate $1.00 * 10% =$0.10 $0.80 $2.00 * 4% =$0.08 $1.25 $0.10 * 50% =$0.05 $0.05
adPredictor Details 102.34.12.201 15.70.165.9 Client IP 221.98.2.187 92.154.3.86 P(pClick) + Match Type Exact Match Broad Match ML-1 Position SB-1 SB-2
Training Algorithm in Action w2 w1 + s c No Click Click
Client IP: Mean & Variance Low clickers High clickers
AdPredictor in Bing Search Engine • AdPredictor is now running 100% Paid Search traffic in Microsoft’s Bing Search Engine • Relevance and Click-Through Rate of Ads improved • Calibrated CTR prediction provides solid foundation for further improvements • AdPredictor explored for other tasks such as contextual and display advertising
Matchbox Joint work with David Stern and Ralf Herbrich
Collaborative Filtering Items 1 2 3 4 5 6 Metadata? A B Users C ? ? ? D
Map Sparse Features To ‘Trait’ Space 234566 34 456457 345 User ID Item ID 13456 64 654777 5474 Male Horror Gender Female Movie Genre Drama Comedy UK Country Documentary USA Height 1.2m
Message Passing For Matchbox u11 u21 v11 v21 u01 s1 t1 + + * u12 u22 v12 v22 u02 s2 t2 + + * + r Message update functions powered by Infer.net
User/Item Taste Space ‘Preference Cone’ for user 145035
Conclusions • Great variety of data sources and tasks • Challenges: privacy, incentives, exploration • Tools: SQL, No-SQL , HPC • Modelling platform (Factor Graphs & PQL): • Represent uncertainty • Composable models • Distributed, data-centric computation • Applications: TrueSkill, AdPredictor, Matchbox • Thanks!