Graphs, Algorithms and Big Data: the Google AdWords Case study

Graphs, Algorithms and Big Data: the Google AdWords Case study GDG DevFest Central Italy 2013 Alessandro Epasto

Joint work with J. Feldman, S. Lattanzi, V. Mirrokni(Google Research), S. Leonardi(Sapienza U. Rome), H. Lynch (Google) and the AdWords team.

The AdWords Problem

The AdWords Problem ?

The AdWords Problem Soccer Shoes

Google Advertisement in Numbers • Over a billion of query a day. • A lot ofadvertisers. www.google.com/competition/howgooglesearchworks.html

Challenges • Several scientific and technological challenges. • How to find in real-time the best ads? • How to price each ads? • How to suggest new queries to advertisers? • The solution to these problems involves some fundamental scientific results (e.g. a NobelPrize-winning auction mechanism)

Google Advertisement in Numbers 2012 Revenues: 46 billions USD • 95%Advertisement: 43 billions USD. http://investor.google.com/financial/tables.html

Goals of the Project • Tackling AdWords data to identify automatically, for each advertiser, its main competitors and suggest relevant queries to each advertiser. • Goals: • Useful business information. • Improve advertisement. • More relevant performance benchmarks.

Information Deluge Large advertisers (e.g. Amazon, Ask.com, etc) compete in several market segments with very different advertisers.

Representing the data • How to represent the salient features of the data? • Relationships between advertisers and queries • Statistics: clicks, costs, etc. • Take into account the categories. • Efficient algorithms.

Graphs: the lingua franca of Big Data Mathematical objects studied well before the history of computers. Königsberg’s bridges problem.Euler, 1735.

Graphs: the lingua franca of Big Data Graphs are everywhere! Technological Networks Social Networks Natural Networks

Graphs: the lingua franca of Big Data Formal definition B D A C A set of Nodes

Graphs: the lingua franca of Big Data Formal definition B D A C A set of Edges

Graphs: the lingua franca of Big Data Formal definition B 2 1 4 D A 3 C The edges might have a weight

Adwords data as a (Bipartite) Graph Hundreds of Labels A lot ofAdvertisers Billions of Queries

Semi-Formal Problem Definition Advertisers Queries

Semi-Formal Problem Definition Advertisers A Queries

Semi-Formal Problem Definition Advertisers A Queries Labels:

Semi-Formal Problem Definition Advertisers A Queries Goal: Find the nodes most “similar” to A. Labels:

How to Define Similarity? Several node similarity measures in the literature based on the graph structure, random walk, etc. • What is the accuracy? • Can it scale to graphs with billions of nodes? • Can be computed in real-time?

The three ingredients of Big Data • A lot of data… • A sophisticated infrastructure: MapReduce • Efficient algorithms: Graph mining

MapReduce

MapReduce The work is spread across several machines in parallel connected with fast links.

Algorithms Personalized PageRank: • Random walks on the graph • Closely related to the celebrated Google PageRank™.

Personalized PageRank

Personalized PageRank • Idea: perform a very long random walk (starting from v). • Rank nodes by probability of visit assigns asimilarity score to each node w.r.t. node v. • Strong community bias (this can be formalized).

Personalized PageRank • Exact computation is unfeasible O(n^3), but it can be approximated very well. • Very efficient Map Reduce algorithm scaling to large graphs (hundred of millions of nodes) However…

Algorithmic Bottleneck • Our graphs are simply too big (billions of nodes) even for large-scale systems. • MapReduce is notreal-time. • We cannot precompute the results for all subsets of categories (exponential time!).

1st idea: Tackling Real Graph Structure • Data size is the main bottleneck. • Compressingthe graph would speed up the computation.

1st idea: Tackling Real Graph Structure A B A 1 B a b c d e f g Only advertisers. Advertisers and queries

1st idea: Tackling Real Graph Structure A B A 1 B a b c d e f g Only advertisers. Advertisers and queries 2 B A g d a e b c f Ranking of the entire graph

1st idea: Tackling Real Graph Structure Theorem: the ranking computed is the corrected Personalized PageRank on the entire graph. Based on results from the mathematical theory Markov Chain state aggregation (Simon and Ado, ’61; Meyer ’89, etc.).

Algorithmic Bottleneck • Our graphs are too big (billions of nodes) even for large-scale systems. • MapReduce is not real-time. • We cannot precompute the results for all subsets of categories (exponential time!).

Graphs, Algorithms and Big Data: the Google AdWords Case study