Fast Nearest-neighbor Search in Disk-resident Graphs

Fast Nearest-neighbor Search in Disk-resident Graphs 报告人：鲁轶奇

Outline • Introduction • Background & related works • Proposed Work • Experiments

Introduction-Motivation • Graph becoming enormous • Streaming algorithm must take passes over the entire dataset • Other perform clever preprocessing which use a specific similarity measure • This paper introduces analysis and algorithms which try to address the scalability problem in a generalizable way: not specific to one kind of graph partitioning nor one specific proximity measure.

Introduction-Motivation(cont.) • Real world graphs contain high-degree nodes • Computing node value by combining that of its neighbors. • Whenever a high degree node is encountered, these algorithm have to examine a much large neighborhood leading to severely degraded performance.

Introduction-Motivation(cont.) • Algorithms can no longer assume that entire graph can be stored in memory. • Compression techniques still have at least three setting where these might not work • social networks are far less compressible than Web graphs • decompression might lead to an unacceptable increase in query response time • even if a graph could be compressed down to a gigabyte, it might be undesirable to keep it in memory on a machine which is running other applications

Contribution • a simple transform of the graph (turning high degree nodes into sinks) • a deterministic local algorithm guaranteed to return nearest neighbors in personalized pagerank from the disk-resident clustered graph. • we develop a fully external-memory clustering algorithm (RWDISK) that uses only sequential sweeps over data files.

Background-Personalized Pagerank • A random walk starting at node a, at any step the walk can be reset to the start node with probability α • PPV(a, j) : PPV entry from a to j • Large value indicates high similarity

Background-Clustering • Using random walk based approaches for computing good quality local graph partition near a given anchor node. • Main intuition: • A random walk started inside a low conductance cluster will mostly stay inside the cluster. • Conductance: • ФV(A) denote conductance and μ(A)=Σi∈Adegree(i)

Proposed Work • First problem: most local algorithms for computing nearest neighbors suffer from the presence of high degree nodes. • Second issue: computing proximity measures on large disk-resident graphs. • Third issue: Finding a good clustering

Effect of high degree nodes • High degree nodes are performance bottleneck • Effect on personalized pagerank • Main intuition: a very high degree node passes on a small fraction of its value to the out-neighbors, which might not be significant enough to invest our computing resources on. • Argue: stopping a random walk at a high degree node does not change the personalized pagerank value at other nodes which have relatively smaller degree.

Effect of high degree nodes • error incurred in personalized pagerank is inversely proportional to the degree of the sink node.

Effect of high degree nodes • faα(i, j) is simply the probability of hitting a node j for the first time from node i, in this α-discounted walk.

Effect of high degree nodes

Effect of high degree nodes • the error for introducing a set of sink nodes

Nearest-neighbors on clustered graphs • how to use the clusters for deterministic computation of nodes "close" to an arbitrary query. • Use degree-normalized personalized pagerank • For a given node i, the PPV from j to it, i.e. PPV (j, i) can be written as

assume that j and i are in the same cluster S. • Don’t have access to PPV-1(k), , replace it with upper and lower bound • lower bound: 0, we pretend that S is completely disconnected to the rest of the graph • Upper bound： A random walk from outside S has to cross the boundary of S to hit node i.

S is small in size, the power method suffice • At each iteration, maintain the upper and lower bounds for nodes within S • To expand S: bring in the clusters for x of the external neighbors of • this global upper boundfalls below a pre-specified small threshold γ • In reality, using an additive slack ε, (ubk+1- ε)

Ranking Step • return all nodes which have lower bound greater than the (k+1)th largest upper bound • Why: All nodes outside the cluster are guaranteed to have personalized pagerank smaller than the global upperbound, which is smaller than γ

Clustered Representation on Disk • Intuition: use a set of anchor nodes and assign each remaining node to its “closest” anchor. • Using personalized page-rank as the measure of “closeness” • Algorithm: • Start with a random set of anchors • Iteratively add new anchors from the set of unreachable nodes, and the recompute the cluster assignments • Two properties: • new anchors are far away from the existing anchors • when the algorithm terminates, each node i is guaranteed to be assigned to its closest anchor.

RWDISK • 4 kinds of files • Edge file: Each line represents an edge by a triplet {src,dst,p}, p = P(Xt = dst| Xt-1=src) • Last file: each line in Last is {src,anchor,value}, value= P(Xt-1=src| X0=anchor) • Newt file: Newt contains xt, each line is {src,anchor,value}, where value equals P(Xt=src|X0 =anchor) • Ans file: represents the values for vt. Thus each line in Ans is {src,anchor,value}, where value = • Algorithm to compute vt by power iterations

RWDISK(cont.) • Newt is simply a matrix-vector product between the transition matrix stored in Edges and Last. • File are stored lexicographically, this can be obtained by a file-join like algorithm. • First step: simply joins the two files, and accumulates the probability values at a node from its in-neighbors. • Next step: the Newt file is sorted and compressed, in order to add up the values from different in-neighbors • multiply the probabilities by α(1-α)t-1 • Fix the number of iterations at maxiter.

One major problem is that intermediate files can become much larger than the number of edges • in most real-world networks within 4-5 steps it is possible to reach a huge fraction of the whole graph • Intermediate file getting too large • Using rounding for reducing file sizes

Experiments • Dataset

Experiments(cont.) • System Detail • On a off-the-shelf PC • Least recently used replacement scheme • Page size 4KB

Experiments(cont.)-Effect of high degree nodes • Three-fold advantages: • Speed up external memory clustering • Reduce number of page-faults in random-walk simulation • Effect on RWDISK

Experiments(cont.)-Deterministic vs. Simulations • Computing top-10 neighborswith approximation slack 0.005 for 500 randomly picked nodes • Citeseer original graph • DBLP turned nodes with degree above 1000 into sinks • LiveJournal turn nodes with degree above 100 into sinks

Experiments(cont.)-RWDISK vs. METIS • maxiter = 30, α = 0.1 and ε = 0.001 for PPV • METIS for baseline algorithm • break DBLP into 50000 parts, which used 20GB of RAM • Break LiveJournal into 75000 parts, which used 50GB of RAM • In comparison, RWDISK can be excuted on a 2-4 GB standard PC

Experiments(cont.)-RWDISK vs. METIS • Measure of cluster quality • A good disk-based clustering must satisfy： • Low conductance • Fit in disk-sized pages

Experiments(cont.)-RWDISK vs. METIS

Fast Nearest-neighbor Search in Disk-resident Graphs