330 likes | 417 Views
Further Investigations on Heat Diffusion Models. Haixuan Yang Supervisors: Prof Irwin King and Prof Michael R. Lyu Term Presentation 2006. Outline. Introduction Input Improvement – Three candidate graphs Outside Improvement – DiffusionRank
E N D
Further Investigations on Heat Diffusion Models Haixuan Yang Supervisors: Prof Irwin King and Prof Michael R. Lyu Term Presentation 2006
Outline • Introduction • Input Improvement –Three candidate graphs • Outside Improvement –DiffusionRank • Inside Improvement –Volume-based heat difusion model • Summary
Introduction DiffusionRank Outside Improvement PHDC Volume-based HDM Inside Improvement Input Improvement HDM on Graphs PHDC: the model proposed last year
PHDC • PHDC is a classifier motivated by • Tenenbaum et al (Science 2000) • Approximate the manifold by a KNN graph • Reduce dimension by shortest paths • Kondor & Lafferty (NIPS 2002) • Construct a diffusion kernel on an undirected graph • Apply to a large margin classifier • Belkin & Niyogi (Neural Computation 2003) • Approximate the manifold by a KNN graph • Reduce dimension by heat kernels • Lafferty & Kondor (JMLR 2005) • Construct a diffusion kernel on a special manifold • Apply to SVM
PHDC • Ideas we inherit • Local information • relatively accurate in a nonlinear manifold. • Heat diffusion on a manifold • a generalization of the Gaussian density from Euclidean space to manifold. • heat diffuses in the same way as Gaussian density in the ideal case when the manifold is the Euclidean space. • Ideas we think differently • Establish the heat diffusion equation on a weighted directed graph. • The broader settings enable its application on ranking on the Web pages. • Construct a classifier by the solution directly.
Heat Diffusion Model in PDHC • Notations • Solution • Classifier • G is the KNN Graph: Connect a directed edge (j,i) if j is one of the K nearest neighbors of i. • For each class k, f(i,0) is set as 1 if data is labeled as k and 0 otherwise. • Assign data j to a label q if j receives most heat from data in class q.
Input Improvement • Three candidate graphs • KNN Graph • Connect points j and i from j to i if j is one of the K nearest neighbors of i, measured by the Euclidean distance. • SKNN-Graph • Choose the smallest K*n/2 undirected edges, which amounts to K*n directed edges. • Minimum Spanning Tree • Choose the subgraph such that • It is a tree connecting all vertices; the sum of weights is minimum among all such trees.
Input Improvement • Illustration • Manifold • KNN Graph • SKNN-Graph • Minimum Spanning Tree
Input Improvement • Advantages and disadvantages • KNN Graph • Democratic to each node • Resulting classifier is a generalization of KNN • May not be connected • Long edges may exit while short edges are removed • SKNN-Graph • Not democratic • May not be connected • Short edges are more important than long edges • Minimum Spanning Tree • Not democratic • Long edges may exit while short edges are removed • Connection is guaranteed • Less parameter • Faster in training and testing
Experiments • Experimental Setup • Experimental Environments • Hardware: Nix Dual Intel Xeon 2.2GHz • OS: Linux Kernel 2.4.18-27smp (RedHat 7.3) • Developing tool: C • Data Description • 3 artificial Data sets and 6 datasets from UCI • Comparison • Algorithms: • Parzen windowKNNSVM KNN-HSKNN-HMST-H • Results: average of the ten-fold cross validation
Experiments • Results
Conclusions • KNN-H, SKNN-H and MST-H • Candidates for the Heat Diffusion Classifier on a Graph.
Application Improvement • PageRank • Tries to find the importance of a Web page based on the link structure. • The importance of a page i is defined recursively in terms of pages which point to it: • Two problems: • The incomplete information about the Web structure. • The web pages manipulated by people for commercial interests. • About 70% of all pages in the .biz domain are spam • About 35% of the pages in the .us domain belong to spam category.
Why PageRank is susceptible to web spam? • Two reasons • Over-democratic • All pages are born equal--equal voting ability of one page: the sum of each column is equal to one. • Input-independent • For any given non-zero initial input, the iteration will converge to the same stable distribution. • Heat Diffusion Model -- a natural way to avoid these two reasons of PageRank • Points are not equal as some points are born with high temperatures while others are born with low temperatures. • Different initial temperature distributions will give rise to different temperature distributions after a fixed time period.
DiffusionRank • On an undirected graph • Assumption: the amount of the heat flow from j to i is proportional to the heat difference between i and j. • Solution: • On a directed graph • Assumption: there is extra energy imposed on the link (j, i) such that the heat flow only from j to i if there is no link (i,j). • Solution: • On a random directed graph • Assumption: the heat flow is proportional to the probability of the link (j,i). • Solution:
DiffusionRank • On a random directed graph • Solution: • The initial value f(i,0) in f(0) is set to be 1 if i is trusted and 0 otherwise according to the inverse PageRank.
Computation consideration • Approximation of heat kernel • N=? • When N>=30, the real eigenvalues of are less than 0.01; • when N>=100, they are less than 0.005. • We use N=100 in the paper. When N tends to infinity
Discuss γ • γcan be understood as the thermal conductivity. • When γ=0, the ranking value is most robust to manipulation since no heat is diffused, but the Web structure is completely ignored; • When γ= ∞, DiffusionRank becomes PageRank, it can be manipulated easily. • Whenγ=1, DiffusionRank works well in practice
DiffusionRank • Advantages • Can detect Group-group relations • Can cut Graphs • Anti-manipulation γ= 0.5 or 1 +1 -1
DiffusionRank • Experiments • Data: • a toy graph (6 nodes) • a middle-size real-world graph (18542 nodes) • a large-size real-world graph crawled from CUHK (607170 nodes) • Compare with TrustRank and PageRank
Results • The tendency of DiffusionRank when γ becomes larger • On the toy graph
Stability--the order difference between ranking results for an algorithm before it is manipulated and those after that
Conclusions • This anti-manipulation feature enables DiffusionRank to be a candidate as a penicillin for Web spamming. • DiffusionRank is a generalization of PageRank (when γ=∞). • DiffusionRank can be employed to detect group-group relation. • DiffusionRank can be used to cut graph.
Inside Improvement • Motivations • Finite Difference Method is a possible way to solve the heat diffusion equation. • the discretization of time • the discretization of space and time
Motivation • Problems where we cannot employ FD directly in the real data analysis: • The graph constructed is irregular; • The density of data varies; this also results in an irregular graph; • The manifold is unknown; • The differential equation expression is unknown even if the manifold is known.
Volume-based Heat Diffusion Model • Assumption • There is a small patch SP[j] of space containing node j; • The volume of the small patch SP[j] is V (j), and the heat diffusion ability of the small patch is proportional to its volume. • The temperature in the small patch SP[j] at time t is almost equal to f(j,t) because every unseen node in the small patch is near node j. • Solution
Volume Computation • Define V(i) to be the volume of the hypercube whose side length is the average distance between node i and its neighbors. a maximum likelihood estimation
Experiments K: KNN P: Parzen window U: UniverSvm L: LightSVMC: consistency method VHD-v: by the best vVHD: v is found by the estimation HD: without volume considerationC1: 1st variation of CC2: 2nd variation of C
Conclusions • The proposed VHDM has the following advantages: • It can model the effect of unseen points by introducing the volume of a node; • It avoids the difficulty of finding the explicit expression for the unknown geometry by approximating the manifold by a finite neighborhood graph; • It has a closed form solution that describes the heat diffusion on a manifold; • VHDC is a generalization of both the Parzen Window Approach (when the window function is a multivariate normal kernel) and KNN.
Summary • The input improvement of PHDC provide us more choices for the input graphs. • The outside improvement provides us a possible penicillin for Web spamming, and a potentially useful tool for group-group discovery and graph cut. • The inside improvement shows us a promising classifier.