1 / 33

Further Investigations on Heat Diffusion Models

Further Investigations on Heat Diffusion Models. Haixuan Yang Supervisors: Prof Irwin King and Prof Michael R. Lyu Term Presentation 2006. Outline. Introduction Input Improvement – Three candidate graphs Outside Improvement – DiffusionRank

usoa
Download Presentation

Further Investigations on Heat Diffusion Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Further Investigations on Heat Diffusion Models Haixuan Yang Supervisors: Prof Irwin King and Prof Michael R. Lyu Term Presentation 2006

  2. Outline • Introduction • Input Improvement –Three candidate graphs • Outside Improvement –DiffusionRank • Inside Improvement –Volume-based heat difusion model • Summary

  3. Introduction DiffusionRank Outside Improvement PHDC Volume-based HDM Inside Improvement Input Improvement HDM on Graphs PHDC: the model proposed last year

  4. PHDC • PHDC is a classifier motivated by • Tenenbaum et al (Science 2000) • Approximate the manifold by a KNN graph • Reduce dimension by shortest paths • Kondor & Lafferty (NIPS 2002) • Construct a diffusion kernel on an undirected graph • Apply to a large margin classifier • Belkin & Niyogi (Neural Computation 2003) • Approximate the manifold by a KNN graph • Reduce dimension by heat kernels • Lafferty & Kondor (JMLR 2005) • Construct a diffusion kernel on a special manifold • Apply to SVM

  5. PHDC • Ideas we inherit • Local information • relatively accurate in a nonlinear manifold. • Heat diffusion on a manifold • a generalization of the Gaussian density from Euclidean space to manifold. • heat diffuses in the same way as Gaussian density in the ideal case when the manifold is the Euclidean space. • Ideas we think differently • Establish the heat diffusion equation on a weighted directed graph. • The broader settings enable its application on ranking on the Web pages. • Construct a classifier by the solution directly.

  6. Heat Diffusion Model in PDHC • Notations • Solution • Classifier • G is the KNN Graph: Connect a directed edge (j,i) if j is one of the K nearest neighbors of i. • For each class k, f(i,0) is set as 1 if data is labeled as k and 0 otherwise. • Assign data j to a label q if j receives most heat from data in class q.

  7. Input Improvement • Three candidate graphs • KNN Graph • Connect points j and i from j to i if j is one of the K nearest neighbors of i, measured by the Euclidean distance. • SKNN-Graph • Choose the smallest K*n/2 undirected edges, which amounts to K*n directed edges. • Minimum Spanning Tree • Choose the subgraph such that • It is a tree connecting all vertices; the sum of weights is minimum among all such trees.

  8. Input Improvement • Illustration • Manifold • KNN Graph • SKNN-Graph • Minimum Spanning Tree

  9. Input Improvement • Advantages and disadvantages • KNN Graph • Democratic to each node • Resulting classifier is a generalization of KNN • May not be connected • Long edges may exit while short edges are removed • SKNN-Graph • Not democratic • May not be connected • Short edges are more important than long edges • Minimum Spanning Tree • Not democratic • Long edges may exit while short edges are removed • Connection is guaranteed • Less parameter • Faster in training and testing

  10. Experiments • Experimental Setup • Experimental Environments • Hardware: Nix Dual Intel Xeon 2.2GHz • OS: Linux Kernel 2.4.18-27smp (RedHat 7.3) • Developing tool: C • Data Description • 3 artificial Data sets and 6 datasets from UCI • Comparison • Algorithms: • Parzen windowKNNSVM KNN-HSKNN-HMST-H • Results: average of the ten-fold cross validation

  11. Experiments • Results

  12. Conclusions • KNN-H, SKNN-H and MST-H • Candidates for the Heat Diffusion Classifier on a Graph.

  13. Application Improvement • PageRank • Tries to find the importance of a Web page based on the link structure. • The importance of a page i is defined recursively in terms of pages which point to it: • Two problems: • The incomplete information about the Web structure. • The web pages manipulated by people for commercial interests. • About 70% of all pages in the .biz domain are spam • About 35% of the pages in the .us domain belong to spam category.

  14. Why PageRank is susceptible to web spam? • Two reasons • Over-democratic • All pages are born equal--equal voting ability of one page: the sum of each column is equal to one. • Input-independent • For any given non-zero initial input, the iteration will converge to the same stable distribution. • Heat Diffusion Model -- a natural way to avoid these two reasons of PageRank • Points are not equal as some points are born with high temperatures while others are born with low temperatures. • Different initial temperature distributions will give rise to different temperature distributions after a fixed time period.

  15. DiffusionRank • On an undirected graph • Assumption: the amount of the heat flow from j to i is proportional to the heat difference between i and j. • Solution: • On a directed graph • Assumption: there is extra energy imposed on the link (j, i) such that the heat flow only from j to i if there is no link (i,j). • Solution: • On a random directed graph • Assumption: the heat flow is proportional to the probability of the link (j,i). • Solution:

  16. DiffusionRank • On a random directed graph • Solution: • The initial value f(i,0) in f(0) is set to be 1 if i is trusted and 0 otherwise according to the inverse PageRank.

  17. Computation consideration • Approximation of heat kernel • N=? • When N>=30, the real eigenvalues of are less than 0.01; • when N>=100, they are less than 0.005. • We use N=100 in the paper. When N tends to infinity

  18. Discuss γ • γcan be understood as the thermal conductivity. • When γ=0, the ranking value is most robust to manipulation since no heat is diffused, but the Web structure is completely ignored; • When γ= ∞, DiffusionRank becomes PageRank, it can be manipulated easily. • Whenγ=1, DiffusionRank works well in practice

  19. DiffusionRank • Advantages • Can detect Group-group relations • Can cut Graphs • Anti-manipulation γ= 0.5 or 1 +1 -1

  20. DiffusionRank • Experiments • Data: • a toy graph (6 nodes) • a middle-size real-world graph (18542 nodes) • a large-size real-world graph crawled from CUHK (607170 nodes) • Compare with TrustRank and PageRank

  21. Results • The tendency of DiffusionRank when γ becomes larger • On the toy graph

  22. Anti-manipulation On the toy graph

  23. Anti-manipulation on the middle graph and the large graph

  24. Stability--the order difference between ranking results for an algorithm before it is manipulated and those after that

  25. Conclusions • This anti-manipulation feature enables DiffusionRank to be a candidate as a penicillin for Web spamming. • DiffusionRank is a generalization of PageRank (when γ=∞). • DiffusionRank can be employed to detect group-group relation. • DiffusionRank can be used to cut graph.

  26. Inside Improvement • Motivations • Finite Difference Method is a possible way to solve the heat diffusion equation. • the discretization of time • the discretization of space and time

  27. Motivation • Problems where we cannot employ FD directly in the real data analysis: • The graph constructed is irregular; • The density of data varies; this also results in an irregular graph; • The manifold is unknown; • The differential equation expression is unknown even if the manifold is known.

  28. Intuition

  29. Volume-based Heat Diffusion Model • Assumption • There is a small patch SP[j] of space containing node j; • The volume of the small patch SP[j] is V (j), and the heat diffusion ability of the small patch is proportional to its volume. • The temperature in the small patch SP[j] at time t is almost equal to f(j,t) because every unseen node in the small patch is near node j. • Solution

  30. Volume Computation • Define V(i) to be the volume of the hypercube whose side length is the average distance between node i and its neighbors. a maximum likelihood estimation

  31. Experiments K: KNN P: Parzen window U: UniverSvm L: LightSVMC: consistency method VHD-v: by the best vVHD: v is found by the estimation HD: without volume considerationC1: 1st variation of CC2: 2nd variation of C

  32. Conclusions • The proposed VHDM has the following advantages: • It can model the effect of unseen points by introducing the volume of a node; • It avoids the difficulty of finding the explicit expression for the unknown geometry by approximating the manifold by a finite neighborhood graph; • It has a closed form solution that describes the heat diffusion on a manifold; • VHDC is a generalization of both the Parzen Window Approach (when the window function is a multivariate normal kernel) and KNN.

  33. Summary • The input improvement of PHDC provide us more choices for the input graphs. • The outside improvement provides us a possible penicillin for Web spamming, and a potentially useful tool for group-group discovery and graph cut. • The inside improvement shows us a promising classifier.

More Related