490 likes | 719 Views
K. I. L. N. I. G. I. N. M. N. Jerry Scripps. Overview. What is link mining? Motivation Preliminaries definitions metrics network types Link mining techniques. What is Link Mining?. Graph Theory. Statistics. Link Mining. Data Mining. Machine Learning.
E N D
K I L N I G I N M N Jerry Scripps
Overview • What is link mining? • Motivation • Preliminaries • definitions • metrics • network types • Link mining techniques
What is Link Mining? Graph Theory Statistics Link Mining Data Mining MachineLearning Social Network Analysis Database
What is Link Mining? Examples: • Discovering communities within collaboration networks • Finding authoritative web pages on a given topic • Selecting the most influential people in a social network
Link Mining – MotivationEmerging Data Sets • World wide web • Social networking • Collaboration databases • etc.
Link Mining – MotivationDirect Applications • What is the community around msu.edu? • What are the authoritative pages? • Who has the most influence? • Who is the likely member of terrorist cell? • Is this a news story about crime, politics or business?
Link Mining – MotivationIndirect Applications • Convert ordinary data sets into networks • Integrate link mining techniques into other techniques
Preliminaries • Definitions • Metrics • Network Types • Definitions • Metrics • Network Types
Definitions Community Node (vertex, point, object) Link (edge, arc)
Metrics Network • Characteristic path length • Clustering coefficient • Min-cut Node Pair • Graph distance • Min-cut • Common neighbors • Jaccard’s coef • Adamic/adar • Pref. attachment • Katz • Hitting time • Rooted pageRank • simRank • Bibliographic metrics Node • Degree • Closeness • Betweenness • Clustering coefficient
Network Types Watts & Strogatz Small World Random Regular
Networks – Scale-free • Barabasi & Bonabeau • Degree follows a power law ~ 1/kn • Can be found in a wide variety of real-world networks
Techniques • Link-Based Classification • Link Prediction • Ranking • Influential Nodes • Community Finding • Link Completion
Include features from linked objects: building a single model on all features Fusion of link and attribute models Link-Based Classification ?
Link-Based ClassificationChakrabarti, et al. • Copying data from neighboring web pages actually reduced accuracy • Using the label from neighboring page improved accuracy 111011 111011 ? B 101011 B 101011 010010 A 010010 A A 011110 011110 A
Link-Based ClassificationLu & Getoor • Define vectors for attributes and links • Attribute data OA(X) • Link data LD(X) constructed using • mode (single feature – class of plurality) • count (feature for each class – count for neighbors) • binary (feature for each class – 0/1 if exists) 111011 ? OA (attr) LD (link) 101011 B 2 1 0 … 1 1 0 … A … 111011 … 010010 A 011110 A Model 1 Model 2 Model
Link-Based ClassificationLu & Getoor • Define probabilities for both • Attribute • Link • Class estimation:
Link-Based ClassificationSummary • Using class of neighbors improves accuracy • Using separate models for attribute and link data further improves accuracy • Other considerations: • improvements are possible by using community information • knowledge of network type could also benefit classifier
Techniques • Link-Based Classification • Link Prediction • Ranking • Influential Nodes • Community Finding • Link Completion
Link PredictionLiben-Nowell and Kleinberg Tested node-pair metrics: • Graph distance • Common neighbors • Jaccards coefficient • Adamic/adar • Preferential attachment • Katz • Hitting time • Rooted PageRank • SimRank Neighborhood Ensemble of paths
Link Prediction – summary • There is room for growth – best predictor has accuracy of only around 9% • Predicting collaborations is difficult • Finding communities could help if most collaborations are intra-community • New problem could be to predict the direction of the link
Techniques • Link-Based Classification • Link Prediction • Ranking • Influential Nodes • Community Finding • Link Completion
Ranking – Markov Chain Based • Random-surfer analogy • Problem with cycles • PageRank uses random vector
Ranking – summary • Other methods such as HITS and SALSA also based on Markov chain • Ranking has been applied in other areas: • text summarization • anomaly detection
Techniques • Link-Based Classification • Link Prediction • Ranking • Influential Nodes • Community Finding • Link Completion
Maximizing influence model-based • Problem – finding the k best nodes to activate to maximize the number of nodes activated • Models: • independent cascade – when activated a node has a one-time change to activate neighbors with prob. pij • linear threshold – node becomes activated when the percent of its neighbors crosses a threshold
Maximizing influence model-based • Models: independent cascade & linear threshold • A function f:S→S*, can be created using either model • Functions use monte-carlo, hill-climbing solution • Submodular functions, where ST are proven in another work to be NP-C but by using a hill-climbing solution can get to within 1-1/e of optimum.
Maximizing influence – cost/benefit • Assumptions: • product x sells for $100 • a discount of 10% can be offered to various prospective customers • If customer purchases profit is: • 90 if discount is offered • 100 if discount is not offered • Expected lift in profit (ELP) from offering discount is: • 90*P(buy|discount) - 100*P(buy|no discount)
Maximizing influence – cost/benefit • Goal is to find M that maximizes global ELP • Three approximations used: • single pass • greedy • hill-climbing • Xi is the decision of customer i to buy • Y is vector of product attributes • M is vector of marketing decision • f is a function to set the ith element of M • r0 and r1 are revenue gained • c is the cost of marketing
Comparison of approaches • An extension would be to spread influence to the most number of communities • Improvements can be made in speed
Techniques • Link-Based Classification • Link Prediction • Ranking • Influential Nodes • Community Finding • Link Completion
Gibson, Kleinberg and Raghavan Query Search Engine Root Set Base Set: add forward and back links Use HITS to find top 10 hubs and authorities
Reddy and Kitsuregawa • Bipartite graph • Given an initial set of nodes T build I from the nodes pointed to from T • Repeat: • use relax_cocite to expand T and I • prune T and I using dense bipartite graph function DBPG(T,I,α,β) I T u v w
Flake, Lawrence and Giles • Uses Min-cut • Start with seed set • Add linked nodes • Find nodes from outgoing links • Create virtual source node • Add virtual sink linking it to all nodes • Find the min-cut of the virtual source and sink
Neville, Adler and Jensen A 0 1 1 0 • Distance based on links and attributes • If link exists score is number of common attributes zero otherwise • score(A,B)=2, score(A,C)=1,score(B,C)=0 • Used with 3 partitioning algorithms: • Karger’s Min-Cut • MajorClust • Spectral partitioning by Shi & Malik B 1 1 0 0 C 1 1 0 1
Communities - summary • There are many options for building communities around a small group of nodes • Possible future directions • finding communities in networks having different link types • impact of network type on community finding techniques
Techniques • Link-Based Classification • Link Prediction • Ranking • Influential Nodes • Community Finding • Link Completion
Goldenberg, Kubica and Komarek Problem: given a network and n-1 members of a community find the nth • random • counting • popular • NB • NN • cGraph • BayesNet • EBS and LR
Conclusions • Link mining is a young, dynamic field of study with problem areas that continue to emerge and morph as techniques continue to evolve • Opportunities for improvements exist in • using community knowledge • using network knowledge • We are the living links in a life force that moves and plays around and through us, binding the deepest soils with the farthest stars. • Alan Chadwick
14 5 3 9 2 15 6 9 4 Ranking • Based on Markov Chain • Rank is sum of node weights from incoming links • Breaks down when cycles exist
Ranking - continued • General approach • ap = authority score for p • Bp = backlinks of p • PageRank • HITS approach • ap = authority score for p • hp = hub score for p • Bp = backlinks of p • Normalize between iterations