730 likes | 745 Views
This talk presents fundamental issues in social network analysis and key challenges in data management, igniting a desire for research in the field. The topics cover defining tasks achievable by data management systems, studying necessary system properties, and addressing problems driven by network structure and content. Some key problems discussed include centrality, link prediction, community detection, and information diffusion. Various centrality measures such as degree, betweenness, closeness, and clique overlap are explored. The presentation highlights the difficulties in computing centrality from a database perspective and introduces efficient algorithms for computing betweenness centrality. This engaging discussion uncovers the complexities and advancements in social network data management.
E N D
Data Management for Social Networking Sara Cohen Hebrew University of Jerusalem PODS 2016, San Francisco, USA
About this Talk • Focus:Breadth-wise • No background knowledge assumed • Take-aways: • Fundamental issues in social network analysis • Key challenges related to data management • Hopefully, the burning desire to do research on social network data management! PODS 2016, San Francisco, USA
What is a Social Network? 1.1 Billion EUMV 310 Million EUMV 255 Million EUMV 250 Million EUMV 120 Million EUMV 110 Million EUMV 100 Million EUMV EUMV = Estimated Unique Monthly Visitors (April 1, 2016) PODS 2016, San Francisco, USA
What is a Social Network? Research Collaborations Emails Diseases and Gene Associations ccsb.dfci.harvard.edu aminer.org cambridge-intelligence.com • May Have/Be: • Typed Nodes • (Un)directed edges • Hyper-edges • Multi-partite • Rich attributes Diseases Spread Movies and Actors thesisthomasdemoor.wordpress.com ai.arizona.edu PODS 2016, San Francisco, USA
Questions to Keep in Mind:Is Social Network Data Management … the same as Graph Data Management? the same as Social Network Analysis? a problem completely solved by Industry Giants? I will try to convince you that the answer to all three questions is “No”! PODS 2016, San Francisco, USA
Topics Define tasks that must be effectively achievable by the data management system Study necessary system properties to achieve the above tasks Problems driven by the network structure Problems driven by the network content and structure Systems for social-network data management PODS 2016, San Francisco, USA
Some Key Problems Centrality: To what degree is a given node central to the network? PODS 2016, San Francisco, USA
Some Key Problems Centrality: To what degree is a given node central to the network? Link Prediction: Which edges not currently in the network are most likely to form? PODS 2016, San Francisco, USA
Some Key Problems Centrality: To what degree is a given node central to the network? Link Prediction: Which edges not currently in the network are most likely to form? Community Detection:How can the nodes be clustered into natural or useful groups? PODS 2016, San Francisco, USA
Some Key Problems Centrality: To what degree is a given node central to the network? Link Prediction: Which edges not currently in the network are most likely to form? Community Detection: How can the nodes be clustered into natural or useful groups? Information Diffusion: How does information diffuse over the network? PODS 2016, San Francisco, USA
Some Key Problems Centrality: To what degree is a given node central to the network? Link Prediction: Which edges not currently in the network are most likely to form? Community Detection: How can the nodes be clustered into natural or useful groups? Information Diffusion: How does information diffuse over the network? PODS 2016, San Francisco, USA
[Katz 1953] [Sabidussi 1966] [Freeman 1979] [Borgatti+ 2006] Centrality • Centrality is a measure of the importance of a node, i.e., how central it is to the network • Can be measured in different ways, depending on context • In practice may want to combine several methods • May require a (cheap) local computation, or a (very expensive) global computation PODS 2016, San Francisco, USA
Centrality Measures Degree Centrality: Count number of neighbors of u u u Betweenness Centrality: Proportion of shortest paths between all pairs of nodes traversing u PODS 2016, San Francisco, USA
Closeness Centrality: Distance of u to all other nodes u u v v Clique Overlap Centrality: Number of maximal cliques (size ≥ 3) in which u participates Others: Rooted PageRank, Katz, Eigenvector... PODS 2016, San Francisco, USA
Difficulties from a Database Perspective • Many research • challenges!! • Some examples of work on computing centrality • No clear winner • How to answer a user’s query! • Computations can be global and expensive • Contrast this to a typical SQL query! • Huge and dynamic network makes values change often • Small change in the graph can effect scores of all nodes! PODS 2016, San Francisco, USA
Computing Betweenness Centrality • Bellman criterion: • Two steps for computing: • Compute length and number of shortest paths between all pairs • Sum all related pairs • Time: PODS 2016, San Francisco, USA
Computing Betweenness Centrality [Brandes 2001] Efficient algorithm for sparse graph with time Avoid summing up all pairs, for each node, by observing that partial sums obey a recursive relation. [Bader+2006, Edmonds+2010] Parallel, distributed computation [Arge+2013] Efficient computation in external memory [Hayashi+2015] Betweenness centrality dynamic maintenance over massive networks PODS 2016, San Francisco, USA
Enumerating Cliques • Clique overlap centrality computes the number of maximal cliques in which a node participates • Cliques can also be used to find communities, and many other applications • A lot of past research on algorithms for enumerating all maximal cliques[Bron&Kerbosch73,Johnson+88,Tomita+06] • Note that counting maximal cliques is #P-complete PODS 2016, San Francisco, USA
Enumerating Clique Relaxations • In reality, cliques are often overly restrictive • Within a community, not all pairs are friends • Real-life friends may miss social-network links • Missing links due to measurement incompleteness/ imprecision • Various relaxations of cliques have been proposed [Seidman+78, Pattillo+13], e.g., • k-plex (every node can be “missing” at most k neighbors), • s-cliques (nodes are at distance at most s one from another) • Recent work has focused on efficient enumeration of clique relaxations [Wu+07, C+2015] PODS 2016, San Francisco, USA
Some Key Problems Centrality: To what degree is a given node central to the network? Link Prediction: Which edges not currently in the network are most likely to form? Community Detection: How can the nodes be clustered into natural or useful groups? Information Diffusion: How does information diffuse over the network? PODS 2016, San Francisco, USA
Link Prediction • Link prediction is the problem of determining, for a given node v, which nodes currently not connected to v are likely to form such a connection • As before, can be measured in different ways, depending on context • In practice may want to combine several methods • As before, may require a (cheap) local computation, or a (very expensive) global computation PODS 2016, San Francisco, USA
Link Prediction Functions Common Neighbors: How many neighbors are common to u and v, versus total number of neighbors u u v v Adamic-Adar: Normalize common neighbor value by their popularity PODS 2016, San Francisco, USA
Katz: Collect score for each path from u to reach v weighted by length u u v v Hitting Time: Expected time for a random walk starting at u to reach v. PODS 2016, San Francisco, USA
Reliability: Given independent link & node failures, what is prob. of a path from u to v? u v Weighted distance: What is the shortest path from u to v? u v PODS 2016, San Francisco, USA
Difficulties from a Database Perspective • Many research • challenges!! • Some examples of work on choosing a link prediction metric Same as before, but at a larger scale, need to compute values • No clear winner • How to answer a user’s query! • Computations can be global and expensive • Contrast this to a typical SQL query! • Huge and dynamic network makes values change often • Small change in the graph can effect scores of all nodes! PODS 2016, San Francisco, USA
Choosing a Link Prediction Function:Machine Learning [Liben-Nowell+2007] Comparison of effectiveness on a variety of domains [Kashima+2009] Semi-supervised learning for link prediction [Hasan+2006] Supervised learning for link prediction [Li+2014] Deep learning for link prediction PODS 2016, San Francisco, USA
Choosing a Link Prediction Function:Axiomatic Approach • The axiomatic approachdefines behavior of a function by axioms over “simple” instances • axioms are used to extrapolate behavior over general graphs • axioms thereby characterize functions • Goal is to gain understanding of the underlying principles that the function assumes • Used in the past for: • social choice (from preferences to ranking) • Web page ranking (from links to ranking) PODS 2016, San Francisco, USA
[C+2015] Example Axiom Templates for Link Prediction Pair graph axiom: f satisfies the Pair-graph axiom if for graph G with only two nodes u,v: Strength of relationship depends on weight of edge and nodes u v -sink (source) series axiom: f satisfies the -sink (source) series axiom if whenever G is decomposable to G1,G2with a single shared node w that is a sink in G1(source in G2): u v w Strength of relationship can be determined by considering each of the sub-graphs on their own u w v w PODS 2016, San Francisco, USA
[C+2015] Results of Axiomatization Let G be a graph with weight 1 on vertices and on edges • katz there is a single link prediction function f satisfying axioms PairGraph, -SameAlternatives, -sink-series, -source-series, In-Split, Same-Out-Split, path-Relevance We prove characterization results for the four link prediction axioms: Katz, Hitting time, Weighted distance, Reliability. PODS 2016, San Francisco, USA
Topics Problems driven by the network structure Problems driven by the network content and structure Systems for social-network data management PODS 2016, San Francisco, USA
Some Key Problems “Dentist” Tom AKA Man with Toothache Social search: How can a social network be leveraged to better search a corpus? PODS 2016, San Francisco, USA
Some Key Problems Dentist that treats at least of my 2 friends, but does not treat “Bad Teeth Bill”? “Dentist” Tom AKA Man with Toothache Social search: How can a social network be leveraged to better search a corpus? Social Querying: How can we query a social network with a highly expressive language? PODS 2016, San Francisco, USA
Some Key Problems … PODS PC 2016 http://research.microsoft.com/ Social search: How can a social network be leveraged to better search a corpus? Social Querying: How can we query a social network with a highly expressive language? Team Formation: Given a set of skills, how can we find a group of people with the skills who can work well together? PODS 2016, San Francisco, USA
Some Key Problems Social search: How can a social network be leveraged to better search a corpus? Social Querying: How can we query a social network with a highly expressive language? Team Formation: Given a set of skills, how can we find a group of people with the skills who can work well together? PODS 2016, San Francisco, USA
Social Search • Social search is the problem of leveraging a social network, to improve the results of searching a corpus • Problem has many flavors, e.g., • Improving Web search • Search for a person who can answer a question • Finding people with a given name • Finding people for a given context and user PODS 2016, San Francisco, USA
[Bao+2007] [Carmel+2009] [Yin+2010] Social Search: Improving Document Search San Francisco Attractions created Bob liked commented on • Take into consideration • Relevance of document to query (standard IR) • Importance of document (standard IR) • Importance of document creator (centrality?) • Relationship of creator / those that like document to query issuer (link prediction?) • … PODS 2016, San Francisco, USA
[Horowitz+2010] Social Search: Search for a Person who Can answer a Question San Francisco Attractions Bob • Take into consideration • User provided list of expertise • Information extracted from profile • Data provided by friends PODS 2016, San Francisco, USA
[Vieira+2007] Social Search: Finding People with a Given Name Alice? Bob Alice Alice • Importance (centrality?) • Relevance (link prediction?) • Note that even ranking by shortest path is difficult in practice, as it is too expensive to pre-compute and store all-pairs-shortest-paths PODS 2016, San Francisco, USA
[C+2013] [C+2015] Social Search:Finding People for Given User + Context Who should I collaborate with on: Frequent Subgraph Mining Of Probabilistically Extracted Data? • User issues a query for a person • Result should be relevant both to the query (IR methods?) and to the user (link prediction?) • Studied in two contexts: • Collaboration prediction • Email recipient prediction June 2016 PODS 2016, San Francisco, USA PODS 2016, San Francisco, USA
[C+2013] [C+2015] Social Search:Finding People for Given User + Context • User issues a query for a person • Result should be relevant both to the query (IR methods?) and to the user (link prediction?) • Studied in two contexts: • Collaboration prediction • Email recipient prediction (only use neighbors in prediction!) June 2016 PODS 2016, San Francisco, USA PODS 2016, San Francisco, USA
Some Key Problems Social search: How can a social network be leveraged to better search a corpus? Social Querying: How can we query a social network with a highly expressive language? Team Formation: Given a set of skills, how can we find a group of people with the skills who can work well together? PODS 2016, San Francisco, USA
Social Querying • Social querying differs from social search in the expressiveness of the user language. • Rich and expressive language as opposed to keyword search • Developing an appropriate language is a huge challenge! • as well as analyzing the expressive power • and then efficient evaluation for such a language… PODS 2016, San Francisco, USA
[Mendelzon+1989] [Wood2012] [Barceló2013] [Libkin+2013] …. Many more! Social Querying:Graph Query Languages friend-of Bob friend-of friend-of (lives-in|visiting).locatedIn* San Francisco • Often, graph patterns, with regular expressions over edges/variables • May have/allow • Inversion • Negation • Path finding • Aggregation • Also, XPath (extensions) PODS 2016, San Francisco, USA
[Ronen+2009] Social Querying: SoQL SELECT PATH, COUNT(PATH.nodes.*) FROM PATH (Bob TO X) WHEREX.attending = ‘PODS 2016’ and X.worksIn = ‘U. of California Berkeley’ ATMOST 0 IN PATH.nodes SATISFY (worksIn= ‘U. of Alaska’) and COUNT(PATH.nodes.*) <= 4 SQL-Style Return people, paths, groups Predicates/Aggregation over paths/groups PODS 2016, San Francisco, USA
Social Querying:SNQL, BiQL [Martín+2011] [Dries +2012] • SNQL: • Based on GraphLog + Second order tuple generating dependencies • Allows querying and creation • Pattern matching, negation, transitive closure… • BiQL: • Can define context over which query is evaluated • External calls to data mining primitives (e.g., clustering) PODS 2016, San Francisco, USA
Social Querying: Natural Queries, Yet Difficult to Formulate Filtered Link Prediction:For every PODS participant p, find the top-5 people currently not friends with p, that are most likely to want to form such a friendship Filtered Centrality: Find a SIGMOD participant who is an expert in crowd sourcing PODS 2016, San Francisco, USA
Social Querying: Natural Queries, Yet Difficult to Formulate Information Diffusion: Find 5 people who will be most effective in spreading information about an “opening for a PostDoc” Team Formation + Community Detection: Find 10 people in the SIGMOD/PODS community to form a Think Tank PODS 2016, San Francisco, USA
Social Querying: Natural Queries, Yet Difficult to Formulate • Natural? • Based on standard notions of social network analysis • Difficult to Formulate? • Not clear which implementation of an SNA primitive to choose • Difficult (impossible?) to formulate primitives in standard query languages • Without built-in SNA primitives, even expressible queries will be very inefficient to implement PODS 2016, San Francisco, USA
Social Querying:Vision of New Language Constructs [C+2013] SELECT n, m FROM NODE n, NODE m WHERE n.at = ‘PODS 2016’ and m.at = ‘PODS 2016’ and not friends({n,m}) ORDER BY n, sim({n,m}) Special constructs exist as database extensions for other contexts, e.g., spatial databases • Built-in social network functions, e.g., • imp(v): measures the importance of v • sim(V): measures the similarity of nodes in V one to another, … • Machine learning to choose between implementations, and to tune parameters of implementations PODS 2016, San Francisco, USA
Challenges Defining a query language that is expressive enough for natural social network queries Compare the expressive power of such a language with previously considered graph query languages Query containment Incremental maintenance [FILL IN with all other standard QL problems] PODS 2016, San Francisco, USA