1 / 73

Data Management for Social Networking

Data Management for Social Networking. Sara Cohen Hebrew University of Jerusalem. About this Talk. Focus: Breadth-wise No background knowledge assumed Take- aways : Fundamental issues in social network analysis Key challenges related to data management

goley
Download Presentation

Data Management for Social Networking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Management for Social Networking Sara Cohen Hebrew University of Jerusalem PODS 2016, San Francisco, USA

  2. About this Talk • Focus:Breadth-wise • No background knowledge assumed • Take-aways: • Fundamental issues in social network analysis • Key challenges related to data management • Hopefully, the burning desire to do research on social network data management! PODS 2016, San Francisco, USA

  3. What is a Social Network? 1.1 Billion EUMV 310 Million EUMV 255 Million EUMV 250 Million EUMV 120 Million EUMV 110 Million EUMV 100 Million EUMV EUMV = Estimated Unique Monthly Visitors (April 1, 2016) PODS 2016, San Francisco, USA

  4. What is a Social Network? Research Collaborations Emails Diseases and Gene Associations ccsb.dfci.harvard.edu aminer.org cambridge-intelligence.com • May Have/Be: • Typed Nodes • (Un)directed edges • Hyper-edges • Multi-partite • Rich attributes Diseases Spread Movies and Actors thesisthomasdemoor.wordpress.com ai.arizona.edu PODS 2016, San Francisco, USA

  5. Questions to Keep in Mind:Is Social Network Data Management … the same as Graph Data Management? the same as Social Network Analysis? a problem completely solved by Industry Giants? I will try to convince you that the answer to all three questions is “No”! PODS 2016, San Francisco, USA

  6. Topics Define tasks that must be effectively achievable by the data management system Study necessary system properties to achieve the above tasks Problems driven by the network structure Problems driven by the network content and structure Systems for social-network data management PODS 2016, San Francisco, USA

  7. Some Key Problems Centrality: To what degree is a given node central to the network? PODS 2016, San Francisco, USA

  8. Some Key Problems Centrality: To what degree is a given node central to the network? Link Prediction: Which edges not currently in the network are most likely to form? PODS 2016, San Francisco, USA

  9. Some Key Problems Centrality: To what degree is a given node central to the network? Link Prediction: Which edges not currently in the network are most likely to form? Community Detection:How can the nodes be clustered into natural or useful groups? PODS 2016, San Francisco, USA

  10. Some Key Problems Centrality: To what degree is a given node central to the network? Link Prediction: Which edges not currently in the network are most likely to form? Community Detection: How can the nodes be clustered into natural or useful groups? Information Diffusion: How does information diffuse over the network? PODS 2016, San Francisco, USA

  11. Some Key Problems Centrality: To what degree is a given node central to the network? Link Prediction: Which edges not currently in the network are most likely to form? Community Detection: How can the nodes be clustered into natural or useful groups? Information Diffusion: How does information diffuse over the network? PODS 2016, San Francisco, USA

  12. [Katz 1953] [Sabidussi 1966] [Freeman 1979] [Borgatti+ 2006] Centrality • Centrality is a measure of the importance of a node, i.e., how central it is to the network • Can be measured in different ways, depending on context • In practice may want to combine several methods • May require a (cheap) local computation, or a (very expensive) global computation PODS 2016, San Francisco, USA

  13. Centrality Measures Degree Centrality: Count number of neighbors of u u u Betweenness Centrality: Proportion of shortest paths between all pairs of nodes traversing u PODS 2016, San Francisco, USA

  14. Closeness Centrality: Distance of u to all other nodes u u v v Clique Overlap Centrality: Number of maximal cliques (size ≥ 3) in which u participates Others: Rooted PageRank, Katz, Eigenvector... PODS 2016, San Francisco, USA

  15. Difficulties from a Database Perspective • Many research • challenges!! • Some examples of work on computing centrality • No clear winner • How to answer a user’s query! • Computations can be global and expensive • Contrast this to a typical SQL query! • Huge and dynamic network makes values change often • Small change in the graph can effect scores of all nodes! PODS 2016, San Francisco, USA

  16. Computing Betweenness Centrality • Bellman criterion: • Two steps for computing: • Compute length and number of shortest paths between all pairs • Sum all related pairs • Time: PODS 2016, San Francisco, USA

  17. Computing Betweenness Centrality [Brandes 2001] Efficient algorithm for sparse graph with time Avoid summing up all pairs, for each node, by observing that partial sums obey a recursive relation. [Bader+2006, Edmonds+2010] Parallel, distributed computation [Arge+2013] Efficient computation in external memory [Hayashi+2015] Betweenness centrality dynamic maintenance over massive networks PODS 2016, San Francisco, USA

  18. Enumerating Cliques • Clique overlap centrality computes the number of maximal cliques in which a node participates • Cliques can also be used to find communities, and many other applications • A lot of past research on algorithms for enumerating all maximal cliques[Bron&Kerbosch73,Johnson+88,Tomita+06] • Note that counting maximal cliques is #P-complete PODS 2016, San Francisco, USA

  19. Enumerating Clique Relaxations • In reality, cliques are often overly restrictive • Within a community, not all pairs are friends • Real-life friends may miss social-network links • Missing links due to measurement incompleteness/ imprecision • Various relaxations of cliques have been proposed [Seidman+78, Pattillo+13], e.g., • k-plex (every node can be “missing” at most k neighbors), • s-cliques (nodes are at distance at most s one from another) • Recent work has focused on efficient enumeration of clique relaxations [Wu+07, C+2015] PODS 2016, San Francisco, USA

  20. Some Key Problems Centrality: To what degree is a given node central to the network? Link Prediction: Which edges not currently in the network are most likely to form? Community Detection: How can the nodes be clustered into natural or useful groups? Information Diffusion: How does information diffuse over the network? PODS 2016, San Francisco, USA

  21. Link Prediction • Link prediction is the problem of determining, for a given node v, which nodes currently not connected to v are likely to form such a connection • As before, can be measured in different ways, depending on context • In practice may want to combine several methods • As before, may require a (cheap) local computation, or a (very expensive) global computation PODS 2016, San Francisco, USA

  22. Link Prediction Functions Common Neighbors: How many neighbors are common to u and v, versus total number of neighbors u u v v Adamic-Adar: Normalize common neighbor value by their popularity PODS 2016, San Francisco, USA

  23. Katz: Collect score for each path from u to reach v weighted by length u u v v Hitting Time: Expected time for a random walk starting at u to reach v. PODS 2016, San Francisco, USA

  24. Reliability: Given independent link & node failures, what is prob. of a path from u to v? u v Weighted distance: What is the shortest path from u to v? u v PODS 2016, San Francisco, USA

  25. Difficulties from a Database Perspective • Many research • challenges!! • Some examples of work on choosing a link prediction metric Same as before, but at a larger scale, need to compute values • No clear winner • How to answer a user’s query! • Computations can be global and expensive • Contrast this to a typical SQL query! • Huge and dynamic network makes values change often • Small change in the graph can effect scores of all nodes! PODS 2016, San Francisco, USA

  26. Choosing a Link Prediction Function:Machine Learning [Liben-Nowell+2007] Comparison of effectiveness on a variety of domains [Kashima+2009] Semi-supervised learning for link prediction [Hasan+2006] Supervised learning for link prediction [Li+2014] Deep learning for link prediction PODS 2016, San Francisco, USA

  27. Choosing a Link Prediction Function:Axiomatic Approach • The axiomatic approachdefines behavior of a function by axioms over “simple” instances • axioms are used to extrapolate behavior over general graphs • axioms thereby characterize functions • Goal is to gain understanding of the underlying principles that the function assumes • Used in the past for: • social choice (from preferences to ranking) • Web page ranking (from links to ranking) PODS 2016, San Francisco, USA

  28. [C+2015] Example Axiom Templates for Link Prediction Pair graph axiom: f satisfies the Pair-graph axiom if for graph G with only two nodes u,v: Strength of relationship depends on weight of edge and nodes u v -sink (source) series axiom: f satisfies the -sink (source) series axiom if whenever G is decomposable to G1,G2with a single shared node w that is a sink in G1(source in G2): u v w Strength of relationship can be determined by considering each of the sub-graphs on their own u w v w PODS 2016, San Francisco, USA

  29. [C+2015] Results of Axiomatization Let G be a graph with weight 1 on vertices and  on edges • katz there is a single link prediction function f satisfying axioms PairGraph, -SameAlternatives, -sink-series, -source-series, In-Split, Same-Out-Split, path-Relevance We prove characterization results for the four link prediction axioms: Katz, Hitting time, Weighted distance, Reliability. PODS 2016, San Francisco, USA

  30. Topics Problems driven by the network structure Problems driven by the network content and structure Systems for social-network data management PODS 2016, San Francisco, USA

  31. Some Key Problems “Dentist” Tom AKA Man with Toothache Social search: How can a social network be leveraged to better search a corpus? PODS 2016, San Francisco, USA

  32. Some Key Problems Dentist that treats at least of my 2 friends, but does not treat “Bad Teeth Bill”? “Dentist” Tom AKA Man with Toothache Social search: How can a social network be leveraged to better search a corpus? Social Querying: How can we query a social network with a highly expressive language? PODS 2016, San Francisco, USA

  33. Some Key Problems … PODS PC 2016 http://research.microsoft.com/ Social search: How can a social network be leveraged to better search a corpus? Social Querying: How can we query a social network with a highly expressive language? Team Formation: Given a set of skills, how can we find a group of people with the skills who can work well together? PODS 2016, San Francisco, USA

  34. Some Key Problems Social search: How can a social network be leveraged to better search a corpus? Social Querying: How can we query a social network with a highly expressive language? Team Formation: Given a set of skills, how can we find a group of people with the skills who can work well together? PODS 2016, San Francisco, USA

  35. Social Search • Social search is the problem of leveraging a social network, to improve the results of searching a corpus • Problem has many flavors, e.g., • Improving Web search • Search for a person who can answer a question • Finding people with a given name • Finding people for a given context and user PODS 2016, San Francisco, USA

  36. [Bao+2007] [Carmel+2009] [Yin+2010] Social Search: Improving Document Search San Francisco Attractions created Bob liked commented on • Take into consideration • Relevance of document to query (standard IR) • Importance of document (standard IR) • Importance of document creator (centrality?) • Relationship of creator / those that like document to query issuer (link prediction?) • … PODS 2016, San Francisco, USA

  37. [Horowitz+2010] Social Search: Search for a Person who Can answer a Question San Francisco Attractions Bob • Take into consideration • User provided list of expertise • Information extracted from profile • Data provided by friends PODS 2016, San Francisco, USA

  38. [Vieira+2007] Social Search: Finding People with a Given Name Alice? Bob Alice Alice • Importance (centrality?) • Relevance (link prediction?) • Note that even ranking by shortest path is difficult in practice, as it is too expensive to pre-compute and store all-pairs-shortest-paths PODS 2016, San Francisco, USA

  39. [C+2013] [C+2015] Social Search:Finding People for Given User + Context Who should I collaborate with on: Frequent Subgraph Mining Of Probabilistically Extracted Data? • User issues a query for a person • Result should be relevant both to the query (IR methods?) and to the user (link prediction?) • Studied in two contexts: • Collaboration prediction • Email recipient prediction June 2016 PODS 2016, San Francisco, USA PODS 2016, San Francisco, USA

  40. [C+2013] [C+2015] Social Search:Finding People for Given User + Context • User issues a query for a person • Result should be relevant both to the query (IR methods?) and to the user (link prediction?) • Studied in two contexts: • Collaboration prediction • Email recipient prediction (only use neighbors in prediction!) June 2016 PODS 2016, San Francisco, USA PODS 2016, San Francisco, USA

  41. Some Key Problems Social search: How can a social network be leveraged to better search a corpus? Social Querying: How can we query a social network with a highly expressive language? Team Formation: Given a set of skills, how can we find a group of people with the skills who can work well together? PODS 2016, San Francisco, USA

  42. Social Querying • Social querying differs from social search in the expressiveness of the user language. • Rich and expressive language as opposed to keyword search • Developing an appropriate language is a huge challenge! • as well as analyzing the expressive power • and then efficient evaluation for such a language… PODS 2016, San Francisco, USA

  43. [Mendelzon+1989] [Wood2012] [Barceló2013] [Libkin+2013] …. Many more! Social Querying:Graph Query Languages friend-of Bob  friend-of friend-of (lives-in|visiting).locatedIn* San Francisco • Often, graph patterns, with regular expressions over edges/variables • May have/allow • Inversion • Negation • Path finding • Aggregation • Also, XPath (extensions) PODS 2016, San Francisco, USA

  44. [Ronen+2009] Social Querying: SoQL SELECT PATH, COUNT(PATH.nodes.*) FROM PATH (Bob TO X) WHEREX.attending = ‘PODS 2016’ and X.worksIn = ‘U. of California Berkeley’ ATMOST 0 IN PATH.nodes SATISFY (worksIn= ‘U. of Alaska’) and COUNT(PATH.nodes.*) <= 4 SQL-Style Return people, paths, groups Predicates/Aggregation over paths/groups PODS 2016, San Francisco, USA

  45. Social Querying:SNQL, BiQL [Martín+2011] [Dries +2012] • SNQL: • Based on GraphLog + Second order tuple generating dependencies • Allows querying and creation • Pattern matching, negation, transitive closure… • BiQL: • Can define context over which query is evaluated • External calls to data mining primitives (e.g., clustering) PODS 2016, San Francisco, USA

  46. Social Querying: Natural Queries, Yet Difficult to Formulate Filtered Link Prediction:For every PODS participant p, find the top-5 people currently not friends with p, that are most likely to want to form such a friendship Filtered Centrality: Find a SIGMOD participant who is an expert in crowd sourcing PODS 2016, San Francisco, USA

  47. Social Querying: Natural Queries, Yet Difficult to Formulate Information Diffusion: Find 5 people who will be most effective in spreading information about an “opening for a PostDoc” Team Formation + Community Detection: Find 10 people in the SIGMOD/PODS community to form a Think Tank PODS 2016, San Francisco, USA

  48. Social Querying: Natural Queries, Yet Difficult to Formulate • Natural? • Based on standard notions of social network analysis • Difficult to Formulate? • Not clear which implementation of an SNA primitive to choose • Difficult (impossible?) to formulate primitives in standard query languages • Without built-in SNA primitives, even expressible queries will be very inefficient to implement PODS 2016, San Francisco, USA

  49. Social Querying:Vision of New Language Constructs [C+2013] SELECT n, m FROM NODE n, NODE m WHERE n.at = ‘PODS 2016’ and m.at = ‘PODS 2016’ and not friends({n,m}) ORDER BY n, sim({n,m}) Special constructs exist as database extensions for other contexts, e.g., spatial databases • Built-in social network functions, e.g., • imp(v): measures the importance of v • sim(V): measures the similarity of nodes in V one to another, … • Machine learning to choose between implementations, and to tune parameters of implementations PODS 2016, San Francisco, USA

  50. Challenges Defining a query language that is expressive enough for natural social network queries Compare the expressive power of such a language with previously considered graph query languages Query containment Incremental maintenance [FILL IN with all other standard QL problems] PODS 2016, San Francisco, USA

More Related