Graph OLAP: Towards Online Analytical Processing on Graphs

Graph OLAP: Towards Online Analytical Processing on Graphs Chen Chen, Xifeng Yan, Feida Zhu, Jiawei Han, Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center University of Illinois at Chicago

Outline • Motivation • Framework • Efficient Computation • Experiments • Conclusion

Online Analytical Processing • Jim Gray, 1997 • OLAP as a powerful analytical tool

The Usefulness of OLAP • Multi-dimensional • Different perspectives • Multi-level • Different granularities • Can we offer roll-up/drill-down and slice/dice on graph data? • Traditional OLAP cannot handle this, because they ignore links among data objects

The Prevalence of Graphs • Chemical compounds, computer vision objects, circuits, XML • Especially various information networks • Biological networks • Bibliographic networks • Social networks • World Wide Web (WWW)

Applications • WWW • >= 3 billion nodes, >= 50 billion arcs • Facebook • >= 100 million active users • Combining topological structures and node/edge attributes • Great challenge to view and analyze them • We propose Graph OLAP to tackle this issue

Scenario #1 • A bibliographic network • The collaboration patterns among researchers for SIGMOD 2004

Scenario #2

Outline • Motivation • Framework • Data Model • Two types of Graph OLAP • Dimension, Measure and OLAP operations • Efficient Computation • Experiments • Conclusion

Data Model • We have a collection of network snapshots G= {G1, G2, . . . , GN} • Each snapshot Gi= (I1,i, I2,i, . . . , Ik,i; Gi) • I1,i, I2,i, . . . , Ik,i are k informational attributes describing the snapshot as a whole • Gi= (Vi, Ei) is an attributed graph, with attributes attached with its nodes Viand edges Ei • Since G1, G2, . . . , GNonly represent different observations of a network, V1, V2, . . . , VNactually correspond to the same set of objects

Two Types of OLAP • Informational OLAP (abbr. I-OLAP) • Topological OLAP (abbr. T-OLAP)

Informational OLAP • Dimensions come from informational attributes attached at the whole snapshot level, so-called Info-Dims • e.g., scenario #1

I-OLAP Characteristics • Overlay multiple pieces of information • Do not change the objects whose interactions are being looked at • In the underlying snapshots, each node is a researcher • In the summarized view, each node is still a researcher

Topological OLAP • Dimensions come from the node/edge attributes inside individual networks, so-called Topo-Dims • e.g., scenario #2

T-OLAP Characteristics • Zoom in/Zoom out • Network topology changed: “generalized” nodes and “generalized” edges • In the underlying network, each node is a researcher • In the summarized view, each node becomes an institute that comprises multiple researchers

Measures in Graph OLAP • Measure is an aggregated graph • I-aggregated graph • T-aggregated graph • Other measures like node count, average degree, etc. can be treated as derived • Graph plays a dual role • Data source • Aggregate measure

Generality of the Framework • Measures could be complex • e.g., maximum flow, shortest path, centrality • Combine I-OLAP and T-OLAP into a hybrid case

Graph OLAP Operations

Outline • Motivation • Framework • Efficient Computation • Measure classification • Optimizations • Constraint pushing • Experiments • Conclusion

Two Categories of Strategies • Top-down • Generalized cells later • How to combine and leverage intermediate results? • Bottom-up • Generalized cells first • How to early-stop?

Measure Classification • How to combine and leverage intermediate results? • Distributive • The computation of high-level cells can be directly built on low-level cells • Algebraic • Not distributive, but can be easily derived from several distributive measures • Holistic • Neither distributive nor algebraic

Examples • Distributive: collaboration frequency • Use distributiveness to drive computation up the cuboid lattice • Algebraic: maximum flow • Will prove later • Semi-distributive • Holistic: centrality • Need to go down to the raw data and start from scratch

Optimizations • Special measures may have special properties that can help optimize the calculations • We discuss two of them here, with regard to I-OLAP • Localization • Attenuation

Localization • During computation, only a neighborhood of the networks needs to be consulted • e.g., the collaboration frequency of “R. Agrawal” and “R.Srikant” for [sigmod, all-years] only depends on their collaboration frequencies in each SIGMOD conferences • Perfect (i.e., 0-neighborhood) localization • k-neighborhood is less ideal, but still useful • e.g., # of common friends shared by “R. Agrawal” and “R.Srikant”

Attenuation • Consider the transporting capability (i.e., maximum flow) from source S to destination T • Multiple transportation networks, each one is operated by a separate company • With regard to I-OLAP, each network is a “snapshot”, and overlaying more than one snapshots means to share link capacities among companies

Attenuation • Data graph C • Node: cities • Edge: capacity of a link • Measure graph F • Node: cities • Edge: when maximum flow is transmitted, the quantity that passes through a link

Attenuation • Maximum flow is algebraic • F can be derived from C • Just run the maximum flow algorithm • The capacity graph C is obviously distributive • Lemma • Let F be a flow in C and let CFbe its residual graph, where residual means that CF= C - F, then F′ is a maximum flow in CFif and only if F + F′ is a maximum flow in C

Attenuation • Consider two snapshots that are overlaid • Maximum flow F1, F2 already calculated from C1, C2 • Without attenuation • Compute the overall maximum flow F from C1+ C2 • With attenuation • Take F1+ F2 as basis • Compute the residual maximum flow F′ from (C1 - F1) + (C2 - F2), and augment it onto F1 + F2 • Thus, our input attenuates from C1 + C2 to (C1 + C2 ) - (F1 + F2 ), which substantially decreases the efforts

Constraint Pushing • Iceberg graph cube • Partial materialization • Satisfying some interestingness requirement • Push the constraints • Anti-monotone • e.g., maximum flow |f| ≥ δ|f| • Monotone • e.g., diameter d ≥ δd

OLAP a Bibliographic Network • We get the coauthorship data from DBLP • Measure • Information Centrality • Two Info-Dims • Area • Database (DB): PODS/SIGMOD/VLDB/ICDE/EDBT • Data Mining (DM): ICDM/SDM/KDD/PKDD • Information Retrieval (IR): SIGIR/WWW/CIKM • Time

OLAP a Bibliographic Network

Efficiency • A test that computes maximum flow as the measure • Synthetically generate flow networks • Details in the paper, with each “snapshot” representing an individual player in the transportation industry • Like the Multi-Way method, calculate low-level cells before merging them into high-level ones • One takes advantage of the attenuation heuristic • The other does not

Efficiency

Conclusion • We propose a Graph OLAP framework to perform multi-dimensional, multi-level analysis on network data • Measure is an aggregated graph • Informational/Topological dimensions lead to I-OLAP, T-OLAP

Conclusion • Mainly focusing on I-OLAP, we discuss how a graph cube can be efficiently computed and materialized • distributive, algebraic, holistic • Optimizations: localization, attenuation • Constraint pushing

Future Works • Technical issues for T-OLAP • Selective drilling and discovery-driven InfoNet-OLAP

Thank You!

Graph OLAP: Towards Online Analytical Processing on Graphs