An Introduction of Big Data

An Introduction of Big Data WEB GROUP 2011.9.24

Outline • What is Big Data • The Framework of Big Data • The Applications of Big Data • The Challenges of Big Data • Research works related with Big Data • Conclusions

What is Big Data • Information Explosion • 57% every year (IDC)  Double every 1.5 years • 988EB (1EB=1024PB) data will be produced in 2010 (IDC)  18 million times of all info in books • IT • 850 million photos & 8 million videos /day (Facebook) • 50PB web pages, 500PB log (Baidu) • Telco (Log, multimedia data) • Enterprise Storage • Public Utilities • Health Care (medical images - photos) • Public Traffic (surveillance - videos)

What is Big Data Structural and Semi-Structural Transaction Data ...... Unstructured data Interaction Data • Definition • Big data is the confluence of the three trends consisting of Big Transaction Data, Big Interaction and Big Data Processing • Questions? Big Data = Large-Scale Data (Massive Data)

What is Big Data • The properties of Big Data • Huge • Distributed • Dispersed over many servers • Dynamic • Items add/deleted/modified continuously • Heterogeneous • Many agents access/update data • Noisy • Inherent • Unintentional • Malicious • Unstructured / semi-structured • No database schema • Complex interrelationships

The Framework of Big Data

The Applications of Big Data Advertisement Finding communities …… Inheritance Sequence of cancer …… Celestial body Exobiology …… Data Mining Consuming habit …… Changing router …… SNA Finding communities ……

The Challenges of Big Data • Efficiency requirements for Algorithm • Traditionally, “efficient” algorithms • Run in (small) polynomial time: O(nlogn) • Use linear space: O(n) • For large data sets, efficient algorithms • Must run in linear or even sub-linear time: o(n) • Must use up to poly-logarithmic space: (logn)2 • Mining Big Data • Association Rule and Frequent Patterns • Two parameters: support, confidence • Clustering • Distance measure (L1, L2, L∞, Edit Distance, etc,.) • Graph structure • Social Networks, Degree distribution (heavy trail)

The Challenges of Big Data • Clean Big Data • Noise in data distorts • Computation results • Search results • Need automatic methods for “cleaning” the data • Duplicate elimination • Quality evaluation • Computing Model • Accuracy and Approximation • Efficiency

Computing Model of Big Data • Abstract Model of Computing (n is very large) Examples Mean Data Parity Computer Program • Approximation of f(x) is sufficient • Program can be randomized Approximation of 13

Computing Model of Big Data • Random Sampling (n is very large) Examples Mean O(1) queries Data Parity n queries Computer Program Query a few data items Approximation of

Random Sampling • Advantages • Ultra-efficient • Sub-linear running time & space (could even be independent of data set size) • Disadvantages • May require random access • Doesn’t fit many problems

Computing Model of Big Data • Data Streams (n is very large) Examples Mean O(1) memory Data Parity 1 bit of memory Computer Program Stream through the data; Use limited memory Approximation of

Random Sampling • Advantages • Sequential access • Limited memory • Disadvantages • Running time is at least linear • Too restricted for some problems

Computing Model of Big Data • Sketching (n is very large) Examples Data1 Data1 Data2 Sketch2 Sketch1 Data2 Equality O(1) size sketch Hamming distance O(1) size sketch Compress each data segment into a small “sketch” Compute over the sketches Lp distance (p > 2) (n1-2/p) size sketch Approximation of

Research works related with Big Data • Finding Maximal Cliques in Massive Networks by H*-Graph (Sigmod 2010) • Large-Scale Collective Entity Matching(VLDB2011) • Estimating Sizes of Social Networks via Biased Sampling (WWW 2011)

Finding Maximal Cliques in Massive Networks by H*-Graph • Massive graph data • Graph is a powerful modeling tool for analyzing massive networks. • Graph data is everywhere (e.g. chemistry, biology, image, vision, social networks, the Web, etc.). • The outstanding property of graph data is massive.

Finding Maximal Cliques in Massive Networks by H*-Graph • Motivation • This has become a serious concern in view of the massive volume of today's fast-growing network graphs. • Web graph has over 1 trillion webpages (google) • Social networks have millions to billions of users ( Facebook, Linkedin) • Maximal Clique Enumeration (MCE) is very useful and helpful for analyzing massive graph data. How to find MCE in massive graph? • The best algorithm require memory space linear in the size of the input graph, which is clearly infeasible on massive graph.

Finding Maximal Cliques in Massive Networks by H*-Graph • Challenges & Methods • Clique: A subset of vertices such that every two vertices are connected. Clique problem is NP-Complete. • Maximal Clique: If no more vertices can be added to the clique. • The Graph has 5 maximal cliques • {1, 2, 5}, {2, 3}, {3, 4}, {4, 5} and {4, 6} • Due to Massive graph, authors provide an External- memory algorithm for MCE (ExtMCE). • One critical problem must be handled. • What portion should be chosen at each recursive step and how? • H*-graph is a core of graph, which can be stand for the massive graph. Only finding MCE in H*-graph that can fit into memory. • H*-graph is the largest set of h vertices in G that have degree at least h. • Therefore, authors maintain and update MCE in H*-graph.

Research works related with Big Data • Finding Maximal Cliques in Massive Networks by H*-Graph (Sigmod 2010) • Large-Scale Collective Entity Matching(VLDB2011) • Estimating Sizes of Social Networks via Biased Sampling (WWW 2011) • The Anatomy of a Large-Scale Social Search Engine (WWW2010)

Large-Scale Collective Entity Matching • Motivation • Two kinds of Approaches • Pair-wise Entity Matching • Label pairs as match/non-match independently • Ignoring the relational information • Low accuracy • Collective Entity Matching • Label all pairs collectively • High accuracy • Often scale only to a few 1000 entities How can we scale Collective Entity Matching to millions of entities?

Large-Scale Collective Entity Matching • Method • The scalable EM framework consists of three key components • Modeling an entity matcher as a block box • Running multiple instances of the matcher on small subsets of entities • Using message passing across the instances to control the interaction between different runs of the matcher. b1 c1 b1 c1 b1 c1 S.Jones R.Smith S.Jones R.Smith S.Jones R.Smith c2 a1 d1 a1 d1 a1 d1 b2 b2 b2 c2 c2 Andrew Andrew Andrew Thomas Thomas Thomas Mr. Smith Mr. Smith Mr. Smith Jones S. Jones S. Jones S. c3 c3 c3 a2 a2 a2 b3 b3 b3 Thopas P. Thopas P. Thopas P. R.A. Smith Jones R.A. Smith Jones R.A. Smith Jones Message Passing C1 C2 C3 C1 C2 C3 Entity Matcher Entity Matcher Entity Matcher Vibhor Rastogi, Nilesh Dalvi, Minos Garofalakis, Large-Scale Collective Entity Matching. Proceedings of the VLDB Endowment, Vol. 4, No. 4

Research works related with Big Data • Finding Maximal Cliques in Massive Networks by H*-Graph (Sigmod 2010) • Large-Scale Collective Entity Matching(VLDB2011) • Estimating Sizes of Social Networks via Biased Sampling (WWW 2011)

Estimating Sizes of Social Networks via Biased Sampling • Motivation • Social network have become pretty big: • Facebook (650,000,000) • Qzone (200,000,000) • Twitter (175,000,000) • No public API for population size queries. • Exhaustive crawl is time / space / communication intensive and violates “politeness” • Goal: • Obtaining estimates for sizes of populations in social network with limit public API calls.

Large-Scale Collective Entity Matching • Method • Biased sampling – random walk on directed graph • Construct 4 statistics: • C – the number of collisions. • C’ – the number of non-unique elements • – the sum of the sampled nodes’ degrees. • – the sum of the inverse sampled nodes’ degrees. • Two way to estimate the number of nodes: • At least samples are needed to guarantee the accuracy of the estimate. collision based estimator non-unique element based estimator

Example 3 2 5 4 1 seed f c b d c d 3 2 4 3 3 4 0 0 0 0 1 2 3 5 9 12 15 19 1/3 5/6 13/12 17/12 21/12 2 - - - - 13 9

Outline • What is Big Data • The Framework of Big Data • The Applications of Big Data • The Challenges of Big Data • Some research works related Big Data • Conclusions

Conclusions • Data on today‘s scales require scientific and computational intelligence. • Big Data is a challenge and an opportunity for us.

Thank You

An Introduction of Big Data

An Introduction of Big Data

Presentation Transcript

Data Mining: An Introduction

Introduction to Big Data

An Introduction to Big Data Ken Smith

An Introduction to Data Intensive Computing Chapter 3: Processing Big Data

Data Interoperability An Introduction

A Brief Introduction of Existing Big Data Tools

Whistleblowers in an Era of Big Data

Microsoft Big Data Essentials Module 1 - Introduction to Big Data

Introduction to Big Data and NoSQL

Data Interoperability: An Introduction

Data Warehousing “An Introduction”

Introduction to Big Data

Microsoft Big Data Essentials Module 1 - Introduction to Big Data

Data Mining: An introduction

Big data timeline- series of Big Data Evolution

Introduction to big data #inspiringcareers

introduction to BIG DATA

Big Data Big Data

Big Data Introduction

Big Data Analytics Introduction