It’s Always been Big Data…!

It’s Always been Big Data…! MinosGarofalakis Technical University of Crete http://www.softnet.tuc.gr/~minos/

“Big” Depending on Context… • Grows by Moore’s Law… • 1st VLDB (1975): Big = millions of data points gathered by the US Census Bureau [Simonson, Alsbrooks, VLDB’75] • Things have changed since then… • In general, Big = data that cannot be handled using standalone, standard tools (on a desktop) • Today, this means using Hadoop/MR clusters, Cloud DBMSs, Supercomputers, …

The Big Data Pipeline • Several major pain points/ challenges at each step • Throwback to early batch computing of the 1960s! • No direct manipulation, interactivity, fast response • Processing is opaque, time consuming, costly • Typically, using a series of remote VMs • Different designs => VERY different temporal/financial implications

Data Analytics is Exploratory by Nature! • Can we support interactive exploration and rapid iteration over Big Data? • Mimic versatility of local file handling with tools like Excel and scripts (e.g., R) • One approach: Small footprint Synopses/Sketches for fast approximate answers and visualizations • Sampling already used (in ad-hoc manner) • Much relevant work on AQP and streaming • But, we must handle the Variety dimension • Both in data types and classes of analytics tasks! • Another important dimension: Distribution • LIFT/LEADS/FERARI projects and BD3 Workshop (this Friday!)

Optimization, Collaboration, Provenance • Can we help users to plan/monitor the monetary and time implications of their design decisions? • Again, this should be an interactive process • Can we enable users to collaborate around Big Data? • Share data sources, scripts, experiences, even data runs • Work on collaborative mashups/visualization, CSCW • Can we help users to explore and exploit the provenance and computation history of the data? • “Institutional memory” on data sources and analyses • Data synopses/approximation critical to all three…! • May just be my personal bias speaking…

A Grand Challenge Can we take a typical Excel/R user and empower them to become a Big Data Scientist? • For non-data-savvy “citizen scientists”, lack of statistical sophistication is a key problem • Can lead to poor decisions and results; more “play” than “science” • Support for fast interactive exploration, workflow optimization, collaboration, and provenance is critical • Relevant work exists in our community but still lots to be done…

A Happy Data Scientist is a Good Thing! 

It’s Always been Big Data…!

It’s Always been Big Data…!

Presentation Transcript

End User Applications

Interagency Retail Listeria monocytogenes Risk Assessment Data Sources, Data Review, and Data Gaps

Data Mining: Data Preparation

Motivation: Why data mining What is data mining Data Mining: On what kind of data Data mining functionality Classificati

Data Management

Data, Graphs, and Measurement

Three Flavors of Data

DATA

How well do you know your DATA?

Open Data

Lesson 2

Big Data – What is it?

Data and Assessment

From raw data to rich data

Using Data to Tell Your Story

Data Structures

Managing Data

Big Data