Using Spark and Shark for Fast Cycle Analysis on Diverse Data

Using Spark and Shark for Fast Cycle Analysis on Diverse Data VaibhavNivargi 12.2.13

About ClearStory Data

Analysis in the New Data Landscape New use cases seen in all industries. • Live situational analysis requiring fast-cycle analysis across internal data and sources of externaldata • Multi-source analysis with data refreshing on new insights, as data from sources evolves • Large-scale analysis of structured and unstructured data combined in integrated insights

Example: Interactive Multi-source Analysis More data and more people change the analysis. News Coverage Online, Print, Television Donations New Members, Donations Data Intelligence Facebook Shares, Likes, Comments Website Traffic Traffic, Referrals, Content Twitter Followers, Tweets, Retweets Interactive analysis on diverse internal & external data Corporate Sponsors Corporate Engagement, New Inquiries

Today’s Need is Speed, Scale & Ad Hoc Flexibility With more sources, more data and more people. ? ? ? ?

Why Spark and Shark ? • RDDs • Low latency & scale • Iterative and Interactive computation • Lineage and fault tolerance • Able to re-derive data • Expressive power of Scalaand SQL • Operations beyond aggregations, joins, and statistical operators • Advanced: ML, data mining, segmentation, approximate queries, graphs … • Support for structured and semi-structured data • BDAS Stack & AMPLab • Tachyon, MLBase, BlinkDB, GraphX … • Community and adoption

The ClearStory Solution Data Sources ClearStory Platform ClearStory Application Harmonization In-Memory Data Units Data Inference & Profiling Visualization Collaboration

Where do Spark & Shark fit ? User Application ClearStory API Harmonization Engine and Blended Data Processing Spark Cluster + ClearStory IP Data Access, Inference and Lineage Data Source API Files Hadoop RDBMS Web Premium Public

How we leverage Spark & Shark User intent captured and translated to custom API Harmonization-as-a-Service • Manages Spark and Shark query execution • Read cached data from HDFS • RESTful • Merges datasets (RDDs) on the fly – on user request • Support conversion of user actions to backend queries • Query optimizations Performance optimizations • Mixed-mode execution (sql2rdd & spark native) • Caching • Pre-computation

How we leverage Spark & Shark Query results returned to the application for scalable visualization and ClearStory-specific viztechniques RDDs cached/un-cached and materialized at strategic points based on usage patterns and signals Data updates automatically processed as source data changes ClearStory’sown deployment, packaging, and integrated monitoring for operations at scale

Spark Developments – What We Like Query cancellation, progress indication (0.8.1 and beyond) More performance breakthroughs Workload Management BlinkDB MLBase Tachyon GraphX

We’re Hiring! Working with the community, giving back Lots of exciting new developments This is like the early days of Hadoop – massive momentum gathering The First Spark Summit!More Meet-ups!

Using Spark and Shark for Fast Cycle Analysis on Diverse Data