130 likes | 273 Views
Using Spark and Shark for Fast Cycle Analysis on Diverse Data. Vaibhav Nivargi. 12.2.13. About ClearStory Data. Analysis in the New Data Landscape. New use cases seen in all industries.
E N D
Using Spark and Shark for Fast Cycle Analysis on Diverse Data VaibhavNivargi 12.2.13
Analysis in the New Data Landscape New use cases seen in all industries. • Live situational analysis requiring fast-cycle analysis across internal data and sources of externaldata • Multi-source analysis with data refreshing on new insights, as data from sources evolves • Large-scale analysis of structured and unstructured data combined in integrated insights
Example: Interactive Multi-source Analysis More data and more people change the analysis. News Coverage Online, Print, Television Donations New Members, Donations Data Intelligence Facebook Shares, Likes, Comments Website Traffic Traffic, Referrals, Content Twitter Followers, Tweets, Retweets Interactive analysis on diverse internal & external data Corporate Sponsors Corporate Engagement, New Inquiries
Today’s Need is Speed, Scale & Ad Hoc Flexibility With more sources, more data and more people. ? ? ? ?
Why Spark and Shark ? • RDDs • Low latency & scale • Iterative and Interactive computation • Lineage and fault tolerance • Able to re-derive data • Expressive power of Scalaand SQL • Operations beyond aggregations, joins, and statistical operators • Advanced: ML, data mining, segmentation, approximate queries, graphs … • Support for structured and semi-structured data • BDAS Stack & AMPLab • Tachyon, MLBase, BlinkDB, GraphX … • Community and adoption
The ClearStory Solution Data Sources ClearStory Platform ClearStory Application Harmonization In-Memory Data Units Data Inference & Profiling Visualization Collaboration
Where do Spark & Shark fit ? User Application ClearStory API Harmonization Engine and Blended Data Processing Spark Cluster + ClearStory IP Data Access, Inference and Lineage Data Source API Files Hadoop RDBMS Web Premium Public
How we leverage Spark & Shark User intent captured and translated to custom API Harmonization-as-a-Service • Manages Spark and Shark query execution • Read cached data from HDFS • RESTful • Merges datasets (RDDs) on the fly – on user request • Support conversion of user actions to backend queries • Query optimizations Performance optimizations • Mixed-mode execution (sql2rdd & spark native) • Caching • Pre-computation
How we leverage Spark & Shark Query results returned to the application for scalable visualization and ClearStory-specific viztechniques RDDs cached/un-cached and materialized at strategic points based on usage patterns and signals Data updates automatically processed as source data changes ClearStory’sown deployment, packaging, and integrated monitoring for operations at scale
Spark Developments – What We Like Query cancellation, progress indication (0.8.1 and beyond) More performance breakthroughs Workload Management BlinkDB MLBase Tachyon GraphX
We’re Hiring! Working with the community, giving back Lots of exciting new developments This is like the early days of Hadoop – massive momentum gathering The First Spark Summit!More Meet-ups!