170 likes | 329 Views
Big Data - Streaming. Kalapriya Kannan IBM Research Labs July, 2013. Query = function(all data). Is there a general purpose way to compute arbitrary functions in real time? Example query Total number of pageviews to a URL over a range of time. Implementation.
E N D
Big Data - Streaming Kalapriya Kannan IBM Research Labs July, 2013
Query = function(all data) • Is there a general purpose way to compute arbitrary functions in real time? • Example query • Total number of pageviews to a URL over a range of time.
Implementation Too slow.. Data is petabyte scale
Pre computation All Data Query Pre computed Views All Data Query
Example Query Precomputed view All Data pageview 1100 pageview Query pageview pageview
Pre computation Pre computed Views All Data functions functions Query
Hadoop • Great at computing arbitrary functions. • Expressing these functions • Cascading, Pig, Cascalog • Scalding • HIVE
Hadoop Pre computation Batch view #1 Map reduce workflow All Data Map reduce workflow Batch view #2 Look at all Batch mode DB – Elephant DB
Are we done? • Not quiet • A batch workflow is too slow • View are out of date Just few hours of data Not absorbed Absorbed in Batch Views Now Time
Last few hours of data… Strom
Application queries Batch view Query Merge Real time view
What does storm do? • Distribute code • Robust process management • Monitors topologies and reassigns failed tasks • Provides reliability by tracking tuple trees • Routing and partitioning of streams. • Atleast once message processing • Horizontal scalability • No intermediate queues • Less operational over head • ‘just works” Storm jar myapp.jar com.twitter. Mytopology demo
Current use cases in twitter Discovery of emerging topics/stories Online learning of tweet features for search result ranking Realtime analytics for ads Internal log processing. Topology isolation