1.01k likes | 1.38k Views
Apache Samza * Reliable Stream Processing atop Apache Kafka and Yarn. Sriram Subramanian Me on Linkedin Me on twitter - @sriramsub1. * Incubating. Agenda. Why Stream Processing? What is Samza’s Design ? How is Samza’s Design Implemented? How can you use Samza ?
E N D
Apache Samza*Reliable Stream Processing atop Apache Kafka and Yarn Sriram Subramanian Me on Linkedin Me on twitter - @sriramsub1 * Incubating
Agenda • Why Stream Processing? • What is Samza’s Design ? • How is Samza’s Design Implemented? • How can you use Samza? • Example usage at Linkedin
0 ms Response latency
RPC 0 ms Response latency Synchronous
RPC 0 ms Response latency Later. Possibly much later. Synchronous
Samza RPC 0 ms Response latency Milliseconds to minutes Later. Possibly much later. Synchronous
Newsfeed Ad Relevance
Search Index Metrics and Monitoring
Stream A Stream B JOB Stream C
Stream D Stream A Stream B Stream E JOB 1 JOB 2 JOB 3 Stream C Stream F Stream G
Streams Partition 0 Partition 1 Partition 2
Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5
Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5
Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5
Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5
Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5
Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5 next append
Jobs Stream A Stream B Task 1 Task 2 Task 3 Stream C
Jobs AdViews AdClicks Task 1 Task 2 Task 3 AdClickThroughRate
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 1 Partition 0
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 1 Partition 0
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 1 Partition 0
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 1 Partition 0
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 1 Partition 0
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1
Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1
Dataflow Stream A Stream B Stream C Job 1 Job 2 Stream D Stream E Job 3 Stream B
Dataflow Stream A Stream B Stream C Job 1 Job 2 Stream D Stream E Job 3 Stream B
Stateful Processing • Windowed Aggregation • Counting the number of page views for each user per hour • Stream Stream Join • Join stream of ad clicks to stream of ad views to identify the view that lead to the click • Stream Table Join • Join user region info to stream of page views to create an augmented stream
How do people do this? • In memory state with checkpointing • Periodically save out the task’s in memory data • As state grows becomes very expensive • Some implementation checkpoints diffs but adds complexity
How do people do this? • Using an external store • Push state to an external store • Performance suffers because of remote queries • Lack of isolation • Limited query capabilities
Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B
Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B
Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream