1 / 34

Apache Kafka

Apache Kafka. A high-throughput distributed messaging system. Johan Lundahl. Agenda. Kafka overview Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts Partitioning, Keyed messages, Replication

zanthe
Download Presentation

Apache Kafka

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apache Kafka A high-throughput distributed messaging system Johan Lundahl

  2. Agenda • Kafka overview • Main concepts and comparisons to other messaging systems • Features, strengths and tradeoffs • Message format and broker concepts • Partitioning, Keyed messages, Replication • Producer / Consumer APIs • Operation considerations • Kafka ecosystem If time permits: • Kafka as a real-time processing backbone • Brief intro to Storm • Kafka-Storm wordcount demo

  3. What is Apache Kafka? • Distributed, high-throughput, pub-sub messaging system • Fast, Scalable, Durable • Main use cases: • log aggregation, real-time processing, monitoring, queueing • Originally developed by LinkedIn • Implemented in Scala/Java • Top level Apache project since 2012: http://kafka.apache.org/

  4. Comparison to other messaging systems • Traditional: JMS, xxxMQ/AMQP • New gen: Kestrel, Scribe, Flume, Kafka Message queues Low throughput, low latency Log aggregators High throughput, high latency RabbitMQ JMS Flume Hedwig Kafka ActiveMQ Scribe Batch jobs Qpid Kestrel

  5. Kafka concepts Producers Service Frontend Frontend Topic1 Topic3 Topic1 Topic2 Push Broker Kafka Pull Topic3 Topic3 Topic2 Topic2 Topic3 Topic1 Topic1 Data warehouse Batch processing Consumers Monitoring Stream processing

  6. Distributed model KAFKA-156 Producer Producer Producer Producer persistence Partitioned Data Publication Intra cluster replication Broker Broker Broker Zookeeper Ordered subscription Topic2 consumer group Topic1 consumer group

  7. Agenda • Kafka overview • Main concepts and comparisons to other messaging systems • Features, strengths and tradeoffs • Message format and broker concepts • Partitioning, Keyed messages, Replication • Producer / Consumer APIs • Operation considerations • Kafka ecosystem If time permits: • Kafka as a real-time processing backbone • Brief intro to Storm • Kafka-Storm wordcount demo

  8. Performance factors • Broker doesn’t track consumer state • Everything is distributed • Zero-copy (sendfile) reads/writes • Usage of page cache backed by sequential disk allocation • Like a distributed commit log • Low overhead protocol • Message batching (Producer & Consumer) • Compression (End to end) • Configurable ack levels From: http://queue.acm.org/detail.cfm?id=1563874

  9. Kafka features and strengths • Simple model, focused on high throughput and durability • O(1) time persistence on disk • Horizontally scalable by design (broker and consumers) • Push - pull => consumer burst tolerance • Replay messages • Multiple independent subscribes per topic • Configurable batching, compression, serialization • Online upgrades

  10. Tradeoffs • Not optimized for millisecond latencies • Have not beaten CAP • Simple messaging system, no processing • Zookeeper becomes a bottleneck when using too many topics/partitions (>>10000) • Not designed for very large payloads (full HD movie etc.) • Helps to know your data in advance

  11. Agenda • Kafka overview • Main concepts and comparisons to other messaging systems • Features, strengths and tradeoffs • Message format and broker concepts • Partitioning, Keyed messages, Replication • Producer / Consumer APIs • Operation considerations • Kafka ecosystem If time permits: • Kafka as a real-time processing backbone • Brief intro to Storm • Kafka-Storm wordcount demo

  12. Message/Log Format Message Length Version Checksum Payload

  13. Log based queue (Simplified model) Broker Topic1 Topic2 Producer API used directly by application or through one of the contributed implementations, e.g. log4j/logbackappender Consumer1 Message1 Message1 Message2 Message2 Message3 Message3 Consumer2 Producer1 Message4 Message4 Message5 Message5 Producer2 Message6 Message6 Message7 Message7 Message8 Consumer3 Message9 ConsumerGroup1 Consumer3 • Batching • Compression • Serialization Message10 Consumer3

  14. Partitioning Broker Partitions Topic1 Group1 Producer Group2 Consumer Producer Consumer Producer Consumer Topic2 Producer Group3 Producer Consumer Consumer Consumer No partition for this guy Consumer

  15. Keyed messages #partitions=3 hash(key) % #partitions BrokerId=3 BrokerId=1 BrokerId=2 Topic1 Topic1 Topic1 Message3 Message1 Message2 Message7 Message5 Message4 Message11 Message9 Message6 Message15 Message13 Message8 Message17 Message10 Message12 Message14 Message16 Producer Message18

  16. Intra cluster replication Replication factor = 3 Broker1 Broker2 Broker3 InSyncReplicas Topic1 follower Topic1 leader Topic1 follower Follower fails: • Follower dropped from ISR • When follower comes online again: fetch data from leader, then ISR gets updated Leader fails: • Detected via Zookeeper from ISR • New leader gets elected Message1 Message1 Message1 Message2 Message2 Message2 Message3 Message3 Message3 Message4 Message4 Message4 Message5 Message5 Message5 Message6 Message6 Message6 Message7 Message7 Message7 Message8 Message8 Message8 Producer Message9 Message9 Message9 ack ack ack ack Message10 Message10 Message10 3 commit modes:

  17. Agenda • Kafka overview • Main concepts and comparisons to other messaging systems • Features, strengths and tradeoffs • Message format and broker concepts • Partitioning, Keyed messages, Replication • Producer / Consumer APIs • Operation considerations • Kafka ecosystem If time permits: • Kafka as a real-time processing backbone • Brief intro to Storm • Kafka-Storm wordcount demo

  18. Producer API …or for log aggregation: Configuration parameters: ProducerType (sync/async) CompressionCodec (none/snappy/gzip) BatchSize EnqueueSize/Time Encoder/Serializer Partitioner #Retries MaxMessageSize …

  19. Consumer API(s) • High-level (consumer group, auto-commit) • Low-level (simple consumer, manual commit)

  20. Agenda • Kafka overview • Main concepts and comparisons to other messaging systems • Features, strengths and tradeoffs • Message format and broker concepts • Partitioning, Keyed messages, Replication • Producer / Consumer APIs • Operation considerations • Kafka ecosystem If time permits: • Kafka as a real-time processing backbone • Brief intro to Storm • Kafka-Storm wordcount demo

  21. Broker Protips • Reasonable number of partitions – will affect performance • Reasonable number of topics – will affect performance • Performance decrease with larger Zookeeper ensembles • Disk flush rate settings • message.max.bytes – max accept size, should be smaller than the heap • socket.request.max.bytes – max fetch size, should be smaller than the heap • log.retention.bytes – don’t want to run out of disk space… • Keep Zookeeper logs under control for same reason as above • Kafka brokers have been tested on Linux and Solaris

  22. Operating Kafka • Zookeeper usage • Producer loadbalancing • Broker ISR • Consumer tracking • Monitoring • JMX • Audit trail/console in the making • Distribution Tools: • Controlled shutdown tool • Preferred replica leader election tool • List topic tool • Create topic tool • Add partition tool • Reassign partitions tool • MirrorMaker

  23. Multi-datacenter replication

  24. Agenda • Kafka overview • Main concepts and comparisons to other messaging systems • Features, strengths and tradeoffs • Message format and broker concepts • Partitioning, Keyed messages, Replication • Producer / Consumer APIs • Operation considerations • Kafka ecosystem If time permits: • Kafka as a real-time processing backbone • Brief intro to Storm • Kafka-Storm wordcount demo

  25. Ecosystem • Producers: • Java (in standard dist) • Scala (in standard dist) • Log4j (in standard dist) • Logback: logback-kafka • Udp-kafka-bridge • Python: kafka-python • Python: pykafka • Python: samsa • Python: pykafkap • Python: brod • Go: Sarama • Go: kafka.go • C: librdkafka • C/C++: libkafka • Clojure: clj-kafka • Clojure: kafka-clj • Ruby: Poseidon • Ruby: kafka-rb • Ruby: em-kafka • PHP: kafka-php(1) • PHP: kafka-php(2) • PHP: log4php • Node.js: Prozess • Node.js: node-kafka • Node.js: franz-kafka • Erlang: erlkafka • Consumers: • Java (in standard dist) • Scala (in standard dist) • Python: kafka-python • Python: samsa • Python: brod • Go: Sarama • Go: nuance • Go: kafka.go • C/C++: libkafka • Clojure: clj-kafka • Clojure: kafka-clj • Ruby: Poseidon • Ruby: kafka-rb • Ruby: Kafkaesque • Jruby::Kafka • PHP: kafka-php(1) • PHP: kafka-php(2) • Node.js: Prozess • Node.js: node-kafka • Node.js: franz-kafka • Erlang: erlkafka • Erlang: kafka-erlang Common integration points: Stream Processing Storm - A stream-processing framework. Samza - A YARN-based stream processing framework. Hadoop Integration Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great. Kafka Hadoop Loader A different take on Hadoop loading functionality from what is included in the main distribution. AWS Integration Automated AWS deployment Kafka->S3 Mirroring Logging klogd - A python syslog publisher klogd2 - A java syslog publisher Tail2Kafka - A simple log tailing utility Fluentd plugin - Integration with Fluentd Flume Kafka Plugin - Integration with Flume Remote log viewer LogStash integration - Integration with LogStash and Fluentd Official logstash integration Metrics Mozilla Metrics Service - A Kafka and Protocol Buffers based metrics and logging system Ganglia Integration Packing and Deployment RPM packaging Debian packaginghttps://github.com/tomdz/kafka-deb-packaging Puppet integration Dropwizard packaging Misc. Kafka Mirror - An alternative to the built-in mirroring tool Ruby Demo App Apache Camel Integration Infobright integration

  26. What’s in the future? • Topic and transient consumer garbage collection (KAFKA-560/KAFKA-559) • Producer side persistence (KAFKA-156/KAFKA-789) • Exact mirroring (KAFKA-658) • Quotas (KAFKA-656) • YARN integration (KAFKA-949) • RESTful proxy (KAFKA-639) • New build system? (KAFKA-855) • More tooling (Console, Audit trail) (KAFKA-266/KAFKA-260) • Client API rewrite (Proposal) • Application level security (Proposal)

  27. Agenda • Kafka overview • Main concepts and comparisons to other messaging systems • Features, strengths and tradeoffs • Message format and broker concepts • Partitioning, Keyed messages, Replication • Producer / Consumer APIs • Operation considerations • Kafka ecosystem If time permits: • Kafka as a real-time processing backbone • Brief intro to Storm • Kafka-Storm wordcount demo

  28. Stream processing Kafka as a processing pipeline backbone Producer Process1 Process2 Kafka topic1 Kafka topic2 Process1 Producer Process2 Process1 Producer Process2 System1 System2

  29. What is Storm? • Distributed real-time computation system with design goals: • Guaranteed processing • No orphaned tasks • Horizontally scalable • Fault tolerant • Fast • Use cases: Stream processing, DRPC, Continuous computation • 4 basic concepts: streams, spouts, bolts, topologies • In Apache incubator • Implemented in Clojure

  30. Streams an [infinite] sequence (of tuples) (timestamp,sessionid,exceptionstacktrace) (t4,s2,e2) (t4,s2,e2) (t3,s3) (t3,s3) (t2,s1,e2) (t2,s1,e2) (t1,s1,e1) (t1,s1,e1) Spouts a source of streams Connects to queues, logs, API calls, event data. Some features like transactional topologies (which gives exactly-once messaging semantics) is only possible using the Kafka-TransactionalSpout-consumer

  31. Bolts (t2,s1,h2) (t1,s1,h1) • Filters • Transformations • Apply functions • Aggregations • Access DB, APIs etc. • Emitting new streams • Trident = a high level abstraction on top of Storm (t4,s2,e2) (t5,s4) (t3,s3)

  32. Topologies (t2,s1,h2) (t1,s1,h1) (t4,s2,e2) (t5,s4) (t3,s3) (t8,s8) (t7,s7) (t6,s6)

  33. Storm cluster Deploy Topology Compare with Hadoop: Nimbus (JobTracker) Zookeeper (TaskTrackers) Supervisor Supervisor Supervisor Supervisor Supervisor Mesos/YARN

  34. Links Apache Kafka: Papers and presentations Main project page Small Mediawiki case study Storm: Introductory article Realtime discussing blog post Kafka+Storm for realtime BigData Trifecta blog post: Kafka+Storm+Cassandra IBM developer article Kafka+Storm@Twitter BigDataQuadfecta blog post

More Related