1 / 19

Spark streaming 的监控和优化

Spark streaming 的监控和优化. 报告人:栾学东. What is spark. Apache Spark is a fast and general engine for large-scale data processing. Speed Ease of Use Generality Integrated with Hadoop. Spark Ecosystem. What is spark streaming.

tequila
Download Presentation

Spark streaming 的监控和优化

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spark streaming 的监控和优化 报告人:栾学东

  2. What is spark Apache Spark is a fast and general engine for large-scale data processing. Speed Ease of Use Generality Integrated with Hadoop

  3. Spark Ecosystem

  4. What is spark streaming Spark Streaming is a sub-project of Apache Spark. Spark is a batch processing platform similar to Apache Hadoop, and Spark Streaming is a real-time processing tool that runs on top of the Spark engine.

  5. Why spark streaming Low latency, high throughput, fault-tolerant DStream: Micro-batches of RDDs • Operations are similar to RDD • Lineage for fault-tolerance Supports Flume, Kafka, Twitter, Kinesis, etc. Built on Spark Core Execution Engine and API Long-running Spark Application

  6. Spark streaming receiving data

  7. Spark sreaming excution model A. Avenir LT 45 18-22PT

  8. Spark streaming api Transformations on Dstreams Rich, expressive API based on core Spark API Window operations Output Operations on Dstreams Output operations allow DStream’s data to be pushed out external systems like a database or a file systems.

  9. Spark streaming at vip 数据彩超 推荐实时日志分析 服务日志实时分析

  10. Spark Monitoring Use zabbix monitoring spark cluster

  11. Spark Monitoring Use zabbix monitoring spark cluster

  12. Spark Monitoring Spark web ui

  13. Spark streaming tuning Achieving stable configuration Achieving lower latency

  14. Tuning - stable configuration How To verify whether the system is stable? look for “Total delay” in Spark driver log4j logs If the delay is maintained to be comparable to the batch size, then system is stable. if the delay is continuously increasing, it means that the system is unable to process data as fast as its receiving and it therefore unstable

  15. Tuning - stable configuration How to find a stable configuration? At the beginning of tuning, test app with a conservative batch interval (say, 5-10 seconds) and a low data rate. Then increase the date rate,find the bottleneck on the web ui If the first statge on the raw data spend more time in the job processing Create multiple DStream and union them together to create a single Dstream Use repartiton() function to repartiton the raw data into more partition If any of the subsequent stages spend more time in the job processing Increase the reducers Increase the workers

  16. Tuning - lower latency Partition tuning Commonly between 100 and 10,000 partitions Lower bound: At least ~2x number of cores in cluster Upper bound: Ensure tasks take at least 100ms

  17. Tuning - lower latency Memery tuning Use kryo Serialization Using the concurrent mark-and-sweep GC Set “spark.storage.memoryFraction” to a reasonable value

  18. Tuning - lower latency Shuffle tuning Before spark 1.1,set “spark.shuffle.consolidateFiles” to true.It can improve filesystem performance for shuffles with large numbers of reduce tasks. Since spark 1.1 ,use sort-based shuffle manager Set “spark.local.dir” to be a comma-separated list of multiple directories on different disks.

  19. Thanks!

More Related