190 likes | 237 Views
Spark streaming 的监控和优化. 报告人:栾学东. What is spark. Apache Spark is a fast and general engine for large-scale data processing. Speed Ease of Use Generality Integrated with Hadoop. Spark Ecosystem. What is spark streaming.
E N D
Spark streaming 的监控和优化 报告人:栾学东
What is spark Apache Spark is a fast and general engine for large-scale data processing. Speed Ease of Use Generality Integrated with Hadoop
What is spark streaming Spark Streaming is a sub-project of Apache Spark. Spark is a batch processing platform similar to Apache Hadoop, and Spark Streaming is a real-time processing tool that runs on top of the Spark engine.
Why spark streaming Low latency, high throughput, fault-tolerant DStream: Micro-batches of RDDs • Operations are similar to RDD • Lineage for fault-tolerance Supports Flume, Kafka, Twitter, Kinesis, etc. Built on Spark Core Execution Engine and API Long-running Spark Application
Spark sreaming excution model A. Avenir LT 45 18-22PT
Spark streaming api Transformations on Dstreams Rich, expressive API based on core Spark API Window operations Output Operations on Dstreams Output operations allow DStream’s data to be pushed out external systems like a database or a file systems.
Spark streaming at vip 数据彩超 推荐实时日志分析 服务日志实时分析
Spark Monitoring Use zabbix monitoring spark cluster
Spark Monitoring Use zabbix monitoring spark cluster
Spark Monitoring Spark web ui
Spark streaming tuning Achieving stable configuration Achieving lower latency
Tuning - stable configuration How To verify whether the system is stable? look for “Total delay” in Spark driver log4j logs If the delay is maintained to be comparable to the batch size, then system is stable. if the delay is continuously increasing, it means that the system is unable to process data as fast as its receiving and it therefore unstable
Tuning - stable configuration How to find a stable configuration? At the beginning of tuning, test app with a conservative batch interval (say, 5-10 seconds) and a low data rate. Then increase the date rate,find the bottleneck on the web ui If the first statge on the raw data spend more time in the job processing Create multiple DStream and union them together to create a single Dstream Use repartiton() function to repartiton the raw data into more partition If any of the subsequent stages spend more time in the job processing Increase the reducers Increase the workers
Tuning - lower latency Partition tuning Commonly between 100 and 10,000 partitions Lower bound: At least ~2x number of cores in cluster Upper bound: Ensure tasks take at least 100ms
Tuning - lower latency Memery tuning Use kryo Serialization Using the concurrent mark-and-sweep GC Set “spark.storage.memoryFraction” to a reasonable value
Tuning - lower latency Shuffle tuning Before spark 1.1,set “spark.shuffle.consolidateFiles” to true.It can improve filesystem performance for shuffles with large numbers of reduce tasks. Since spark 1.1 ,use sort-based shuffle manager Set “spark.local.dir” to be a comma-separated list of multiple directories on different disks.