Spark Streaming Monitoring and Optimization Report

Spark streaming 的监控和优化 报告人：栾学东

What is spark Apache Spark is a fast and general engine for large-scale data processing. Speed Ease of Use Generality Integrated with Hadoop

Spark Ecosystem

What is spark streaming Spark Streaming is a sub-project of Apache Spark. Spark is a batch processing platform similar to Apache Hadoop, and Spark Streaming is a real-time processing tool that runs on top of the Spark engine.

Why spark streaming Low latency, high throughput, fault-tolerant DStream: Micro-batches of RDDs • Operations are similar to RDD • Lineage for fault-tolerance Supports Flume, Kafka, Twitter, Kinesis, etc. Built on Spark Core Execution Engine and API Long-running Spark Application

Spark streaming receiving data

Spark sreaming excution model A. Avenir LT 45 18-22PT

Spark streaming api Transformations on Dstreams Rich, expressive API based on core Spark API Window operations Output Operations on Dstreams Output operations allow DStream’s data to be pushed out external systems like a database or a file systems.

Spark streaming at vip 数据彩超推荐实时日志分析服务日志实时分析

Spark Monitoring Use zabbix monitoring spark cluster

Spark Monitoring Spark web ui

Spark streaming tuning Achieving stable configuration Achieving lower latency

Tuning - stable configuration How To verify whether the system is stable? look for “Total delay” in Spark driver log4j logs If the delay is maintained to be comparable to the batch size, then system is stable. if the delay is continuously increasing, it means that the system is unable to process data as fast as its receiving and it therefore unstable

Tuning - stable configuration How to find a stable configuration? At the beginning of tuning, test app with a conservative batch interval (say, 5-10 seconds) and a low data rate. Then increase the date rate,find the bottleneck on the web ui If the first statge on the raw data spend more time in the job processing Create multiple DStream and union them together to create a single Dstream Use repartiton() function to repartiton the raw data into more partition If any of the subsequent stages spend more time in the job processing Increase the reducers Increase the workers

Tuning - lower latency Partition tuning Commonly between 100 and 10,000 partitions Lower bound: At least ~2x number of cores in cluster Upper bound: Ensure tasks take at least 100ms

Tuning - lower latency Memery tuning Use kryo Serialization Using the concurrent mark-and-sweep GC Set “spark.storage.memoryFraction” to a reasonable value

Tuning - lower latency Shuffle tuning Before spark 1.1,set “spark.shuffle.consolidateFiles” to true.It can improve filesystem performance for shuffles with large numbers of reduce tasks. Since spark 1.1 ,use sort-based shuffle manager Set “spark.local.dir” to be a comma-separated list of multiple directories on different disks.

Thanks！

Spark Streaming Monitoring and Optimization Report

Spark Streaming Monitoring and Optimization Report

Presentation Transcript

Video Streaming

Streaming

SPARK creates, implements, and evaluates programs that promote lifelong wellness.

Peer-to-Peer Streaming

Streaming Protocol Suite

SPARK Timeline

The Spark Debugger

What’s New in Spark 0.6 and Shark 0.2

Ignition System

Intro to Spark 0.7: PySpark and Streaming

Spark Streaming Large -scale near-real-time stream processing

Streaming: Just How Big Is It?

Streaming Video Library

Replacing Spark Plugs

Ex isolating spark gaps Examples of use

Digital Spark Technology

4 – Spark, Calgary, Canada

Getting Started with the SPARK