120 likes | 266 Views
Arthur. Ankur Dave , Matei Zaharia , Murphy McCauley, Scott Shenker , Ion Stoica. The Spark Debugger. UC BERKELEY. Motivation. Debugging large parallel jobs is hard Current approaches to debugging: Repeatedly modify and rerun the program Run isolated code in Spark shell.
E N D
Arthur Ankur Dave, MateiZaharia, Murphy McCauley,Scott Shenker, Ion Stoica The Spark Debugger UC BERKELEY
Motivation Debugging large parallel jobs is hard Current approaches to debugging: • Repeatedly modify and rerun the program • Run isolated code in Spark shell
Introducing Arthur Interactive replay debugger for Sparkprograms • Reconstruct and query intermediate datasets • Visualize the program’s data flow • Rerun any task in a single-process debugger • Trace records across transformations • Aggregate exceptions at the master
Spark Programming Model Example: Find how many Wikipedia articles match a search term HDFS file map(_.split(‘\t’)(3)) Resilient Distributed Datasets (RDDs) articles Deterministic transformations filter(_.contains( “Berkeley”)) matches count() 10,000
Approach lineage, checksums, events Master Workers Log results, checksums, events tasks
Approach Master Workers lineage Log user input results,checksums tasks
Detecting Nondeterministic Transformations Re-running a nondeterministic transformation may yield different results Arthur checksums RDD contents and alerts the user if necessary
Demo Example dataset: 1 GB partial Wikipedia dump • Reconstruct and query intermediate datasets • Visualize the program’s data flow • Rerun any task in a single-process debugger
Record Tracing Example: query a databaseof users and groups HDFS file A HDFS file B map(_.split(‘\t’)) map(_.split(‘\t’)) users groups join() groupCounts
Performance Event logging introduces minimal overhead
Future Plans • More analyses like backward tracing and culprit detection • Profiling tools for GC and memory • Real bugs
Arthur is in development at https://github.com/mesos/spark, branch arthur Documentation: https://github.com/mesos/spark/wiki/Spark-Debugger Ankur Dave ankurd@eecs.berkeley.edu http://ankurdave.com