1 / 12

The Spark Debugger

Arthur. Ankur Dave , Matei Zaharia , Murphy McCauley, Scott Shenker , Ion Stoica. The Spark Debugger. UC BERKELEY. Motivation. Debugging large parallel jobs is hard Current approaches to debugging: Repeatedly modify and rerun the program Run isolated code in Spark shell.

jack
Download Presentation

The Spark Debugger

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Arthur Ankur Dave, MateiZaharia, Murphy McCauley,Scott Shenker, Ion Stoica The Spark Debugger UC BERKELEY

  2. Motivation Debugging large parallel jobs is hard Current approaches to debugging: • Repeatedly modify and rerun the program • Run isolated code in Spark shell

  3. Introducing Arthur Interactive replay debugger for Sparkprograms • Reconstruct and query intermediate datasets • Visualize the program’s data flow • Rerun any task in a single-process debugger • Trace records across transformations • Aggregate exceptions at the master

  4. Spark Programming Model Example: Find how many Wikipedia articles match a search term HDFS file map(_.split(‘\t’)(3)) Resilient Distributed Datasets (RDDs) articles Deterministic transformations filter(_.contains( “Berkeley”)) matches count() 10,000

  5. Approach lineage, checksums, events Master Workers Log results, checksums, events tasks

  6. Approach Master Workers lineage Log user input results,checksums tasks

  7. Detecting Nondeterministic Transformations Re-running a nondeterministic transformation may yield different results Arthur checksums RDD contents and alerts the user if necessary

  8. Demo Example dataset: 1 GB partial Wikipedia dump • Reconstruct and query intermediate datasets • Visualize the program’s data flow • Rerun any task in a single-process debugger

  9. Record Tracing Example: query a databaseof users and groups HDFS file A HDFS file B map(_.split(‘\t’)) map(_.split(‘\t’)) users groups join() groupCounts

  10. Performance Event logging introduces minimal overhead

  11. Future Plans • More analyses like backward tracing and culprit detection • Profiling tools for GC and memory • Real bugs

  12. Arthur is in development at https://github.com/mesos/spark, branch arthur Documentation: https://github.com/mesos/spark/wiki/Spark-Debugger Ankur Dave ankurd@eecs.berkeley.edu http://ankurdave.com

More Related