1 / 29

lbsh: Breadcrumbs as You Work

lbsh: Breadcrumbs as You Work. Eric Osterweil. Problem. Measurement studies, simulations, and many other investigations require a lot of data work Data processing (or experimentation) can be ad-hoc Given some raw data (measurements, observations, etc.) we often “try a number of things”

tiponya
Download Presentation

lbsh: Breadcrumbs as You Work

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. lbsh: Breadcrumbs as You Work Eric Osterweil

  2. Problem • Measurement studies, simulations, and many other investigations require a lot of data work • Data processing (or experimentation) can be ad-hoc • Given some raw data (measurements, observations, etc.) we often “try a number of things” • Simulations are often tweaked and re-run numerous times • In this mode, experiments can recursively lead subsequent experiments • How can a researcher always remember the exact provenance of their results?

  3. What is Data Provenance? • The concept of data provenance is a lot like “chain of custody” • More specifically, we borrow a definition from [1]: • “… We defined data provenance as information that helps determine the derivation history of a data product, starting from its original sources.”

  4. Why Does Provenance Matter? • Do we need to be able to remember exactly how we got results? • Setting: • Student: does a lot of processing, gets “compelling” results • Advisor: wants to re-run with new data • Student: panics (silently of course) • Reviewers: hope that results are reproducible

  5. Sharing Work With Others? • How many people have had to re-implement someone else’s algos for a paper? • How about getting a tarball of crufty scripts from an author and trying to get them to work? • What if you could get a tarball that was totally self-descriptive • What if the tarball could totally describe the work that lead to that user’s results? • What if it could allow you to re-run the whole thing?

  6. Example • sort data2.out > sorted-data2.out • awk ‘{print $1 “\t” $5}’ sorted-data2.out > dook • sort data3.out > sorted-data3.out • join -1 1 -2 1 dook sorted-data3.out > blah • vi script.sh • script.sh dook • awk ‘{tot+=$1*$1}END{print tot}’ blah > day1.out • vi blah.pl • blah.pl data1.out data2.out > day2.out • sort data1.out > sorted-data1.out • join -1 1 -2 1 dook sorted-data1.out > blah • vi blindluck.awk • blindluck.awk blah > day3.out

  7. Results? • What if it turns out that “day3.out” has the results I wanted? • Can anyone recall what the commands were for that? • Were any of the files overwritten?

  8. Outline • Inspiration • Goal • lbsh (Pound-Shell) • Usage • Contribution • Future

  9. Inspiration • Computer Scientists cannot be the first group to have this problem • In fact, we’re not • Science is predicated on reproducibility, so how do (for example) biologists deal with this? • They have lab-books, and they take notes

  10. Can We Do the Same? • A biologist may make a few notes and then spend several days conducting experiments • Conversely, we process data as fast as we can type, and block on I/O occasionally • Note taking is a small task in proportion to a biologist’s experiment • Note taking is a large task in proportion to our fast-fingers • Even then, a lab-book can look like a dictionary (too full of noise to use)

  11. What Else Do People Do? • Scientific Workflow • Design experiments in workflow environments • Lets each experiment be re-run and transparent • Lower level of noise • Of course, users must do all work in a foreign, and often times restrictive, environment

  12. Observation • We can’t always (ever?) know what experiments will be fruitful before we run them • So, we may not want to setup a large experiment and design a workflow every time we try something • Corollary:We may not realize our results are good until some time after we first examine them

  13. What Holds Us Back? • A lack of motivation? • Shouldn’t a solution be • Easy • Support automation that makes it worth doing. Why bother if it isn’t directly useful?

  14. Goal • What we really want is to know how “day3.out” was generated because: • We need to be sure we did it right • We need to be able to show our collaborators that we aren’t smoking crack • We often want to re-run our analysis with new data • More? Let’s stop here for now…

  15. How COULD We Do This? • Keep a manual lab-book file of all commands run • This is feasible, but very prone to both bloat and stale/missing/mistaken info • It’s a very manual process and a pain. You can’t copy-and-paste w/o stripping the prompts, etc. • Look at the history file • Multiple shells will cause holes in the history • What about commands issued in: R, gnuplot, etc? • An ideal solution… • Automatic, just specify start and stop points. • Wasted experiments are not a factor

  16. Meaningless Eye Candy

  17. lbsh (Pound-Shell) • Let’s provide lab-book support on the command line! • While typing we should be able to just “start an experiment” do some work, and then “stop” it • In addition, we should keep track of what files were accessed and modified during this • Goal: provide provenance for files based on lab-book entries

  18. Level-Setting • lbsh is in alpha • The code works well, but there are certainly bugs • The features that are there are a starting point • Feedback is welcome • Tell me about bugs, tell me what you like, tell me what you dislike, etc • The page is hosted here, but there are links to sourceforge for bug tracking and feature reqs http://lbsh.cs.ucla.edu/

  19. How Does it Work? • Lbsh is a monitor that spawns a worker shell and passes commands to it • When a user “starts an experiment” lbsh starts recording • The experiments are entered as separate lab-book entries

  20. Specifically… • lbsh uses a user config file ($HOME/.lbshrc) • Records commands (even in R, etc.) • Stats files in a user-specified directory (atime/mtime) • Can repeat experiments • Is able to avoid repeating editor sessions (vi, emacs, etc.) • Can report the experimental provenance of individual files • i.e. “How did I get ‘day3.out’?”

  21. Usage • To use lbsh, just launch it • To start/stop an experiment: • ctrl-b • To tell if lbsh is running, or if an experiment is running: • lbshrunning.sh -v • exprunning.sh -v • To find a file’s provenance: • file-provenance.pl • To re-run an old experiment: • exeggutor.pl <experiment ID>

  22. Revisiting Example • sort data2.out > sorted-data2.out • awk ‘{print $1 “\t” $5}’ sorted-data2.out > dook • sort data3.out > sorted-data3.out • join -1 1 -2 1 dook sorted-data3.out > blah • vi script.sh • script.sh dook • awk ‘{tot+=$1*$1}END{print tot}’ blah > day1.out • vi blah.pl • blah.pl data1.out data2.out > day2.out • sort data1.out > sorted-data1.out • join -1 1 -2 1 dook sorted-data1.out > blah • vi blindluck.awk • blindluck.awk blah > day3.out

  23. Real Experiments • This example is too simple to be interesting • Though simple is good • Let’s see the result of some real usage from a paper submission:

  24. Contribution • What we want is to make reproducibility a foregone conclusion, not a pipedream • Can we do it? • lbsh is a simple tool that is NOT fool-proof • Evidence: I’ve already found ways to trick it • lbsh is just a useful tool that makes it easier for each of us to be more diligent • What lbsh really contributes is: • An automation framework for us to be more efficient, and more secure in our work (reproducing data, etc.) • An enabling technology for us to do better

  25. Future • In addition to tending our own farm, can we build on someone else’s work now? • Ex: IMC requires datasets to be made public to be considered for best-paper • From public data, can I automatically see how someone got their results and try to do follow-on work? • Feature requests: • Svn support: version control some files • File cleanup • Fix NFS support

  26. http://lbsh.cs.ucla.edu/

  27. References [1] Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIGMOD Rec., 34(3):31–36, 2005.

  28. Thanks! Questions? Ideas?

More Related