1 / 28

David P. Anderson Space Sciences Laboratory University of California – Berkeley

David P. Anderson Space Sciences Laboratory University of California – Berkeley. Public Distributed Computing with BOINC. Public-resource computing. 1 billion Internet-connected PCs in 2010 >50% of PCs are privately owned Assume 100M participants At least 100 PetaFLOPs

brent
Download Presentation

David P. Anderson Space Sciences Laboratory University of California – Berkeley

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. David P. Anderson Space Sciences Laboratory University of California – Berkeley Public Distributed Computing with BOINC

  2. Public-resource computing • 1 billion Internet-connected PCs in 2010 • >50% of PCs are privately owned • Assume 100M participants • At least 100 PetaFLOPs • At least 1 Exabyte (10^18) storage • Problems • incentive, security, failures, ...

  3. SETI@home • Started May 1999 • ~600,000 active participants • ~60 TeraFLOPs • Problems with current software • hard to change/add algorithms • can't share participants w/ other projects • inflexible data architecture

  4. SETI@home data architecture ideal: current: tapes Internet2 (free) Berkeley Berkeley Stanford USC 50 Mbps commercial Internet commercial Internet participants participants

  5. BOINC: Berkeley Open Infrastructure for Network Computing • Multiple projects • easy to develop and operate • independent • Support wide range of tasks • computation/storage • task “topologies” • Participant features • can choose projects, resource allocation • configurable; invisible on participant hosts • many platforms supported

  6. BOINC server architecture work generator timeout/retry project DB BOINC DB validater assimilator file deleter data server data server data server data server data server scheduling server Web interfaces (PHP)

  7. BOINC client architecture application application BOINC library BOINC library files, shared memory screensaver messages BOINC core client schedulers, data servers

  8. Data architecture • Files • immutable, replicated • may originate on client or project • may remain resident on client • Persistent, non-intrusive file transfers • XML descriptor: <file_info> <name>arecibo_3392474_jun_23_01</name> <url>http://ds.ssl.berkeley.edu/a3392474</url> <url>http://dt.ssl.berkeley.edu/a3392474</url> <md5_cksum>uwi7eyufiw8e972h8f9w7</md5_cksum> <nbytes>10000000</nbytes> </file_info>

  9. BOINC applications • Any language (C, C++, Fortran) • BOINC API • filename translation • checkpoint/restart, % done, CPU time • graphics (based on OpenGL, GLUT)

  10. Work units • Template for a computation • Resource estimates • Integer, FP ops; memory; disk space • Delay bound • determines retry, client abort <file_info> <name>arecibo_3392474_jun_23_01</name> ... </file_info> <workunit> <name>ar_13323313</name> <file_ref> <name>arecibo_3392474_jun_23_01</name> <open_name>input_file</open_name> </file_ref> <command_line>-niter 1000</command_line> </workunit>

  11. Results • An instance of a computation (completed or not) • Includes: host ID, claimed/granted credit <file_info> <name>arecibo_3392474_jun_23_01.out</name> ... </file_info> <result> <workunit_name>ar_13323313</workunit_name> <file_ref> <name>arecibo_3392474_jun_23_01.out</name> <open_name>output_file</open_name> </file_ref> </result>

  12. Scheduling • Work buffering on client • upper, lower bounds • Host attributes • FP/int/mem speeds, disk/memory sizes • network bandwidth up/down • fraction of time connected, computing • Scheduler policy: • send as much work as requested, subject to feasibility, WU deadlines

  13. Client/server protocol (XML-RPC) • Request • Authentication • Host description • Persistent file descriptions • Result descriptions • Duration of work requested • Reply • Application, workunit, result descriptors • Result acknowledgements • Preferences • Control messages (redirect, back off, etc.)

  14. Work sequences • Handle long (weeks or months) computations with large local state • Sequence normally stays on one host; move to different host if failure • Scheduling, redundancy checking are tricky Upload state Check for abort

  15. Redundant computing • Create several results per workunit • Find “canonical result” with project-specific consensus policy • Generate additional copies as needed, up to error thresholds • One result per WU per user

  16. Participant Credit • Goals: • credit for work actually done (CPU, network, storage) • don't know workunit size in advance • cheat-proof • Integration with redundancy • claimed credit = benchmark * CPU time • granted credit = minimum claimed credit • Handling graphics coprocessors • project-specific benchmarks

  17. Work unit lifecycle • Work generator: create WU, N results • Timeout check • create new results if needed • detect too many errors, too many results without consensus • Validator • find canonical result; grant credit • Assimilator • merge canonical result into project DB • File deleter • delete input and output files when no longer needed

  18. Participating in a BOINC project User Project web site create account email account ID download core client core client enter account ID, project URL get list of scheduling servers scheduler RPC

  19. Windows GUI • Multi-language • Operations: suspend/resume, attach/detach projects, etc.

  20. Participant preferences

  21. Project-specific preferences

  22. User-visible web features • User profiles • user of the day • Forums • Self-moderating FAQs • Teams • XML data export (3rd party statistics reporting)

  23. Project configuration file <boinc> <config> <db_name>ap</db_name> <db_passwd></db_passwd> <shmem_key>0x35740417</shmem_key> <key_dir>/mydisks/a/users/boincadm/keys</key_dir> <upload_url>http://setiboinc.ssl.berkeley.edu/ap_cgi/file_upload_handler</upload_url> <upload_dir>/mydisks/a/users/boincadm/projects/AstroPulse_Beta/upload</upload_dir> <cgi_url>http://setiboinc.ssl.berkeley.edu/ap_cgi</cgi_url> <log_dir>/mydisks/a/users/boincadm/projects/AstroPulse_Beta/log</log_dir> <disable_account_creation/> </config> <daemons> <daemon><cmd>feeder -d 1</cmd></daemon> <daemon><cmd>validate_test -d 2 -app AstroPulse -quorum 3</cmd></daemon> <daemon><cmd>timeout_check -d 2 -app AstroPulse -nerror 10 -ndet 10 -nredundancy 3</cmd></daemon> <daemon><cmd>assimilator -d 2 -app AstroPulse</cmd></daemon> <daemon><cmd>file_deleter -d 2</cmd></daemon> </daemons> <tasks> <task><cmd>update_stats -update_users -update_hosts -update_teams</cmd><period>1 hour</period></task> <task><cmd>get_load</cmd><period>5 min</period></task> <task><cmd>db_count "user"</cmd><output>count_users.out</output><period>5 min</period></task> <task><cmd>db_count "result"</cmd><output>count_results_all.out</output><period>5 min</period></task> </tasks> </boinc>

  24. Project control • Single control program • enable, disable • cron • status • uses PID files to keep track of daemons • uses timestamp file for period tasks • uses lockfiles for mutual exclusion

  25. Python-based testing system • Create objects representing projects, hosts, applications, work, etc. • Activate objects to realize (create databases and directories, run servers and clients) • Simulate various types of failures • Check correctness of final system state (database, result files, etc.) host = Host() user = UserUC() for i in range(2): ProjectUC(users=[user], hosts=[host], redundancy=5, short_name="test_1sec_%d"%i, resource_share=[1, 5][i]) run_check_all()

  26. Monitoring/debugging tools • All backend processes create log files • web/grep tool for tracking particular WU/result • Database browsing tools • summary of activity; entry point for browsing • Strip charts • record, graph measures of system health • Watchdogs • detect system failures; ring pager

  27. Summary and status • BOINC is funded by a 3-year NSF grant • Computing projects at Space Sciences Lab • Astropulse (in beta test) • SETI@home (original, Australian) • Other projects • Folding@home • Climateprediction.net • Source code is free for noncommercial use

More Related