280 likes | 450 Views
David P. Anderson Space Sciences Laboratory University of California – Berkeley. Public Distributed Computing with BOINC. Public-resource computing. 1 billion Internet-connected PCs in 2010 >50% of PCs are privately owned Assume 100M participants At least 100 PetaFLOPs
E N D
David P. Anderson Space Sciences Laboratory University of California – Berkeley Public Distributed Computing with BOINC
Public-resource computing • 1 billion Internet-connected PCs in 2010 • >50% of PCs are privately owned • Assume 100M participants • At least 100 PetaFLOPs • At least 1 Exabyte (10^18) storage • Problems • incentive, security, failures, ...
SETI@home • Started May 1999 • ~600,000 active participants • ~60 TeraFLOPs • Problems with current software • hard to change/add algorithms • can't share participants w/ other projects • inflexible data architecture
SETI@home data architecture ideal: current: tapes Internet2 (free) Berkeley Berkeley Stanford USC 50 Mbps commercial Internet commercial Internet participants participants
BOINC: Berkeley Open Infrastructure for Network Computing • Multiple projects • easy to develop and operate • independent • Support wide range of tasks • computation/storage • task “topologies” • Participant features • can choose projects, resource allocation • configurable; invisible on participant hosts • many platforms supported
BOINC server architecture work generator timeout/retry project DB BOINC DB validater assimilator file deleter data server data server data server data server data server scheduling server Web interfaces (PHP)
BOINC client architecture application application BOINC library BOINC library files, shared memory screensaver messages BOINC core client schedulers, data servers
Data architecture • Files • immutable, replicated • may originate on client or project • may remain resident on client • Persistent, non-intrusive file transfers • XML descriptor: <file_info> <name>arecibo_3392474_jun_23_01</name> <url>http://ds.ssl.berkeley.edu/a3392474</url> <url>http://dt.ssl.berkeley.edu/a3392474</url> <md5_cksum>uwi7eyufiw8e972h8f9w7</md5_cksum> <nbytes>10000000</nbytes> </file_info>
BOINC applications • Any language (C, C++, Fortran) • BOINC API • filename translation • checkpoint/restart, % done, CPU time • graphics (based on OpenGL, GLUT)
Work units • Template for a computation • Resource estimates • Integer, FP ops; memory; disk space • Delay bound • determines retry, client abort <file_info> <name>arecibo_3392474_jun_23_01</name> ... </file_info> <workunit> <name>ar_13323313</name> <file_ref> <name>arecibo_3392474_jun_23_01</name> <open_name>input_file</open_name> </file_ref> <command_line>-niter 1000</command_line> </workunit>
Results • An instance of a computation (completed or not) • Includes: host ID, claimed/granted credit <file_info> <name>arecibo_3392474_jun_23_01.out</name> ... </file_info> <result> <workunit_name>ar_13323313</workunit_name> <file_ref> <name>arecibo_3392474_jun_23_01.out</name> <open_name>output_file</open_name> </file_ref> </result>
Scheduling • Work buffering on client • upper, lower bounds • Host attributes • FP/int/mem speeds, disk/memory sizes • network bandwidth up/down • fraction of time connected, computing • Scheduler policy: • send as much work as requested, subject to feasibility, WU deadlines
Client/server protocol (XML-RPC) • Request • Authentication • Host description • Persistent file descriptions • Result descriptions • Duration of work requested • Reply • Application, workunit, result descriptors • Result acknowledgements • Preferences • Control messages (redirect, back off, etc.)
Work sequences • Handle long (weeks or months) computations with large local state • Sequence normally stays on one host; move to different host if failure • Scheduling, redundancy checking are tricky Upload state Check for abort
Redundant computing • Create several results per workunit • Find “canonical result” with project-specific consensus policy • Generate additional copies as needed, up to error thresholds • One result per WU per user
Participant Credit • Goals: • credit for work actually done (CPU, network, storage) • don't know workunit size in advance • cheat-proof • Integration with redundancy • claimed credit = benchmark * CPU time • granted credit = minimum claimed credit • Handling graphics coprocessors • project-specific benchmarks
Work unit lifecycle • Work generator: create WU, N results • Timeout check • create new results if needed • detect too many errors, too many results without consensus • Validator • find canonical result; grant credit • Assimilator • merge canonical result into project DB • File deleter • delete input and output files when no longer needed
Participating in a BOINC project User Project web site create account email account ID download core client core client enter account ID, project URL get list of scheduling servers scheduler RPC
Windows GUI • Multi-language • Operations: suspend/resume, attach/detach projects, etc.
User-visible web features • User profiles • user of the day • Forums • Self-moderating FAQs • Teams • XML data export (3rd party statistics reporting)
Project configuration file <boinc> <config> <db_name>ap</db_name> <db_passwd></db_passwd> <shmem_key>0x35740417</shmem_key> <key_dir>/mydisks/a/users/boincadm/keys</key_dir> <upload_url>http://setiboinc.ssl.berkeley.edu/ap_cgi/file_upload_handler</upload_url> <upload_dir>/mydisks/a/users/boincadm/projects/AstroPulse_Beta/upload</upload_dir> <cgi_url>http://setiboinc.ssl.berkeley.edu/ap_cgi</cgi_url> <log_dir>/mydisks/a/users/boincadm/projects/AstroPulse_Beta/log</log_dir> <disable_account_creation/> </config> <daemons> <daemon><cmd>feeder -d 1</cmd></daemon> <daemon><cmd>validate_test -d 2 -app AstroPulse -quorum 3</cmd></daemon> <daemon><cmd>timeout_check -d 2 -app AstroPulse -nerror 10 -ndet 10 -nredundancy 3</cmd></daemon> <daemon><cmd>assimilator -d 2 -app AstroPulse</cmd></daemon> <daemon><cmd>file_deleter -d 2</cmd></daemon> </daemons> <tasks> <task><cmd>update_stats -update_users -update_hosts -update_teams</cmd><period>1 hour</period></task> <task><cmd>get_load</cmd><period>5 min</period></task> <task><cmd>db_count "user"</cmd><output>count_users.out</output><period>5 min</period></task> <task><cmd>db_count "result"</cmd><output>count_results_all.out</output><period>5 min</period></task> </tasks> </boinc>
Project control • Single control program • enable, disable • cron • status • uses PID files to keep track of daemons • uses timestamp file for period tasks • uses lockfiles for mutual exclusion
Python-based testing system • Create objects representing projects, hosts, applications, work, etc. • Activate objects to realize (create databases and directories, run servers and clients) • Simulate various types of failures • Check correctness of final system state (database, result files, etc.) host = Host() user = UserUC() for i in range(2): ProjectUC(users=[user], hosts=[host], redundancy=5, short_name="test_1sec_%d"%i, resource_share=[1, 5][i]) run_check_all()
Monitoring/debugging tools • All backend processes create log files • web/grep tool for tracking particular WU/result • Database browsing tools • summary of activity; entry point for browsing • Strip charts • record, graph measures of system health • Watchdogs • detect system failures; ring pager
Summary and status • BOINC is funded by a 3-year NSF grant • Computing projects at Space Sciences Lab • Astropulse (in beta test) • SETI@home (original, Australian) • Other projects • Folding@home • Climateprediction.net • Source code is free for noncommercial use