250 likes | 413 Views
Last Time. Parallelism within a machine: Run lots of programs Break into multiple programs or multiple data set Fork multiple processes → fork() Spawns multiple threads → pthreads Parallelism across machines: MPI Queing system Scheduler. Lecture Overview. Lecture:
E N D
Last Time • Parallelism within a machine: • Run lots of programs • Break into multiple programs or multiple data set • Fork multiple processes → fork() • Spawns multiple threads → pthreads • Parallelism across machines: • MPI • Queing system • Scheduler
Lecture Overview • Lecture: • More on Clusters, Grids & Schedulers • Distributed Filesystems • Hadoop Building Blocks • Hands On: • Clusters/Grids • Sun Grid Engine • Distributed Filesystems • Lustre
Next Week • From Grids to Clouds: • Hadoop • Hadoop Streaming • Map/Reduce • HDFS
News • Cluster is now accessible from home! • ssh -p 60000 student#@jordan-test.genome.ucla.edu • # corresponds to your machine in lab • NOTE: • See me to set up password for outside access! • Don't store important files on it, as it is not backed up! • It is now larger, and has more queues! • Lab 1 will be discussed later. It is due on April 27. • Now in the syllabus • UCLA Extension has agreed to offer advanced cloud courses • Learn to setup, tune and administer the tools from this class • Will likely be offered in 2-3 quarters, after we have more students • Feedback is desired to shape the additional courses • More details as they are available • The syllabus is being updated as we go along with more references
Queue Selection • How does SGE select a queue? • Job submitted to one or more queues • Queues have sequence number (seq_no) • qconf -mq all.q • Users are only authorized for cetrain queues • Does desired queue have slots available, or continue? • Examples • echo sleep 100 | qsub -N all • echo sleep 100 | qsub -N all -q small.q • echo sleep 100 | qsub -N both -q all.q,small.q
Queues: Advanced • Subordinate Queues • Maximum Job Length • Do we kill cheaters? • Queues contains hosts • qsub -q q1@compute-0,q2@compute-1,q3 • Host Groups (a group of hosts) • qconf -ahgrp • qconf -shgrpl • echo sleep 1000 | qsub -q all.q@@highmem2 • Run in any @highmem2 node, within all.q queue
Scheduling • How do we efficiently map jobs to nodes? • Parallel Environments • Multiple Slots/Cores • qsub -pe serial N • Consumable Resources (Memory, IB, etc) • Setup • qconf -sc (show) • qconf -mc (setup) • qconf -me compute-1-1 • complex_values virtual_free=32.409G • Usage • qsub -l vf=2G • Load-based scheduling • How do we handle cheaters?
Fair Scheduling • Which user/project/department gets the next resource? • FIFO (First In First Out) • Tickets • Functional • Fair usage right now, history irrelevant • Share-based • Fair usage over time, with amortized half-life • Policy, Politics • Should power users have low priority? • If I run 1 job a month, does that mean it is important? • How long am I penalized for after heavy usage? • Does the Lead Developer get more shares than me? What about the Pipeline? • Priorities • Are priorities legitimate between users, or just within? • How do we weigh priority, tickets, wait time, etc?
Fair Scheduling • How do we prevent a user/group/dept from dominating? • Quotas • qconf -srqsl, qconf -arqs name limitUsers description rules to avoid user domination enabled TRUE limit users {guest,test} to slots=4 limit users * hosts @mainframes to slots=1 limit users * hosts to slots=100 • The scheduled job can still run forever once! (or do nothing) • Do we kill them or put to sleep? Do we have enough swap? • Subordinate Queues, Time Limits
Advanced Jobs • Array jobs • Input File (arrays.jobs.in) /home/jordan/file1 /home/jordan/file2 • Execution Script (runner.sh) #!/bin/bash INPUT_FILE=/home/jordan/array.jobs.in LINE=$(head -n $SGE_TASK_ID $jobfile | tail -n 1) gzip $LINE • Submission • qsub -N array -t 1-`cat $jobfile | wc -l` ~/runner.sh • Job Dependencies • JOB1_ID=`qsub -N job1 ./job1.sh | awk '{print $3}'` • qsub -N job2 -hold_jid $JOB1_ID job1.sh
Advanced Jobs • Email notification • qsub -m e -M jmendler@ucla.edu /tmp/foo.sh • Job length • Advanced Reservations • Lets us preallocate execution hosts • Too advanced for now
Scaling Storage • As we add computers, what bottlenecks? • Network • Depends on application • Infiniband, Myrinet • Storage • Direct Attached Storage • Network Attached Storage (NAS) • Storage Area Network (SAN)
Scaling Storage • Direct Attached Storage • Not shared • Network Attached Storage (NAS) • Head node bottlenecks • Storage Area Network (SAN) • Every node talks to every disk • Doesn't scale as well and very expensive • Clustered NAS, NAS + SAN, etc • Clients are load-balanced amongst NFS servers • A single request goes to one server at a time • All head nodes must present same storage
Distributed Filesystems • POSIX Compliant (sometimes) • Appears to the client the same way a local disk or NFS volume would • Global • A single namespace across multiple storage servers/volumes • All clients see a single volume regardless of the back-end • Distributed • Files are distributed across storage servers for load balancing • Storage servers talk to one or more sets of disks at a time • Parallel • A single file can be distributed across one or more servers • Client talks to one or more fileservers at a time
Distributed Filesystems • Expandable • Add storage servers or disk arrays to grow (online) • Multi-Petabyte installations • Fast and scalable • Support tens of thousands of clients • Support hundreds of gigabytes per second • Reliability • Automatic failover when a server or disk dies
Lustre • Lustre is a POSIX-compliant global, distributed, parallel filesystem • Lustre is fast, scalable and live expandable • Lustre is licensed under GPL • Lustre was acquired by Sun/Oracle
Lustre: Flagship Deployment • Oak Ridge National Lab (2009) • 1 center-wide Lustre-based storage volume • Over 10 PB of RAID6 storage (13,400 SATA disks!) • Over 200GB/s of throughput (240GB/s theoretical) • 192 Lustre servers over Infiniband • Over 26,000 clients simultaneously performing I/O • ORNL 2012 Projections • 1.5 TB/s aggregate disk bandwidth • 244 PB of SATA disk storage or 61 PB of SAS • 100,000 clients Source: http://wiki.lustre.org/images/a/a8/Gshipman_lug_2009.pdf
Lustre: Downsides • Complexity • Reliability is improving but not quite there • Unexplained slow downs, hangs, weird debug messages • Occasional corruption bug pops up on mailing list • Fast scratch space at best • Copy raw data to Lustre, process, copy results back • No High Availability at the Lustre level (yet) • Regardless, Lustre is surprisingly robust to failures • Reboot any number of OSSes and/or the MDS during a read/write • The client simple waits around for the target to return • When the cluster comes back online, I/O generally resumes cleanly • Client timeouts are tunable to wait or return file unavailable error
Lustre Components: Servers • Metadata Server (MDS) • Manages filesystem metadata, but stores no actual data • Ideally, enough RAM to fit all of metadata in memory • Object Storage Servers (OSS) • Analogous to head node(s) for each storage server • Performs the disk I/O when prompted by client • Server side caching • Management Server (MGS) • Stores configuration information about filesystem • Servers require custom kernel
Lustre Components • Metadata Target (MDT) • Disk back-end to the MDS, tuned for small files • Object Storage Target (OST) • One or more per OSS, each a disk or array stores actual files • ldiskfs (modified ext3/ext4), porting to ZFS • Clients • Lustre client runs as kernel module to direct mount Lustre • Client asks the MDS where to read/write a file or directory • Client makes request directly to the OSS(s) • OSS talks to the appropriate OST(s) • Clients cache when possible
Lab time • Cluster is now accessible from home! • ssh -p 60000 student#@jordan-test.genome.ucla.edu • # corresponds to your machine number in lab • In lab you can ssh directly in. From home you need a password! • New cluster is bigger, and has more bells and whistles • It is not backed up, so do not store important files on it! • Problems: • Forking and Threading (from Last Week) • http://genome.ucla.edu/~jordan/teaching/spring2010/LinuxCloudComputing/lecture2/lab_problems.txt • Sun Grid Engine • http://genome.ucla.edu/~jordan/teaching/spring2010/LinuxCloudComputing/lecture3/lab_problems.txt • The two combined will make up Lab 1, due April 27
Introduction to Hadoop • Sun Grid Engine + Lustre • Job runs in Rack1, writes to storage in Rack5 • All writes go across the network, possibly far away • SGE disks do nothing, Lustre CPUs do nothing • Wasted resources, and need to grow these systems independently • Integration points? • Combine storage servers and compute nodes • CPUs for computation, disks for storage • Minimize network traffic by writing to our local disk when possible • Each added server speeds up both processing and data throughput/capacity • Combine job scheduler and filesystem metadata (data locality) • Run jobs on the node, or rack that has the desired input files • Cheaper to move computation than data! • Stripe across the local rack, not across uplinks!
Introduction to Hadoop • How else can we optimize? • Run duplicate computation on empty nodes • Amongst 3 computers, 1 is likely to be a little faster and 1 may fail • Replicate data • Copies on different racks to improve read speeds • But no reason to copy intermediate temp files • Also safer, so we can use cheap/commidity hardware • Compression • CPUs are faster than disks and networks • How else can we simply? • Automate and optimize splits and merges • Integrate the whole system, so user doesn't worry about internals • How can we hide all this detail from the user? • An API providing simple functions/data structures that the system can scale
Introduction to MapReduce • What primitives must this API to do? • Get Input and Split (InputReader) • Efficiently read in 1 dataset with 1,000,000 records • Split into N groups for N nodes • Computation (Map) • Take a group of data from split or prior computation • Run some algorithm on that data • Output result(s) for each computation • Merges (Reduce) • Take some group of data and combine them into 1 or more values • Store Results (OutputWriter) • Take our result and efficiently/safely write it to storage