830 likes | 1.05k Views
Session D: Tashi. Tashi. Michael Ryan Intel. Introduction 8.30-9.00 Hadoop 9.00-10.45 Break 10.45-11.00 Pig 11.00-12.00 Lunch 12.00-1.00 Tashi 1.00-3.00 Break 3.00-3.15 PRS 3.15-5.00. Overview User view Administration Installation Internals Summary. Agenda. Overview. Tashi.
E N D
Tashi Michael Ryan Intel
Introduction 8.30-9.00 Hadoop 9.00-10.45 Break 10.45-11.00 Pig 11.00-12.00 Lunch 12.00-1.00 Tashi 1.00-3.00 Break 3.00-3.15 PRS 3.15-5.00 Overview User view Administration Installation Internals Summary Agenda
Tashi An infrastructure through which service providers are able to build applications that harness cluster computing resources to efficiently access repositories of Big Data
Cluster Computing: A User’s Perspective Job-submission spectrum Tight environment coupling Loose environment coupling Runtime-specific (i.e. Hadoop) Queue-based (i.e. Condor or Torque) Virtual Machine-based (i.e. EC2 or COD…)
Tashi System Requirements • Provide high-performance execution over Big Data repositories Many spindles, many CPUs, co-location • Enable multiple services to access a repository concurrently • Enable low-latency scaling of services • Enable each service to leverage its own software stack Virtualization, file-system protections • Enable slow resource scaling for growth • Enable rapid resource scaling for power/demand Scaling-aware storage
Tashi High Level Architecture Remote cluster users Cluster Mgr Remote cluster owners Logical clusters … Distributed storage system(s) Note: Tashi runtime and distributed storage systems do not necessarily run on the same physical nodes as the logical clusters
Virtualization Service Storage Service Node Node Node Node Node Node Tashi Components Services are instantiated through virtual machines Most decisions happen in the scheduler; manages compute/storage in concert Scheduler Data location information is exposed to scheduler and services Cluster Manager Cluster nodes are assumed to be commodity machines CM maintains databases and routes messages; decision logic is limited
A query arrives Request forwarded The scheduler receives the file mapping information from the storage service Node Node Node Node Node Node VMs are requested on the appropriate nodes Create 4 VMs to handle files 5, 13, 17, and 26 The web server converts the query into a parallel data processing request Tashi Operation answers.opencirrus.net web server running in 1 VM Acting as a Tashi client, a request for additional VMs is submitted Scheduler Virtualization Service Storage Service Cluster Manager After the data objects are processed, the results are collected and forwarded to Alice. The VMs can then be destroyed
Why Virtualization? • Ease of deployment • Boot 100 copies of an operating system in 2 minutes • Cluster lubrication • Machines can be migrated or even restarted very easily in a different location • Overheads are going down • Even workloads that tax the virtual memory subsystem can now run with a very small overhead • I/O intensive workloads have improved dramatically, but still have some room for improvement
Tashi in a Nutshell • Tashi is primarily a system for managing Virtual Machines (VMs) • Virtual Machines are software containers that provide the illusion of real hardware, enabling • Physical resource sharing • OS-level isolation • Users specification of custom software environments • Rapid provisioning of services • Users will use Tashi to request the creation, destruction, and manipulation of VMs
Tashi Native Interface • Users invoke Tashi actions through a Tashi client • The client will have been configured by an administrator to communicate with the Tashi Cluster Manager • Example client actions include: • tashi createVm • tashi destroyVm • tashi createMany • etc.
Tashi AWS-compatibility • Tashi also has a client interface that is compatible with a subset of Amazon Web Services* • Parts of the SOAP and QUERY interfaces
Apache cgi-bin QUERY -> SOAP Tashi Agent Tashi AWS-compatibility Elastic Fox Client ec2-api-tools QUERY SOAP VM instance DB Cluster Manager (CM) Node Manager DB
Tashi Organization • Each cluster contains one Tashi Cluster Manager (CM) • The CM maintains a database of: • Available physical resources (nodes) • Active virtual machines • Pending requests for virtual machines • Virtual networks • Users submit requests to the CM through a Tashi Client • The Tashi Scheduler uses the CM databases to invoke actions, such as VM creation, through the CM • Each node contains a Node Manager that carries out actions, such as invoking the local Virtual Machine Manager (VMM), to create a new VM, and monitoring the performance of VMs
Tashi Software Architecture Site Specific Plugin(s) Centralized cluster administration Cluster Manager (CM) VM instance DB Scheduling Agent DFS Proxy Client Client API Node Manager DB VM Ganglia VM VM VM CM-NM API Resource Controller Plugins (VMM, DFS, power, etc.) Node Manager (NM) Sensor Plugins VMM DFS Legend DFS Metadata Server Tashi component system software nmd iptables /vlan non-Tashi component sshd Compute node
Tashi Native Client Interface (I) • VM Creation/Destruction Calls (Single Version) • createVm [--userId <value>] --name <value> [--cores <value>] [--memory <value>] --disks <value> [--nics <value>] [--hints <value>] • destroyVm --instance <value> • shutdownVm --instance <value> • VM Creation/Destruction Calls (Multiple Version) • createMany [--userId <value>] --basename <value> [--cores <value>] [--memory <value>] --disks <value> [--nics <value>] [--hints <value>] --count <value> • destroyMany --basename <value>
Creating a VM tashi createVm --name mikes-vm --cores 4 --memory 1024 --disks hardy.qcow2 --name specifies the DNS name to be created --disks specifies the disk image Advanced: [--nics <value>] [--hints <value>]
Tashi: Instances • An instance is a running VM • Each disk image may be used for multiple VMs if the ‘persistent’ bit is not set • A VM may be booted in persistent mode to make modifications without building an entirely new disk image
getMyInstances Explained tashi getMyInstances • This lists all VMs belonging to your userId • This is a good way to see what you’re currently using
getVmLayout Explained tashi getVmLayout • This command displays the layout of currently running VMs across the nodes in the cluster id name state instances usedMemory memory usedCores cores --------------------------------------------------------------------------------------- 126 r3r2u42 Normal ['bfly3', 'bfly4'] 14000 16070 16 16 127 r3r2u40 Normal ['mpa-00'] 15360 16070 8 16 128 r3r2u38 Normal ['xren1', 'jpan-vm2'] 15480 16070 16 16 129 r3r2u36 Normal ['xren3', 'collab-00'] 14800 16070 16 16 130 r3r2u34 Normal ['collab-02', 'collab-03'] 14000 16070 16 16 131 r3r2u32 Drained [] 0 16068 0 16 132 r3r2u30 Normal ['collab-04', 'collab-05'] 14000 16070 16 16 133 r3r2u28 Normal ['collab-06', 'collab-07'] 14000 16070 16 16
Tashi Native Client Interface (II) • VM Management Calls • suspendVm --instance <value> • resumeVm --instance <value> • pauseVm --instance <value> • unpauseVm --instance <value> • migrateVm --instance <value> --targetHostId <value> • vmmSpecificCall --instance <value> --arg <value>
Tashi Native Client Interface (III) • Bookkeeping Calls • getMyInstances • getInstances • getVmLayout • getUsers • getNetworks • getHosts
Creating Multiple VMs tashi createMany –count 10 --basename mikes-vm --cores 4 --memory 1024 --disks hardy.qcow2 --name specifies the DNS name to be created --disks specifies the disk image Advanced: [--nics <value>] [--hints <value>]
Example cluster: Maui/Torque • Configure a base disk image from an existing Maui/Torque cluster (or setup a new one) • We’ve done this - amd64-torque_node.qcow2 • Ask the Cluster Manager (CM) to create <N> VMs using this image • Have one preconfigured to be the scheduler and queue manager • Or set it up once the VMs have booted • Or have a separate image
Example cluster: Web Service • Configure a base image for a web server, and whatever other tiers (database, etc) you need for your service • Variable numbers of each can be created by requesting them from the CM • Conventional architecture for a web service
Example cluster: Hadoop • Configure a base image including Hadoop • Ask the CM to create instances • Note: Hadoop wants memory • Two options: • Let HDFS reside in the VMs • Not ideal for availability/persistence • Use HDFS from the hosts • Upcoming topic
Appliances • Not surprisingly, this set of examples makes one think of VM appliances • Certainly not a new concept • We’ve built several of these from the software configuration of common systems at our site • Configuration of old physical nodes • Clean images after an OS install (Ubuntu)
Where are we today? • Tashi can reliably manage virtual machines spread across a cluster • In production use for over a year • Still some opportunities to add features • Security • Intelligent scheduling • Additional opportunities for research • Power management • Alternative distributed file systems • Other
Where are we today? (cont) • Our deployment of Tashi has managed ~500 VMs across ~150 hosts • Primary access mechanism for the Big Data cluster • Maui/Torque and Hadoop have been pulled into VMs and are running on top of Tashi
Tashi Deployment Intel Labs Pittsburgh • Tashi is used on the Open Cirrus site at ILP • Majority of the cluster • Some nodes run Maui/Torque, Hadoop • Primary source of computational power for the lab • Mix of preexisting batch users, HPC workloads, Open Cirrus customers, and others
compute servers storage servers Storing the Data – Choices Model 1: Separate Compute/Storage Compute and storage can scale independently Many opportunities for reliability Model 2: Co-located Compute/Storage No compute resources are under-utilized Potential for higher throughput compute/storage servers
How is this done currently? HPC Amazon EC2/S3 Fine-grained parallelism Virtualized compute Separate Compute/Storage Task(s) Compute Storage See also: Usher, CoD, Eucalyptus, SnowFlock, … Hadoop/Google Tashi Coarse-grained parallelism Co-located Compute/Storage Multiple Cluster Users Single Cluster User
1U Rack Blade Rack 2U Rack Example cluster hardware 4/8 Gbps 48 port Gbps switches 30 Servers 2 disks/server 40 Servers 2 disks/server 15 Servers 6 disks/server
Far vs Near • With co-located compute/storage: • Near: data consumed on node where it is stored • Far: data consumed across the network • System software must enable near access for good performance • MapReduce provides near access • HPC typically provides far access, unless function shipping
Far vs Near Methodology Assume I/O bound (scan) application One task per spindle, no CPU load In the far system, data is consumed on a randomly selected node In the near system, data is consumed on the node where stored Average throughput, no queueing model Scenario 1: 11 Racks @ 4 Gbps Scenario 3: 5 Pods @ 8Gbps of 11 Racks @ 4 Gbps Far vs Near Analysis Scenario 2: 5 Racks @ 8 Gbps
Far vs Near Access Throughput 396 264 352 8.1x 11.3x 10.3x 5.0x 6.0x 5.8x 2.4x 2.8x 2.8x
Storage Service • Many options possible • HDFS, PVFS, pNFS, Lustre, JBOD, etc. • A standard interface is needed to expose location information
Data Location Service struct blockInfo { encodingType type; byteRange range; list<hostId> nodeList; }; list<blockInfo> getBlockInfoByteRange(fileId f, byteRange r); How do we know which data server is the best?
Resource Telemetry Service typedef double metricValue; metricValue getMetric(hostId from, hostId to, metricType t); list< list<metricValue> > getAllMetrics(list<hostId> fromList, list<hostId> toList, metricType t); Example metrics include latency, bandwidth, switch count, fault-tolerance domain, …
Putting the Pieces Together Data Location Service LA application LA application LA runtime LA runtime Resource Telemetry Service Virtual Machines DFS DFS Guest OS OS DFS VM Runtime VMM OS (a) non-virtualized (b) virtualized
Key Configuration Options • Tashi uses a series of configuration files • TashiDefaults.cfg is the most basic and is included in the source tree • Tashi.cfg overrides this for site specific settings • Agent.cfg, NodeManager.cfg, ClusterManager.cfg, and Client.cfg override those setting based on which app is launched • Files in ~/.tashi/ override everything else
Key Configuration Options (CM hostname) • You need to set the hostname used for the CM by Node Managers • Some example settings are listed below • Tashi.cfg: [Client] clusterManagerHost = merkabah [NodeManagerService] clusterManagerHost = merkabah
Key Configuration Options (VFS) • You need to set the directory that serves disk images • We’re using NFS for this at the moment • Some example settings are listed below • Tashi.cfg: [Vfs] prefix = /mnt/merkabah/tashi/