Distributed Data Mining

ACAI’05/SEKT’05ADVANCED COURSE ON KNOWLEDGE DISCOVERY Distributed Data Mining Dr. Giuseppe Di Fatta University of Konstanz (Germany) and ICAR-CNR, Palermo (Italy) 5 July, 2005 Email: fatta@inf.uni-konstanz.de, difatta@pa.icar.cnr.it

Tutorial Outline • Part 1: Overview of High-Performance Computing • Technology trends • Parallel and Distributed Computing architectures • Programming paradigms • Part 2: Distributed Data Mining • Classification • Clustering • Association Rules • Graph Mining • Conclusions

Tutorial Outline • Part 1: Overview of High-Performance Computing • Technology trends • Moore’s law • Processing • Memory • Communication • Supercomputers

Units of HPC • Processing • 1 Mflop/s 1 Megaflop/s 106 Flop/sec • 1 Gflop/s 1 Gigaflop/s 109 Flop/sec • 1 Tflop/s 1 Teraflop/s 1012 Flop/sec • 1 Pflop/s 1 Petaflop/s 1015 Flop/sec • Memory • 1 MB 1 Megabyte 106 Bytes • 1 GB 1 Gigabyte 109 Bytes • 1 TB 1 Terabyte 1012 Bytes • 1 PB 1 Petabyte 1015 Bytes

How far did we go?

Technology Limits r • Consider the 1 Tflop sequential machine • data must travel some distance, r, to get from memory to CPU • to get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/s • so r < c/1012 = 0.3 mm • Now put 1 TB of storage in a 0.3 mm2 area • each word occupies about 3 Angstroms2, the size of a small atom 1 Tflop - 1 TB sequential machine r = 0.3 mm 1 TB

Moore’s Law (1965) Gordon Moore (co-founder of Intel) “ The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. “

Moore’s Law (1975) • In 1975, Moore refined his law: • circuit complexity doubles every 18 months. • So far it holds for CPUs and DRAMs! • Extrapolation for computing power at a given cost and semiconductor revenues.

Technology Trend

Technology Trend • Processors issue instructions roughly every nanosecond. • DRAM can be accessed roughly every 100nanoseconds. • DRAM cannot keep processors busy! And the gap is growing: • processors getting faster by 60% per year. • DRAM getting faster by 7% per year.

Memory Hierarchy • Most programs have a high degree of locality in their accesses • spatial locality: accessing things nearby previous accesses • temporal locality: reusing an item that was previously accessed • Memory hierarchy tries to exploit locality.

Memory Latency • Hiding memory latency: • temporal and spatial locality (caching) • multithreading • prefetching

Communication • Topology The manner in which the nodes are connected. Best choice would be a fully connected network (every processor to every other). Unfeasible for cost and scaling reasons. Instead, processors are arranged in some variation of a bus, grid, torus, or hypercube. • Latency How long does it take to start sending a "message"? Measured in microseconds. (Also in processors: how long does it take to output results of some operations, such as floating point add, divide etc., which are pipelined?) • Bandwidth What data rate can be sustained once the message is started? Measured in Mbytes/sec.

Networking Trend • System interconnection network: • bus, crossbar, array, mesh, tree • static, dynamic • LAN/WAN

LAN/WAN • 1st network connection in 1969: 50 Kpbs • at about 10:30 PM on October 29'th, 1969, the first ARPANET connection was established between UCLA and SRI over a 50 kbps line provided by the AT&T telephone company. • “At the UCLA end, they typed in the 'l' and asked SRI if they received it; 'got the l' came the voice reply. UCLA typed in the 'o', asked if they got it, and received 'got the o'. UCLA then typed in the 'g' and the darned system CRASHED! Quite a beginning. On the second attempt, it worked fine!” (Leonard Kleinrock) • 10Base5 Ethernet in 1976 by Bob Metcalfe and David Boggs • end of ‘90s: 100 Mbps (fast Ethernet) and 1 Gbps Bandwidth is not all the story! Do not forget to consider delay and latency.

(1)nodal processing: check bit errors determine output link (2)queuing time waiting at output link for transmission depends on congestion level of router (3) Transmission delay: R=link bandwidth (bps) L=packet length (bits) time to send bits into link = L/R (4) Propagation delay: d = length of physical link s = propagation speed in medium (~2x108 m/sec) propagation delay = d/s (3) transmission (4) propagation … Source Destination (1) nodal processing (2) queuing Delay in packet-switched networks Note: s and R are very different quantities!

Latency How long does it take to start sending a "message"? Latency may be critical for parallel computing. Some LAN technologies provide high BW and low latency for €.

HPC Trend ~ 20 years ago: Mflop/s 1x106 Floating Point Ops/sec - Scalar based ~ 10 years ago: Gflop/s 1x109 Floating Point Ops/sec) Vector & Shared memory computing, bandwidth aware block partitioned, latency tolerant ~ Today: Tflop/s 1x1012 Floating Point Ops/sec Highly parallel, distributed processing, message passing, network based data decomposition, communication/computation ~ 5 years away: Pflop/s 1x1015 Floating Point Ops/sec Many more levels MH, combination/grids&HPC More adaptive, LT and BW aware, fault tolerant, extended precision, attention to SMP nodes

TOP500 SuperComputers

IBM BlueGene/L

Tutorial Outline • Part 1: Overview of High-Performance Computing • Technology trends • Parallel and Distributed Computing architectures • Programming paradigms

Parallel and Distributed Systems

Different Architectures • Parallel computing • single systems with many processors working on same problem • Distributed computing • many systems loosely coupled by a scheduler to work on related problems • Grid Computing (MetaComputing) • many systems tightly coupled by software, perhaps geographically distributed, to work together on single problems or on related problems • Massively Parallel Processors (MPPs) continue to account of more than half of all installed high-performance computers worldwide (Top500 list). • Microprocessor based supercomputers have brought a major change in accessibility and affordability. • Nowadays, cluster systems are the most growing part.

Classification: Control Model Flynn’s Classical Taxonomy (1966) • Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of Instruction and Data. Each of these dimensions can have only one of two possible states: Single or Multiple.

SISD Von Neumann Machine Single Instruction, Single Data • A serial (non-parallel) computer • Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle • Single data: only one data stream is being used as input during any one clock cycle • Deterministic execution • This is the oldest and until recently, the most prevalent form of computer • Examples: most PCs, single CPU workstations and mainframes

SIMD Single Instruction, Multiple Data • Single instruction: all processing units execute the same instruction at any given clock cycle. • Multiple data: each processing unit can operate on a different data element. • This type of machine typically has an instruction dispatcher, a very high-bandwidth internal network, and a very large array of very small-capacity instruction units. • Best suited for specialized problems characterized by a high degree of regularity, such as image processing. • Synchronous (lockstep) and deterministic execution • Two varieties: Processor Arrays and Vector Pipelines • Examples: • Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2 • Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820

MISD Multiple Instruction, Single Data • Few actual examples of this class of parallel computer have ever existed. • Some conceivable examples might be: • multiple frequency filters operating on a single signal stream • multiple cryptography algorithms attempting to crack a single coded message.

MIMD Multiple Instruction, Multiple Data • Currently, the most common type of parallel computer • Multiple Instruction: every processor may be executing a different instruction stream. • Multiple Data: every processor may be working with a different data stream. • Execution can be synchronous or asynchronous, deterministic or non- deterministic. • Examples: most current supercomputers, networked parallel computer "grids" and multi-processor SMP computers - including some types of PCs.

Classification: Communication Model Shared vs. Distributed Memory systems

Shared Memory: UMA vs. NUMA

Distributed Memory: MPPs vs. Clusters • Processors-memory nodes are connected by some type of interconnect network • Massively Parallel Processor (MPP): tightly integrated, single system image. • Cluster: individual computers connected by SW Interconnect Network P P P P P P M M M M M M

Distributed Shared-Memory • Virtual shared memory (shared address space) • on hardware level • on software level • Global address space spanning all of the memory in the system. • E.g., HPF, TreadMarks, sw for NoW (JavaParty, Manta, Jackal)

Parallel vs. Distributed Computing • Parallel computing usually considers dedicated homogeneous HPC systems to solve parallel problems. • Distributed computing extends the parallel approach to heterogeneous general-purpose systems. • Both look at the parallel formulation of a problem. • But usually reliability, security, heterogeneity are not considered in parallel computing. But they are considered in Grid computing. • “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” (Leslie Lamport)

Parallel and Distributed Computing • Parallel computing: • Shared-Memory SIMD • Distributed-Memory SIMD • Shared-Memory MIMD • Distributed-Memory MIMD • Behind DM-MIMD: • Distributed computing and Clusters • Behind parallel and distributed computing: • Metacomputing SCALABILITY

Tutorial Outline • Part 1: Overview of High-Performance Computing • Technology trends • Parallel and Distributed Computing architectures • Programming paradigms • Programming models • Problem decomposition • Parallel programming issues

Programming Paradigms Parallel Programming Models • Control • how is parallelism created • what orderings exist between operations • how do different threads of control synchronize • Naming • what data is private vs. shared • how logically shared data is accessed or communicated • Set of operations • what are the basic operations • what operations are considered to be atomic • Cost • how do we account for the cost of each of the above

Model 1: Shared Address Space • Program consists of a collection of threads of control, • Each with a set of private variables • e.g., local variables on the stack • Collectively with a set of shared variables • e.g., static variables, shared common blocks, global heap • Threads communicate implicitly by writing and reading shared variables • Threads coordinate explicitly by synchronization operations on shared variables • writing and reading flags • locks, semaphores • Like concurrent programming in uniprocessor

Model 2: Message Passing • Program consists of a collection of named processes • thread of control plus local address space • local variables, static variables, common blocks, heap • Processes communicate by explicit data transfers • matching pair of send & receive by source and dest. proc. • Coordination is implicit in every communication event • Logically shared data is partitioned over local processes • Like distributed programming • Program with standard libraries: MPI, PVM • aka shared nothing architecture, or a multicomputer

Model 3: Data Parallel • Single sequential thread of control consisting of parallel operations • Parallel operations applied to all (or defined subset) of a data structure • Communication is implicit in parallel operators and “shifted” data structures • Elegant and easy to understand • Not all problems fit this model • Vector computing

SIMD Machine • An SIMD (Single Instruction Multiple Data) machine • A large number of small processors • A single “control processor” issues each instruction • each processor executes the same instruction • some processors may be turned off on any instruction interconnect • Machines not popular (CM2), but programming model is • implemented by mapping n-fold parallelism to p processors • mostly done in the compilers (HPF = High Performance Fortran) control processor

Model 4: Hybrid • Shared memory machines (SMPs) are the fastest commodity machine. Why not build a larger machine by connecting many of them with a network? • CLUMP = Cluster of SMPs • Shared memory within one SMP, message passing outside • Clusters, ASCI Red (Intel), ... Programming model? • Treat machine as “flat”, always use message passing, even within SMP (simple, but ignore important part of memory hierarchy) • Expose two layers: shared memory (OpenMP) and message passing (MPI) higher performance, but ugly to program.

Hybrid Systems

Model 5: BSP • Bulk Synchronous Processing (BSP) (L. Valiant, 1990) • Used within the message passing or shared memory models as a programming convention • Phases separated by global barriers • Compute phases: all operate on local data (in distributed memory) • or read access to global data (in shared memory) • Communication phases: all participate in rearrangement or reduction of global data • Generally all doing the “same thing” in a phase • all do f, but may all do different things within f • Simplicity of data parallelism without restrictions BSP superstep

Problem Decomposition • Domain decomposition  data parallel • Functional decomposition  task parallel

Parallel Programming • directives-based data-parallel language • Such as High Performance Fortran (HPF) or OpenMP • Serial code is made parallel by adding directives (which appear as comments in the serial code) that tell the compiler how to distribute data and work across the processors. • The details of how data distribution, computation, and communications are to be done are left to the compiler. • Usually implemented on shared-memory architectures. • Message Passing (e.g. MPI, PVM) • very flexible approach based on explicit message passing via library calls from standard programming languages • It is left up to the programmer to explicitly divide data and work across the processors as well as manage the communications among them. • Multi-threading in distributed environments • Parallelism is transparent to the programmer • Shared-memory or distributed shared-memory systems

Parallel Programming Issues • The main goal of a parallel program is to get better performance over the serial version. • Performance evaluation • Important issues to take into account: • Load balancing • Minimizing communication • Overlapping communication and computation

fs fp P1 Ts P1 P2 P3 P4 Tp Speedup superlinear linear sublinear p Speedup • Serial fraction: • Parallel fraction: • Speedup: • Superlinear speedup is, in general, impossible; but it may arise in two cases: • memory hierarchy phenomena • search algorithms

Maximum Speedup • Amdahl’s Law states that potential program speedup is defined by the fraction of code (fp) which can be parallelized.

Distributed Data Mining