1.12k likes | 1.36k Views
BIG DATA. Algorithms. Google trend. Big data. everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it. Is there anything fundamentally new?. Massive Data vs Big Data The 3 V’s Volume Velocity
E N D
BIG DATA Algorithms
Big data • everyone talks about it, • nobody really knows how to do it, • everyone thinks everyone else is doing it, • so everyone claims they are doing it...
Is there anything fundamentally new? • Massive Data vs Big Data • The 3 V’s • Volume • Velocity • Variety
Big Data Algorithms Distributed Algorithms Data stream Algorithms External memory Algorithms Parallel Algorithms
Computational models for big data • All models are wrong, • But some are useful. • George E. P. Box
What’s the Bottleneck? • CPU speed approaching limit • Does it matter? • From CPU-intensive computing to data-intensive computing • Algorithm has to be near-linear, linear, or even sub-linear! • Data movement, i.e., communication is the bottleneck!
Random Access Machine Model • Standard theoretical model of computation: • Unlimited memory • Uniform access cost • Simple model crucial for success of computer industry R A M
R A M L 1 L 2 Hierarchical Memory • Modern machines have complicated memory hierarchy • Levels get larger and slower further away from CPU • Data moved between levels using large blocks
Slow I/O • Disk access is 106 times slower than main memory access read/write arm “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer) track 4835 1915 5748 4125 • Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes) • Important to store/access data to take advantage of blocks (locality) magnetic surface
running time data size Scalability Problems • Most programs developed in RAM-model • Run on large datasets because OS moves blocks as needed • Moderns OS utilizes sophisticated paging and prefetching strategies • But if program makes scattered accesses even good OS cannot take advantage of block access Scalability problems!
External Memory Model N = # of items in the problem instance B = # of items per disk block M = # of items that fit in main memory I/O: # blocks moved between memory and disk CPU time is ignored Successful model used extensively in massive data algorithms and database communities D Block I/O M P
Fundamental Bounds Internal External • Scanning: N • Sorting: N log N • Permuting • Searching: • Note: • Linear I/O: O(N/B) • Permuting not linear • Permuting and sorting bounds are equal in all practical cases • B factor VERY important:
Queues and Stacks • Queue: • Maintain push and pop blocks in main memory O(1/B) I/O per operation (amortized) • Stack: • Maintain push/pop block in main memory O(1/B) I/O per operation (amortized) Push Pop
Sorting • Merge sort: • Create N/M memory sized sorted lists • Repeatedly merge lists together Θ(M/B) at a time phases using I/Os each I/Os
Sorting • <M/B sorted lists (queues) can be merged in O(N/B) I/Os M/B blocks in main memory • The M/B head elements kept in a heap in main memory
Toy Experiment: Permuting • Problem: • Input: N elements out of order: 6, 7, 1, 3, 2, 5, 10, 9, 4, 8 • Each element knows its correct position • Output: Store them on disk in the right order • Internal memory solution: • Just scan the original sequence and move every element in the right place! • O(N) time, O(N) I/Os • External memory solution: • Use sorting • O(N log N) time, I/Os
Searching in External Memory • Store N elements in a data structure such that • Given a query element x, find it or its predecessor
B-trees • BFS-blocking naturally corresponds to tree with fan-out • B-trees balanced by allowing node degree to vary • Rebalancing performed by splitting and merging nodes
(a,b)-tree • T is an (a,b)-tree (a≥2 and b≥2a-1) • All leaves on the same level (contain between a and b elements) • Except for the root, all nodes have degree between a and b • Root has degree between 2 and b (2,4)-tree • (a,b)-tree uses linear space and has height • • Choosing a,b = each node/leaf stored in one disk block • • O(N/B) space and query
(a,b)-Tree Insert • Insert: Search and insert element in leaf v DO v has b+1 elements/children Splitv: make nodes v’ and v’’ with and elements insert element (ref) in parent(v) (make new root if necessary) v=parent(v) • Insert touch nodes v v’ v’’
(a,b)-Tree Delete • Delete: Search and delete element from leaf v DO v has a-1 elements/children Fusev with sibling v’: move children of v’ to v delete element (ref) from parent(v) (delete root if necessary) If v has >b (and ≤ a+b-1<2b) children split v v=parent(v) • Delete touch nodes v v
(a,b)-Tree (2,3)-tree • (a,b)-tree properties: • Every update can cause O(logaN) rebalancing operations • If b>2a rebalancing operations amortized • Why? delete insert
Summary/Conclusion: B-tree • B-trees: (a,b)-trees with a,b = • O(N/B) space • O(logBN)query • O(logB N)update • B-trees with elements in the leaves sometimes called B+-tree • Now B-tree and B+tree are synonyms • Construction in I/Os • Sort elements and construct leaves • Build tree level-by-level bottom-up
Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 100 40 90 40 30 50 29 23 15 65 Insertion
Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 100 40 90 40 30 65 50 23 15 29 Insertion
Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 40 90 40 30 65 50 23 15 29 DeleteMax
Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 90 40 40 30 65 50 23 15 29 DeleteMax
Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 90 40 65 40 30 50 29 23 15 DeleteMax
How to Make the Heap I/O-Efficient I/O Technique 1: Make it many-way I/O Technique 2: Buffering!
External Heap insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent
External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent
External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent
External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks sift-up may not be half-full Heap property: All elements in a child are smaller than those in its parent
External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks sift-up may not be half-full Heap property: All elements in a child are smaller than those in its parent
External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks swap may not be half-full Heap property: All elements in a child are smaller than those in its parent
External Heap: DeleteMax insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent
External Heap: DeleteMax insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent
External Heap: DeleteMax insert buffer main memory in memory refill heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent
External Heap: DeleteMax insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks refill may not be half-full Heap property: All elements in a child are smaller than those in its parent
External Heap: DeleteMax insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks refill merge may not be half-full Heap property: All elements in a child are smaller than those in its parent
External Heap: I/O Analysis • What is the I/O cost for a sequence of N mixed insertions / deletemax (analysis in paper too complicated) • Height of heap: Θ(logM/BN/B) • Insertions • Wait until insert buffer is full (served at least Ω(M) inserts) • Then do one (occasionally two) bottom-up chains of sift-ups. • Cost: O(M/B∙logM/BN/B) • Amortized cost per insert: O(1/B∙logM/BN/B) • DeleteMax: • Wait until root is below half full (served at least Ω(M) deletemax) • Then do one, two, sometimes a lot of refills… dead • Do one sift-up: this is easy
External Heap: I/O Analysis • Cost of all refills: • Need a global argument • Idea: trace individual elements • Total amount of “work”: O(N logM/BN/B) • One unit of work: move one element up one level • Refills do positive work • sift-ups do both positive and negative work • |positive work done by refills| + |positive works done by sift-ups| – |negative work done by sift-ups| = O(N logM/BN/B) • But note: |positive works done by sift-ups| >|negative work done by sift-ups| • So, |positive work done by refills| = O(N logM/BN/B)