210 likes | 313 Views
IR. Paolo Ferragina Dipartimento di Informatica Università di Pisa. Paradigm shift:. Web 2.0 is about the many. Do big DATA need big PC s ??. an Italian Ad of the ’80 about a BIG brush or a brush BIG. big DATA big PC ?. We have three types of algorithms:
E N D
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa
Paradigm shift: Web 2.0 is about the many
Do big DATAneed big PCs ?? an Italian Ad of the ’80 about a BIG brush or a brush BIG....
big DATA big PC ? • We have three types of algorithms: • T1(n) = n, T2(n) = n2, T3(n) = 2n ... and assume that 1 step = 1 time unit • How many input data n each algorithm may process within t time units? • n1 = t, n2 = √t, n3 = log2 t • What about a k-times faster processor? ...or, what is n, when the available time is k*t ? • n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario for Algorithmics • Data are more available than even before n ➜ ∞ ... is more than a theoretical assumption • The RAM model is too simple Step cost is w(1)
net L2 RAM HD CPU L1 registers Few Tbs Few Gbs Tens of nanosecs Some words fetched Cache Few Mbs Some nanosecs Few words fetched Few millisecs B = 32K page Many Tbs Even secs Packets The memory hierarchy 1 RAM CPU
Does Virtual Memory help ? • M = memory size, N = problem size • p = prob. of memory access [0,3÷0,4 (Hennessy-Patterson)] • C = cost of an I/O [105 ÷ 106 (Hennessy-Patterson)] If N ≤ M, then the cost per step is 1 If N=(1+e) M, then the avg cost per step is: 1 + C * p * e/(1+e) This is at least > 104 * e/(1+e) If e = 1/1000 ( e.g. M = 1Gb, N = 1Gb + 1Mb ) Avg step-cost is > 20
read/write head track read/write arm HD RAM CPU magnetic surface The I/O-model Spatial locality or Temporal locality B Count I/Os 1 “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer) Less and faster I/Os caching
Other issues other models • Random vs sequential I/Os • Scanning is better than jumping • Not just one CPU • Many PCs, Multi-cores CPUs or even GPUs • Parameter-free algorithms • Anywhere, anytime, anyway... Optimal !! Streaming algorithms Parallel or Distributed algorithms Cache-oblivious algorithms
What about energy-consumption ? [Leventhal, CACM 2008] ≈10 IO/s/W ≈6000 IO/s/W
? Crawler Web Page Analizer Query resolver Ranker Our topics, on an example Page archive Hashing Linear Algebra Clustering Classification Query Indexer Sorting Dictionaries Which pages to visit next? text auxiliary Data Compression Structure
Warm up... • Take Wikipedia in Italian, and compute word freq: • Few GBs n 109 words • How do you proceed ?? • Tokenize into a sequence of strings • Sortthe strings • Create tuples < word, freq >
1 2 8 10 7 9 13 19 1 Binary Merge-Sort Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Divide Conquer Combine 2 7 Merge is linear in the #items to be merged
But... Few key observations: • Items = (short) strings = atomic... • Q(n log n) memory accesses (I/Os ??) • [5ms] * n log2 n ≈ 3years In practice it is a “faster”, why?
4 15 2 10 13 19 1 5 7 9 1 2 5 7 9 10 13 19 3 8 3 4 6 8 11 12 15 17 12 17 6 11 1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 19 12 7 15 4 8 3 13 11 9 6 1 5 2 10 17 Implicit Caching… 2 passes (one Read/one Write) = 2 * (N/B) I/Os Log2 (N/M) 2 passes (R/W) 2 passes (R/W) log2 N M N/M runs, each sorted in internal memory (no I/Os) — I/O-cost for binary merge-sort is ≈ 2 (N/B) log2 (N/M)
1 2 4 7 9 10 13 19 3 5 6 8 11 12 15 17 A key inefficiency After few steps, every run is longer than B !!! Output Run 1, 2, 3 B B B Output Buffer Disk 1, 2, 3 4, ... B We are using only 3 pages But memory contains M/B pages ≈ 230/215 = 215
Multi-way Merge-Sort • Sort N items with main-memory M and disk-pages B: • Pass 1: Produce (N/M) sorted runs. • Pass i: merge X =M/B-1 runs logX N/M passes Pg for run1 . . . . . . Pg for run 2 . . . Out Pg Pg for run X Disk Disk Main memory buffers of B items
Cost of Multi-way Merge-Sort • Number of passes = logX N/M logM/B (N/M) • Total I/O-cost is Q( (N/B) logM/B N/M ) I/Os N/B logM/B M = logM/B [(M/B)*B] = (logM/BB) + 1 • In practice • M/B ≈ 105#passes =1 few mins Tuning depends on disk features • Large fan-out (M/B) decreases #passes • Compression would decrease the cost of a pass!
I/O-lower bound for Sorting Every I/O fetches B items, in memory M Decision tree with fan out: There are N/B steps in which xB! cmp-outcomes Find t > N/B such that: We get t = W( (N/B) logM/B N/B ) I/Os
Keep attention... • If sorting needs to manage arbitrarily long strings Key observations: • Array A is an “array of pointers to objects” • For each object-to-object comparison A[i] vs A[j]: • 2 random accesses to 2 memory locations A[i] and A[j] • Q(n log n) random memory accesses (I/Os ??) Indirectsort Memory containing the strings A Again chaching helps, But it may be less effective than before