290 likes | 410 Views
Algoritmi per IR. Prologo. Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. Mining the Web: Discovering Knowledge from... S. Chakrabarti, Morgan-Kaufmann Publishers, 2003. References. A bunch of scientific papers available on the course site !!.
E N D
Algoritmi per IR Prologo
Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. Mining the Web: Discovering Knowledge from... S. Chakrabarti, Morgan-Kaufmann Publishers, 2003. References A bunch of scientific papers available on the course site !!
About this course • It is a mix of algorithms for • data compression • data indexing • data streaming (and sketching) • data searching • data mining Massive data !!
Web 2.0 is about the many Paradigm shift...
Big DATA Big PC ? • We have three types of algorithms: • T1(n) = n, T2(n) = n2, T3(n) = 2n ... and assume that 1 step = 1 time unit • How many input data n each algorithm may process within t time units? • n1 = t, n2 = √t, n3 = log2 t • What about a k-times faster processor? ...or, what is n, when the time units are k*t ? • n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario • Data are more available than even before n ➜ ∞ ... is more than a theoretical assumption • The RAM model is too simple Step cost is W(1) time
net L2 RAM HD CPU L1 registers Few Tbs Few Gbs Tens of nanosecs Some words fetched Cache Few Mbs Some nanosecs Few words fetched Few millisecs B = 32K page Many Tbs Even secs Packets Not just MIN#steps… You should be “??-aware programmers” 1 RAM CPU
read/write head track read/write arm magnetic surface I/O-conscious Algorithms Spatial locality vs Temporal locality “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)
The space issue • M = memory size, N = problem size • T(n) = time complexity of an algorithm using linear space • p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)] • C = cost of an I/O [105 ÷ 106 (Hennessy-Patterson)] If N=(1+f)M, then the D-avg cost per step is: C * p * f/(1+f) This is at least 104 * f/(1+f) If we fetch B ≈ 4Kb in time C, and algo uses all of them: (1/B) * (p * f/(1+f) * C) ≈ 30 * f/(1+f)
Space-conscious Algorithms I/Os Compressed data structures search access
read/write head track read/write arm magnetic surface Streaming Algorithms Data arrive continuously or we wish FEW scans • Streaming algorithms: • Use few scans • Handle each element fast • Use small space
net L2 RAM HD CPU L1 registers Few Tbs Few Gbs Tens of nanosecs Some words fetched Cache Few Mbs Some nanosecs Few words fetched Few millisecs B = 32K page Many Tbs Even secs Packets Cache-Oblivious Algorithms Unknown and/or changing devices • Block access important on all levels of memory hierarchy • But memory hierarchies are very diverse • Cache-oblivious algorithms: • Explicitly, algorithms do not assume any model parameters • Implicitly, algorithms use blocks efficiently on all memory levels
Toy problem #1: Max Subarray • Goal:Given a stock, and its D-performance over the time, find the time window in which it achieved the best “market performance”. • Math Problem: Find the subarray of maximum sum. A = 2 -5 6 1 -2 4 3 -13 9 -6 7
An optimal solution We assume every subsum≠0 Algorithm • sum=0; max = -1; • For i=1,...,n do • If (sum + A[i] ≤ 0) sum=0; else sum +=A[i]; MAX{max, sum}; >0 Optimum A = <0 A = 2 -5 6 1 -2 4 3 -13 9 -6 7 • Note: • Sum < 0 when OPT starts; • Sum > 0 within OPT
Toy problem #2 : sorting • How to sort tuples (objects) on disk • Key observation: • Array A is an “array of pointers to objects” • For each object-to-object comparison A[i] vs A[j]: • 2 random accesses to memory locations A[i] and A[j] • MergeSort Q(n log n) random memory accesses (I/Os ??) Memory containing the tuples A
What about listing tuples in order ? B-trees for sorting ? Using a well-tuned B-tree library: Berkeley DB • n insertions Data get distributed arbitrarily !!! B-tree internal nodes B-tree leaves (“tuple pointers") Tuples Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Divide Conquer Combine
Cost of Mergesort on large data • Take Wikipedia in Italian, compute word freq: • n=109 tuples few Gbs • Typical Disk (Seagate Cheetah 150Gb): seek time~5ms • Analysis of mergesort on disk: • It is an indirect sort: Q(n log2 n) random I/Os • [5ms] * n log2 n ≈ 1.5years In practice, it is faster because of caching...
4 15 2 10 13 19 1 5 7 9 1 2 5 7 9 10 13 19 3 8 3 4 6 8 11 12 15 17 12 17 6 11 1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 19 12 7 15 4 8 3 13 11 9 6 1 5 2 10 17 Merge-Sort Recursion Tree If the run-size is larger than B (i.e. after first step!!), fetching all of it in memory for merging does not help 2 passes (R/W) How do we deploy the disk/mem features ? log2 N M N/M runs, each sorted in internal memory (no I/Os) — I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)
Multi-way Merge-Sort • The key is to balance run-size and #runs to merge • Sort N items with main-memory M and disk-pages B: • Pass 1: Produce (N/M) sorted runs. • Pass i: merge X M/Bruns logM/B N/M passes INPUT 1 . . . . . . INPUT 2 . . . OUTPUT INPUT X Disk Disk Main memory buffers of B items
Multiway Merging Bf1 p1 min(Bf1[p1], Bf2[p2], …, Bfx[pX]) Bf2 Fetch, if pi = B Bfo p2 po Bfx pX Flush, if Bfo full Current page Current page Current page EOF Run 1 Run 2 Run X=M/B Out File: Merged run
Cost of Multi-way Merge-Sort • Number of passes = logM/B #runs logM/B N/M • Optimal cost = Q((N/B) logM/B N/M) I/Os • In practice • M/B ≈ 1000#passes = logM/B N/M 1 • One multiway merge 2 passes = few mins Tuning depends on disk features • Large fan-out (M/B) decreases #passes • Compression would decrease the cost of a pass!
Does compression may help? • Goal: enlarge M and reduce N • #passes = O(logM/B N/M) • Cost of a pass = O(N/B)
Part of Vitter’s paper… In order to address issues related to: • Disk Striping: sorting easily on D disks • Distribution sort: top-down sorting • Lower Bounds: how much we can go
Problems if ≤ N/2 Toy problem #3: Top-freq elements Algorithm • Use a pair of variables <X,C> • For each item s of the stream, • if (X==s) then C++ else { C--; if (C==0) X=s; C=1;} • Return X; • Goal: Top queries over a stream of N items (S large). • Math Problem: Find the item y whose frequency is > N/2, using the smallest space. (i.e. If mode occurs > N/2) A = b a c c c d c b a a a c c b c c c <b,1> <a,1><c,1><c,2><c,3> <c,2><c,3><c,2> <c,1> <a,1><a,2><a,1><c,1><b,1><c,1>.<c,2><c,3> Proof If X≠y, then every one of y’s occurrences has a “negative” mate. Hence these mates should be ≥#y. As a result, 2 * #occ(y) > N...
Toy problem #4: Indexing • Consider the following TREC collection: • N = 6 * 109 size = 6Gb • n = 106 documents • TotT= 109 (avg term length is 6 chars) • t = 5 * 105 distinct terms • What kind of data structure we build to support word-based searches ?
Solution 1: Term-Doc matrix n = 1 million t=500K 1 if play contains word, 0 otherwise Space is 500Gb !
2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 Solution 2: Inverted index We can do still better: i.e. 3050% original text Brutus Calpurnia Caesar 13 16 • Typically <doc,pos,rankinfo> use about 12 bytes • We have 109 total terms at least 12Gb space • Compressing 6Gb documents gets 1.5Gb data • Better index but yet it is >10 times the text !!!!
Please !! Do not underestimate the features of disks in algorithmic design