270 likes | 356 Views
Multi-way Algorithm for Cube Computation CPS 196.03 Notes 8. First Programming Project. Individual project, 15 Points in final grade Sales(customer_id, item_id, item_group, item_price, purchase_date)
E N D
First Programming Project • Individual project, 15 Points in final grade • Sales(customer_id, item_id, item_group, item_price, purchase_date) • Will be provided as a file during demo and for generating performance numbers for project report • Task 1: 5 Points • Interface to enter MIN_SUPPORT (% of customers) • Find frequent itemsets using Apriori (set of item_id’s) • Task 2: 5 Points (Section 5.5 in the textbook) • Interface to enter two constraint types (e.g., SUM(item_price) op const) • Use the constraints in Apriori as effectively as possible, study and demonstrate performance improvement • Task 3: 5 Points • Extension of your choice. Examples include (i) association rules, (ii) complex constraints, (iii) sequential patterns, (iv) variants of apriori, (v) FP-growth
File Format • 10,123,3,54,4/4/2008 • 10,12,4,101,4/5/2008 • 14,123,3,54,8/4/2008 • … • Caveats: • Customer Vs. Item • Three datasets: Toy, Medium, and Large • Comma-separated file, one purchase per line in file, no header in file • Integers for simplicity • Note date format
First Programming Project: Milestones • Feb 3: Project announced • Feb 17: Mid-project report due • Describe progress and planned extensions • Describe detailed algorithms for all three tasks • Feb 17: Sample data file will be provided for generating performance results for project report • March 2: Submit code, README file to run code, code documentation, and final project report • March 2-4: Project demos (random assignment) • March 6: Spring break. Second project announced
Finalized Grading Criteria for Class • Homeworks: 15 points • Programming projects: 40 points • Midterm: 20 points • Note: Midterm is on Feb 19 (Thu) in class • Final: 25 Points
ROLAP server utilities relational DBMS ROLAP Server • Relational OLAP Server tools Special indices, tuning; Schema is “denormalized”
Sales City B A milk soda eggs soap Product 1 2 3 4 Date utilities MOLAP Server • Multi-Dimensional OLAP Server M.D. tools multi-dimensional server could also sit on relational DBMS
Date 2Qtr 1Qtr sum 3Qtr 4Qtr TV Product U.S.A PC VCR sum Canada Country Mexico sum All, All, All MOLAP Total annual sales of TV in U.S.A.
C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c 0 B 60 13 14 15 16 b3 44 28 56 9 b2 40 24 52 5 b1 36 20 1 2 3 4 b0 a0 a1 a2 a3 A MOLAP B
Challenges in MOLAP • Storing large arrays for efficient access • Row-major, column major • Chunking • Compressing sparse arrays • Creating array data from data in tables • Efficient techniques for Cube computation Topics are discussed in the paper for reading
ROLAP Vs. MOLAP • What do the authors say? • What can you do in MOLAP that you cannot do in ROLAP? • Can the algorithm in this paper be used in ROLAP?
Array Storage • Chunks • Compression • Chunk-offset compression Vs. LZW
Loading Arrays from Tables • The easy case: array fits in memory • Else: • Partitions
Loading Arrays from Tables • Suppose there are 1000 chunks. 10 chunks can fit in memory. The partition size is 10 chunks Table 100 ... ... 10 chunks
Basic Array Cubing Algo • First find minimum spanning tree • Hierarchy of aggregates • Compute each (k-1) dimensional aggregate from its best k dimensional aggregate • One pass through the array in the right order Let us look at some basics first
A a3 61 62 63 64 a2 45 46 47 48 a1 29 30 31 32 a0 B 60 13 14 15 16 b3 44 28 56 9 b2 40 24 52 5 b1 36 20 1 2 3 4 b0 a0 a1 c2 c3 C Chunked 3D Array B Dimension order CBA
“a0b0” chunk c1 c0 c2 c3 b0 a0b0c0 c1 c2 c3 a0 a0 a0b1c0 c1 c2 c3 c1 c0 c2 c3 b0 b1 a0b2c0 c1 c2 c3 b2 b3 a0b3c0 c1 c2 c3 …
a0b1 chunk c1 c0 c2 c3 b1 a0b0c0 c1 c2 c3 Done with a0b0 a0 a0 a0b1c0 c1 c2 c3 c1 c0 c2 c3 b0 b1 a0b2c0 c1 c2 c3 b2 b3 a0b3c0 c1 c2 c3 …
a0b2 chunk c1 c0 c2 c3 b2 a0b0c0 c1 c2 c3 Done with a0b1 a0 a0 a0b1c0 c1 c2 c3 c1 c0 c2 c3 b0 b1 a0b2c0 c1 c2 c3 b2 b3 a0b3c0 c1 c2 c3 …
Table Visualization c1 c0 c2 c3 b3 a0b0c0 c1 c2 c3 Done with a0b2 a0 a0 a0b1c0 c1 c2 c3 c1 c0 c2 c3 b0 b1 a0b2c0 c1 c2 c3 b2 b3 a0b3c0 c1 c2 c3
Table Visualization … c1 c0 c2 c3 b0 a1b0c0 c1 c2 c3 Done with a0b3 Done with a0c* a1 a1 a1b1c0 c1 c2 c3 c1 c0 c2 c3 b0 b1 a1b2c0 c1 c2 c3 b2 b3 a1b3c0 c1 c2 c3 …
a3b3 chunk (last one) … c1 c0 c2 c3 b0 a3b0c0 c1 c2 c3 Done with a0b3 Done with a0c* Done with b*c* a3 a3 a3b1c0 c1 c2 c3 c1 c0 c2 c3 b0 b1 a3b2c0 c1 c2 c3 b2 b3 a3b3c0 c1 c2 c3 Finish
Memory Used • A: 40 distinct values • B: 400 distinct values • C: 4000 distinct values • CBA: Dimension Order • Plane AB: Need 1 chunk (10 * 100 * 1) • Plane AC: Need 4 chunks (10 * 1000 * 4) • Plane BC: Need 16 chunks (100 * 1000 * 16) • Total memory: 1,641,000
Memory Used • A: 40 distinct values • B: 400 distinct values • C: 4000 distinct values • ABC: Dimension Order • Plane BC: Need 1 chunk (1000 * 100 * 1) • Plane AC: Need 4 chunks (1000 * 10 * 4) • Plane AB: Need 16 chunks (100 * 10 * 16) • Total memory: 156,000
Basic Array Cubing Algo • First find minimum spanning tree • Hierarchy of aggregates • Compute each (k-1) dimensional aggregate from its best k dimensional aggregate • One pass through the array in the right order • What are the advantages and disadvantages of this algorithm?
Multi-way Array Cubing Algo • What is the main idea? • Rule 1 on Page 163 • Minimum memory spanning tree • Figure 2 • Figures 3 and 4 • Theorem 1 • Basic idea of multi-pass algorithm • Tradeoff between memory usage and number of passes