80 likes | 257 Views
CPS 216: Data-intensive Computing Systems. Information about Project 1 Shivnath Babu. Project 1: Overview. Project 1 (Sept to late Nov): Processing collections of records: Systems like Pig, Hive, Jaql , Cascading, Cascalog , HadoopDB
E N D
CPS 216: Data-intensive Computing Systems Information about Project 1 ShivnathBabu
Project 1: Overview • Project 1 (Sept to late Nov): • Processing collections of records: Systems like Pig, Hive, Jaql, Cascading, Cascalog, HadoopDB • Matrix and graph computations: Systems like Rhipe, Ricardo, SystemML, Mahout, Pregel, Hama • Data stream processing: Systems like Flume, FlumeJava, S4, STREAM, Scribe, STORM • Data serving systems: Systems like BigTable/HBase, Dynamo/Cassandra, CouchDB, MongoDB, Riak, VoltDB • Project 1 will have regular milestones. The final report will include: • What are properties of the data encountered? • What are concrete examples of workloads that are run? Develop a benchmark workload that you will implement and use in Step 5. • What are typical goals and requirements? • What are typical systems used, and how do they compare with each other? • Install some of these systems and do an experimental evaluation of 1, 2, 3, & 4 • Project 2 (Late Nov to end of class). Of your own choosing. Could be a significant new feature added to Project 1
Group 1: Processing Collections of Records • Workloads: • See the “The Case for Evaluating MapReduce Performance Using Workload Suites” for pointers to a number of possible MapReduce workloads: (http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-21.html) • Citation 12 in the paper: Pavlo, Paulson, and others (comes with data) • TPC-H: http://www.tpc.org/tpch/ (comes with data) • If things work out: A real Hadoop+HBase workload that Akamai uses • Systems: • Hadoop • Pig • Hive • A hybrid system like: HadoopDB
Group 2: Matrix and Graph Computations • Workloads: • Matrix computations, e.g., PLSA • Graph computations, e.g., PageRank • Machine-learning workloads (Are of interest to Groups 1 and 2) • Systems: • Hadoop • Spark / Twister • RHIPE • (Mahout)
Group 3: Data Stream Processing • Workloads: • Behavioral Targeting: http://research.microsoft.com/apps/pubs/default.aspx?id=150002 • Linear Road Benchmark: http://pages.cs.brandeis.edu/~linearroad/ • Systems: • Hadoop • Flume and FlumeBase • Hadoop + HBase
Group 4: Data Serving Systems • Workloads: • YCSB: https://github.com/brianfrankcooper/YCSB/wiki • YCSB++ • Systems (no need to do them all): • HDFS (not the full Hadoop) or MapR • HBase (Original design comes from Google BigTable) • Cassandra / Riak (Original design comes from Amazon Dynamo) • VoltDB (Parallel in-memory database) • CouchDB / MongoDB (Document Stores)
Upcoming Milestones 1. Read about the workloads, performance goals, etc. Discuss within your group. Pick one workload or come up with your own. Write a report by Sept 23. You can do it as part of a group or on your own. 2. One part of programming assignment 2 will involve writing and running the workload using Hadoop/HDFS/MapR. This assignment will be done on Amazon EC2. Done individually. Group discussion is fine. 3. As part of Project 1 later on, you will compare the performance on Hadoop/HDFS/MapR seen in Step 2 Vs. the other systems you will use.