80 likes | 251 Views
Extending Map-Reduce for Efficient Predicate-Based Sampling. Raman Grover Dept of Computer Science, University of California, Irvine. P roblem Statement . Collect and process as much as you can ! Collected data may accumulate to tera / peta bytes . Sampling !
E N D
Extending Map-Reduce for Efficient Predicate-Based Sampling Raman Grover Dept of Computer Science, University of California, Irvine
Problem Statement • Collect and process as much as you can ! • Collected data may accumulate to tera/petabytes • Sampling ! • Sampled data required to additionally satisfy a • given set of predicates – • “Predicate Based Sampling” Similar Sampling queries common at Facebook SELECT age, profession, salary FROM CENSUS WHERE AGE > 25 and AGE <=30 AND GENDER = `FEMALE` AND STATE = `CALIFORNIA` LIMIT 1000 We needed … Pseudo Radnom Sample Response time to not be a function of the size of the input. • What were the challenges ? • Absence of Indexes • Wide variety of predicates • Size of the data Tera/Peta bytes
What would Hadoop do ? K= required Sample Size N = number of map tasks. Select first ‘K’ Collect first ‘K‘ pairs that satisfy the predicate(s) Reduce With N mappers, output contains at most N*K (key,value) pairs Evaluate predicate on (key,value) pairs in input We are processing the whole of input data !!! Map 2 Map 3 Map N Map 1 H D F S Input Data
What happens at runtime ? Map Outputs I N P U T S P L I T S Reduce Ou t pu t Desired Sample ‘K’ records, each of which satisfy some given predicate(s) Did we really need to process the whole of input to produced the desired sample ? Input data could be in range of tera/peta bytes
Hadoop with Incremental Processing Input Provider (configurable) Do we need to process more input ? Add Input No ! We are good ! Map Outputs Map Task Report output Reduce The Job produces the desired output, but processes less amount of input, does less work and finishes earlier
A Mechanism Needs a Policy In controlling intake of input, decisions need to take into account • The capacity of the cluster • The current (or expected ) load on the cluster • The priority of the job ( acceptable response time and resource consumption ) Defining a Policy • Grab Limit • Work Threshold • Evaluation Interval • Aconservative approach: Add minimal input at each step, minimize resource usage, leave more for others ! • An aggressive approach: Add all input upfront ( Grab Limit = Infinity ). This is the existing Hadoopmodel Can be played with to form different policies
Experimental Evaluation with Policies Multiuser workload Policies Grab Limit Hadoop : Hadoop Default Infinity HA: Highly Aggressive max(0.5*TS, AS ) MA : Mid Aggressive AS !=0 ? 0.5 * AS : 0.2 * TS LA : Less Aggressive AS !=0 ? 0.2 * AS : 0.1 * TS C : Conservative 0.1 * AS Decreasing Grab Limit Decreasing Degree of Aggressiveness 360 350 32
My other activities at UCI Past • Built the HadoopCompatibility Layer forHyracks • Incremental Processing in Hadoop Work in process of being incorporated into Hadoop system at Facebook Current • Transactions in Asterix Future Building support for processing Data Feeds/Streams in Asterix Reachable at: ramang@uci DOT edu Homepage: www.ics.uci.edu/~ramang