1 / 8

Extending Map-Reduce for Efficient Predicate-Based Sampling

Extending Map-Reduce for Efficient Predicate-Based Sampling. Raman Grover Dept of Computer Science, University of California, Irvine. P roblem Statement . Collect and process as much as you can ! Collected data may accumulate to tera / peta bytes . Sampling !

phil
Download Presentation

Extending Map-Reduce for Efficient Predicate-Based Sampling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extending Map-Reduce for Efficient Predicate-Based Sampling Raman Grover Dept of Computer Science, University of California, Irvine

  2. Problem Statement • Collect and process as much as you can ! • Collected data may accumulate to tera/petabytes • Sampling ! • Sampled data required to additionally satisfy a • given set of predicates – • “Predicate Based Sampling” Similar Sampling queries common at Facebook SELECT age, profession, salary FROM CENSUS WHERE AGE > 25 and AGE <=30 AND GENDER = `FEMALE` AND STATE = `CALIFORNIA` LIMIT 1000 We needed … Pseudo Radnom Sample Response time to not be a function of the size of the input. • What were the challenges ? • Absence of Indexes • Wide variety of predicates • Size of the data Tera/Peta bytes

  3. What would Hadoop do ? K= required Sample Size N = number of map tasks. Select first ‘K’ Collect first ‘K‘ pairs that satisfy the predicate(s) Reduce With N mappers, output contains at most N*K (key,value) pairs Evaluate predicate on (key,value) pairs in input We are processing the whole of input data !!! Map 2 Map 3 Map N Map 1 H D F S Input Data

  4. What happens at runtime ? Map Outputs I N P U T S P L I T S Reduce Ou t pu t Desired Sample ‘K’ records, each of which satisfy some given predicate(s) Did we really need to process the whole of input to produced the desired sample ? Input data could be in range of tera/peta bytes

  5. Hadoop with Incremental Processing Input Provider (configurable) Do we need to process more input ? Add Input No ! We are good ! Map Outputs Map Task Report output Reduce The Job produces the desired output, but processes less amount of input, does less work and finishes earlier

  6. A Mechanism Needs a Policy In controlling intake of input, decisions need to take into account • The capacity of the cluster • The current (or expected ) load on the cluster • The priority of the job ( acceptable response time and resource consumption ) Defining a Policy • Grab Limit • Work Threshold • Evaluation Interval • Aconservative approach: Add minimal input at each step, minimize resource usage, leave more for others ! • An aggressive approach: Add all input upfront ( Grab Limit = Infinity ). This is the existing Hadoopmodel Can be played with to form different policies

  7. Experimental Evaluation with Policies Multiuser workload Policies Grab Limit Hadoop : Hadoop Default Infinity HA: Highly Aggressive max(0.5*TS, AS ) MA : Mid Aggressive AS !=0 ? 0.5 * AS : 0.2 * TS LA : Less Aggressive AS !=0 ? 0.2 * AS : 0.1 * TS C : Conservative 0.1 * AS Decreasing Grab Limit Decreasing Degree of Aggressiveness 360 350 32

  8. My other activities at UCI Past • Built the HadoopCompatibility Layer forHyracks • Incremental Processing in Hadoop Work in process of being incorporated into Hadoop system at Facebook Current • Transactions in Asterix Future Building support for processing Data Feeds/Streams in Asterix Reachable at: ramang@uci DOT edu Homepage: www.ics.uci.edu/~ramang

More Related