1 / 9

Code Gen of Expr Eval in Shark

Code Gen of Expr Eval in Shark. h ao.cheng@intel.com. Outlines. CG examples Performance Comparison (CG Expr Eval V.S. Hive Expr Eval ) CG Design & Major Class Diagram Implemented UDFs/Generic UDFs Future Works. CG Examples.

yovela
Download Presentation

Code Gen of Expr Eval in Shark

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Code Gen of ExprEval in Shark hao.cheng@intel.com

  2. Outlines • CG examples • Performance Comparison (CG ExprEval V.S. Hive ExprEval) • CG Design & Major Class Diagram • Implemented UDFs/Generic UDFs • Future Works

  3. CG Examples shark.expr.cg=true/false in hive-site.xml to enable/disable the feature; default is true.

  4. Performance Comparison (CG ExprEval V.S. Hive ExprEval) 747,747,840 records / 66,909,023,675 bytes / RC File (with LzoCodec) on 4 Slaves Machines

  5. Performance Comparison (CG ExprEval V.S. Hive ExprEval) (2) • Why CG ExprEval is Faster than Hive ExprEval? In Hive ExprEval: • Keep re-evaluating the common sub node expressions • e.g. in expression: concat(year(date_add(visitDate,7)), '/', month(date_add(visitDate,7)), '/', day(date_add(visitDate,7))), the “date_add(visitDate,7)” will be evaluated 3 times. • Keep checking data types in the runtime • The parameter types of “evaluate” method in GenericUDFs is uncertain until runtime, and Hive ExprEval have to keep checking the value types inside of the “evaluating”. e.g. GenericUDFOPGreaterThan.evaluate, GenericUDFPrintf.evaluate etc. • Un-necessary type converting • e.g. in expression: (duration + 1.03), variable “duration” will be converted into a new object FloatWritable first in Hive ExprEval, which creates lots of small temperate objects (GenericUDFBridge.conversionHelper) • Large mount of virtual function calls in runtime • Hive ExprEvalalways use the base class objects, particularly the UDF objects and the field value objects • Using the Java Reflection to call UDF evaluate() method • Hive ExprEvalsaccess the UDF (in class GenericUDFBridge) is based on the Java Reflection API, which cause another performance issue (http://docs.oracle.com/javase/tutorial/reflect/index.html) CG ExprEval Generates Source Code with concrete objects and executing branches.

  6. CG Design & Major Class Diagram

  7. CG Design & Major Class Diagram (2) • Why not generate the bytecode directly? • The generated content is quite complicated, source code is much easier to debug / troubleshooting. • Java complier could do another optimizations when compile the source code. • Why not generate the evaluating source code according to Hive ExprNodeEvaluator tree, but the ExprNodeDesc tree? • ExprNodeEvaluator tree loss some information, which may be helpful for further optimization. (e.g. the common sub node expression evaluating) • Extracting the information from the ExprNodeEvaluator tree is kind of tough, as most of the variables are protected / private in ExprNodeEvaluator.

  8. Implemented UDFs/Generic UDFs • Supported Features: • Relational Operators (=,!=,<,<= etc.) • Arithmetic Operators (+,-,*,/,% etc.) • Logical Operators (AND,OR,NOT etc.) • Built-in Functions(UDF) and existed User-Defined Functions • Partial of the generic UDF • GenericUDFBetween • GenericUDFPrintf • GenericUDFInstr • GenericUDFBridge • Unsupported Features • Conditional Functions (if/case/when etc.) • Map/Array • UDAF • UDTF • Misc. Functions (java_method/reflect/hash etc.)

  9. Future Works • Generated Java Source Compile once and distribute among the cluster • Reuse the Generated .class for the same queries • Support more General UDF (case/when/if etc.) • Support Collection Type(Array/Map etc.) • Code Gen in Aggregations

More Related