90 likes | 221 Views
Code Gen of Expr Eval in Shark. h ao.cheng@intel.com. Outlines. CG examples Performance Comparison (CG Expr Eval V.S. Hive Expr Eval ) CG Design & Major Class Diagram Implemented UDFs/Generic UDFs Future Works. CG Examples.
E N D
Code Gen of ExprEval in Shark hao.cheng@intel.com
Outlines • CG examples • Performance Comparison (CG ExprEval V.S. Hive ExprEval) • CG Design & Major Class Diagram • Implemented UDFs/Generic UDFs • Future Works
CG Examples shark.expr.cg=true/false in hive-site.xml to enable/disable the feature; default is true.
Performance Comparison (CG ExprEval V.S. Hive ExprEval) 747,747,840 records / 66,909,023,675 bytes / RC File (with LzoCodec) on 4 Slaves Machines
Performance Comparison (CG ExprEval V.S. Hive ExprEval) (2) • Why CG ExprEval is Faster than Hive ExprEval? In Hive ExprEval: • Keep re-evaluating the common sub node expressions • e.g. in expression: concat(year(date_add(visitDate,7)), '/', month(date_add(visitDate,7)), '/', day(date_add(visitDate,7))), the “date_add(visitDate,7)” will be evaluated 3 times. • Keep checking data types in the runtime • The parameter types of “evaluate” method in GenericUDFs is uncertain until runtime, and Hive ExprEval have to keep checking the value types inside of the “evaluating”. e.g. GenericUDFOPGreaterThan.evaluate, GenericUDFPrintf.evaluate etc. • Un-necessary type converting • e.g. in expression: (duration + 1.03), variable “duration” will be converted into a new object FloatWritable first in Hive ExprEval, which creates lots of small temperate objects (GenericUDFBridge.conversionHelper) • Large mount of virtual function calls in runtime • Hive ExprEvalalways use the base class objects, particularly the UDF objects and the field value objects • Using the Java Reflection to call UDF evaluate() method • Hive ExprEvalsaccess the UDF (in class GenericUDFBridge) is based on the Java Reflection API, which cause another performance issue (http://docs.oracle.com/javase/tutorial/reflect/index.html) CG ExprEval Generates Source Code with concrete objects and executing branches.
CG Design & Major Class Diagram (2) • Why not generate the bytecode directly? • The generated content is quite complicated, source code is much easier to debug / troubleshooting. • Java complier could do another optimizations when compile the source code. • Why not generate the evaluating source code according to Hive ExprNodeEvaluator tree, but the ExprNodeDesc tree? • ExprNodeEvaluator tree loss some information, which may be helpful for further optimization. (e.g. the common sub node expression evaluating) • Extracting the information from the ExprNodeEvaluator tree is kind of tough, as most of the variables are protected / private in ExprNodeEvaluator.
Implemented UDFs/Generic UDFs • Supported Features: • Relational Operators (=,!=,<,<= etc.) • Arithmetic Operators (+,-,*,/,% etc.) • Logical Operators (AND,OR,NOT etc.) • Built-in Functions(UDF) and existed User-Defined Functions • Partial of the generic UDF • GenericUDFBetween • GenericUDFPrintf • GenericUDFInstr • GenericUDFBridge • Unsupported Features • Conditional Functions (if/case/when etc.) • Map/Array • UDAF • UDTF • Misc. Functions (java_method/reflect/hash etc.)
Future Works • Generated Java Source Compile once and distribute among the cluster • Reuse the Generated .class for the same queries • Support more General UDF (case/when/if etc.) • Support Collection Type(Array/Map etc.) • Code Gen in Aggregations