100 likes | 107 Views
This presentation gives an overview of the Apache Tez project. It explains Tez as a processing system based on Hadoop YARN as well as comparing it to Map Reduce. <br> <br>Links for further information and connecting<br><br>http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/<br><br>https://nz.linkedin.com/pub/mike-frampton/20/630/385<br><br>https://open-source-systems.blogspot.com/<br><br>Music by <br><br>"Little Planet", composed and performed by Bensound from http://www.bensound.com/
E N D
What Is Apache Tez ? ● An application framework ● Build on top of Apache Hadoop YARN ● Uses directed-acyclic-graphs ( DAG's ) ● Open source / Apache 2.0 license ● Scaleable ● Performant
Tez DAG ● Tez directed-acyclic-graphs ( DAG ) ● Distributed data processing ● Vertices represent data transformation ● Edges represent data movement ● For data processing applications ● TEZ is an execution engine ● Built on top of YARN
Tez Performance ● Performance improvement compared to Map Reduce – No need for HDFS storage between MR jobs – Better execution performance ● Expressive dataflow API for DAG – Visualise what you wish to construct – Add processor vertices to graph – Add data movement edges to graph – To build the computational DAG that you require
Tez Deployment ● Tez is client side ● Install Tez client locally ● Build task DAG ● Load DAG/Tez libraries to HDFS ● Execute YARN based job – From Tez client – Using HDFS based DAG library
Tez Existing MR Tasks ● Tez can process existing Map Reduce ( MR ) tasks ● No need for any modification ● Allows for phased migration – Of existing MR jobs to DAG's ● Allows for near real time task types ● Rather than just MR tasks which are – Batch oriented – Iterative – Resource intensive
Tez API ● Tez DAG defines the job ● Vertex defines one DAG job step – Requires user logic and resources for step ● Edge defines one DAG data movement step – From producer to consumer – Edge properties define movement ●How data moves ●Schedules when data moves relationally ●Defines durability of data
Tez Hive ● Increased performance – Compared to Map Reduce usage ● No need to use HDFS for intermediate steps ● Greater parallelism via DAG's ● Less complex steps in DAG compared to MR ● Reduced latency ● Higher throughput ● Better speed
Available Books ● See “Big Data Made Easy” Apress Jan 2015 – See “Mastering Apache Spark” ● Packt Oct 2015 – See “Complete Guide to Open Source Big Data Stack ● “Apress Jan 2018” – – ● Find the author on Amazon www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ – Connect on LinkedIn ● www.linkedin.com/in/mike-frampton-38563020 –
Connect ● Feel free to connect on LinkedIn –www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at open-source-systems.blogspot.com/ – ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration