480 likes | 697 Views
This presentation on PySpark Tutorial will help you understand what PySpark is, the different features of PySpark, and the comparison of Spark with Python and Scala. Then, you will learn the various PySpark contents - SparkConf, SparkContext, SparkFiles, RDD, StorageLevel, DataFrames, Broadcast and Accumulator. You will get an idea about the various Subpackages in PySpark. Finally, you will look at a demo using PySpark SQL to analyze Walmart Stocks data. Now, let's dive into learning PySpark in detail.<br><br>This Apache Spark and Scala certification training is designed to advance your expertise working with the Big Data Hadoop Ecosystem. You will master essential skills of the Apache Spark open source framework and the Scala programming language, including Spark Streaming, Spark SQL, machine learning programming, GraphX programming, and Shell Scripting Spark. This Scala Certification course will give you vital skillsets and a competitive advantage for an exciting career as a Hadoop Developer.<br><br>What is this Big Data Hadoop training course about?<br>The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.<br><br>What are the course objectives?<br>Simplilearnu2019s Apache Spark and Scala certification training are designed to:<br>1. Advance your expertise in the Big Data Hadoop Ecosystem<br>2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark<br>3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos<br><br>What skills will you learn?<br>By completing this Apache Spark and Scala course you will be able to:<br>1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations<br>2. Understand the fundamentals of the Scala programming language and its features<br>3. Explain and master the process of installing Spark as a standalone cluster<br>4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark<br>5. Master Structured Query Language (SQL) using SparkSQL<br>6. Gain a thorough understanding of Spark streaming features<br>7. Master and describe the features of Spark ML programming and GraphX programming<br><br>Who should take this Scala course?<br>1. Professionals aspiring for a career in the field of real-time big data analytics<br>2. Analytics professionals<br>3. Research professionals<br>4. IT developers and testers<br>5. Data scientists<br>6. BI and reporting professionals<br>7. Students who wish to gain a thorough understanding of Apache Spark<br><br>Learn more at: https://bit.ly/2WtRzQL<br>
E N D
What’s in it for you? What is PySpark? PySpark Features PySpark with Python and Scala PySpark Contents PySparkSubpackages Companies using PySpark Demo using PySpark
What is PySpark? PySpark is the PythonAPI to support Apache Spark
What is PySpark? PySpark is the PythonAPI to support Apache Spark + =
PySpark Features Caching and disk persistence Real-time Analysis Fast processing Polyglot
Spark with Python and Scala Criteria Spark is written in Scala, so it integrates well and is faster than Python Python is slower than Scala when used with Spark Performance
Spark with Python and Scala Criteria Spark is written in Scala, so it integrates well and is faster than Python Python is slower than Scala when used with Spark Performance Python has simple syntax and being a high-level language, it’s easy to learn Scala has a complex syntax, hence is not easy to learn Learning curve
Spark with Python and Scala Criteria Spark is written in Scala, so it integrates well and is faster than Python Python is slower than Scala when used with Spark Performance Python has simple syntax and being a high-level language, it’s easy to learn Scala has a complex syntax, hence is not easy to learn Learning curve Code Readability Scala is a sophisticated language. Developers need to pay a lot of attention towards the readability of the code Readability, maintenance, and familiarity of code is better in Python API
Spark with Python and Scala Criteria Spark is written in Scala, so it integrates well and is faster than Python Python is slower than Scala when used with Spark Performance Python has simple syntax and being a high-level language, it’s easy to learn Scala has a complex syntax, hence is not easy to learn Learning curve Code readability Scala is a sophisticated language. Developers need to pay a lot of attention towards the readability of the code Readability, maintenance, and familiarity of code is better in Python API Python provides a rich set of libraries for data visualization and model building Scala lacks in providing data science libraries and tools for data visualization Data Science libraries
PySpark Contents SparkConf Broadcast & Accumulator SparkContext SparkFiles DataFrames StorageLevel RDD
PySpark – SparkConf SparkConf provides configurations to run a Spark application
PySpark – SparkConf SparkConf provides configurations to run a Spark application The following code block has the details of a SparkConf class for PySpark class pyspark.SparkConf( loadDefaults = True, _jvm = None, _jconf = None )
PySpark – SparkConf SparkConf provides configurations to run a Spark application Following are some of the most commonly used attributes of SparkConf The following code block has the details of a SparkConf class for PySpark set(key, value) – To set a configuration property setMaster(value) – To set the master URL setAppName(value) – To set an application name Get(key, defaultValue=None) – To get a configuration value of a key class pyspark.SparkConf( loadDefaults = True, _jvm = None, _jconf = None )
PySpark – SparkContext SparkContext is the main entry point in any Spark Program
PySpark – SparkContext Data Flow Local SparkContext Socket Py4J Python SparkContext Local FS JVM SparkContext is the main entry point in any Spark Program Cluster Spark Worker Spark Worker Pipe Python Python Python Python
PySpark – SparkContext Below code has the details of a PySpark class as well as the parameters which SparkContext can take class pyspark.SparkContext ( master = None, appName = None, sparkHome = None, pyFiles = None, environment = None, batchSize = 0, serializer = PickleSerializer(), conf = None, gateway = None, jsc = None, profiler_cls = <class ‘pyspark.profiler.BasicProfiler’> )
PySpark – SparkFiles SparkFiles allows you to upload your files using sc.addFileand get the path on a worker using SparkFiles.get
PySpark – SparkFiles SparkFiles allows you to upload your files using sc.addFileand get the path on a worker using SparkFiles.get SparkFiles contain the following classmethods: get(filename) getrootdirectory()
PySpark – SparkFiles SparkFiles allows you to upload your files using sc.addFileand get the path on a worker using SparkFiles.get getrootdirectory() specifies the path to the root directory, which contains the file that is added through the SparkContext.addFile() SparkFiles contain the following classmethods: from pyspark import SparkContext from pyspark import SparkFiles finddistance = “/home/Hadoop/examples/finddistance.R” finddistancename = “finddistance.R” sc = SparkContext(“local”, “SparkFile App”) sc.addFile(finddistance) print “Absolute path -> %s” % SparkFiles.get(finddistancename) get(filename) getrootdirectory()
PySpark – RDD A Resilient Distributed Dataset(RDD) is the basic abstraction in Spark. It presents an immutable, partitioned collection of elements that can be operated on in parallel
PySpark – RDD A Resilient Distributed Dataset(RDD) is the basic abstraction in Spark. It presents an immutable, partitioned collection of elements that can be operated on in parallel RDD Transformation Action These are operations (such as reduce, first, count) that return a value after running a computation on an RDD These are operations (such as map, filter, join, union) that are performed on an RDD that yields a new RDD containing the result
PySpark – RDD PySpark program to return the number of elements in the RDD class pyspark.RDD ( jrdd, ctx, jrdd_deserializer = AutoBatchedSerializer(PickleSerializer()) ) from pyspark import SparkContext sc = SparkContext("local", "count app") words = sc.parallelize ( ["scala", "java", "hadoop", "spark", "akka", "spark vs hadoop", "pyspark", "pyspark and spark"] ) counts = words.count() print "Number of elements in RDD -> %i" % (counts) Creating PySpark RDD:
PySpark – StorageLevel StorageLevel decides whether RDD should be stored in the memory or should it be stored over the disk, or both RDD Disk Memory
PySpark – StorageLevel StorageLevel decides whether RDD should be stored in the memory or should it be stored over the disk, or both class pyspark.StorageLevel(useDisk, useMemory, useoffHeap, deserialized, replication=1) RDD from pyspark import SparkContext import pyspark sc = SparkContext( “local”, “storagelevel app” ) rdd1 = sc.parallelize([1, 2]) rdd1.persist(pyspark.StorageLevel.MEMORY_AND_DISK_2) rdd1.getStorageLevel() print(rdd1.getStorageLevel()) Disk Memory Output: Disk Memory Serialized 2x Replicated
PySpark – DataFrames DataFrames in PySpark is a distributed collection of rows with named columns
PySpark – DataFrames DataFrames in PySpark is a distributed collection of rows with named columns Characteristics with RDD: • Immutable in nature • Lazy Evaluation • Distribution
PySpark – DataFrames DataFrames in PySpark is a distributed collection of rows with named columns Characteristics with RDD: Ways to create a DataFrame in Spark • It can be created using different data formats • Loading data from existing RDD • Programmatically specifying schema • Immutable in nature • Lazy Evaluation • Distribution
PySpark – Broadcast and Accumulator A Broadcast variableallow the programmers to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks
PySpark – Broadcast and Accumulator A Broadcast variableallow the programmers to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks A broadcast variable is created with SparkContext.broadcast() >>> from pyspark.context import SparkContext >>> sc = SparkContext(‘local’, ‘test’) >>> b = sc.broadcast([1, 2, 3, 4, 5]) >>> b.value [1, 2, 3, 4, 5]
PySpark – Broadcast and Accumulator Accumulatorsare variables that are only added through an associative and commutative operation
PySpark – Broadcast and Accumulator Accumulatorsare variables that are only added through an associative and commutative operation class pyspark.Accumulator(aid, value, accum_param) from pyspark import SparkContext sc = SparkContext(“local”, “Accumulator app”) num = sc.accumulator(10) def f(x): global num num + = x rdd = sc.parallelize([20, 30, 40, 50]) rdd.foreach(f) final = num.value print(“Accumulated value is -> %i” % (final)) Output: Accumulated value is -> 150
Subpackages in PySpark SQL Streaming ML Mllib pyspark.mllib package pyspark.streaming module pyspark.sql module pyspark.ml package