160 likes | 181 Views
ITCS-3190. Overview. Its a fast g eneral- purpose e ngine for large-scale d ata p rocessing. Speed Ease of use Generality Runs e verywhere. Overview.
E N D
Overview • Its a fast general-purpose engine for large-scale data processing. • Speed • Ease of use • Generality • Runs everywhere
Overview • MLLib is a machine learning library that is built on top of Spark, and has the provision to support many machine learning algorithms. But the point difference is that it runs almost 100 times faster than MapReduce. • Spark has its own Graph Computation Engine, called GraphX. • Spark Core Engine allows writing raw Spark programs and Scala programs and launch them; it also allows writing Java programs before launching them. All these are being executed by Spark Core Engine. • Spark SQL for querying structured data via SQL and Hive Query Language (HQL) • Spark Streaming mainly enables you to create analytical and interactive applications for live streaming data. You can do the streaming of the data and then, Spark can run its operations from the streamed data itself.
Downloading Download a recent released version of spark at • http://spark.apache.org/downloads.html • Select package type of “Pre-built for Hadoop 2.4 and later” • Click Direct Download which will download a compressed TAR file • Unpack the file using: • The tar command-line tool that comes with most Unix and Linux variants, including Mac OS X • a free TAR extractor
Getting Started • Spark comes with shells much like operating system shells such as Bash or Windows Command Prompt. The difference in these is spark shells allow you to manipulate data that is distributed across many machines. • To open a Spark shell, go to your Spark directory and type: • bin/pyspark for Python version (bin\pyspark in windows) • bin/spark-shell for the Scala version • In Spark, operations on distributed collections are expressed as RDD’s (resilient distributed datasets) • The variable lines is an RDD created from a text file. Then we can run various operations on the RDD such as count the lines of text or print the first one.
Spark Application • Like MapReduce applications, each Spark application is a self-contained computation that runs user-supplied code to compute a result. • In Spark, the highest-level unit of computation is an application. • A Spark application can be used for a single batch job, an interactive session with multiple jobs, or a long-lived server continually satisfying requests. • A Spark job can consist of more than just a single map and reduce. A Spark application can have processes running on its behalf even when it's not running a job. • Multiple tasks can run within the same executor • Therefore, this enables extremely fast task startup time as well as in-memory data storage, resulting in orders of magnitude faster performance
Simple Spark Apps Simple Spark APP: Pi Estimation • Spark can be used to compute very intensive tasks • This code estimates pi by “throwing darts” at a circle -pick random points in unit circle ((0,0) to (1,1)) -check how many fall in the unit circle • The fraction should be pi/4 • Then that estimates the value for pi
Simple Spark Apps Simple Spark APP: Word Count Creates a Spark Conf and SparkContext. A Spark application corresponds to an instance of the SparkContext class. • Gets a word frequency threshold. • Reads an input set of text documents. • Counts the number of times each word appears. • Filters out all words that appear fewer times than • the threshold. • For the remaining words, counts the number • of times each letter occurs.
Simple Spark Apps In MapReduce, this requires two MapReduce applications, as well as persisting the intermediate data to HDFS between them. In Spark, this application requires about 90 percent fewer lines of code than one developed using the MapReduce API
Introduction to Scala • A scalable programming language Influenced by Haskell and Java • Can use any Java code in Scala, making it almost as fast as Java but with much shorter code • Allows fewer errors – no Null Pointer errors • More flexible - Every Scala function is a value, every value is an object • Scala Interpreter is an interactive shell for writing expressions $ scala starts interpreter scala> 3 + 5 expression to be evaluated by interpreter Unnamed0: Int = 8 result of evaluation scala> :quit quits interpreter
Scala • A scalable language • Object-oriented language • Every value is an object • functional language concepts • Start using it like Java • Gradually use more functional style syntax • Runs on the JVM • Many design patterns are already natively supported Pair<Integer, String> p = • new Pair<Integer, String>(1, "Scala"); • val p = new MyPair(1, "scala")
PySpark • Python is dynamically typed, so RDDs can hold objects of multiple types. • PySpark does not yet support a few API calls, such as look up and non-text input files. • PySpark requires Python 2.6 or higher. PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions. • They have not tested PySpark with Python 3 or with alternative Python interpreters, such as PyPy or Jython.
Python API • Key Differences • In PySpark, RDDs support the same methods as their Scala counterparts but take Python functions and return Python collection types • You can also pass functions that are defined with the def keyword; this is useful for longer functions that can’t be expressed using lambda
Python API • Key Differences • Functions can access objects in enclosing scopes, although modifications to those objects within RDD methods will not be propagated back: • PySpark will automatically ship these functions to workers, along with any objects that they reference. Instances of classes will be serialized and shipped to workers by PySpark, but classes themselves cannot be automatically distributed to workers.
Interactive Use • The bin/pyspark script launches a Python interpreter that is configured to run PySpark applications. • The Python shell can be used explore data interactively and is a simple way to learn the API: • By default, the bin/pyspark shell creates SparkContext that runs applications locally on a single core. To connect to a non-local cluster, or use multiple cores, set the MASTER environment variable • PySpark can also be used from standalone Python scripts by creating a SparkContext in your script and running the script using bin/pyspark.
Spark Code Examples Text search of error messages in a log file