90 likes | 407 Views
A introduction to Apache Spark, what is it and how does it work ? Why use it and some examples of use.
E N D
Apache Spark • What is it ? • How does it work ? • Benefits • Tuning • Examples www.semtech-solutions.co.nz info@semtech-solutions.co.nz
Spark – What is it ? • Open Source • Alternative to Map Reduce for certain applications • A low latency cluster computing system • For very large data sets • May be 100 times faster than Map Reduce for • Iterative algorithms • Interactive data mining • Used with Hadoop / HDFS • Released under BSD License www.semtech-solutions.co.nz info@semtech-solutions.co.nz
Spark – How does it work ? • Uses in memory cluster computing • Memory access faster than disk access • Has API's written in • Scala • Java • Python • Can be accessed from Scala and Python shells • Currently an Apache incubator project www.semtech-solutions.co.nz info@semtech-solutions.co.nz
Spark – Benefits • Scales to very large clusters • Uses in memory processing for increased speed • High Level API's • Java, Scala, Python • Low latency shell access www.semtech-solutions.co.nz info@semtech-solutions.co.nz
Spark – Tuning • Bottlenecks can occur in the cluster via • CPU, memory or network bandwidth • Tune data serialization method i.e. • Java ObjectOutputStream vs Kryo • Memory Tuning • Use primitive types • Set JVM Flags • Store objects in serialized form i.e. • RDD Persistence • MEMORY_ONLY_SER www.semtech-solutions.co.nz info@semtech-solutions.co.nz
Spark – Examples Example from spark-project.org, Spark job in Scala. Showing a simple text count from a system log. /*** SimpleJob.scala ***/ import spark.SparkContext import SparkContext._ object SimpleJob { def main(args: Array[String]) { val logFile = "/var/log/syslog"// Should be some file on your system val sc = new SparkContext("local", "Simple Job", "$YOUR_SPARK_HOME", List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar")) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } } www.semtech-solutions.co.nz info@semtech-solutions.co.nz
Contact Us • Feel free to contact us at • www.semtech-solutions.co.nz • info@semtech-solutions.co.nz • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems