150 likes | 363 Views
6/27/2014 Small Scale Scala Spark Demo, DataRepos, Spark Executors. Scala/Spark Review. Spark 2.9.3 partition. //withoutTen( = {1, 10, 10, 2}) → = {1, 2, 0, 0} //withoutTen( = {10, 2, 10}) → = {2, 0, 0} //withoutTen( = {1, 99, 10}) → = {1, 99, 0}
E N D
6/27/2014 Small Scale Scala Spark Demo, DataRepos, Spark Executors Scala/Spark Review
Spark 2.9.3 partition //withoutTen( = {1, 10, 10, 2}) → = {1, 2, 0, 0} //withoutTen( = {10, 2, 10}) → = {2, 0, 0} //withoutTen( = {1, 99, 10}) → = {1, 99, 0} def withoutTen(nums:Array[Int]):Array[Int] = { //cant get span and dropWhile,takeWhile to work correctly in 2.9.3 nums.partition(_ == 10)._2.padTo(nums.size,0) }
Partition to split into 2 arrays/lists scala> a res37: Array[Int] = Array(1, 2, 3, 10, 10, 10, 1) scala> a.partition(_ == 10) res38: (Array[Int], Array[Int]) = (Array(10, 10, 10),Array(1, 2, 3, 1))
Span should do the same thing • scala> a.span( _ ==10) • res44: (Array[Int], Array[Int]) = (Array(),Array(1, 2, 3, 10, 10, 10, 1)) • Works for first element only • scala> a.span( _ ==1) • res45: (Array[Int], Array[Int]) = (Array(1),Array(2, 3, 10, 10, 10, 1)) • takeWhile/dropWhile
Using fold/reduce for accumulation • foldLeft/foldRight binary operator to add not ( _ + _ ) on webpost • def add(res:Int,acc:Int)={println(“res:”+res+” acc:“+acc) res+acc} • Val a = Array(1,2,3,10,10,10,1) • Add if statement to add:scala> def add(res:Int, x:Int)={ println("res:"+res+" acc:"+acc);if(acc%2==0)res+acc else res} • add: (res: Int, acc: Int)Int • a.foldLeft(0)(add)
FoldLeft scala> a.foldLeft(0)(add) • res:0 x:1 • res:0 x:2 • res:2 x:3 • res:2 x:10 • res:12 x:10 • res:22 x:10 • res:32 x:1 • res51: Int = 32
reduceLeft scala> a.reduceLeft(add) • res:1 x:2 • res:3 x:3 • res:3 x:10 • res:13 x:10 • res:23 x:10 • res:33 x:1 • res49: Int = 33
Test using embeded fxns • Can't add logic to _ + _ in the same line • Limited to functions which return boolean, count or another data collection • ReduceRigtht, foldRight have to reverse arguments to add(acc:Int, res:Int)
Spark • CDK, MR parallelism vs Spark Executors in Mesos/YARN • Spark Job Server demo • Change Dependencies.scala • lazy val commonDeps = Seq(... • "org.apache.hadoop" % "hadoop-common" % "2.3.0", • "org.apache.hadoop" % "hadoop-client" % "2.3.0", • "org.apache.hadoop" % "hadoop-hdfs" % "2.3.0"
Test HDFS access HelloWorld.scala object HelloWorld extends SparkJob{ def main(args:Array[String]){ println("asdf") //wont see this val sc=new SparkContext("local[2]","HelloWorld") val config = ConfigFactory.parseString("") val results = runJob(sc,config) println("results:"+results) }
IPC error aused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1113) Wrong version of Hadoop Client libs
Validate, runJob def validate(sc:SparkContext,config:Config):SparkJobValidation = { Try(config.getString("input.string")).map(x=>SparkJobValid).getOrElse(SparkJ$ } override def runJob(sc:SparkContext, config:Config):Any= { val dd = sc.textFile("hdfs://localhost:8020/user/dc/books") dd.count() }
Results Test in spark shell first, count num lines hdfs file books 1) sbt package to create a jar 2) start the spark job server >re-start Verify you see a ui at localhost:8090 3) load the jar you packaged in 1) [dc@localhost spark-jobserver-master]$ curl --data-binary @job-server-tests/target/job-server-tests-0.3.1.jar localhost:8090/jars/test OK
Jobserver Hadoop HelloWorld 4) run jar in Spark [dc@localhost spark-jobserver-master]$ curl -d "input.string = a a a a a a a b b" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.HelloWorld' { "status": "STARTED", "result": { "jobId": "ce208815-f445-4a77-866c-0be46fdd5df9", "context": "70b92cb1-spark.jobserver.HelloWorld" } }
Query JobServer for results [dc@localhost spark-jobserver-master]$ curl localhost:8090/jobs/ce208815-f445-4a77-866c-0be46fdd5df9 { "status": "OK", "result": 5 }[dc@localhost spark