1 / 30

Real-time PMML Scoring over Spark Streaming and Storm

Explore real-time PMML scoring techniques with Spark Streaming and Storm in big data analytics. Learn about Naïve Bayes models and practical use cases for internet traffic analysis and arrhythmia detection.

skaggsd
Download Presentation

Real-time PMML Scoring over Spark Streaming and Storm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Real-time PMML Scoring over Spark Streaming and Storm Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus

  2. Contents

  3. Big Data Computations [1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013. [2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741

  4. Berkeley Big-data Analytics Stack (BDAS)

  5. BDAS: Spark [MZ12] MateiZaharia, MosharafChowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.

  6. BDAS: Discretized Streams pageViews = readStream("http://...", "1s") 1_s = pageViews.map(event => (event.url, 1)) counts = 1_s.runningReduce((a, b) => a + b)

  7. BDAS: D-Streams Streaming Operators words = sentences.flatMap(s => s.split(" ")) pairs = words.map(w => (w, 1)) counts = pairs.reduceByKey((a, b) => a + b)

  8. BDAS: Use Cases

  9. Real-time Analytics: R over Storm

  10. Real-time Analytics UC 1: Internet Traffic Analysis

  11. Real-time Analysis UC2: Arrhythmia Detection

  12. PMML Primer

  13. Naïve Bayes Primer Likelihood Prior Normalization Constant

  14. PMML Scoring for Naïve Bayes

  15. PMML Scoring for Naïve Bayes <DataDictionarynumberOfFields="4"> <DataFieldname="Class" optype="categorical" dataType="string"> <Value value="democrat"/> <Value value="republican"/> </DataField> <DataField name="V1" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V2" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V3" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> </DataDictionary> (ctd on the next slide)

  16. PMML Scoring for Naïve Bayes <NaiveBayesModelmodelName="naiveBayes_Model" functionName="classification" threshold="0.003"> <MiningSchema> <MiningField name="Class" usageType="predicted"/> <MiningField name="V1" usageType="active"/> <MiningField name="V2" usageType="active"/> <MiningField name="V3" usageType="active"/> </MiningSchema> <Output> <OutputField name="Predicted_Class" feature="predictedValue"/> <OutputField name="Probability_democrat" optype="continuous" dataType="double" feature="probability" value="democrat"/> <OutputField name="Probability_republican" optype="continuous" dataType="double" feature="probability" value="republican"/> </Output> <BayesInputs> (ctd on the next page)

  17. PMML Scoring for Naïve Bayes <BayesInputs> <BayesInputfieldName="V1"> <PairCounts value="n"> <TargetValueCounts> <TargetValueCount value="democrat" count="51"/> <TargetValueCount value="republican" count="85"/> </TargetValueCounts> </PairCounts> <PairCounts value="y"> <TargetValueCounts> <TargetValueCount value="democrat" count="73"/> <TargetValueCount value="republican" count="23"/> </TargetValueCounts> </PairCounts> </BayesInput> <BayesInputfieldName="V2"> * <BayesInputfieldName="V3"> * </BayesInputs> <BayesOutputfieldName="Class"> <TargetValueCounts> <TargetValueCount value="democrat" count="124"/> <TargetValueCount value="republican" count="108"/> </TargetValueCounts> </BayesOutput>

  18. PMML Scoring for Naïve Bayes Definition Of Elements:- DataDictionary : Definitions for fields as used in mining models ( Class, V1, V2, V3 ) NaiveBayesModel : Indicates that this is a NaiveBayes PMML MiningSchema : lists fields as used in that model. Class is “predicted” field, V1,V2,V3 are “active” predictor fields Output: Describes a set of result values that can be returned from a model

  19. PMML Scoring for Naïve Bayes Definition Of Elements (ctd .. ) :- BayesInputs: For each type of inputs, contains the counts of outputs BayesOutput: Contains the counts associated with the values of the target field

  20. PMML Scoring for Naïve Bayes Sample Input Eg1 - n y y n y y n nnnnn y yyy Eg2 - n y n y yy n nnnn y yy n y • 1st , 2nd and 3rd Columns: Predictor variables ( Attribute “name” in element MiningField ) • Using these we predict whether the Output is Democrat or Republican ( PMML element BayesOutput)

  21. PMML Scoring for Naïve Bayes • 3 Node Xeon Machines Storm cluster ( 8 quad code CPUs, 32 GB RAM, 32 GB Swap space, 1 Nimbus, 2 Supervisors )

  22. PMML Scoring for Naïve Bayes • 3 Node Xeon Machines Spark cluster( 8 quad code CPUs, 32 GB RAM and 32 GB Swap space )

  23. Thank You!

  24. Back up slides

  25. Representation of an RDD

  26. Logistic Regression: Spark VS Hadoop http://spark-project.org

  27. Some Spark(ling) examples Scala code (serial) var count = 0 for (i <- 1 to 100000) { val x = Math.random * 2 - 1 valy = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count / 100000.0) Sample random point on unit circle – count how many are inside them (roughly about PI/4). Hence, u get approximate value for PI. Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).

  28. Some Spark(ling) examples Spark code (parallel) val spark = new SparkContext(<Mesos master>) varcount = spark.accumulator(0) for (i <- spark.parallelize(1 to 100000, 12)) { val x = Math.random * 2 – 1 val y = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count / 100000.0) Notable points: • Spark context created – talks to Mesos1 master. • Count becomes shared variable – accumulator. • For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices. • Parallelize method invokes foreach method of RDD. 1Mesos is an Apache incubated clustering system – http://mesosproject.org

  29. Logistic Regression in Spark: Serial Code // Read data file and convert it into Point objects val lines = scala.io.Source.fromFile("data.txt").getLines() val points = lines.map(x => parsePoint(x)) // Run logistic regression var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = Vector.zeros(D) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient } println("Result: " + w)

  30. Logistic Regression in Spark // Read data file and transform it into Point objects val spark = new SparkContext(<Mesos master>) val lines = spark.hdfsTextFile("hdfs://.../data.txt") val points = lines.map(x => parsePoint(x)).cache() // Run logistic regression var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = spark.accumulator(Vector.zeros(D)) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient.value } println("Result: " + w)

More Related