310 likes | 1.08k Views
Real-time PMML Scoring over Spark Streaming and Storm. Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus. Contents. Big Data Computations.
E N D
Real-time PMML Scoring over Spark Streaming and Storm Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus
Big Data Computations [1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013. [2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741
BDAS: Spark [MZ12] MateiZaharia, MosharafChowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.
BDAS: Discretized Streams pageViews = readStream("http://...", "1s") 1_s = pageViews.map(event => (event.url, 1)) counts = 1_s.runningReduce((a, b) => a + b)
BDAS: D-Streams Streaming Operators words = sentences.flatMap(s => s.split(" ")) pairs = words.map(w => (w, 1)) counts = pairs.reduceByKey((a, b) => a + b)
Naïve Bayes Primer Likelihood Prior Normalization Constant
PMML Scoring for Naïve Bayes <DataDictionarynumberOfFields="4"> <DataFieldname="Class" optype="categorical" dataType="string"> <Value value="democrat"/> <Value value="republican"/> </DataField> <DataField name="V1" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V2" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V3" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> </DataDictionary> (ctd on the next slide)
PMML Scoring for Naïve Bayes <NaiveBayesModelmodelName="naiveBayes_Model" functionName="classification" threshold="0.003"> <MiningSchema> <MiningField name="Class" usageType="predicted"/> <MiningField name="V1" usageType="active"/> <MiningField name="V2" usageType="active"/> <MiningField name="V3" usageType="active"/> </MiningSchema> <Output> <OutputField name="Predicted_Class" feature="predictedValue"/> <OutputField name="Probability_democrat" optype="continuous" dataType="double" feature="probability" value="democrat"/> <OutputField name="Probability_republican" optype="continuous" dataType="double" feature="probability" value="republican"/> </Output> <BayesInputs> (ctd on the next page)
PMML Scoring for Naïve Bayes <BayesInputs> <BayesInputfieldName="V1"> <PairCounts value="n"> <TargetValueCounts> <TargetValueCount value="democrat" count="51"/> <TargetValueCount value="republican" count="85"/> </TargetValueCounts> </PairCounts> <PairCounts value="y"> <TargetValueCounts> <TargetValueCount value="democrat" count="73"/> <TargetValueCount value="republican" count="23"/> </TargetValueCounts> </PairCounts> </BayesInput> <BayesInputfieldName="V2"> * <BayesInputfieldName="V3"> * </BayesInputs> <BayesOutputfieldName="Class"> <TargetValueCounts> <TargetValueCount value="democrat" count="124"/> <TargetValueCount value="republican" count="108"/> </TargetValueCounts> </BayesOutput>
PMML Scoring for Naïve Bayes Definition Of Elements:- DataDictionary : Definitions for fields as used in mining models ( Class, V1, V2, V3 ) NaiveBayesModel : Indicates that this is a NaiveBayes PMML MiningSchema : lists fields as used in that model. Class is “predicted” field, V1,V2,V3 are “active” predictor fields Output: Describes a set of result values that can be returned from a model
PMML Scoring for Naïve Bayes Definition Of Elements (ctd .. ) :- BayesInputs: For each type of inputs, contains the counts of outputs BayesOutput: Contains the counts associated with the values of the target field
PMML Scoring for Naïve Bayes Sample Input Eg1 - n y y n y y n nnnnn y yyy Eg2 - n y n y yy n nnnn y yy n y • 1st , 2nd and 3rd Columns: Predictor variables ( Attribute “name” in element MiningField ) • Using these we predict whether the Output is Democrat or Republican ( PMML element BayesOutput)
PMML Scoring for Naïve Bayes • 3 Node Xeon Machines Storm cluster ( 8 quad code CPUs, 32 GB RAM, 32 GB Swap space, 1 Nimbus, 2 Supervisors )
PMML Scoring for Naïve Bayes • 3 Node Xeon Machines Spark cluster( 8 quad code CPUs, 32 GB RAM and 32 GB Swap space )
Logistic Regression: Spark VS Hadoop http://spark-project.org
Some Spark(ling) examples Scala code (serial) var count = 0 for (i <- 1 to 100000) { val x = Math.random * 2 - 1 valy = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count / 100000.0) Sample random point on unit circle – count how many are inside them (roughly about PI/4). Hence, u get approximate value for PI. Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).
Some Spark(ling) examples Spark code (parallel) val spark = new SparkContext(<Mesos master>) varcount = spark.accumulator(0) for (i <- spark.parallelize(1 to 100000, 12)) { val x = Math.random * 2 – 1 val y = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count / 100000.0) Notable points: • Spark context created – talks to Mesos1 master. • Count becomes shared variable – accumulator. • For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices. • Parallelize method invokes foreach method of RDD. 1Mesos is an Apache incubated clustering system – http://mesosproject.org
Logistic Regression in Spark: Serial Code // Read data file and convert it into Point objects val lines = scala.io.Source.fromFile("data.txt").getLines() val points = lines.map(x => parsePoint(x)) // Run logistic regression var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = Vector.zeros(D) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient } println("Result: " + w)
Logistic Regression in Spark // Read data file and transform it into Point objects val spark = new SparkContext(<Mesos master>) val lines = spark.hdfsTextFile("hdfs://.../data.txt") val points = lines.map(x => parsePoint(x)).cache() // Run logistic regression var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = spark.accumulator(Vector.zeros(D)) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient.value } println("Result: " + w)