NLP in Scala with Breeze and Epic

NLP in Scala with Breeze and Epic David Hall UC Berkeley

ScalaNLP Ecosystem ≈ ≈ ≈ • Linear Algebra • Scientific Computing • Optimization • Natural Language Processing • Structured Prediction • Super-fast GPUparser for English { } Numpy/Scipy PyStruct/NLTK Breeze Epic Puck

Natural Language Processing V S S VP VP Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap. It certainly won’t get there on looks. VP NP PP NP

Epic for NLP V S S VP VP Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap. VP NP PP NP

Named Entity Recognition PersonOrganizationLocationMisc

NER with Epic • import epic.models.NerSelector • valnerModel = NerSelector.loadNer("en").get • val tokens = epic.preprocess.tokenize("Almost 20 years ago, Bill Watterson walked away from \"Calvin & Hobbes.\"") • println(nerModel.bestSequence(tokens).render("O"))Almost 20 years ago , [PER:Bill Watterson] walked away from `` [LOC:Calvin & Hobbes] . '' Not a location!

Annotate a bunch of data?

Building an NER system • val data: IndexedSeq[Segmentation[Label, String]] = ??? • val system = SemiCRF.buildSimple(data, startLabel, outsideLabel) • println(system.bestSequence(tokens).render("O"))Almost 20 years ago , [PER:Bill Watterson] walked away from `` [MISC:Calvin& Hobbes] . ''

Gazetteers http://en.wikipedia.org/wiki/List_of_newspaper_comic_strips_A%E2%80%93F

Gazetteers

Using your own gazetteer • val data: IndexedSeq[Segmentation[Label, String]] = ??? • valmyGazetteer = ??? • val system = SemiCRF.buildSimple(data, startLabel, outsideLabel, gaz= myGazetteer)

Gazetteer • Careful with gazetteers! • If built from training data, system will use it and only it to make predictions! • So, only known forms will be detected. • Still, can be very useful…

Semi-CR-What? • Semi-Markov Conditional Random Field • Don’t worry about the name.

Semi-CRFs

Semi-CRFs score(Chez Panisse) + score(Berkeley, CA) + score(- A bowl of ) + score(Churchill-Brenneis Orchards) + score(Page mandarins and medjool dates)

Features = w(starts-with-Chez) • w(starts-with-C…) • w(ends-with-P…) • w(starts-sentence) • w(shape:Xxx Xxx) • w(two-words) • w(in-gazetteer) score(Chez Panisse)

Building your own features valdsl = new WordFeaturizer.DSL[L](counts) with SurfaceFeaturizer.DSL import dsl._ word(begin) // word at the beginning of the span + word(end – 1) // end of the span + word(begin – 1) // before (gets things like Mr.) + word (end) // one past the end + prefixes(begin) // prefixes up to some length + suffixes(begin) + length(begin, end) // names tend to be 1-3 words + gazetteer(begin, end)

Using your own featurizer • val data: IndexedSeq[Segmentation[Label, String]] = ??? • valmyFeaturizer = ??? • val system = SemiCRF.buildSimple(data, startLabel, outsideLabel, featurizer = myFeaturizer)

Features • So far, we’ve been able to do everything with (nearly) no math. • To understand more, need to do some math.

Machine Learning Primer • Training example (x, y) • x: sentence of some sort • y: labeled version • Goal: • want score(x, y) > score(x, y’), forall y’.

Machine Learning Primer score(x, y) = wTf(x, y)

Machine Learning Primer score(x, y) = w.t * f(x, y)

Machine Learning Primer score(x, y) = w dot f(x, y)

Machine Learning Primer score(x, y) >= score(x, y’)

Machine Learning Primer w dot f(x, y) >= w dot f(x, y’)

Machine Learning Primer w dot f(x, y)>= w dot f(x, y’)

Machine Learning Primer w+ f(x,y) f(x,y) w f(x,y’) (w + f(x,y)) dot f(x,y) >= w dot f(x,y)

The Perceptron • valfeatureIndex: Index[Feature] = ??? • vallabelIndex: Index[Label] = ??? • val weights = DenseVector.rand[Double](featureIndex.size) • for ( epoch <- 0 until numEpochs; (x, y) <- data ) { vallabelScores = DV.tabulate(labelIndex.size) { yy => val features = featuresFor(x, yy) weights.t * new FeatureVector(indexed) // or weights dot new FeatureVector(indexed) } … }

The Perceptron (cont’d) • valfeatureIndex : Index[Feature] = ??? • vallabelIndex: Index[Label] = ??? • val weights = DenseVector.rand[Double](featureIndex.size) • for ( epoch <- 0 until numEpochs; (x, y) <- data ) { vallabelScores= ...valy_best = argmax(labelScores) if (y != y_best) { weights += new FeatureVector(featuresFor(x, y))weights -= new FeaureVector(featuresFor(x, y_best)) }}

Structured Perceptron • Can’t enumerate all segmentations! (L * 2n) • But dynamic programs exist to max or sum • … if the feature function has a nice form.

Structured Perceptron • valfeatureIndex: Index[Feature] = ??? • vallabelIndex: Index[Label] = ??? • val weights = DenseVector.rand[Double](featureIndex.size) • for ( epoch <- 0 until numEpochs; (x, y) <- data ) { valy_best = bestStructure(weights, x) if (y != y_best) {weights += featuresFor(x, y) weights -= featuresFor(x, y_best) } }

Constituency Parsing V S S VP VP Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap. VP NP PP NP

Multilingual Parser “Berkeley:” [Petrov & Klein, 2007]; Epic [Hall, Durrett, and Klein, 2014]

Epic Pre-built Models Parsing English, Basque, French, German, Swedish, Polish, Korean (working on Arabic, Chinese, Spanish) Part-of-Speech Tagging English, Basque, French, German, Swedish, Polish Named Entity Recognition English Sentence segmentation English (ok support for others) Tokenization Above languages

What Is Breeze? Dense Vectors, Matrices, Sparse Vectors, Counters, MatrixDecompositions

What Is Breeze? Nonlinear Optimization, Probability Distributions

Getting started libraryDependencies ++= Seq( "org.scalanlp" %% "breeze" % "0.8.1", // optional: native linear algebra libraries // bulky, but faster "org.scalanlp" %% "breeze-natives" % "0.8.1" ) scalaVersion := "2.11.1"

Linear Algebra • import breeze.linalg._ • val x = DenseVector.zeros[Int](5)// DenseVector(0, 0, 0, 0, 0) • val m = DenseMatrix.zeros[Int](5,5) • val r = DenseMatrix.rand(5,5) • m.t // transpose • x + x // addition • m * x // multiplication by vector • m * 3 // by scalar • m * m // by matrix • m :* m // element wise mult, Matlab .*

Return Type Selection • import breeze.linalg.{DenseVector=> DV} • val dv = DV(1.0, 2.0) • valsv = SparseVector.zeros[Double](2) • dv + sv// res: DenseVector[Double] = DenseVector(1.0, 2.0) • sv+ sv// res: SparseVector[Double] = SparseVector(1.0, 2.0)

Return Type Selection • import breeze.linalg.{DenseVector=> DV} • val dv = DV(1.0, 2.0) • valsv = SparseVector.zeros[Double](2) • (dv: Vector[Double]) + (sv: Vector[Double])// res: Vector[Double] = DenseVector(1.0, 2.0) • (sv: Vector[Double]) + (sv: Vector[Double])// res: Vector[Double] = SparseVector() Static: Vector Dynamic: Dense Static: Vector Dynamic: Sparse

Linear Algebra: Slices • val m = DenseMatrix.zeros[Int](5, 5) • m(::,1) // slice a column// DV(0, 0, 0, 0, 0) • m(4,::) // slice a row • m(4,::) := DV(1,2,3,4,5).t • m.toString0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5

Linear Algebra: Slices • m(0 to 1, 3 to 4).toString0 0 2 3 • m(IndexedSeq(3,1,4,2),IndexedSeq(4,4,3,1))0 0 0 0 0 0 0 0 5 5 4 2 0 0 0 0

Universal Functions • import breeze.numerics._ • log(DenseVector(1.0, 2.0, 3.0, 4.0))// DenseVector(0.0, 0.6931471805599453, // 1.0986122886681098, 1.3862943611198906) • exp(DenseMatrix( (1.0, 2.0), (3.0, 4.0))) • sin(Array(2.0, 3.0, 4.0, 42.0)) • // also sin, cos, sqrt, asin, floor, round, digamma, trigamma, ...

UFuncs: In-Place • import breeze.numerics._ • val v = DV(1.0, 2.0, 3.0, 4.0) • log.inPlace(v)// v == DenseVector(0.0, 0.6931471805599453, // 1.0986122886681098, 1.3862943611198906)

UFuncs: Reductions • sum(DenseVector(1.0, 2.0, 3.0))// 6.0 • sum(DenseVector(1, 2, 3))// 6 • mean(DenseVector(1.0, 2.0, 3.0))// 2.0 • mean(DenseMatrix( (1.0, 2.0, 3.0) (4.0, 5.0, 6.0)))// 3.5

UFuncs: Reductions • valm = DenseMatrix((1.0, 3.0), (4.0, 4.0)) • sum(m) == 12.0 • mean(m) == 3.0 • sum(m(*, ::)) == DenseVector(4.0, 8.0) • sum(m(::, *)) == DenseMatrix((5.0, 7.0)) • mean(m(*, ::)) == DenseVector(2.0, 4.0) • mean(m(::, *)) == DenseMatrix((2.5, 3.5))

Broadcasting • valdm = DenseMatrix((1.0,2.0,3.0), (4.0,5.0,6.0)) • sum(dm(::, *)) == DenseMatrix((5.0, 7.0, 9.0)) • dm(::, *) + DenseVector(3.0, 4.0)// DenseMatrix((4.0, 5.0, 6.0), (8.0, 9.0, 10.0))) • dm(::, *) := DenseVector(3.0, 4.0)// DenseMatrix((3.0, 3.0, 3.0), (4.0, 4.0, 4.0))

UFuncs: Implementation • object log extends UFunc • implicit object logDouble extends log.Impl[Double, Double] { def apply(x: Double) = scala.math.log(x) } • log(1.0) // == 0.0 • // elsewhere: magic to automatically make this work: • log(DenseVector(1.0)) // == DenseVector(0.0) • log(DenseMatrix(1.0)) // == DenseMatrix(0.0) • log(SparseVector(1.0)) // == SparseVector()

NLP in Scala with Breeze and Epic

NLP in Scala with Breeze and Epic

Presentation Transcript

Getting Started with Scala

Running Scala in the Browser Scala Days 2010

NLP and ML in Scala with Breeze

specs and Scala

The Epic, The Epic Hero, and The Epic World In Media

Getting Serious with Versioned APIs in Scala

Single Page Apps with Breeze and Ruby

Selling with NLP

Epic and Epic Heroes

Sea Breeze and Land Breeze

Epic and Epic Hero

Multiparadigm Programming in Scala

Multiparadigm Programming in Scala

‘ WoW ! Performance and Presentation’ with NLP

Scala

Programming in Scala

Scala and Spark

Spark and scala course in pune

Multiparadigm Programming in Scala