660 likes | 856 Views
NLP in Scala with Breeze and Epic. David Hall UC Berkeley. ScalaNLP Ecosystem. ≈. ≈. ≈. Linear Algebra Scientific Computing Optimization. Natural Language Processing Structured Prediction. Super-fast GPU parser for English. { }. Numpy / Scipy. PyStruct /NLTK. Breeze.
E N D
NLP in Scala with Breeze and Epic David Hall UC Berkeley
ScalaNLP Ecosystem ≈ ≈ ≈ • Linear Algebra • Scientific Computing • Optimization • Natural Language Processing • Structured Prediction • Super-fast GPUparser for English { } Numpy/Scipy PyStruct/NLTK Breeze Epic Puck
Natural Language Processing V S S VP VP Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap. It certainly won’t get there on looks. VP NP PP NP
Epic for NLP V S S VP VP Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap. VP NP PP NP
Named Entity Recognition PersonOrganizationLocationMisc
NER with Epic • import epic.models.NerSelector • valnerModel = NerSelector.loadNer("en").get • val tokens = epic.preprocess.tokenize("Almost 20 years ago, Bill Watterson walked away from \"Calvin & Hobbes.\"") • println(nerModel.bestSequence(tokens).render("O"))Almost 20 years ago , [PER:Bill Watterson] walked away from `` [LOC:Calvin & Hobbes] . '' Not a location!
Building an NER system • val data: IndexedSeq[Segmentation[Label, String]] = ??? • val system = SemiCRF.buildSimple(data, startLabel, outsideLabel) • println(system.bestSequence(tokens).render("O"))Almost 20 years ago , [PER:Bill Watterson] walked away from `` [MISC:Calvin& Hobbes] . ''
Gazetteers http://en.wikipedia.org/wiki/List_of_newspaper_comic_strips_A%E2%80%93F
Using your own gazetteer • val data: IndexedSeq[Segmentation[Label, String]] = ??? • valmyGazetteer = ??? • val system = SemiCRF.buildSimple(data, startLabel, outsideLabel, gaz= myGazetteer)
Gazetteer • Careful with gazetteers! • If built from training data, system will use it and only it to make predictions! • So, only known forms will be detected. • Still, can be very useful…
Semi-CR-What? • Semi-Markov Conditional Random Field • Don’t worry about the name.
Semi-CRFs score(Chez Panisse) + score(Berkeley, CA) + score(- A bowl of ) + score(Churchill-Brenneis Orchards) + score(Page mandarins and medjool dates)
Features = w(starts-with-Chez) • w(starts-with-C…) • w(ends-with-P…) • w(starts-sentence) • w(shape:Xxx Xxx) • w(two-words) • w(in-gazetteer) score(Chez Panisse)
Building your own features valdsl = new WordFeaturizer.DSL[L](counts) with SurfaceFeaturizer.DSL import dsl._ word(begin) // word at the beginning of the span + word(end – 1) // end of the span + word(begin – 1) // before (gets things like Mr.) + word (end) // one past the end + prefixes(begin) // prefixes up to some length + suffixes(begin) + length(begin, end) // names tend to be 1-3 words + gazetteer(begin, end)
Using your own featurizer • val data: IndexedSeq[Segmentation[Label, String]] = ??? • valmyFeaturizer = ??? • val system = SemiCRF.buildSimple(data, startLabel, outsideLabel, featurizer = myFeaturizer)
Features • So far, we’ve been able to do everything with (nearly) no math. • To understand more, need to do some math.
Machine Learning Primer • Training example (x, y) • x: sentence of some sort • y: labeled version • Goal: • want score(x, y) > score(x, y’), forall y’.
Machine Learning Primer score(x, y) = wTf(x, y)
Machine Learning Primer score(x, y) = w.t * f(x, y)
Machine Learning Primer score(x, y) = w dot f(x, y)
Machine Learning Primer score(x, y) >= score(x, y’)
Machine Learning Primer w dot f(x, y) >= w dot f(x, y’)
Machine Learning Primer w dot f(x, y)>= w dot f(x, y’)
Machine Learning Primer w+ f(x,y) f(x,y) w f(x,y’) (w + f(x,y)) dot f(x,y) >= w dot f(x,y)
The Perceptron • valfeatureIndex: Index[Feature] = ??? • vallabelIndex: Index[Label] = ??? • val weights = DenseVector.rand[Double](featureIndex.size) • for ( epoch <- 0 until numEpochs; (x, y) <- data ) { vallabelScores = DV.tabulate(labelIndex.size) { yy => val features = featuresFor(x, yy) weights.t * new FeatureVector(indexed) // or weights dot new FeatureVector(indexed) } … }
The Perceptron (cont’d) • valfeatureIndex : Index[Feature] = ??? • vallabelIndex: Index[Label] = ??? • val weights = DenseVector.rand[Double](featureIndex.size) • for ( epoch <- 0 until numEpochs; (x, y) <- data ) { vallabelScores= ...valy_best = argmax(labelScores) if (y != y_best) { weights += new FeatureVector(featuresFor(x, y))weights -= new FeaureVector(featuresFor(x, y_best)) }}
Structured Perceptron • Can’t enumerate all segmentations! (L * 2n) • But dynamic programs exist to max or sum • … if the feature function has a nice form.
Structured Perceptron • valfeatureIndex: Index[Feature] = ??? • vallabelIndex: Index[Label] = ??? • val weights = DenseVector.rand[Double](featureIndex.size) • for ( epoch <- 0 until numEpochs; (x, y) <- data ) { valy_best = bestStructure(weights, x) if (y != y_best) {weights += featuresFor(x, y) weights -= featuresFor(x, y_best) } }
Constituency Parsing V S S VP VP Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap. VP NP PP NP
Multilingual Parser “Berkeley:” [Petrov & Klein, 2007]; Epic [Hall, Durrett, and Klein, 2014]
Epic Pre-built Models Parsing English, Basque, French, German, Swedish, Polish, Korean (working on Arabic, Chinese, Spanish) Part-of-Speech Tagging English, Basque, French, German, Swedish, Polish Named Entity Recognition English Sentence segmentation English (ok support for others) Tokenization Above languages
What Is Breeze? Dense Vectors, Matrices, Sparse Vectors, Counters, MatrixDecompositions
What Is Breeze? Nonlinear Optimization, Probability Distributions
Getting started libraryDependencies ++= Seq( "org.scalanlp" %% "breeze" % "0.8.1", // optional: native linear algebra libraries // bulky, but faster "org.scalanlp" %% "breeze-natives" % "0.8.1" ) scalaVersion := "2.11.1"
Linear Algebra • import breeze.linalg._ • val x = DenseVector.zeros[Int](5)// DenseVector(0, 0, 0, 0, 0) • val m = DenseMatrix.zeros[Int](5,5) • val r = DenseMatrix.rand(5,5) • m.t // transpose • x + x // addition • m * x // multiplication by vector • m * 3 // by scalar • m * m // by matrix • m :* m // element wise mult, Matlab .*
Return Type Selection • import breeze.linalg.{DenseVector=> DV} • val dv = DV(1.0, 2.0) • valsv = SparseVector.zeros[Double](2) • dv + sv// res: DenseVector[Double] = DenseVector(1.0, 2.0) • sv+ sv// res: SparseVector[Double] = SparseVector(1.0, 2.0)
Return Type Selection • import breeze.linalg.{DenseVector=> DV} • val dv = DV(1.0, 2.0) • valsv = SparseVector.zeros[Double](2) • (dv: Vector[Double]) + (sv: Vector[Double])// res: Vector[Double] = DenseVector(1.0, 2.0) • (sv: Vector[Double]) + (sv: Vector[Double])// res: Vector[Double] = SparseVector() Static: Vector Dynamic: Dense Static: Vector Dynamic: Sparse
Linear Algebra: Slices • val m = DenseMatrix.zeros[Int](5, 5) • m(::,1) // slice a column// DV(0, 0, 0, 0, 0) • m(4,::) // slice a row • m(4,::) := DV(1,2,3,4,5).t • m.toString0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5
Linear Algebra: Slices • m(0 to 1, 3 to 4).toString0 0 2 3 • m(IndexedSeq(3,1,4,2),IndexedSeq(4,4,3,1))0 0 0 0 0 0 0 0 5 5 4 2 0 0 0 0
Universal Functions • import breeze.numerics._ • log(DenseVector(1.0, 2.0, 3.0, 4.0))// DenseVector(0.0, 0.6931471805599453, // 1.0986122886681098, 1.3862943611198906) • exp(DenseMatrix( (1.0, 2.0), (3.0, 4.0))) • sin(Array(2.0, 3.0, 4.0, 42.0)) • // also sin, cos, sqrt, asin, floor, round, digamma, trigamma, ...
UFuncs: In-Place • import breeze.numerics._ • val v = DV(1.0, 2.0, 3.0, 4.0) • log.inPlace(v)// v == DenseVector(0.0, 0.6931471805599453, // 1.0986122886681098, 1.3862943611198906)
UFuncs: Reductions • sum(DenseVector(1.0, 2.0, 3.0))// 6.0 • sum(DenseVector(1, 2, 3))// 6 • mean(DenseVector(1.0, 2.0, 3.0))// 2.0 • mean(DenseMatrix( (1.0, 2.0, 3.0) (4.0, 5.0, 6.0)))// 3.5
UFuncs: Reductions • valm = DenseMatrix((1.0, 3.0), (4.0, 4.0)) • sum(m) == 12.0 • mean(m) == 3.0 • sum(m(*, ::)) == DenseVector(4.0, 8.0) • sum(m(::, *)) == DenseMatrix((5.0, 7.0)) • mean(m(*, ::)) == DenseVector(2.0, 4.0) • mean(m(::, *)) == DenseMatrix((2.5, 3.5))
Broadcasting • valdm = DenseMatrix((1.0,2.0,3.0), (4.0,5.0,6.0)) • sum(dm(::, *)) == DenseMatrix((5.0, 7.0, 9.0)) • dm(::, *) + DenseVector(3.0, 4.0)// DenseMatrix((4.0, 5.0, 6.0), (8.0, 9.0, 10.0))) • dm(::, *) := DenseVector(3.0, 4.0)// DenseMatrix((3.0, 3.0, 3.0), (4.0, 4.0, 4.0))
UFuncs: Implementation • object log extends UFunc • implicit object logDouble extends log.Impl[Double, Double] { def apply(x: Double) = scala.math.log(x) } • log(1.0) // == 0.0 • // elsewhere: magic to automatically make this work: • log(DenseVector(1.0)) // == DenseVector(0.0) • log(DenseMatrix(1.0)) // == DenseMatrix(0.0) • log(SparseVector(1.0)) // == SparseVector()