Machine Learning for Natural Language Processing

Machine Learning forNatural Language Processing Seminar Information Extraktion Wednesday, 6th November 2013 Robin Tibor Schirrmeister

Outline • Informal definitionwithexample • Fivemachinelearningalgorithms • Motivation andmainidea • Example • Training andclassification • Usageandassumptions • Typesofmachinelearning • Improvementofmachinelearningsystems

Informal DefinitionInformal Definition • Machine Learning means a programgetsbetterbyautomaticallylearningfromdata • Let‘sseewhatthismeansbylookingat an example

NamedEntity Recognition Example • Howtorecognizementionsofpersons in a text? • Angela Merkel made a decision. • Tiger Woods played golf. • Tiger ran intothewoods.Couldusehandwrittenrules, mighttake a lotof time todefine all rulesandtheirinteractions… So howtolearnrulesandtheirinteractions?

Person Name Features • Create featuresforyour ML algorithm • For a machinetolearnif a wordis a name, youhavetodefineattributesorfeaturesthatmightindicate a name • Word itself • Word is a knownnounlikehammer, boat, tiger • Word iscapitalizedor not • Part ofspeech (verb, noun etc.) • …Human hasto design thesefeatures!

Person Name Features • ML algorithmlearnstypicalfeatures • Whatvaluesofthesefeaturesaretypicalfor a personname? • Michael, Ahmed, Ina • Word is not a knownnoun • Word iscapitalized • Proper nounMachineshouldlearnthesetypicalvaluesautomatically!

Person Name Training Set • Create a trainingset • Containswordsthatarepersonnamesandwordsthatare not personnames

Naive Bayes Training • For eachfeaturevaluecounthowfrequentitoccurs in personnamesandotherwords. • This tellsyouhowlikely a randompersonnameiscapitalized. Howcouldpersonname not becapitalized? Think noisytextliketweets..

Naive Bayes Training • Count totalfrequencyofclasses • Count howfrequentpersonnamesare in general: Howmanywordsarepartofpersonnamesandhowmanyare not • Do wholetrainingthe same wayforwordsthatare not personnames

Naive BayesClassificationExample • Let‘strytoclassify Tiger in Tiger Woodsplayedgolf. • Assume5% ofwordsarepartsofpersonnames

Naive BayesClassification • Classify a wordbymultiplyingfeaturevalueprobabilities • (valueof i-thfeature) • Higher Score wins! 

Naive BayesOverview Training • For all classes • For all featuresand all possiblefeaturevalues • Compute (e.g. chanceofwordiscapitalizedifit‘s a personname) • Compute total probability Classification • For all classescomputeclassscoreas: • Data pointisclassifiedbyclasswithhighest score

Naive BayesAssumptions • Probabilityofonefeatureisindependentofanotherfeaturewhenweknowtheclass • Whenweknowwordispartofpersonname, probabilityofcapitalizationisindependentofprobabilitythatwordis a knownnounThis is not completelytrueIfthewordistiger, wealreadyknowitis a knownnoun • That‘swhyit‘scalledNaiveBayesNaive Bayesoftenclassifieswellevenifassumptionisviolated!

Evaluation • Evaluateperformanceon testset • Correctclassification rate on testsetofwordsthatwere not used in trainingCorrectclassification rate not necessarilythemost informative… • IfweclassifyeverywordasOthersandonlyhave5% personnamewords, weget a 95% classification rate! • More informative measuresexist • Words correctlyandincorrectlyclassifiedaspersonname(true positives, false positives) • andothers (true negatives, false negatives)

Evaluation Metric • Best performanceis in partsubjective • Recall: Maybewanttocapture all personsoccuring in textevenatcostofsomenon-personsE.g. ifyouwanttocapture all personsmentioned in connectionwith a crime • Precision: OnlywanttocapturewordsthataredefinitelypersonsE.g. ifyouwanttobuild a reliablelistoftalkedaboutpersons

Interpretation • Feature valueprobabilitiesshowfeaturecontributiontoclassification • Comparingtrainedvaluesand tellsyouifthisfeaturevalueismorelikelyfor class1 or class2 • meanscapitalizedwordsaremorelikelytobepartsofpersonnames • Youcanlookateachfeatureindependently

Machine Learning System Overview Training Set Test Set Evaluate Data Acquisition Importanttogetdatasimilartodatayou will classify! Data RepresentationasFeatures Importanttohavetheinformation in thefeaturesthatallowsyoutoclassify MachineLearning AlgorithmTraining Importanttousealgorithmwhoseassumptions fit thedatawellenough Performance Evaluation Importanttoknowwhatmeasureofqualityyouareinterested in

Logistic Regression Motivation • Correlatedfeaturesmightdisturbourclassification • Tiger isalways a knownnoun • Bothfeatures (knownnoun, wordtiger) indicatethatit‘s not a name • Since Naive Bayesignoresthatthewordtigeralreadydeterminesthatitis a knownnoun, it will underestimatechanceoftigerbeing a nameModellingrelationfromcombinationoffeaturevaluestoclassmoredirectlymighthelp?

Logistic Regression Idea • Idea: Learnweightstogether, not separately • Makeall featuresnumerical • Insteadof Part Of Speech = verb, nounetc.: • Onefeatureforverbwhichis 0 or 1 • Onefeaturefornounwhichis 0 or 1 etc. • Thenyoucanmakesumsofthesefeaturevalues * weightsandlearntheweights • Sumshouldbevery high for Person Namesandverysmallfor non-person names • Weights will indicatehowstrongly a featurevalueindicates a personnameCorrelatedfeaturescangetappropriate, not too high weights, becausetheyarelearnedtogether!

Logistic Regression 1 0 • Estimateprobabilityforclass • Usesumof linear functionchainedto a link function • Link functionrisessharplyaroundclassboundary

Example • Let‘s lookatTiger Woods played golf. • Assumewelearnedtheseweights: • > 0.5 => looksmorelike a name • 0.62 canbeinterpretedas 62% probabilitythatit‘s a name

Training • Solveequationsystemiteratively • Fromourtrainingexamplesweget a systemofequations. Using • () = 0 • () = 1 • … • Best Fit cannotbecomputeddirectly, issolvedbyiterativeproceduresNot ourtopichere • Just havetoknowthatweightsareestimatedtogether!

Interpretation • Higher weightsmeanprobabilityofyes(1) isincreasedbythecorrespondingfeature • Weightshavetobeinterpretedtogether (noconditionalindependenceassumed) • Ifwehavethefeaturewordprecededby Dr.andanotherfeaturewordprecededby Prof. Dr. • But in all ourtextsthereareonlyProf. Dr.Bothfeatures will alwayshavethe same value! • Then and leadsto same predictionsas and • Also, weightsareaffectedbyhowbigandhowsmallfeaturevaluerangeis

Naive BayesvsLogistic Regression Jordan, Ng(2005) • LogisticRegression betterwithmoredata, Naive Bayesbetterwithlessdata • Naive Bayesreachesitsoptimumfaster • LogisticRegression hasbetteroptimal classification

Support Vector Machines Animaland plant words in sentence Tiger asAnimal Tiger as Name Support VectorMachinetriestoseparatetrueandfalseexamplesby a bigboundary Sports references in sentence

Training • Soft Margin forinseparabledata • In practiceexamplesusually not perfectlyseparable • Soft-Margintoallowforwrongclassifications • Parametertoadjusttradeoffbetween: • Datapoints shouldbe on thecorrectsideoftheboundaryandoutside ofthemargin • Margin shouldbebig • Specializedoptimizationalgorithmsfor SVMs exist

Usage • SVM usedoftenverysuccessfully, verypopular • Very robust, and fast, onlysupportvectorsareneededfortheclassification • => robust againstmissingdata

Interpretation Animaland plant words in sentence Tiger asAnimal Tiger as Name Hyperplane cantellyouwhichfeatures matter moreforclassification Sports references in sentence

DecisionTrees Word tiger spears Sports references michael Capital Name < 2 >= 2 yes no 90% Name Other Name Other Recursivelyseparate databyfeaturesthatsplitwellinto different classes

DecisionTrees Training • Start with all trainingexamplesatrootnode • Pick featuretosplittrainingexamplesintonextsubtrees • Pick a feature, so thattrainingexamples in onesubtreearemostlyfromoneclass • Recursivelyrepeatprocedure on subtrees • Finished, whensubtreeonlycontainsexamplesfromoneclass (converttoleaf, e.g. name) • Ormostexamplesfromoneclass (usingsomepredefinedthreshold)

DecisionTreesUsage • Usefulespeciallyifyouassumesomefeatureinteractions • Also usefulforsomenon-linear relationshipsoffeaturestoclassification • Word shorterthan 3 characters: Unlikelytobe a name • Word between 3 and 10 charachters: Mightbe a name • Word longerthan 10 characters: Unlikelytobe a name • Oftenmanytreesareusedtogetherasforests(ensemblemethods) • Verycleartointerpretlearningofsingletree • Forforests, methodsexisttodeterminefeatureimportance

Conditional Random Fields Motivation • Compare • Tiger Woods played golf. • Tiger ran intothewoods.We still wanttoknowforbothoccurencesof Tiger ifitis a name. Thereisonehelpfulcharacteristicofthesesentenceswedid not use. Can youguesswhatitis? … • Tiger andWoods couldbothbenamesandtwopartsof a namestandingtogetheraremorelikelythanonepartof a namebyitself.

Conditional Random Fields Idea • Toclassify a datapoint, usethesurroundingclassificationsanddatapoints • E.g. usethefactthatnamesoften stand together • (Sequential) input-> sequentialoutput • Weonlyuseneighbouringclassifications (Linear-Chain-CRFs)

Conditional Random Fields Sketch Feature Functions Score forvalue of Usefeaturefunctionstodetermineprobabilityofclasssequence

Feature Functions • Feature functionsforlinear-chain CRFs canuse • thecompletesentence • Currentposition (word) in sentence • theclassoftheoutputnodebefore • theclassofthecurrentoutputnode • Return real value • Eachfeaturefunctionismultipliedby a weight, thatneedstobelearned

Examples • Washington Post wroteabout Mark Post. • City + Post usuallyis a newspaper, First Name + Postmorelikelytobe a name • Dr. Woods methisclient. • Salutation (Mr./Dr. etc) usuallyfollowedbyname • Feature functiondoes not havetouse all inputs • E.g. featurefunctioncan just lookatiswordcapitalized, whatispartofspeechofthenextword etc.

Usage • Definefeaturefunctions • Learnweightsforfeaturefunctions • Classify • Find sequencethatmaximizessumofweights * featurefunctions • Can bedone in polynomial time withdynamicprogramming • Used a lotfor NLP taskslikenamedentityrecognition, partsofspeechin noisytext

AlgorithmCharacteristics • Assumptions on data • Linear relationshiptooutput/ non-linear • Interpretabilityoflearning • Meaningoffeatureweights etc. • Computationaltime andspace • Type ofinputandoutput • Categorical, numerical • Single datapoints, sequences

Supervised/Unsupervised • Unsupervisedalgorithmsforlearningwithoutknownclasses • These algorithmsweresupervisedalgorithms • Wehadpreclassifiedtrainingdata • Sometimeswemightneedunsupervisedalgorithms • Rightnow, whatkindoftopicsarecovered in newsarticles? • Oftenworkbyclustering • Similardatagetsassignedthe same classE.g. textwithsimilarwordsmayrefertothe same newstopic

Semi-Supervised • Create moretrainingdataautomatically • Big amountoftrainingdataimportantforgoodclassification • Creatingtrainingdatabyhandtime-demanding • Yourunsupervisedalgorithmalreadygivesyoudatapointswithclasses • Other simple rulescan also giveyoutrainingdataE.g. Dr. isalmostalwaysfollowedby a name • New datayouclassifiedwith high confidencecan also beusedastrainingdata

Improvement ML System • Itisimportanttoknowhowandifyoucanimproveyourmachinelearningsystem • Maybe in youroverall (NLP) systemtherearebiggersourcesoferrorthanyour ML system • Maybefromcurrentdataitisimpossibletolearnmorethanyouralgorithmdoes • Youcantryto: • Getmoredata • Use different featuresAlso maybepreprocessmore • Use different algorithmsAlso different combinations

Machine Learning in NLP • Verywidelyused • Makeiteasiertocreatesystemsthat deal withnew/noisytextForexampletweets, freetext on medicalrecords • Can beeasiertospecifyfeaturesthatmaybeimportantandlearnclassificationautomaticallythanwrite all rulesbyhand

Summary Typicalmachinelearningconsistsofdataacquisition, feature design, algorithmtrainingandperformanceevaluation Manyalgorithmsexistwith different assumptions on thedata Importanttoknowwhetheryourassumptionsmatchyourdata Importanttoknowwhatthegoalofyouroverallsystemis

Helpful Resources • Wikipedia  • Courseexplaining ML includinglogisticregressionand SVMs • Anothersimilarone, slidesfreelyavailable • Lectureabout Naive Bayesfordocumentclassification • CRF intuitive introduction • Guidetochoose ML classifier

References Ng, Andrew Y., and Michael I. Jordan. On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes. (2002).

Machine Learning for Natural Language Processing

Machine Learning for Natural Language Processing

Presentation Transcript

Global Inference in Learning for Natural Language Processing

Natural Language Processing

CS 595-052 Machine Learning and Statistical Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Supervised and Unsupervised learning for Natural language processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

An Introduction to Machine Learning and Natural Language Processing Tools

Language Technology Machine learning of natural language

Machine Translation ICS 482 Natural Language Processing

Natural Language Processing

Natural Language Processing

Declarative Learning Models for Natural Language Processing

Machine Learning Natural Language Processing

Machine Learning in Spoken Language Processing

Machine Learning in Natural Language

CS 391L: Machine Learning Natural Language Learning