460 likes | 697 Views
Machine Learning for Natural Language Processing. Seminar Information Extraktion Wednesday , 6th November 2013. Robin Tibor Schirrmeister. Outline. Informal definition with example Five machine learning algorithms Motivation and main idea Example Training and classification
E N D
Machine Learning forNatural Language Processing Seminar Information Extraktion Wednesday, 6th November 2013 Robin Tibor Schirrmeister
Outline • Informal definitionwithexample • Fivemachinelearningalgorithms • Motivation andmainidea • Example • Training andclassification • Usageandassumptions • Typesofmachinelearning • Improvementofmachinelearningsystems
Informal DefinitionInformal Definition • Machine Learning means a programgetsbetterbyautomaticallylearningfromdata • Let‘sseewhatthismeansbylookingat an example
NamedEntity Recognition Example • Howtorecognizementionsofpersons in a text? • Angela Merkel made a decision. • Tiger Woods played golf. • Tiger ran intothewoods.Couldusehandwrittenrules, mighttake a lotof time todefine all rulesandtheirinteractions… So howtolearnrulesandtheirinteractions?
Person Name Features • Create featuresforyour ML algorithm • For a machinetolearnif a wordis a name, youhavetodefineattributesorfeaturesthatmightindicate a name • Word itself • Word is a knownnounlikehammer, boat, tiger • Word iscapitalizedor not • Part ofspeech (verb, noun etc.) • …Human hasto design thesefeatures!
Person Name Features • ML algorithmlearnstypicalfeatures • Whatvaluesofthesefeaturesaretypicalfor a personname? • Michael, Ahmed, Ina • Word is not a knownnoun • Word iscapitalized • Proper nounMachineshouldlearnthesetypicalvaluesautomatically!
Person Name Training Set • Create a trainingset • Containswordsthatarepersonnamesandwordsthatare not personnames
Naive Bayes Training • For eachfeaturevaluecounthowfrequentitoccurs in personnamesandotherwords. • This tellsyouhowlikely a randompersonnameiscapitalized. Howcouldpersonname not becapitalized? Think noisytextliketweets..
Naive Bayes Training • Count totalfrequencyofclasses • Count howfrequentpersonnamesare in general: Howmanywordsarepartofpersonnamesandhowmanyare not • Do wholetrainingthe same wayforwordsthatare not personnames
Naive BayesClassificationExample • Let‘strytoclassify Tiger in Tiger Woodsplayedgolf. • Assume5% ofwordsarepartsofpersonnames
Naive BayesClassification • Classify a wordbymultiplyingfeaturevalueprobabilities • (valueof i-thfeature) • Higher Score wins!
Naive BayesOverview Training • For all classes • For all featuresand all possiblefeaturevalues • Compute (e.g. chanceofwordiscapitalizedifit‘s a personname) • Compute total probability Classification • For all classescomputeclassscoreas: • Data pointisclassifiedbyclasswithhighest score
Naive BayesAssumptions • Probabilityofonefeatureisindependentofanotherfeaturewhenweknowtheclass • Whenweknowwordispartofpersonname, probabilityofcapitalizationisindependentofprobabilitythatwordis a knownnounThis is not completelytrueIfthewordistiger, wealreadyknowitis a knownnoun • That‘swhyit‘scalledNaiveBayesNaive Bayesoftenclassifieswellevenifassumptionisviolated!
Evaluation • Evaluateperformanceon testset • Correctclassification rate on testsetofwordsthatwere not used in trainingCorrectclassification rate not necessarilythemost informative… • IfweclassifyeverywordasOthersandonlyhave5% personnamewords, weget a 95% classification rate! • More informative measuresexist • Words correctlyandincorrectlyclassifiedaspersonname(true positives, false positives) • andothers (true negatives, false negatives)
Evaluation Metric • Best performanceis in partsubjective • Recall: Maybewanttocapture all personsoccuring in textevenatcostofsomenon-personsE.g. ifyouwanttocapture all personsmentioned in connectionwith a crime • Precision: OnlywanttocapturewordsthataredefinitelypersonsE.g. ifyouwanttobuild a reliablelistoftalkedaboutpersons
Interpretation • Feature valueprobabilitiesshowfeaturecontributiontoclassification • Comparingtrainedvaluesand tellsyouifthisfeaturevalueismorelikelyfor class1 or class2 • meanscapitalizedwordsaremorelikelytobepartsofpersonnames • Youcanlookateachfeatureindependently
Machine Learning System Overview Training Set Test Set Evaluate Data Acquisition Importanttogetdatasimilartodatayou will classify! Data RepresentationasFeatures Importanttohavetheinformation in thefeaturesthatallowsyoutoclassify MachineLearning AlgorithmTraining Importanttousealgorithmwhoseassumptions fit thedatawellenough Performance Evaluation Importanttoknowwhatmeasureofqualityyouareinterested in
Logistic Regression Motivation • Correlatedfeaturesmightdisturbourclassification • Tiger isalways a knownnoun • Bothfeatures (knownnoun, wordtiger) indicatethatit‘s not a name • Since Naive Bayesignoresthatthewordtigeralreadydeterminesthatitis a knownnoun, it will underestimatechanceoftigerbeing a nameModellingrelationfromcombinationoffeaturevaluestoclassmoredirectlymighthelp?
Logistic Regression Idea • Idea: Learnweightstogether, not separately • Makeall featuresnumerical • Insteadof Part Of Speech = verb, nounetc.: • Onefeatureforverbwhichis 0 or 1 • Onefeaturefornounwhichis 0 or 1 etc. • Thenyoucanmakesumsofthesefeaturevalues * weightsandlearntheweights • Sumshouldbevery high for Person Namesandverysmallfor non-person names • Weights will indicatehowstrongly a featurevalueindicates a personnameCorrelatedfeaturescangetappropriate, not too high weights, becausetheyarelearnedtogether!
Logistic Regression 1 0 • Estimateprobabilityforclass • Usesumof linear functionchainedto a link function • Link functionrisessharplyaroundclassboundary
Example • Let‘s lookatTiger Woods played golf. • Assumewelearnedtheseweights: • > 0.5 => looksmorelike a name • 0.62 canbeinterpretedas 62% probabilitythatit‘s a name
Training • Solveequationsystemiteratively • Fromourtrainingexamplesweget a systemofequations. Using • () = 0 • () = 1 • … • Best Fit cannotbecomputeddirectly, issolvedbyiterativeproceduresNot ourtopichere • Just havetoknowthatweightsareestimatedtogether!
Interpretation • Higher weightsmeanprobabilityofyes(1) isincreasedbythecorrespondingfeature • Weightshavetobeinterpretedtogether (noconditionalindependenceassumed) • Ifwehavethefeaturewordprecededby Dr.andanotherfeaturewordprecededby Prof. Dr. • But in all ourtextsthereareonlyProf. Dr.Bothfeatures will alwayshavethe same value! • Then and leadsto same predictionsas and • Also, weightsareaffectedbyhowbigandhowsmallfeaturevaluerangeis
Naive BayesvsLogistic Regression Jordan, Ng(2005) • LogisticRegression betterwithmoredata, Naive Bayesbetterwithlessdata • Naive Bayesreachesitsoptimumfaster • LogisticRegression hasbetteroptimal classification
Support Vector Machines Animaland plant words in sentence Tiger asAnimal Tiger as Name Support VectorMachinetriestoseparatetrueandfalseexamplesby a bigboundary Sports references in sentence
Training • Soft Margin forinseparabledata • In practiceexamplesusually not perfectlyseparable • Soft-Margintoallowforwrongclassifications • Parametertoadjusttradeoffbetween: • Datapoints shouldbe on thecorrectsideoftheboundaryandoutside ofthemargin • Margin shouldbebig • Specializedoptimizationalgorithmsfor SVMs exist
Usage • SVM usedoftenverysuccessfully, verypopular • Very robust, and fast, onlysupportvectorsareneededfortheclassification • => robust againstmissingdata
Interpretation Animaland plant words in sentence Tiger asAnimal Tiger as Name Hyperplane cantellyouwhichfeatures matter moreforclassification Sports references in sentence
DecisionTrees Word tiger spears Sports references michael Capital Name < 2 >= 2 yes no 90% Name Other Name Other Recursivelyseparate databyfeaturesthatsplitwellinto different classes
DecisionTrees Training • Start with all trainingexamplesatrootnode • Pick featuretosplittrainingexamplesintonextsubtrees • Pick a feature, so thattrainingexamples in onesubtreearemostlyfromoneclass • Recursivelyrepeatprocedure on subtrees • Finished, whensubtreeonlycontainsexamplesfromoneclass (converttoleaf, e.g. name) • Ormostexamplesfromoneclass (usingsomepredefinedthreshold)
DecisionTreesUsage • Usefulespeciallyifyouassumesomefeatureinteractions • Also usefulforsomenon-linear relationshipsoffeaturestoclassification • Word shorterthan 3 characters: Unlikelytobe a name • Word between 3 and 10 charachters: Mightbe a name • Word longerthan 10 characters: Unlikelytobe a name • Oftenmanytreesareusedtogetherasforests(ensemblemethods) • Verycleartointerpretlearningofsingletree • Forforests, methodsexisttodeterminefeatureimportance
Conditional Random Fields Motivation • Compare • Tiger Woods played golf. • Tiger ran intothewoods.We still wanttoknowforbothoccurencesof Tiger ifitis a name. Thereisonehelpfulcharacteristicofthesesentenceswedid not use. Can youguesswhatitis? … • Tiger andWoods couldbothbenamesandtwopartsof a namestandingtogetheraremorelikelythanonepartof a namebyitself.
Conditional Random Fields Idea • Toclassify a datapoint, usethesurroundingclassificationsanddatapoints • E.g. usethefactthatnamesoften stand together • (Sequential) input-> sequentialoutput • Weonlyuseneighbouringclassifications (Linear-Chain-CRFs)
Conditional Random Fields Sketch Feature Functions Score forvalue of Usefeaturefunctionstodetermineprobabilityofclasssequence
Feature Functions • Feature functionsforlinear-chain CRFs canuse • thecompletesentence • Currentposition (word) in sentence • theclassoftheoutputnodebefore • theclassofthecurrentoutputnode • Return real value • Eachfeaturefunctionismultipliedby a weight, thatneedstobelearned
Examples • Washington Post wroteabout Mark Post. • City + Post usuallyis a newspaper, First Name + Postmorelikelytobe a name • Dr. Woods methisclient. • Salutation (Mr./Dr. etc) usuallyfollowedbyname • Feature functiondoes not havetouse all inputs • E.g. featurefunctioncan just lookatiswordcapitalized, whatispartofspeechofthenextword etc.
Usage • Definefeaturefunctions • Learnweightsforfeaturefunctions • Classify • Find sequencethatmaximizessumofweights * featurefunctions • Can bedone in polynomial time withdynamicprogramming • Used a lotfor NLP taskslikenamedentityrecognition, partsofspeechin noisytext
AlgorithmCharacteristics • Assumptions on data • Linear relationshiptooutput/ non-linear • Interpretabilityoflearning • Meaningoffeatureweights etc. • Computationaltime andspace • Type ofinputandoutput • Categorical, numerical • Single datapoints, sequences
Supervised/Unsupervised • Unsupervisedalgorithmsforlearningwithoutknownclasses • These algorithmsweresupervisedalgorithms • Wehadpreclassifiedtrainingdata • Sometimeswemightneedunsupervisedalgorithms • Rightnow, whatkindoftopicsarecovered in newsarticles? • Oftenworkbyclustering • Similardatagetsassignedthe same classE.g. textwithsimilarwordsmayrefertothe same newstopic
Semi-Supervised • Create moretrainingdataautomatically • Big amountoftrainingdataimportantforgoodclassification • Creatingtrainingdatabyhandtime-demanding • Yourunsupervisedalgorithmalreadygivesyoudatapointswithclasses • Other simple rulescan also giveyoutrainingdataE.g. Dr. isalmostalwaysfollowedby a name • New datayouclassifiedwith high confidencecan also beusedastrainingdata
Improvement ML System • Itisimportanttoknowhowandifyoucanimproveyourmachinelearningsystem • Maybe in youroverall (NLP) systemtherearebiggersourcesoferrorthanyour ML system • Maybefromcurrentdataitisimpossibletolearnmorethanyouralgorithmdoes • Youcantryto: • Getmoredata • Use different featuresAlso maybepreprocessmore • Use different algorithmsAlso different combinations
Machine Learning in NLP • Verywidelyused • Makeiteasiertocreatesystemsthat deal withnew/noisytextForexampletweets, freetext on medicalrecords • Can beeasiertospecifyfeaturesthatmaybeimportantandlearnclassificationautomaticallythanwrite all rulesbyhand
Summary Typicalmachinelearningconsistsofdataacquisition, feature design, algorithmtrainingandperformanceevaluation Manyalgorithmsexistwith different assumptions on thedata Importanttoknowwhetheryourassumptionsmatchyourdata Importanttoknowwhatthegoalofyouroverallsystemis
Helpful Resources • Wikipedia • Courseexplaining ML includinglogisticregressionand SVMs • Anothersimilarone, slidesfreelyavailable • Lectureabout Naive Bayesfordocumentclassification • CRF intuitive introduction • Guidetochoose ML classifier
References Ng, Andrew Y., and Michael I. Jordan. On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes. (2002).