1 / 31

V9: Parameter Estimation for Bayesian Networks

V9: Parameter Estimation for Bayesian Networks. Today, we assume that the network structure of a BN is given and that our data set D consists of fully observed instances of the network .

aviva
Download Presentation

V9: Parameter Estimation for Bayesian Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. V9: Parameter EstimationforBayesian Networks Today, weassumethatthenetworkstructureof a BN isgivenandthat ourdataset D consistsoffullyobservedinstancesofthenetwork. Wewantto „learn“ fromthedataset D theparameters in theprobabilitydistributionshowthenetwork variables affecteachother. Thereexist 2 mainapproachesto deal withthe parameter-estimationtask: • Oneisbased on maximum-likelihoodestimation (MLE) • The otheroneusesBayesianapproaches. Today we will focus on the MLE-approach. Mathematics of Biological Networks

  2. Thumbtackexample Imaginethatwehave a thumbtack (dt. Heftzwecke) and weconduct an experimentwherebyweflipthethumbtack in theair. Itcomestolandaseitherheadortails Bytossingthetumbtackseveraltimes, weobtain a datasetx[1] … x[n] ofheadortailoutcomes. Based on thisdataset, wewanttoestimatetheprobability withwhichthenextflip will landheadsortails. Mathematics of Biological Networks

  3. Thumbtackexample Weassumeimplicitlythatthethumbtacktossesarecontrolledby an (unknown) parameter , whichdescribesthefrequencyofheads. We also assumethatthedatainstancesareindependentandidenticallydistributed. This assumptionislaterabbreviatedas IID. Onewayofevaluating isbyhowwellitpredictsthedata. Supposeweobservethesequenceofoutcomes H, T, T, H, H. The probabilityofthefirsttossis P( X[1] = H) =  The probabilityofthesecondtossisP( X[2] = T | X[1] = H) Sinceweassumethatthecointossesareindependent, wecansimplifythisto P( X[2] = T ) = 1 -  Mathematics of Biological Networks

  4. Thumbtackexample And so on… Thus, theprobabilityofthefullsequenceis P( H, T, T, H, H : ) =  ( 1 - ) (1 - )   = 3( 1 - )2 As expected, thisprobabilitydepends on theparticularvalueof . As weconsider different valuesof , weget different probabilitiesofthesequence. Letusexaminehowtheprobabilityofthedatachangesas a functionof . Wecandefinethelikelihoodfunctionput fig. 17.2 tobe L ( : H, T, T, H, H ) = P(H, T, T, H, H : ) = 3 ( 1 - )2 Mathematics of Biological Networks

  5. Maximum likelihoodestimator Parameter valueswithhigherlikelihood aremorelikelytogeneratetheobservedsequences. Thus, wecanusethelikelihoodfunctionasourmeasureofquality for different parametervaluesandselecttheparametervalue thatmaximizesthelikelihood. This valueiscalledthemaximumlikelihoodestimator(MLE). Fromthefigure, weseethat  = 0.6 = 3/5 maximizes thelikelihoodforthesequence H, T, T, H, H Mathematics of Biological Networks

  6. Determining MLE How canwe find the MLE forthegeneralcase? Assumethatourdataset D ofobservationscontains M[1] headsand M[0] tails. The likelihoodfunctionforthisis Itturns out thatitiseasiertomaximizethelogarithmofthelikelihoodfunction. The log-likelihoodfunctionis: The log-likelihoodismonotonicallyrelatedtothelikelihood. Maximizingtheoneisequivalenttomaximizingtheother. Mathematics of Biological Networks

  7. Determining MLE In ordertodeterminethevaluethatmaximizes the log-likelihood, wetakethe derivative andsetthisequaltozero, andsolvefor . This gives Mathematics of Biological Networks

  8. The ML principle Wenowconsiderhowtoapplythe ML principletoBayesiannetworks. Assumethatweobserveseveral IID samples of a setofrandom variables X from an unknowndistribution P*(X). Weassumethatweknow in advancethe sample spacewearedealingwith (i.e. whichrandom variables andwhatvaluestheycantake). However, we do not makeany additional assumptionsabout P*. Wedenotethetrainingsetofsamplesas D andassume thatitconsistsof M instancesof X: [1] … [M]. Nowweneedtoconsiderwhatexactlywewanttolearn. Mathematics of Biological Networks

  9. The ML principle We assumewearegiven a parametricmodel, definedby a function P(:) forwhichwewishtoestimateparameters. Given a particularsetofparametervalues  and an instance  of X, themodelassigns a probability (ordensity) to . Werequirethatforeachchoiceofparameters , P(:) is a legal distribution; thatis, itisnonnegativeand In general, foreachmodel, not all parametervaluesare legal. Thus, weneedtodefinetheparameterspace , whichisthesetofallowableparameters. Mathematics of Biological Networks

  10. The ML principle As an example, themodelweconsideredbeforehasparameterspace andisdefinedas Mathematics of Biological Networks

  11. Anotherexample Suppose that X is a multinomial variable thatcantakevalues x1, … xK. The simplestrepresentationof a multinomialdistributionis a vector K such that The parameterspaceofthismodelis Mathematics of Biological Networks

  12. Anotherexample Suppose that X is a continuous variable thatcantakevalues in the real line. A Gaussianmodelfor X is where  = , The parameterspaceforthismodelis Gaussian =   + (weallowany real valueof  andany positive real valueof ). Mathematics of Biological Networks

  13. Likelihoodfunction The nextstep in maximumlikelihoodestimation isdefiningthelikelihoodfunction. For a givenchoiceofparameters  thelikelihoodfunctionis theprobability (ordensity) themodelassignsthetrainingdata: In thethumbtackexample, wesawthatwecanwritethelikelihoodfunction using simpler termswiththecounts M[1] and M[0]. The orderoftosses was irrelevant. The counts M[1] and M[0] weresufficientstatistics forthethumbtacklearningproblem. Mathematics of Biological Networks

  14. Sufficientstatistics Definition: A function () frominstancesof X to l (forsome l) is a sufficientstatistic if, foranytwodatasets D and D‘ andany   , wehavethat  . Forthemultinomialmodel, a sufficientstatisticforthedataisthe tupleofcounts M[1] … M[K] such that M[k] isthenumberoftimes thevaluexkappears in thetrainingdata. Toobtainthesecountsbysumminginstance-level statistics, wedefine(x) tobe a tupleofdimension K, such that(x) has a 0 in everyposition, except at theposition k forwhich x = xkwhereitsvalueis 1. Giventhevectorofcounts, wecanwritethelikelihoodfunctionas Mathematics of Biological Networks

  15. Likelihoodfunction The likelihoodfunctionmeasurestheeffect ofthechoiceofparameters on thetrainingdata. Ifwehave 2 setsofparameters  and ‘, so that L( : D) = L(‘ : D), thenwecannot, givenonlythedata, distinguishbetweenthe 2 choicesofparameters. IfL( : D) = L(‘ : D) for all possiblechoicesof D, thenthe 2 parametersareindistinguishableforanyoutcome. In such a situation, wecansay in advance (i.e. beforeseeingthedata) thatsomedistinctionscannotberesolvedbased on thedataalone. Mathematics of Biological Networks

  16. Likelihoodfunction Secondly, sincewearemaximizingthelikelihoodfunction, weusuallywantittobecontinuous (andpreferably smooth) functionof . Toensuretheseproperties, mostofthetheoryofstatisticalestimation requiresthat P(,) is a continuousanddifferentiablefunctionof , andmoreoverthat  is a continuoussetofpoints (whichisoftenassumedtobeconvex). Mathematics of Biological Networks

  17. Likelihoodfunction Once wehavedefinedthelikelihoodfunction, wecanusemaximumlikelihoodestimationtochoosetheparametervalues. This canbestatedformallyas Maximum LikelihoodEstimation: Given a dataset D, chooseparametersthatsatisfy Forthemultinomialdistribution, themaximumlikelihoodisattainedwhen i.e. theprobabilityofeachvalueof X correspondstoitsfrequency in thetrainingdata. Mathematics of Biological Networks

  18. Likelihoodfunction For theGaussiandistribution, themaximumisattained when  and  correspondtotheempiricalmeanandvariance ofthetrainingdata: Mathematics of Biological Networks

  19. MLE forBayesiannetworks The simplestexampleof a nontrivialnetworkstructureis a network consistingof 2 binary variables, say X and Y, with an arc X → Y. This networkisparametrizedby a parametervector  whichdefinesthesetofparametersfor all the CPDs in thenetwork. In thiscase, and specifytheprobabilityofthe 2 valuesof X: and specifytheprobabilityof Y giventhat X = x1and and specifytheprobabilityof Y giventhat X = x0. Forbrevity, weusetheshorthandtorefertotheset { ,} andtoreferto . Mathematics of Biological Networks

  20. MLE forBayesiannetworks In thisexample, everytraininginstanceis a tuple x[m], y[m] thatdescribes a particularassignmentto X and Y. Ourlikelihoodfunctionis: Ournetworkmodel X → Y specifiesthat P(X,Y:) has a product form. Thus wecanwrite Mathematics of Biological Networks

  21. MLE forBayesiannetworks Exchanging theorderofmultiplication, wecanequivalentlywritethistermas Thatis, thelikelihooddecomposesinto 2 separate terms, oneforeach variable. Eachofthesetermsis a locallikelihoodfunction thatmeasureshowwellthe variable ispredictedgivenitsparents. Eachtermdependsonly on theparametersforthatvariable‘s CPD. The firstterm, isidenticaltothe multinomialdistributionwediscussedearlier. Mathematics of Biological Networks

  22. MLE forBayesiannetworks The secondtermcanbedecomposedfurther Thus, in thisexample, thelikelihoodfunctiondecomposes into a productofterms, oneforeachgroupofparameters in . This propertyiscalledthedecomposabilityofthelikelihoodfunction. Mathematics of Biological Networks

  23. MLE forBayesiannetworks We can do onemoresimplificationbyusingthenotionofsufficientstatistics. Letusconsideroneterm in thisexpression Eachofthe individual terms can takeoneof 2 values, depending on thevalueof y[m]. If y[m] = y1 , itisequalto. If y[m] = y0 , itisequalto. Howmanycasesofeach type do weget? Mathematics of Biological Networks

  24. MLE forBayesiannetworks Letusrestrictattentiontothosedatacaseswhere x[m] = x0. These againpartitioninto 2 categories. Wegetin thosecaseswherex[m] = x0and y[m] = y1. Weuse M[x0, y1] todenotetheirnumber. We get in thosecaseswhere x[m] = x0and y[m] = y0. Weuse M[x0, y0] todenotetheirnumber. Thus theterm in theeq. isequalto: Mathematics of Biological Networks

  25. MLE forBayesiannetworks Based on ourdiscussionofthemultinomiallikelihood, weknowthatwemaximizebysetting andsimilarlyfor. Thus, wecan find themaximumlikelihoodparameters in this CPD bysimplycountinghowmanytimeseachofthepossible assignmentsof X and Y appears in thetrainingdata. Itturns out thatthesecountsofthevariousassignments forsomesetof variables aregenerallyuseful. Mathematics of Biological Networks

  26. MLE forBayesiannetworks Definition Let Z besomesetofrandom variables, and z besomeinstantiationoftheserandom variables. Let D be a dataset. Wedefine M[z] tobethenumberofentries in D thathave Z[m] = z This approachcanbeextendedtogeneralBayesiannetworks. Mathematics of Biological Networks

  27. Global Likelihooddecomposition We startbyexaminingthelikelihoodfunctionof a Bayesiannetwork. Supposewewanttolearntheparametersfor a BN withstructure G andparameters . Weare also given a dataset D consistingofsamples[1] … [M]. Writing thelikelihoodandrepeatingthepreviousstepsgives Mathematics of Biological Networks

  28. Global Likelihooddecomposition Each oftheterms in thesquarebracketsreferstotheconditionallikelihoodof a particular variable givenitsparents in thenetwork. Weusetodenotethesubsetofparameters thatdetermines in ourmodel. Then, wecanwrite wherethelocallikelihoodfunctionforXiis: This form isparticularlyusefulwhentheparametersaredisjoint. Thatis, each CPD isparametrizedby a separate setofparametersthat do not overlap. Mathematics of Biological Networks

  29. Global Likelihooddecomposition The previousanalysisshowedthatthelikelihooddecomposes as a productofindepdendentterms, oneforeach CPD in thenetwork. This importantpropertyiscalledtheglobal decomposition ofthelikelihoodfunction. Proposition Let D be a completedatasetfor X1, … Xn, let G be a networkstructureoverthese variables andsupposethattheparametersaredisjointfromfor all j  i. Letbetheparametersthatmaximize. Thenmaximizes L( : D). In otherwords, wecanmaximizeeachlocallikelihoodfunctionindependentlyoftherestofthenetwork, andthencombinethesolutionstoget an MLE solution. Mathematics of Biological Networks

  30. Table-CPDs We nowconsiderthesimplestparametrizationofthe CPDs: a table-CPD. Supposewehave a variable X withparents U. Ifwerepresentthat CPD P(X | U) as a table, thenwe will have a parameter x|uforeachcombinationofx  Val(X) and u  Val(U). In thiscasewecanwritethelocallikelihoodfunctionas: where M[u,x] isthenumberoftimes [m] = xand u[m] = u. Mathematics of Biological Networks

  31. Table-CPDs We needtomaximizethistermundertheconstraintsthat, foreachchoice ofvaluefortheparentsU , theconditionalprobabilityis legal, thatis Wecanmaximizeeachoftheterms in squarebrackets in thepreviouseq. independently. Wecanfurtherdecomposethelocallikelihoodfunctionfor a tabular CPD into a productof simple likelihoodfunctionsthatareeachmultinomiallikelihoods. The counts in thedataforthe different outcomesxare [M[u,x] : x  Val(X)]. The MLE parameters turn out tobewhere Mathematics of Biological Networks

More Related