V9: Parameter Estimation for Bayesian Networks

V9: Parameter EstimationforBayesian Networks Today, weassumethatthenetworkstructureof a BN isgivenandthat ourdataset D consistsoffullyobservedinstancesofthenetwork. Wewantto „learn“ fromthedataset D theparameters in theprobabilitydistributionshowthenetwork variables affecteachother. Thereexist 2 mainapproachesto deal withthe parameter-estimationtask: • Oneisbased on maximum-likelihoodestimation (MLE) • The otheroneusesBayesianapproaches. Today we will focus on the MLE-approach. Mathematics of Biological Networks

Thumbtackexample Imaginethatwehave a thumbtack (dt. Heftzwecke) and weconduct an experimentwherebyweflipthethumbtack in theair. Itcomestolandaseitherheadortails Bytossingthetumbtackseveraltimes, weobtain a datasetx[1] … x[n] ofheadortailoutcomes. Based on thisdataset, wewanttoestimatetheprobability withwhichthenextflip will landheadsortails. Mathematics of Biological Networks

Thumbtackexample Weassumeimplicitlythatthethumbtacktossesarecontrolledby an (unknown) parameter , whichdescribesthefrequencyofheads. We also assumethatthedatainstancesareindependentandidenticallydistributed. This assumptionislaterabbreviatedas IID. Onewayofevaluating isbyhowwellitpredictsthedata. Supposeweobservethesequenceofoutcomes H, T, T, H, H. The probabilityofthefirsttossis P( X[1] = H) =  The probabilityofthesecondtossisP( X[2] = T | X[1] = H) Sinceweassumethatthecointossesareindependent, wecansimplifythisto P( X[2] = T ) = 1 -  Mathematics of Biological Networks

Thumbtackexample And so on… Thus, theprobabilityofthefullsequenceis P( H, T, T, H, H : ) =  ( 1 - ) (1 - )   = 3( 1 - )2 As expected, thisprobabilitydepends on theparticularvalueof . As weconsider different valuesof , weget different probabilitiesofthesequence. Letusexaminehowtheprobabilityofthedatachangesas a functionof . Wecandefinethelikelihoodfunctionput fig. 17.2 tobe L ( : H, T, T, H, H ) = P(H, T, T, H, H : ) = 3 ( 1 - )2 Mathematics of Biological Networks

Maximum likelihoodestimator Parameter valueswithhigherlikelihood aremorelikelytogeneratetheobservedsequences. Thus, wecanusethelikelihoodfunctionasourmeasureofquality for different parametervaluesandselecttheparametervalue thatmaximizesthelikelihood. This valueiscalledthemaximumlikelihoodestimator(MLE). Fromthefigure, weseethat  = 0.6 = 3/5 maximizes thelikelihoodforthesequence H, T, T, H, H Mathematics of Biological Networks

Determining MLE How canwe find the MLE forthegeneralcase? Assumethatourdataset D ofobservationscontains M[1] headsand M[0] tails. The likelihoodfunctionforthisis Itturns out thatitiseasiertomaximizethelogarithmofthelikelihoodfunction. The log-likelihoodfunctionis: The log-likelihoodismonotonicallyrelatedtothelikelihood. Maximizingtheoneisequivalenttomaximizingtheother. Mathematics of Biological Networks

Determining MLE In ordertodeterminethevaluethatmaximizes the log-likelihood, wetakethe derivative andsetthisequaltozero, andsolvefor . This gives Mathematics of Biological Networks

The ML principle Wenowconsiderhowtoapplythe ML principletoBayesiannetworks. Assumethatweobserveseveral IID samples of a setofrandom variables X from an unknowndistribution P*(X). Weassumethatweknow in advancethe sample spacewearedealingwith (i.e. whichrandom variables andwhatvaluestheycantake). However, we do not makeany additional assumptionsabout P*. Wedenotethetrainingsetofsamplesas D andassume thatitconsistsof M instancesof X: [1] … [M]. Nowweneedtoconsiderwhatexactlywewanttolearn. Mathematics of Biological Networks

The ML principle We assumewearegiven a parametricmodel, definedby a function P(:) forwhichwewishtoestimateparameters. Given a particularsetofparametervalues  and an instance  of X, themodelassigns a probability (ordensity) to . Werequirethatforeachchoiceofparameters , P(:) is a legal distribution; thatis, itisnonnegativeand In general, foreachmodel, not all parametervaluesare legal. Thus, weneedtodefinetheparameterspace , whichisthesetofallowableparameters. Mathematics of Biological Networks

The ML principle As an example, themodelweconsideredbeforehasparameterspace andisdefinedas Mathematics of Biological Networks

Anotherexample Suppose that X is a multinomial variable thatcantakevalues x1, … xK. The simplestrepresentationof a multinomialdistributionis a vector K such that The parameterspaceofthismodelis Mathematics of Biological Networks

Anotherexample Suppose that X is a continuous variable thatcantakevalues in the real line. A Gaussianmodelfor X is where  = , The parameterspaceforthismodelis Gaussian =   + (weallowany real valueof  andany positive real valueof ). Mathematics of Biological Networks

Likelihoodfunction The nextstep in maximumlikelihoodestimation isdefiningthelikelihoodfunction. For a givenchoiceofparameters  thelikelihoodfunctionis theprobability (ordensity) themodelassignsthetrainingdata: In thethumbtackexample, wesawthatwecanwritethelikelihoodfunction using simpler termswiththecounts M[1] and M[0]. The orderoftosses was irrelevant. The counts M[1] and M[0] weresufficientstatistics forthethumbtacklearningproblem. Mathematics of Biological Networks

Sufficientstatistics Definition: A function () frominstancesof X to l (forsome l) is a sufficientstatistic if, foranytwodatasets D and D‘ andany   , wehavethat  . Forthemultinomialmodel, a sufficientstatisticforthedataisthe tupleofcounts M[1] … M[K] such that M[k] isthenumberoftimes thevaluexkappears in thetrainingdata. Toobtainthesecountsbysumminginstance-level statistics, wedefine(x) tobe a tupleofdimension K, such that(x) has a 0 in everyposition, except at theposition k forwhich x = xkwhereitsvalueis 1. Giventhevectorofcounts, wecanwritethelikelihoodfunctionas Mathematics of Biological Networks

Likelihoodfunction The likelihoodfunctionmeasurestheeffect ofthechoiceofparameters on thetrainingdata. Ifwehave 2 setsofparameters  and ‘, so that L( : D) = L(‘ : D), thenwecannot, givenonlythedata, distinguishbetweenthe 2 choicesofparameters. IfL( : D) = L(‘ : D) for all possiblechoicesof D, thenthe 2 parametersareindistinguishableforanyoutcome. In such a situation, wecansay in advance (i.e. beforeseeingthedata) thatsomedistinctionscannotberesolvedbased on thedataalone. Mathematics of Biological Networks

Likelihoodfunction Secondly, sincewearemaximizingthelikelihoodfunction, weusuallywantittobecontinuous (andpreferably smooth) functionof . Toensuretheseproperties, mostofthetheoryofstatisticalestimation requiresthat P(,) is a continuousanddifferentiablefunctionof , andmoreoverthat  is a continuoussetofpoints (whichisoftenassumedtobeconvex). Mathematics of Biological Networks

Likelihoodfunction Once wehavedefinedthelikelihoodfunction, wecanusemaximumlikelihoodestimationtochoosetheparametervalues. This canbestatedformallyas Maximum LikelihoodEstimation: Given a dataset D, chooseparametersthatsatisfy Forthemultinomialdistribution, themaximumlikelihoodisattainedwhen i.e. theprobabilityofeachvalueof X correspondstoitsfrequency in thetrainingdata. Mathematics of Biological Networks

Likelihoodfunction For theGaussiandistribution, themaximumisattained when  and  correspondtotheempiricalmeanandvariance ofthetrainingdata: Mathematics of Biological Networks

MLE forBayesiannetworks The simplestexampleof a nontrivialnetworkstructureis a network consistingof 2 binary variables, say X and Y, with an arc X → Y. This networkisparametrizedby a parametervector  whichdefinesthesetofparametersfor all the CPDs in thenetwork. In thiscase, and specifytheprobabilityofthe 2 valuesof X: and specifytheprobabilityof Y giventhat X = x1and and specifytheprobabilityof Y giventhat X = x0. Forbrevity, weusetheshorthandtorefertotheset { ,} andtoreferto . Mathematics of Biological Networks

MLE forBayesiannetworks In thisexample, everytraininginstanceis a tuple x[m], y[m] thatdescribes a particularassignmentto X and Y. Ourlikelihoodfunctionis: Ournetworkmodel X → Y specifiesthat P(X,Y:) has a product form. Thus wecanwrite Mathematics of Biological Networks

MLE forBayesiannetworks Exchanging theorderofmultiplication, wecanequivalentlywritethistermas Thatis, thelikelihooddecomposesinto 2 separate terms, oneforeach variable. Eachofthesetermsis a locallikelihoodfunction thatmeasureshowwellthe variable ispredictedgivenitsparents. Eachtermdependsonly on theparametersforthatvariable‘s CPD. The firstterm, isidenticaltothe multinomialdistributionwediscussedearlier. Mathematics of Biological Networks

MLE forBayesiannetworks The secondtermcanbedecomposedfurther Thus, in thisexample, thelikelihoodfunctiondecomposes into a productofterms, oneforeachgroupofparameters in . This propertyiscalledthedecomposabilityofthelikelihoodfunction. Mathematics of Biological Networks

MLE forBayesiannetworks We can do onemoresimplificationbyusingthenotionofsufficientstatistics. Letusconsideroneterm in thisexpression Eachofthe individual terms can takeoneof 2 values, depending on thevalueof y[m]. If y[m] = y1 , itisequalto. If y[m] = y0 , itisequalto. Howmanycasesofeach type do weget? Mathematics of Biological Networks

MLE forBayesiannetworks Letusrestrictattentiontothosedatacaseswhere x[m] = x0. These againpartitioninto 2 categories. Wegetin thosecaseswherex[m] = x0and y[m] = y1. Weuse M[x0, y1] todenotetheirnumber. We get in thosecaseswhere x[m] = x0and y[m] = y0. Weuse M[x0, y0] todenotetheirnumber. Thus theterm in theeq. isequalto: Mathematics of Biological Networks

MLE forBayesiannetworks Based on ourdiscussionofthemultinomiallikelihood, weknowthatwemaximizebysetting andsimilarlyfor. Thus, wecan find themaximumlikelihoodparameters in this CPD bysimplycountinghowmanytimeseachofthepossible assignmentsof X and Y appears in thetrainingdata. Itturns out thatthesecountsofthevariousassignments forsomesetof variables aregenerallyuseful. Mathematics of Biological Networks

MLE forBayesiannetworks Definition Let Z besomesetofrandom variables, and z besomeinstantiationoftheserandom variables. Let D be a dataset. Wedefine M[z] tobethenumberofentries in D thathave Z[m] = z This approachcanbeextendedtogeneralBayesiannetworks. Mathematics of Biological Networks

Global Likelihooddecomposition We startbyexaminingthelikelihoodfunctionof a Bayesiannetwork. Supposewewanttolearntheparametersfor a BN withstructure G andparameters . Weare also given a dataset D consistingofsamples[1] … [M]. Writing thelikelihoodandrepeatingthepreviousstepsgives Mathematics of Biological Networks

Global Likelihooddecomposition Each oftheterms in thesquarebracketsreferstotheconditionallikelihoodof a particular variable givenitsparents in thenetwork. Weusetodenotethesubsetofparameters thatdetermines in ourmodel. Then, wecanwrite wherethelocallikelihoodfunctionforXiis: This form isparticularlyusefulwhentheparametersaredisjoint. Thatis, each CPD isparametrizedby a separate setofparametersthat do not overlap. Mathematics of Biological Networks

Global Likelihooddecomposition The previousanalysisshowedthatthelikelihooddecomposes as a productofindepdendentterms, oneforeach CPD in thenetwork. This importantpropertyiscalledtheglobal decomposition ofthelikelihoodfunction. Proposition Let D be a completedatasetfor X1, … Xn, let G be a networkstructureoverthese variables andsupposethattheparametersaredisjointfromfor all j  i. Letbetheparametersthatmaximize. Thenmaximizes L( : D). In otherwords, wecanmaximizeeachlocallikelihoodfunctionindependentlyoftherestofthenetwork, andthencombinethesolutionstoget an MLE solution. Mathematics of Biological Networks

Table-CPDs We nowconsiderthesimplestparametrizationofthe CPDs: a table-CPD. Supposewehave a variable X withparents U. Ifwerepresentthat CPD P(X | U) as a table, thenwe will have a parameter x|uforeachcombinationofx  Val(X) and u  Val(U). In thiscasewecanwritethelocallikelihoodfunctionas: where M[u,x] isthenumberoftimes [m] = xand u[m] = u. Mathematics of Biological Networks

Table-CPDs We needtomaximizethistermundertheconstraintsthat, foreachchoice ofvaluefortheparentsU , theconditionalprobabilityis legal, thatis Wecanmaximizeeachoftheterms in squarebrackets in thepreviouseq. independently. Wecanfurtherdecomposethelocallikelihoodfunctionfor a tabular CPD into a productof simple likelihoodfunctionsthatareeachmultinomiallikelihoods. The counts in thedataforthe different outcomesxare [M[u,x] : x  Val(X)]. The MLE parameters turn out tobewhere Mathematics of Biological Networks

V9: Parameter Estimation for Bayesian Networks