220 likes | 348 Views
Personalisation. Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter. Domain. Museums offering vast amount of information But: Visitors receptivity and time limited Challenge: seclecting ( subjectively ) interesting exhibits
E N D
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter
Domain • Museums offeringvastamountofinformation • But: Visitorsreceptivityand time limited • Challenge: seclecting (subjectively) interestingexhibits • Ideaof mobile, electronicehandheld, like PDA assistingvisitorby : • Deliveringcontentbased on observationsofvisit • Recommendexhibits • Non-intrusive ,adaptive usermodellingtechnologiesused
Predictionstimuli • Different stimuli: • Physicalproximityofexhibits • Conceptualsimularity (based on textualdescribtionofexhibit) • Relative sequenceothervisitorsvisitedexhibits (popularity) • Evaluate relative impactofthe different factors => seperatestimuli • Language basedmodelssimulatevisitorsthoughtprocess
Experimental Setup • Melbourne Museum, Australia • Largestmuseum in Southern Hemisphere • RestrictedtoAustralia Gallery collection, presentshistoryofcityof Melbourne: • PharLap • CSIRAC • Variation ofexhibits: can not classified in a singlecategory
Experimental Setup • Wide rangeofmodality: • Information plaque • Audio-visualenhancement • Multiple displaysinteractingwithvisitor • Here: NOT differentiatebetweenexhibittypesormodalities • Australia Gallery Collectionexistsof 53 exhibits • Topologyoffloor: open plan design => nopredeterminedsequencebyarchitecture
Resources • Floorplanofexhibitionlocated in 2. floor • Physicaldistanceoftheexhibits • Melbourne Museum web-siteprovidescorresponding web-pageforeveryexhibit • Dataset of 60 visitorpathsthroughthegallery, usedfor: • Training (machinelearning) • Evaluation
Predictionsbased onProximityandPopularity • Proximity-basedpredictions: • Exhibitsranked in order ofphysicaldistance • Prediction: closest not-yet-visitedexhibittovisitorscurrentlocation • In evaluation: baseline • Popularity-basedpredictions: • Visitorpathsprovidedby Melbourne Museum • Convertpathsintomatrixoftransitionalprobabilities • Zero probabilitiesremovedwithLaplaciansmoothing • Markov Model
Text-basedPrediction • Exhibitsrelatedtoeachotherbyinformationcontent • Every exhibits web-pageconsitsof: • Body oftextdescribingexhibit • Set ofattributekeywords • Predictionofmostsimilarexhibit: • Keywordsasqueries • Web-pagesasdocumentspace • Simple termfrequency-inverse documentfrequency, tf-idf • Score ofeachqueryovereachdocumentnormalised
WSD • Whymakevisitorsconnectionsbetweenexhibits ? • Multiple simularitiesbetweenexhibitspossible • Useof Word Sense Disambiguation: • Path ofvisitorassentenceofexhibits • Eachexhibit in sentencehasassociatedmeaning • Deteminemeaningofnextexhibit • Foreachword in keywordsetofeachexhibit: • WordNetsimilarityiscalculatedagainsteachotherword in otherexhibits
WordNetSimilarity • Similaritymethodsused: • Lin (measuresdifferenceofinformationcontentoftwotermsasfunctionofprobabilityofoccurence in a corpus) • Leacock-Chodorow (edge-counting: function of length of path linking the terms and position of the terms in the taxonomy) • Banerjee-Pedersen (Leskalgorithm) • SimilarityassumofWordNetsimilaritiesbetweeneachkeyword • Visitorshistorymaybeimportantforprediction • Latestvisitedexhibitshigherimpact on visitorthanfirstvisitedexhibits
Evaluation: Method • Foreachmethodtwotests: • Predictnextexhibit in visitorspath • Restrictpredictions, onlyifpred. overthreshold • Evaluation data, aforementiened 60 visitorpaths • 60-fold cross-validationused, forPopularity: • 59 visitorpathsastrainingdata • 1 remainingpathforevaluationused • Repeat thisfor all 60 paths • Combine theresults in singleestimation (e.gaverage)
Evaluation Accuracy: Percentageoftimes, occuredevent was predictedwithhighestProbability BOE: BagofExhibits: Percentageofexhibitsvisitedbyvisitor, not necessary in order ofrecommendation BOE is, in thiscase, identicaltoprecision Single exhibithistory
Evaluation Single exhibithistory withoutthresholdwiththreshold
Evaluation Visitorshistoryenhanced Single exhibithistory
Conclusion • Best performingmethod: Popularity-basedprediction • Historyenhancedmodelslowperformer, possiblereason: • Visitorshadnopreconceivedtask in mind • Movingfromoneimpressiveexhibittonext • Historyhere not relevant, currentlocationmoreimportant • Keep in mind: • Small dataset • Melbourne Gallery (historyofthecity) perhabsnogoodchoice
tf-idf • Term frequency – inverse documentfrequency • Term count=numberoftimes a giventermappears in document • Number n oftermt_i in doumentd_j • In larger documentstermoccursmorelikely, therefornormalise • Inverse documentfrequency, idf, measuresgeneralimportanceofterm • Total numberofdocuments, • Dividedbynrofdocscontainingterm
tf-idf: Similarity • Vectorspace model used • Documentsandqueriesrepresentedasvectors • Eachdimensioncorrespondsto a term • Tf-idfusedforweighting • Compare angle betweenquery an doc
WordNetsimilarities • Lin: • method to compute the semantic relatedness of word senses using the information content of the concepts in WordNet and the 'Similarity Theorem' • Leacock-Chodorow: • counts up the number of edges between the senses in the 'is-a' hierarchy of WordNet • value is then scaled by the maximum depth of the WordNet 'is-a' hierarchy • Banerjee-Pedersen, Lesk: • choosing pairs of ambiguous words within a neighbourhood • checks their definitions in a dictionary • choose the senses as to maximise the number of common terms in the definitions of the chosen words.
Precision, Recall Recall: Percentageof relevant documentswithrespecttothe relative numberofdocumentsretrieved. Precision: Percentageof relevant documentsretrievedwithrespectto total numberof relevant documents in dataspace.
F-Score • F-Score combines Precision and Recall • Harmonicmeanofprecisionandrecall