170 likes | 314 Views
Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik , 1st March 2013. Task Description. 3.2M documents from English language Wikipedia 140 queries
E N D
Attempting to Use Wikipedia Categories to ImproveRetrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013
TaskDescription • 3.2M documents from English language Wikipedia • 140 queries • Return a ranked list with 1000 documents for eachquery • UseLinked Data
Document Collection • Approx. 30% ofthefiles describedeleted files, images, etc. • XML-like documents - Regex • Missingdocuments • Eachdocumentconsistsofthree parts: • Wikipedia article • DBPediaproperties • Yagoproperties
<lodxmlxmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xml:lang='en' xmlns:xhtml='http://www.w3.org/1999/xhtml' encoding='UTF-8'> <articletitle='73425'> <wikipedia> <paragraph> <template type='Metadata'> <arg></arg> <tag name='id'>73425</tag> <tag name='title'>The_Deer_Hunter</tag> </template> <template type='Otheruses'> <arg></arg> <arg>Deer Hunter (disambiguation)</arg> </template> <infobox type='film'> <tag name='name'>The Deer Hunter</tag> ... <tag name='director'> <link> <wikilink href='./f4/05/522346.xml'>Michael Cimino</wikilink> <dbpediahref='http://dbpedia.org/resource/Michael_Cimino'></dbpedia> <yagoref='Michael_Cimino'></yago> </link> </tag> <dbpediaproperties> <propertyname='http://dbpedia.org/ontology/thumbnail'> <objectname='http://upload.wikimedia.org/wikipedia/commons/thumb/5/57/The_Deer_Hunter_poster.jpg/200px-The_Deer_Hunter_poster.jpg'></object> </property> ... <yagoproperties><propertyname='hasDuration'><objectname='10920.0#s'></object> </property> <propertyname='isCalled'><objectname='A szarvasvad\u00e1sz'></object> </property> The_Deer_Hunter
guitarchord tuning guitarchordminor guitarclassical flamenco guitarclassicalbach guitaroriginRussia guitarorigin blues tango culturemovies tango culturecountries tango musiccomposers tango music instruments tango dance styles tango dancehistory vietnamwarmovie vietnamwarfacts vietnamfoodrecipes vietnamesefoodblog vietnam travel national park vietnam travel airports bicycle sport races bicycle sport disciplines bicycleholiday nature bicyclebenefitshealth bicyclebenefitsenvironment female rock singers south korean girlgroups electronicmusic genres digital musicnotation formats musicconferences intellectualpropertyrights lobby Queries
Two stage approach • Traditionalretrieval • Improve by: • Using links betweendocuments • Using categories • Using Linked Data
Stage One • Extracttitle, headings and categories from documents • Index using Indri – Krovetz stemming, stopword list • Weightedsearch – Title (10), Category (5), H2 (2), H3 (1) • Smoothing (ask Michael)
ResultAfter Stage One Vietnam_War_Crimes_Working_Group Vietnam_War_in_film Operation_Sunrise_(Vietnam_War) Vietnam_War_Story_II Book:Vietnam_War Vietnam_during_World_War_I Vietnam_War_casualties Vietnam_War_Crimes_Working_Group_Files Puerto_Ricans_Missing_in_Action_in_the_Vietnam_War Star_Wars_Mini_Movie_Awards Vietnam_War_Memorial,_Hanoi 17th_Parallel:_Vietnam_in_War Vietnam:_The_Camera_At_War March_Against_the_Vietnam_War 1960_in_the_Vietnam_War 1961_in_the_Vietnam_War List_of_Vietnam_War_flying_aces List_of_wars_involving_Vietnam Outline_of_the_Vietnam_War The_War_Within:_America's_Battle_over_Vietnam Vietnam:_The_Ten_Thousand_Day_War Protests_against_the_Vietnam_War Matterhorn:_A_Novel_of_the_Vietnam_War List_of_bombs_in_the_Vietnam_War Puerto_Ricans_in_the_Vietnam_War Military_history_of_Australia_during_the_Vietnam_War Query: Vietnam War Movie
Stage Two • Links betweendocuments • Categories • Linked Data • …
Expand Query withWordnet Synonyms Original query: vietnamwarmovie vietnam -> annam war -> warfare movie -> film flick pic picture Expandedquery: vietnamannam war warfare film flick movie pic picture
CalculateTextSimilaritybetweenExpanded Query and CategoryName Levenshteindistance: "The smallest number of insertions, deletions, and substitutions required to change one string or tree into another. " NIST (http://xlinux.nist.gov/dads/HTML/Levenshtein.html) Original query: Vietnam War Movie Expanded query: Vietnam Annam War Warefare Film Flick Movie Pic Picture Category: Vietnam War Films = 1 + (0*0 + 0*0 + 1*1) = 2 / 3 = 0.66
0.66 Vietnam War films 1 War films 2.33 Star Wars films 2.75 Star Wars fan films 3 Fan films 3 PunicWars 3 Star Wars 3.66 Gulf War films 3.75 World War I films 5.5 BarbaryWars 5.5 Boer Wars 5.5 Civilwars 5.5 Guild Wars 5.5 Opium Wars 5.66 Vietnam Warbooks 5.66 Vietnam Warnovels 5.66 Vietnam Warsites 5.75 World War II media 6.33 Flags of Vietnam 6.33 Laws ofwar 6.33 Media of Vietnam 6.33 MTV Movie Awards 6.8 Women in World War I 7 Floods in Vietnam 7 Songs ofthe Vietnam War 7.33 Star Warsbooks 7.33 Star Warscomics 7.5 Warcrimes in Vietnam 7.5 World War I games 7.5 World War II comics Threshold < 1 Categoriesranked by similarity to expandedquery Query: Vietnam War Movie
Problems Expanded Query: Vietnam Annam War Warefare Film Flick Movie Pic Picture Category: Star Wars = 1 + (2*2 + 1*1) = 6 / 2 = 3 Frosker = Forsker? Homonyms – Vietnam War Picture Missingcategories
ResultAfter Stage Two We_Were_Soldiers A_Better_Tomorrow_3 The_War_(film) Faith_of_My_Fathers_(film) Combat_Shock The_Killing_Fields_(film) A_Bright_Shining_Lie Apocalypse_Now_Redux Flight_of_the_Intruder Dead_Presidents R-Point The_Last_Hunter There_Is_No_13 Deceit_(2009_film) The_Ballad_of_Andy_Crocker Some_Kind_of_Hero The_Deer_Hunter A_Rumor_of_War_(miniseries) Platoon_(film) The_Crazy_World_of_Julius_Vrooder Thou_Shalt_Not_Kill..._Except 1969_(film) A.W.O.L._(2006_film) The_Siege_of_Firebase_Gloria Alamo_Bay Rolling_Thunder_(film) Query: Vietnam War Movie
Original Query Result Stage 1 Result Stage 2 vietnamwarmovie'11-1------' '11--------' vietnamwarfacts '11--11111-' '11--11111-' vietnamfoodrecipes '------1---' '------1---' vietnamesefoodblog '----------' '----------' vietnam travel national park '1111111111' '1111111111' vietnam travel airports '-1-111----' '-1-11-1---' guitarchord tuning '111111111-' '111111111-' guitarchordminor '11111--11-' '11111--11-' guitarclassical flamenco '----1-----' '----1-----' guitarclassicalbach '1-11--11--' '1-11--11--' guitaroriginRussia '----------' '----------' guitarorigin blues '1-1-------' '1-1-------' tango culturemovies '---1--1---' '---1--1---' tango culturecountries '---1-1---1' '---1-1---1' tango musiccomposers '-1--------' '---1-1---1' tango music instruments '----------' '----------' tango dance styles '11--------' '----------' tango dancehistory '111-------' '111-------' bicycle sport races '111-1--1--' '---1111-1-' bicycle sport disciplines '----1-----' '----1-----' bicycleholiday nature '----------' '----------' bicyclebenefitshealth '------1---' '------1---' bicyclebenefitsenvironment '---------1' '---------1' female rock singers '1-------1-' '1-1--1--1-' south korean girlgroups '----------' '111111-111' electronicmusic genres '1-1-1-----' '-1-1------' digital musicnotation formats '-111-1111-' '-111-1111-' musicconferences'11111-1---' '--11111-1-' intellectualpropertyrights lobby '111-1-1111' '111-1-1111'
Literature Kaptein, R., Koolen, M., & Kamps, J. (2009, July). Using Wikipedia categories for ad hoc search. In Proceedingsofthe 32nd international ACM SIGIR conferenceon Research and development in informationretrieval (pp. 824-825). ACM. Vercoustre, A. M., Pehcevski, J., & Thom, J. (2008). Using wikipedia categories and links in entity ranking. Focused Access to XML Documents, 321-335. Illustration http://www.flickr.com/photos/pasukaru76/6196321318/