180 likes | 307 Views
Inducing Ontologies from Folksonomies using Natural Language Understanding. Marta Tatu, Dan Moldovan Lymba Corporation Presenter: Chris Irwin Davis. Overview. Folksonomy. lexical normalization of tags semantic consistency tag-tag relations. folksonomy-based applications
E N D
Inducing Ontologies from Folksonomies using Natural Language Understanding Marta Tatu, Dan Moldovan Lymba Corporation Presenter: Chris Irwin Davis
Overview Folksonomy • lexical normalization of tags • semantic consistency • tag-tag relations • folksonomy-based applications • reasoning applications NLP • typographical errors, spelling variations • singular/plural forms, lower case • space/punctuation used as delimiters • same tag in different contexts • tag synonymy • social annotations (author vs. user) • browse/search bookmarks • resource discovery (recommendations) • collaborative tagging (across folksonomies) Ontology LREC 2010 May 19th, 2010
Semantic Approach • Folksonomy semantic representation • Tag understanding • Lexical: language identification, tokenization and spelling corrections, capitalization restoration • Syntactic: part-of-speech tagging, syntactic parsing • Semantic: acronym understanding, word sense disambiguation, named entity recognition, semantic parsing • Deriving the ontological structure • Semantic relations between tags • Sources of information • Tag text semantics • Social bookmarking annotations • Machine understanding of bookmark content LREC 2010 May 19th, 2010
Representing Folksonomies • knowledge • advertisign • americanhistory • read-now knowledge[NN]1 advertising[NN]1 American[JJ]1 TOPIC history[NN]2 now[RB]3 TEMPORAL read[VB]1 LREC 2010 May 19th, 2010
Representing Folksonomies Associated (user, document) pairs LREC 2010 May 19th, 2010
Representing Folksonomies LREC 2010 May 19th, 2010
System Architecture LREC 2010 May 19th, 2010
Tag Understanding LREC 2010 May 19th, 2010
Acronym/Abbreviation Understanding • Abbreviation dictionary: (abbreviation - expansion - domain of usage) • 118,055 distinct abbreviations • 137 domains: Law, Music, TV/Radio Stations, Countries, Airport, Domain Names, Chat, Emoticons, etc. • 25% of the abbreviations have more than one definition • (unambiguous) Zip codes – (76012 : Arlington, TX) • (ambiguous) SS : 192 definitions in 66 domains • Social Security – Business and US Government, Screen Saver – File Extensions, Stainless Steel – Housing and Products, Subtropical Storm – Meteorology, Style Sheet – Software • Check tag if part of abbreviation dictionary • Use lexical chains to link document content to abbreviation domain • Use co-occurring tags to identify correct expansion • Use text alignment to find new abbreviation definitions within document content LREC 2010 May 19th, 2010
Acronym/Abbreviation Understanding • “PR” ~ 1409 documents • 87 definitions for PR • Press Release, Public Relations, Puerto Rico, Page Rank, Public Radio, Permanent Resident/Residency, etc. • http://prsarahevans.com/2009/06/do-you-have-a-strategy-for-online-comments • “PR” = “public relations” (6 times in document content) • Other tags of the bookmark: “public”, “relations”, “media”, “strategy” • http://www.bbc.co.uk/pressoffice/pressreleases/category/new_media_index.shtml • “PR” = “press releases” (in document content) • http://escape.topuertorico.com • “PR” = “Puerto Rico” (in document content) LREC 2010 May 19th, 2010
Evaluation • Experimental data • ~ 150,000 (user,document,tag) from del.icio.us • 8,460 tags; 83,827 documents; 58,198 users • Main error source: tag cannot be identified within document • Lack of document content (images, non-EN content, etc.) • Errors propagate from initial processing steps to later ones • Bad capitalization leads to bad named entity recognition LREC 2010 May 19th, 2010
Ontological Tag-Tag Relations • EQUALITY relations • same lemma, part-of-speech, and sense number • EQ(activity, activities), EQ(after-effects, AfterEffects), EQ(opinion, Opnion), etc. • SYNONYMY clusters • Same synset id • SYN(OS, operating.system), SYN(LA, losangeles), SYN (nyt, nytimes) • ISA relations between named entities and type tags • ISA(OracleCorporation, organization), ISA(davidfosterwallace, person) • WordNet relations between tags • ISA(vegan, vegetarian), ANTONYMY(peace, war), PART_WHOLE(Businesses, markets), ENTAIL(proofreading, +read), SIMILARITY(important, general), DOMAIN(light, physics) LREC 2010 May 19th, 2010
Ontological Tag-Tag Relations • Lexical chains of size 2 and Semantic calculus • tag1 rel1 synset rel2 tag2 • rel1 & rel2 rel3 • rel3(tag1, tag2) is added to the ontology • ISA(integration, events,) ISA(integration, group_action/NN/1) and ISA(group_action/NN/1, events,) • PART_WHOLE(lobby, hotels) PART_WHOLE(lobby, building/NN/1) and ISA(building/NN/1, hotels) • ISA relations between “modifier head” and “head” tags • ISA(book-cover, covers) • ISA(theoryofmind, theory) • ISA(photoshoptutorials, tutorials,) LREC 2010 May 19th, 2010
Ontological Tag-Tag Relations • Relations between “modifieri headi” tags (i=1,2) • ISA(build-solar-panel, create-solar-panel) • SIMILARITY(socialnetworks, socialweb) modifier2 head2 modifier2 head2 modifier2 head2 OR OR ISA & ISA ISA & SYN SYN & ISA modifier1 head1 modifier1 head1 modifier1 head1 head2 head2 REL REL modifier2 modifier2 ⇒ ISA LREC 2010 May 19th, 2010
Evaluation • 9,820 EQ clusters for the 8,460 unique tags • Same abbreviation expanded to different definitions • EQ: tutorial, tutorials, tutorials, • 8,801 SYN clusters • Largest cluster (133 bookmarks): car, automobiles, auto, autos, cars, automobile • 17% of tags placed into incorrect SYN cluster • Errors caused by imperfect word sense disambiguation • 5,439 ontological tag-tag relations • 3,869 ISA, 601 SIMILARITY, 429 PART_WHOLE, etc. • 1,778 relations derived using WordNet’s lexical chains and Lymba’s semantic calculus rules LREC 2010 May 19th, 2010
Folksonomic Ontology LREC 2010 May 19th, 2010 Portion of ontology generated from experimental folksonomy
Folksonomic Ontology LREC 2010 May 19th, 2010 Portion of ontology generated from experimental folksonomy
Thank you! For questions: email marta@lymba.com