1 / 18

Inducing Ontologies from Folksonomies using Natural Language Understanding

Inducing Ontologies from Folksonomies using Natural Language Understanding. Marta Tatu, Dan Moldovan Lymba Corporation Presenter: Chris Irwin Davis. Overview. Folksonomy. lexical normalization of tags semantic consistency tag-tag relations. folksonomy-based applications

Download Presentation

Inducing Ontologies from Folksonomies using Natural Language Understanding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inducing Ontologies from Folksonomies using Natural Language Understanding Marta Tatu, Dan Moldovan Lymba Corporation Presenter: Chris Irwin Davis

  2. Overview Folksonomy • lexical normalization of tags • semantic consistency • tag-tag relations • folksonomy-based applications • reasoning applications NLP • typographical errors, spelling variations • singular/plural forms, lower case • space/punctuation used as delimiters • same tag in different contexts • tag synonymy • social annotations (author vs. user) • browse/search bookmarks • resource discovery (recommendations) • collaborative tagging (across folksonomies) Ontology LREC 2010 May 19th, 2010

  3. Semantic Approach • Folksonomy semantic representation • Tag understanding • Lexical: language identification, tokenization and spelling corrections, capitalization restoration • Syntactic: part-of-speech tagging, syntactic parsing • Semantic: acronym understanding, word sense disambiguation, named entity recognition, semantic parsing • Deriving the ontological structure • Semantic relations between tags • Sources of information • Tag text semantics • Social bookmarking annotations • Machine understanding of bookmark content LREC 2010 May 19th, 2010

  4. Representing Folksonomies • knowledge • advertisign • americanhistory • read-now knowledge[NN]1 advertising[NN]1 American[JJ]1 TOPIC history[NN]2 now[RB]3 TEMPORAL read[VB]1 LREC 2010 May 19th, 2010

  5. Representing Folksonomies Associated (user, document) pairs LREC 2010 May 19th, 2010

  6. Representing Folksonomies LREC 2010 May 19th, 2010

  7. System Architecture LREC 2010 May 19th, 2010

  8. Tag Understanding LREC 2010 May 19th, 2010

  9. Acronym/Abbreviation Understanding • Abbreviation dictionary: (abbreviation - expansion - domain of usage) • 118,055 distinct abbreviations • 137 domains: Law, Music, TV/Radio Stations, Countries, Airport, Domain Names, Chat, Emoticons, etc. • 25% of the abbreviations have more than one definition • (unambiguous) Zip codes – (76012 : Arlington, TX) • (ambiguous) SS : 192 definitions in 66 domains • Social Security – Business and US Government, Screen Saver – File Extensions, Stainless Steel – Housing and Products, Subtropical Storm – Meteorology, Style Sheet – Software • Check tag if part of abbreviation dictionary • Use lexical chains to link document content to abbreviation domain • Use co-occurring tags to identify correct expansion • Use text alignment to find new abbreviation definitions within document content LREC 2010 May 19th, 2010

  10. Acronym/Abbreviation Understanding • “PR” ~ 1409 documents • 87 definitions for PR • Press Release, Public Relations, Puerto Rico, Page Rank, Public Radio, Permanent Resident/Residency, etc. • http://prsarahevans.com/2009/06/do-you-have-a-strategy-for-online-comments • “PR” = “public relations” (6 times in document content) • Other tags of the bookmark: “public”, “relations”, “media”, “strategy” • http://www.bbc.co.uk/pressoffice/pressreleases/category/new_media_index.shtml • “PR” = “press releases” (in document content) • http://escape.topuertorico.com • “PR” = “Puerto Rico” (in document content) LREC 2010 May 19th, 2010

  11. Evaluation • Experimental data • ~ 150,000 (user,document,tag) from del.icio.us • 8,460 tags; 83,827 documents; 58,198 users • Main error source: tag cannot be identified within document • Lack of document content (images, non-EN content, etc.) • Errors propagate from initial processing steps to later ones • Bad capitalization leads to bad named entity recognition LREC 2010 May 19th, 2010

  12. Ontological Tag-Tag Relations • EQUALITY relations • same lemma, part-of-speech, and sense number • EQ(activity, activities), EQ(after-effects, AfterEffects), EQ(opinion, Opnion), etc. • SYNONYMY clusters • Same synset id • SYN(OS, operating.system), SYN(LA, losangeles), SYN (nyt, nytimes) • ISA relations between named entities and type tags • ISA(OracleCorporation, organization), ISA(davidfosterwallace, person) • WordNet relations between tags • ISA(vegan, vegetarian), ANTONYMY(peace, war), PART_WHOLE(Businesses, markets), ENTAIL(proofreading, +read), SIMILARITY(important, general), DOMAIN(light, physics) LREC 2010 May 19th, 2010

  13. Ontological Tag-Tag Relations • Lexical chains of size 2 and Semantic calculus • tag1 rel1 synset  rel2 tag2 • rel1 & rel2  rel3 • rel3(tag1, tag2) is added to the ontology • ISA(integration, events,)  ISA(integration, group_action/NN/1) and ISA(group_action/NN/1, events,) • PART_WHOLE(lobby, hotels)  PART_WHOLE(lobby, building/NN/1) and ISA(building/NN/1, hotels) • ISA relations between “modifier head” and “head” tags • ISA(book-cover, covers) • ISA(theoryofmind, theory) • ISA(photoshoptutorials, tutorials,) LREC 2010 May 19th, 2010

  14. Ontological Tag-Tag Relations • Relations between “modifieri headi” tags (i=1,2) • ISA(build-solar-panel, create-solar-panel) • SIMILARITY(socialnetworks, socialweb) modifier2 head2 modifier2 head2 modifier2 head2 OR OR ISA & ISA ISA & SYN SYN & ISA modifier1 head1 modifier1 head1 modifier1 head1 head2 head2 REL REL modifier2 modifier2 ⇒ ISA LREC 2010 May 19th, 2010

  15. Evaluation • 9,820 EQ clusters for the 8,460 unique tags • Same abbreviation expanded to different definitions • EQ: tutorial, tutorials, tutorials, • 8,801 SYN clusters • Largest cluster (133 bookmarks): car, automobiles, auto, autos, cars, automobile • 17% of tags placed into incorrect SYN cluster • Errors caused by imperfect word sense disambiguation • 5,439 ontological tag-tag relations • 3,869 ISA, 601 SIMILARITY, 429 PART_WHOLE, etc. • 1,778 relations derived using WordNet’s lexical chains and Lymba’s semantic calculus rules LREC 2010 May 19th, 2010

  16. Folksonomic Ontology LREC 2010 May 19th, 2010 Portion of ontology generated from experimental folksonomy

  17. Folksonomic Ontology LREC 2010 May 19th, 2010 Portion of ontology generated from experimental folksonomy

  18. Thank you! For questions: email marta@lymba.com

More Related