120 likes | 236 Views
A Semantic Approach to Tag Redundancy in Folksonomies Moving Forward. Tom Schimoler 11/23/2009 CSC 549. The Story So Far…. The Problem: Redundancy of Tags The Domain: Last.fm -- a music annotation system The Data: 800K users 1 million tags 3.6 million resources 2.4 million songs
E N D
A Semantic Approach to Tag Redundancy in FolksonomiesMoving Forward Tom Schimoler 11/23/2009 CSC 549
The Story So Far… • The Problem: Redundancy of Tags • The Domain: Last.fm -- a music annotation system • The Data: • 800K users • 1 million tags • 3.6 million resources • 2.4 million songs • 800K artists • 400K albums • 18K wiki articles (14K artist-based, 4K tag-based)
The Story So Far… • I define redundancy as a binary semantic relation between 2 tags in the context of a single resource • Objective is to automatically predict this relation, based on previously observed relations • Other relevant relations: spelling similarity of tags, both tags have same relation to a given resource, etc.
Basic Proposed Solution • The task is broken into 2 distinct phases: • Phase 1: find observed relations among tags and resources • Phase 2: learn a set of rules which predicts the redundancy relationship given the observed data from Phase 1 and a training set (in which redundancy has been determined) • The larger picture: • a framework in which relations are general connections between tags, resources, and users • rule learning is largely a data-mining task • Hypothesis: • The relations learned from NLP techniques will improve the learning of redundancy prediction rules, compared with learning from semantically-unaware techniques alone.
Phase 1: Issues Too many possible relations to find them all Many relations in this domain are not objectively consistent Keep goal in mind: objective is NOT to build a comprehensive ontology KISS: focus on 3 observed relations which might be extracted from the data through NLP techniques
Observed Relations: Alt.Spelling(1) • Alternate spelling is the most noticeable indicator of redundancy in 2 tags • A lexical database, such as WordNet, can help find some alternate spellings; tag systems such as Last.fm contain many tags which are not part of the English vocabulary and therefore will not be found in a general database • Noun • S: (n) heavy metal, heavy metal music (loud and harsh sounding rock music with a strong beat; lyrics usually involve violent or fantastic imagery) • direct hypernym / inherited hypernym / sister term • S: (n) rock 'n' roll, rock'n'roll, rock-and-roll, rock and roll, rock, rock music (a genre of popular music originating in the 1950s; a blend of black rhythm-and-blues with white country-and-western) "rock is a generic term for the range of styles that evolved out of rock'n'roll." • S: (n) heavy metal, heavy metal music (loud and harsh sounding rock music with a strong beat; lyrics usually involve violent or fantastic imagery) • S: (n) progressive rock, art rock (a style of rock music that emerged in the 1970s; associated with attempts to combine rock with jazz and other forms; intended for listening and not dancing) • S: (n) psychedelic rock, acid rock (a musical style that emerged in the 1960s; rock music inspired by or related to drug-induced experience) • S: (n) punk rock, punk (rock music with deliberately offensive lyrics expressing anger and social alienation; in part a reaction against progressive rock)
Observed Relations: Alt.Spelling(2) • As discussed previously, tags (in this domain, at least) seem to follow certain morphological rules which can help determine alternate spellings • Learning these rules is dependent on finding the “information-free” portions of each tag • A common connector string (and, -, ’n, etc) • I18n replacements (“ou” for “o”, “f” for “ph”, etc) • Some prefixes/suffixes are so common that it is questionable how much information they provide (-“rock”)
Observed Relations: Genre and Locale (1) • Some semantic relations are potentially easier to extract from text than others • 2 features of an artist which are almost always present in that artist’s wiki page are the genre and location of the artist. • A consistent pattern of language usually ties these features together with the artist in a single sentence • The Ramones were a punk rock band which formed in Queens, New York, United States in 1974. Coldplay is a British alternative rock band, formed in London, United Kingdom in 1997. • Radiohead are an English alternative rock band from Abingdon, Oxfordshire. • Stefani Joanne Angelina Germanotta (born March 28, 1986), who performs under the stage name Lady Gaga, is a singer, songwriter and musician from New York, U.S.A.
Observed Relations: Genre and Locale (2) “<Artist> (is/was/are/were) a <Location>? <Genre>+ (musician/band) (from/formed in) <Location>,<Location>…” begins many artist wiki entries. With supervised learning, this pattern can be further refined so that location and genre can be predicted, given a set of tagged data (wiki text in which n-grams are mapped to known tags/resources). Presence of tag/resource data avoids the need for deeper NER for this limited task.
Evaluating Phase 1 results (1) As Phase 2 (determining redundancy rules) depends on the accuracy of the semantic relations observed in Phase 1, we need some ground truth by which to judge (alt.spelling evaluation is deferred to Phase 2) An outside ontology which already contains (some of) these relations can be held as a “gold-standard” for purposes of evaluation In the music domain, some ontology projects (MusicBrainz, for example) have emerged which may suit our purposes
Evaluating Phase 1 results (2) • The evaluation methodology essentially becomes a classification problem • For artist A we have some knowledge a priori: • G(A) = { g | Genre(A,g)} • L(A) = { l | Location(A,l)} • From the text data, we extract genre and location information: • G’(A) and L’(A) • For each relation over each test Artist, we compute standard precision/recall measures of success
Evaluating Phase 2 Determining how well our system correctly identifies tag redundancy depends on having a consistent set of test-cases Although our formal definition of redundancy has informed the overall approach, it is not something for which a metric is known Therefore, we will rely on manual classification in order to define the gold-standard Logistics of such a project are non-trivial