220 likes | 310 Views
Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions. DBSocial 2013, New York. Motivation enBlogue (1). enBlogue : Identifies emergent topics Input: A stream of documents annotated with hash-tags (e.g. Tweets)
E N D
Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions DBSocial 2013, New York
MotivationenBlogue (1) • enBlogue: Identifies emergent topics • Input: A stream of documents annotated with hash-tags (e.g. Tweets) • Restricts the focus to the more recent documents using a time sliding window
MotivationenBlogue (2) • Tracks the correlation of co-occurring hash-tags over time • Reports on unexpected changes in the correlation correlation time
Jaccard Coefficient • T : A set containing the document ids annotated with tag t • Pair of tags : • Set of n tags :
Jaccard Coefficient Computation • Maintain counters for all subsets of co-occurring tags
Inclusion – Exclusion Principle • Compute the cardinality of the union of n sets using the cardinalities of the intersections of all its subsets:
Inclusion – Exclusion PrincipleAdvantages • Needs to maintain less counters • Adapts more easily to changes in the load
Problem • For each subset of co-occurring tags • Number of documents annotated each tag • Number of documents annotated with all tags • A big number of co-occurring tag sets • New documents arrive fast changing the numbers Solution: Let multiple nodes compute the Jaccard coefficient for different tag sets
Outline • Motivation • enBlogue • Jaccard Coefficient • Inclusion – Exclusion Principle • Problem • Idea • Architecture • Partition Tags • Updating Counters • Results • Theoretical Results • Experimental Results • Conclusion
Architecture Nodes computing the partitions Nodes computing the Jaccard coefficients
Partition TagsRequisites • Treat tag-sets as inseparable units • Minimise the overlap of single tags tracked by different nodes
Partition TagsAlgorithm • Phase 1: Create an initial assignment of the tags to the nodes • Max-k cover : Selects k out of n sets that cover the maximum number of elements Phase 2: Make sure all sets of tags are assigned to some node
Partition TagsExample PHASE 1: MAX-2 COVER PHASE 2: ASSIGNING REMAINING SETS
Finding nodes Inverted Index
Outline • Motivation • enBlogue • Jaccard Coefficient • Inclusion – Exclusion Principle • Problem • Idea • Architecture • Distributing Tags • Updating Counters • Results • Theoretical Results • Experimental Results • Conclusion
Theoretic expectation • k partitions • v total tags (vocabulary) • m randomly selected tags per set • n total tag-sets
Real Data Experiments • Dataset: Tweets of 15th March 2013 • Partitions: 10
Outline • Motivation • enBlogue • Jaccard Coefficient • Inclusion – Exclusion Principle • Problem • Idea • Architecture • Distributing Tags • Updating Counters • Results • Theoretical Results • Experimental Results • Conclusion
Conclusion • An algorithm to compute the Jaccard coefficient for tag-sets in a massive data stream. • Applicable to all measures using intersection and/or unions of sets (e.g. Dice) • Results show small replication • Load equally distributed to the nodes.