DBSocial 2013, New York

Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions DBSocial 2013, New York

MotivationenBlogue (1) • enBlogue: Identifies emergent topics • Input: A stream of documents annotated with hash-tags (e.g. Tweets) • Restricts the focus to the more recent documents using a time sliding window

MotivationenBlogue (2) • Tracks the correlation of co-occurring hash-tags over time • Reports on unexpected changes in the correlation correlation time

Jaccard Coefficient • T : A set containing the document ids annotated with tag t • Pair of tags : • Set of n tags :

Jaccard Coefficient Computation • Maintain counters for all subsets of co-occurring tags

Inclusion – Exclusion Principle • Compute the cardinality of the union of n sets using the cardinalities of the intersections of all its subsets:

Inclusion – Exclusion PrincipleAdvantages • Needs to maintain less counters • Adapts more easily to changes in the load

Problem • For each subset of co-occurring tags • Number of documents annotated each tag • Number of documents annotated with all tags • A big number of co-occurring tag sets • New documents arrive fast changing the numbers Solution: Let multiple nodes compute the Jaccard coefficient for different tag sets

Outline • Motivation • enBlogue • Jaccard Coefficient • Inclusion – Exclusion Principle • Problem • Idea • Architecture • Partition Tags • Updating Counters • Results • Theoretical Results • Experimental Results • Conclusion

Architecture Nodes computing the partitions Nodes computing the Jaccard coefficients

Partition TagsRequisites • Treat tag-sets as inseparable units • Minimise the overlap of single tags tracked by different nodes

Partition TagsAlgorithm • Phase 1: Create an initial assignment of the tags to the nodes • Max-k cover : Selects k out of n sets that cover the maximum number of elements Phase 2: Make sure all sets of tags are assigned to some node

Partition TagsExample PHASE 1: MAX-2 COVER PHASE 2: ASSIGNING REMAINING SETS

Update Counters

Finding nodes Inverted Index

Outline • Motivation • enBlogue • Jaccard Coefficient • Inclusion – Exclusion Principle • Problem • Idea • Architecture • Distributing Tags • Updating Counters • Results • Theoretical Results • Experimental Results • Conclusion

Theoretic expectation • k partitions • v total tags (vocabulary) • m randomly selected tags per set • n total tag-sets

Theoretical Results

Real Data Experiments • Dataset: Tweets of 15th March 2013 • Partitions: 10

Outline • Motivation • enBlogue • Jaccard Coefficient • Inclusion – Exclusion Principle • Problem • Idea • Architecture • Distributing Tags • Updating Counters • Results • Theoretical Results • Experimental Results • Conclusion

Conclusion • An algorithm to compute the Jaccard coefficient for tag-sets in a massive data stream. • Applicable to all measures using intersection and/or unions of sets (e.g. Dice) • Results show small replication • Load equally distributed to the nodes.

Thank you!

DBSocial 2013, New York

DBSocial 2013, New York

Presentation Transcript

“New York, New York”

New York

NEW YORK

New York

New York

New York

NEW YORK

New York

New York

New York, New York – sightseeing

NEW YORK

New York, New York

New York

NEW YORK

New York

NEW YORK

NEW YORK

New York County New York

New York, New York

NEW YORK P6PA NEW YORK INC

New York New York Stake

“New York, New York”