1 / 22

DBSocial 2013, New York

Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions. DBSocial 2013, New York. Motivation enBlogue (1). enBlogue : Identifies emergent topics Input: A stream of documents annotated with hash-tags (e.g. Tweets)

grazia
Download Presentation

DBSocial 2013, New York

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions DBSocial 2013, New York

  2. MotivationenBlogue (1) • enBlogue: Identifies emergent topics • Input: A stream of documents annotated with hash-tags (e.g. Tweets) • Restricts the focus to the more recent documents using a time sliding window

  3. MotivationenBlogue (2) • Tracks the correlation of co-occurring hash-tags over time • Reports on unexpected changes in the correlation correlation time

  4. Jaccard Coefficient • T : A set containing the document ids annotated with tag t • Pair of tags : • Set of n tags :

  5. Jaccard Coefficient Computation • Maintain counters for all subsets of co-occurring tags

  6. Inclusion – Exclusion Principle • Compute the cardinality of the union of n sets using the cardinalities of the intersections of all its subsets:

  7. Inclusion – Exclusion PrincipleAdvantages • Needs to maintain less counters • Adapts more easily to changes in the load

  8. Problem • For each subset of co-occurring tags • Number of documents annotated each tag • Number of documents annotated with all tags • A big number of co-occurring tag sets • New documents arrive fast changing the numbers Solution: Let multiple nodes compute the Jaccard coefficient for different tag sets

  9. Outline • Motivation • enBlogue • Jaccard Coefficient • Inclusion – Exclusion Principle • Problem • Idea • Architecture • Partition Tags • Updating Counters • Results • Theoretical Results • Experimental Results • Conclusion

  10. Architecture Nodes computing the partitions Nodes computing the Jaccard coefficients

  11. Partition TagsRequisites • Treat tag-sets as inseparable units • Minimise the overlap of single tags tracked by different nodes

  12. Partition TagsAlgorithm • Phase 1: Create an initial assignment of the tags to the nodes • Max-k cover : Selects k out of n sets that cover the maximum number of elements Phase 2: Make sure all sets of tags are assigned to some node

  13. Partition TagsExample PHASE 1: MAX-2 COVER PHASE 2: ASSIGNING REMAINING SETS

  14. Update Counters

  15. Finding nodes Inverted Index

  16. Outline • Motivation • enBlogue • Jaccard Coefficient • Inclusion – Exclusion Principle • Problem • Idea • Architecture • Distributing Tags • Updating Counters • Results • Theoretical Results • Experimental Results • Conclusion

  17. Theoretic expectation • k partitions • v total tags (vocabulary) • m randomly selected tags per set • n total tag-sets

  18. Theoretical Results

  19. Real Data Experiments • Dataset: Tweets of 15th March 2013 • Partitions: 10

  20. Outline • Motivation • enBlogue • Jaccard Coefficient • Inclusion – Exclusion Principle • Problem • Idea • Architecture • Distributing Tags • Updating Counters • Results • Theoretical Results • Experimental Results • Conclusion

  21. Conclusion • An algorithm to compute the Jaccard coefficient for tag-sets in a massive data stream. • Applicable to all measures using intersection and/or unions of sets (e.g. Dice) • Results show small replication • Load equally distributed to the nodes.

  22. Thank you!

More Related