690 likes | 819 Views
ACM DEBS 2016 Grand Challenge. SOSTREAM Project. Giacomo Marciani – Marco Piu – Michele Porretta – Matteo Nardelli – Valeria Cardellini. 01. Our Goal. To see how very interesting recent problems can be solved using the latest and most advanced computer technologies.
E N D
ACM DEBS 2016 Grand Challenge SOSTREAM Project Giacomo Marciani – Marco Piu – Michele Porretta – Matteo Nardelli – Valeria Cardellini
01 Our Goal To see how very interesting recent problems can be solved using the latest and most advanced computer technologies To analyze an application for processing large volumes of data built using a framework for Data Stream Processing In detail: solving the ACM DEBS Gran Challenge Case study: real-time analysis of a social network (e.g. Facebook, Twitter…) Purpose: evaluating the higher activity in trend in the social network
02 Association for Computing Machinery It is the world's largest educational and scientific society, uniting computing educators, researchers and professionals to inspire dialogue a share resources of computer science ACM has the goal of popularize and promote science, research and computer science education ACM was founded in New York on September 15, 1947 and has reached an amount of professional and studentmembershipwhichis over 100,000 worldwide. ACM cooperates with more than 170 technical meetings annually. www.acm.org
02 Distributed Event-Based System Conference 2016 10th ACM International Conference on Distributed and Event-BasedSystems - Irvine, CA, 20 – 24 June, 2016 DEBS 2016 is organized into six tracks TheResearch Track TheIndustry and Experience Reports Track The Tutorial Track The Poster and Demo Track TheDoctoral Symposiums Track TheGrand Challenge Track The ACM International Conference DEBS providesa forum dedicated to the spread of the originalresearch, the discussion of practicalinsights, and the reporting of experiencesrelevant to event-basedcomputing STRUCTURE www.debs2016.org
03 DEBS 2016 “Grand Challenge” ACM DEBS 2016 Grand Challenge is the sixth in a series of challengeswhichseekto provide a common ground and uniformevaluationcriteria for a competitionaimedatbothresearch and industrial event-basedsystems TheGoal of the 2016 Grand Challenge Real-time analysis of a social network in the dynamic evolution • Query 1: identifythe poststhatcurrently trigger the mostactivityin the social network • Query 2: identify large communitiesthat are currentlyinvolved in a topic Requirecontinuousanalysisof dynamicgraphconsidering multiple streamsthatreflectgraphupdates
Towards the design of a solution Recap of SNA Query 2 Recap of EBS - DSP Query 1 Future Works and Conclusions
05 Event-Based Systems General overview Definitions An Event-Based System is a system in which the integrated components communicate by generating and receiving events or event notifications An Eventis an occurrence of a happening relevant for the system, e.g. a state change in some component A Notificationis the reification of an event within the system, and provides for its description Event type A Event type B
06 In our problem An Event is a new post (e.g. new Tweet), a new like or comment, a new relationship of friendship… …but an eventcauses the transfer of data between operatorsin a system what is an operator?
07 Event Processing Two general styles of event processing • Complex Event Processing • Data Stream Processing
08 Complex Event Processing More racks in a datacenter Temperature and power consumption monitoring Overheating or power excessive consumption corresponds to an alert Threshold of intervention: two consecutive alerts for type Example DATACENTER COOLING
09 Data Stream Processing General overview Definition A Data Stream Processing (DSP) is a paradigm for fast processing of large volumes of data • Real-time (almost) extraction of information from continuous large data flows (stream) • Low computation time • Composing the output streams from processing the input streams
10 DSP model A DSP application is made of a network of operators connected by streams, at least one data source Usually Sourcesare more and distributed and generates a continuous stream of data (input) Operators(or Processing Elements) • Distributed and designed to work in parallel • Perform well-defined operations • Transform one or more input streams into another stream • Stateful or Stateless DSP application model isrepresentedby a directedgraphwhere the vertices are operators and edges arestreams ...graph can be cyclic!
11 Example Best k words from social network’s posts Words Source Top Ranking Intermediate Ranking Data: (Word) Data: (Best k Words) Data: (Word, Count) Data: (Top k Words) Words Counter
12 Infrastructures and frameworks for DSP Apache Storm Dedicated Cluster • Homogeneous nodes at "short distance” • Number of nodes usually statically defined Cloud and Distributed Clouds • Dynamic allocation of nodes • Geographically distributed nodes • New DSP interest, but also the problems (latency, SLA, etc…) Hybrid solutions • Static set of nodes (extensible with on-demand resources in the cloud) • Trade-off between performance and cost Apache Flink Amazon Kinesis
13 Top posts In our problem Posts Likes Comments Friendships Infrastructure is a dedicated cluster, consisting of a single machine with 8 GB of RAM, 4 cores and 20 GB of storage Query 1 Query 2 As Framework we use Apache Flink Major community
14 Social Network and Social Network Analysis General overview Definitions Social Network (SN): A social structure composed of individuals (or organizations) interconnected by one or more specific types of interdependencies such as friendship, kinship, financial exchanges, communication exchanges, etc Social Network Analysis (SNA): The application of graph theory to understand, categorize and quantify relationships in a social network
15 Representation A Social Networkis represented by undirected graph, where individuals (or organizations) are the vertices, and the relationship between individuals is represented by adjacency between two vertices
16 Why SNA? The SNA focuses its analysis not on the attributes of the individual but on the attributes of the links (relations) between pairs or communities of individuals Oldfocus: individualor company New focus: large social networks Web information is already data, you only need to measure them! The links of a person are replicated in the social network of Internet (social media)
17 Example Marco, Student at Uniroma2 Giuseppe, Freelance Web Designer Giacomo, Student at Uniroma2 Michele Porretta’sLinkedin Network
18 ...in variousfields of application Advantages of SNA on social media • marketing strategies • voting intentions • analysis of social phenomena • … • Develop a taxonomy based on the behavior of individuals or recipients of an action • Search influencers(opinion leaders) or simple followers • Implement methodologies for opinion mining e sentiment analysis • Implement methodologies for decision-making purposes
Query 1 ROADMAP Goal Dataset Challenges Design Evaluation
01 Goal Vision and track of the first query Identify posts that trigger the most activity in an evolving social network graph ENTITIESDefinition of posts and comments SCORINGPost score encapsulates the relevance of activity RANKINGPost ranking discipline and update state TRACKDEBS 2016 Grand Challenge briefing
02 Input Posts A post is a tuple defined as follows: (ts, post_id, user_id, post, user) 2016-06-10T12:30:30.525+000|100|1|My Post|User A
03 Input Comments A comment is a tuple defined as follows: (ts, cmnt_id, user_id, cmnt, user, replied, commented) 2016-06-10T12:31:30.500+0000|200|2|My Comment|User B||100 2016-06-10T12:32:30.525+0000|300|3|My Reply|User C|200|
04 Output Rankings A ranking is a tuple defined as follows: (ts, top1_post_id, top1_post_user, top1_post_score, top1_post_commenters, top2_post_id, top2_post_user, top2_post_score, top2_post_commenters, top3_post_id, top3_post_user, top3_post_score, top3_post_commenters) 2016-06-10T12:31:30.500+0000|101|User A|10|8|-|-|-|-|-|-|-|- 2016-06-10T12:31:30.500+0000|101|User A|10|8|102|User B|9|6|-|-|-|- 2016-06-10T12:31:30.500+0000|101|User A|10|8|102|User B|9|6|103|User C|8|4
05 Scoring A new post [comment] has an initial own score of 10. Post [comment]’s score decreases by 1 every 24h since its creation. Post’s total score is the sum of its own score plus the score of all its comments.
06 Ranking Posts are ranked by total score. On score equality, they are ranked by timestamp. On timestamp equality, they are ranked by last comment’s timestamp. A rank is considered updated when its rank positions change.
07 Track Compute the top-3 scoring active posts. Produce a rank, on rank update. Calculate the result in a streaming fashion1. 1 no pre-calculations allowed, continuous update of result stream.
08 Dataset The “Big” dataset of the Grand Challenge 15% 56M SOCIAL EVENTS 9M 22M 1M 24M POSTS COMMENTS LIKES FRIENDSHIPS 39% 2% 44%
09 Challenges A step-by-step approach towards the design TIMING Operators must be synchronized on a common notion of time. EVENT INTERLACING Streams of events entering the system must be globally totally ordered by time. GRAPH BUILDING Every comment and reply must efficiently retrieve its root post. MEMORY Memory usage must be bounded to only valuable data. COMMUNICATION Trasmissions must be bounded to only valuable data. SCORE/RANKING COMP. Data structures and algorithms must be efficient and distributable.
10 Challenges A step-by-step approach towards the design Timing Operators must be synchronized on a common notion of time. The time is simulated, then parallel operators know about time only when they consume events. Timestamp Broadcasting The timestamp of events entering a parallel operator is broadcasted to every instance of the operator.
11 1 2 3 4 5 6 2 3 4 6 1 5 Events Interlacing Streams of events entering the system must be globally ordered by time. 1 3 5 Two parallel sources produce streams of events singularly ordered by time, but the system has no guarantee about the order in which it consumes events. 3 1 2 6 4 5 2 4 6 Event dispatcher A centralized operator consumes two parallel streams of events, enqueueing them according to their timestamp, thus producing a single stream of mixed events ordered by time. 5 1 3 5 1 2 3 4 6 4 2 6
12 Graph Building Every comment and reply must efficiently retrieve its root post. Comments come with post reference, but replies come with only comment reference. Comment Mapper A centralized operator consumes the streams of posts and comments, producing a stream of posts and comments mapped on them.
13 Memory Memory usage must be bounded to only valuable data. Expired posts and their comments are potentially infinite and they are not valuable data. Feedback Stream Expired posts are removed feedbacking their id to the operator that is storing them.
14 (ts, post_id, user_id, post, user) (ts, post_id, user_id, post, user) (ts, post_id, user) Communication (ts, cmnt_id, user_id, cmnt, user, replied, commented) (ts, cmnt_id, user_id, cmnt, user, replied, commented) (ts, cmnt_id, user_id, replied, commented) Trasmissions must be bounded to only valuable data. A post [comment] tuple contains potentially large fields not involved in rank computation. (ts, event_id, user_id, user, commented) Compact Tuples Provide one interpretable tuple type, containing only attributes involved in the rank computations.
15 Score/Rank Computation Data structures and algorithms must be efficient and distributable. Expirations Counter 2-Steps Filtered Ranking A parallel operator produces partial ranks, which are then merged by a centralized operator into a global rank. Only posts that could induce a local rank update are consumed. Score components are stored into a circular buffer that computes the total score decreased once every 24h.
16 Design The architecture of the first query
17 Evaluation Latency In the steady state, the average per tuple latency is kept with a sub-linear trend by the parallelization of score/rank operators, and the restricted ranking calculations. The latency could be further decreased by means of some planned key.
18 Evaluation Memory In the steady state, the average memory usage is kept consistently below a satisfactory threshold by the feedback stream and the minimized state of operators.
Query 2 Track Architecture Solution Bron-Kerbosch Algorithm Evaluation Conclusions
Query 2 Track
01 Query 2 Track • Scope • To find trends and topics shared between entire communities of people. • Why? To target a group for specific solutions which may be of their interest (e.g., advertisements) • Track • To address the change of interests with large communities. • Focus on the dynamic change of query results over time, i.e., calls for a continuous evaluation of the results. • Goal • Given an integer k and a duration d (in seconds), find the k comments with the largestrange, where the range of a commentisthe set of nodesdefined by personswho • havelikedthatcomment (seelikes, comments) • wherethe commentwascreatednot more than d secondsago • knoweachother.
02 Query 2 Track Clique • a cliqueis a subset of vertices of an undirectedgraphsuchthatitsinducedsubgraphis complete; that is, every two distinct vertices in the clique are adjacent
03 Query 2 Track
Query 2 Architecture
05 Query 2 Architecture
06 Redis Database NoSQL in-memorykey-value
Query 2 Solution
07 Query 2 Solution