360 likes | 510 Views
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering. A Grid-Based Middleware’s Support for Processing Distributed Data Streams. Introduction- Motivation. Data stream processing and analysis Data stream: data arrive continuously and need to be processed in real-time
E N D
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering A Grid-Based Middleware’s Support for Processing Distributed Data Streams
Introduction-Motivation • Data stream processing and analysis • Data stream: data arrive continuously and need to be processed in real-time • Data Stream Applications: • Online network Intrusion Detection • Sensor networks • Network Fault Management System for Telecommunication Network Elements • Computer Vision Based Surveillance • Common features of data streams • Continuous arrival • Enormous volume • Real-time constraints • Data sources could be distributed
X Introduction-Motivation Network Fault Management System analyzing alarm message streams Switch Network Network Fault Management System
Introduction-Motivation Computer Vision Based Surveillance
Switch Network X Introduction-Motivation • Challenges & possible Solutions • Challenge1: Data and/or Computation intensive
Switch Network Introduction-Motivation • Challenges & possible Solutions • Challenge1: Data and/or Computation intensive • Solution: Grid computing technologies
Introduction-Motivation • Challenges & possible Solutions • Challenge1: Data and/or Computation intensive • Solution: Grid computing technologies • Challenge 2: real-time analysis is required • Solution: Self-Adaptation functionality is desired
Introduction-Motivation • From point of view of the developers who are interested in applications of data streams • Would like to concentrate on applications themselves • Would not like to focus efforts on • Grid computing • Adaptation function
Introduction-Our Approach • A Middle-ware that is based on Grid standards and tools and provides self-adaptation functionality • The middleware is referred to as GATES (Grid-based AdapTive Execution on Stream) • Automatically distributed to proper computing nodes • Automatically self-adaptive to varying environment without implementing certain algorithms
System Architecture and Design(From Application Perspective) • Breaking down a task into several sub-tasks so that the sub-tasks can consist of a pipeline • Implementing each sub-task in Java • Writing an XML configuration file for the sub-tasks to be automatically deployed. I.E. • specify how many stages (sub-tasks) the pipeline has • specify where the codes that are implementing the sub-tasks reside • Launch the application by running a java program (StreamClient.class) provided by the GATES
:Buffers for applications :Queues between Grid services :Grid services of the GATES :Stages of an application System Architecture and Design(Architecture) Stage A Stage B Stage C A B C
System Architecture and Design(Example) Public class Sampling-Stage implements StreamProcessing{ … void init(){…} … void work(buffer in, buffer out){ … while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out); } … } GATES.Information-About-Adjustment-Parameter(min, max, 1) sampling-ratio = GATES.getSuggestedParameter();
Self-adaptation Algorithm • Given a queue’s long-term factor at each stage, we want to improve the method of adjusting values of an adaptation parameter • Should the adaptation parameter be modified, and if so, in which direction? • How to find a new value (update the value) of the adaptation parameter
Enhanced Self-adaptation Algorithm • Should the adaptation parameter be modified, and if so, in which direction? • The answer is related to load status of queues at two consecutive stages
A B C A B C A B C Performance Parameter A B C A B C A B C A B C A B C Enhanced Self-adaptation Algorithm A B C Convergent States Non-Convergent States
Enhanced Self-adaptation Algorithm Summary of Load States
Enhanced Self-adaptation Algorithm • How to determine the new value for the adaptation parameter • Linear update: increase or decrease by a fixed value • Hard to find a proper fixed value • Previous method • Binary tree search
Left Border Current Value Right Border Enhanced Self-adaptation Algorithm Left Border Current Value New Value Right Border
Data Mining Applications & System Evaluation • Two Data mining applications • Clustream: Clustering data arriving in data streams
Data Mining Applications &System Evaluation • Dist-Freq-Counting: finding frequent itemsets from distributed streams
Resource Allocation Schemes • Problem Definition • Grid resource scheduling for Pipelined processing and real-time distributed streaming applications • Mapping workflows onto Grid is a NP-complete problem • Static Part: the resource allocation problem for GATES is to determine a deployment configuration • Dynamic Part
Static Allocation Scheme • Static allocation problem: determining a deployment configuration • Objective: Automatically generate a deployment configuration according to the information of available resources • The number of data sources and their location • The destination • The number of stages consisting of a pipeline • The number of instances of each stage • How the instances connect to each other • The node where each instance is placed
Static Allocation Scheme Examples of deployment configurations
Related work • Grid Resource Allocation • Condor • Realtor • ACDS etc. • Main Differences: our work focuses on Grid resource allocation for workflow applications • Adaptation Through a Middleware • Cheng et al.’s adaptation framework • SWiFT • Conductor • DART • ROAM • Main Differences: our work focuses on general supports for adaptation in run-time
Summary • Grid computing could be an effective solution for distributed data stream processing • GATES • Distributed processing • Exploit grid web services • Self-adaptation to meet the real-time constraints • Grid resource allocation schemes