540 likes | 658 Views
PAPER PRESENTATION on An Efficient and Resilient Appro ac h to Filtering & Disseminating Streaming Data CMPE 521 Database Systems Prepared by: Mürsel Taşgın Onur Kardeş. Introduction. The internet and the web are increasingly used to disseminate fast changing data .
E N D
PAPER PRESENTATION on An Efficient and Resilient Approach to Filtering & Disseminating Streaming Data CMPE 521 Database Systems Prepared by: Mürsel Taşgın Onur Kardeş
Introduction • The internet and the web are increasingly used todisseminate fast changing data. • Several examples for fast changing data: • sensors, • traffic and weather information, • stock prices, • sports scores, • health monitoring information
Introduction • The properties of this data: • Highly dinamic, • Streaming, • Aperiodic. • Users are interested in not onlymonitoring streaming data but in also using it for on-linedecision making.
Introduction Replicating the Source SOURCE Repository 1 Repository 3 Repository 2
Introduction • Services like Akamai.net and IBM’s edge server technology are exemplarsof such networks of repositories, which aim to providebetter services by shifting most of the work to the edge ofthe network (closer to the end users). • But, although suchsystems scale quite well, if the data is changing at a fastrate, the quality of service at a repository farther fromthe data source would deteriorate.
Introduction • In general; • Replicationcan reduce the load on the sources, • But, replication oftime-varying data introduces new challenges: • Coherency • Delays and scalability
Introduction • Coherency requirement (cr) : Users specify the bound on the tolerableimprecision associated with each requested data item. Repository 1 Microsoft : $60,89 at time : 11:36 USER 1 SOURCE Microsoft : $60,85 at time : 11:43 Repository 2 Microsoft : $60,86 at time : 11:41 USER 2
Introduction • Coherency-preserving system: • the delivered data must preserve associatedcoherency requirements, • resilient to failures, • efficient. • Necessary changes are pushed to the users; instead of polling the source independently.
Introduction • A logical overlay network of repositories are created according to: • coherency needs of users attached to • each repository • expected delays at each repository • this network is called dynamic data dissemination • graph (d3g). Construction of an effective dissemination network of repositories
Introduction • The previous algorithm called LeLA, for d3g, was unable to cope with large number of data. • A new algorithm (DiTA) to build dissemination networks that are scalable and resilient, is introduced. Construction of an effective dissemination network of repositories
Introduction • In DiTA, repositories with more stringent coherency requirements are placed closer to the source in the network as they are likely to get more updates than the ones with looser coherency requirements. Construction of an effective dissemination network of repositories • In DiTA, a dynamic data dissemination tree, d3g, is created for each data item, x.
Introduction Construction of an effective dissemination network of repositories SOURCE Repository 1c = 0.2 Repository 2c = 0.3 Repository 3c = 0.8 Repository 6c = 0.7 Repository 4c = 0.7 Repository 5c = 0.9
Introduction • to handle repository and communication link failures; back-up parents are used. • back-up parent is asked to deliver data with coherency that is less stringent than that associated with the parent. Provision for the dissemination of dynamic data in spite of failures in the overlay network
Introduction Provision for the dissemination of dynamic data in spite of failures in the overlay network x,y,z,t a,b,c,x Parent x,t y,z,t z Back-up Parent
Introduction • normally a repository receives updates and selectively disseminates them to its downstreams. • it is not always necessary to disseminate the exact values of the most recent updates, as long as the values presented preserve the coherency of the data. Efficient filtering and scheduling techniques for repositories
The Basic Framework: Data Coherency and Overlay Network • a coherency requirement (c) is associated with a data item, to denote the maximum permissible deviation of the user’s view from the value of data x at the source. • c can be specified in terms of; • time (values should never be out-of-sync by more than 5sec.) • value (weather information where the temperaturevalue should never be out-of-sync by more than 2 degrees).
Ux(t) – Sx(t) ≤c1 Px(t) – Sx(t) ≤c2 The Basic Framework: Data Coherency and Overlay Network Each data item inthe repository from which a user obtains data must be refreshedin such a way that the user-specified coherencyrequirements are maintained. fidelity fobserved by a user can be definedto be the total length of time for which the above inequalityholds
The Basic Framework: Data Coherency and Overlay Network • Assume x is served by a single source • Repositories R1,....,Rn are interested in x. • These repositories in turnserve a subset of the remainingrepositories such that theresulting network is in the form a tree rooted at the sourceand consisting ofrepositories R1,....,Rn . • Parent dependent relationship.
The Basic Framework: Data Coherency and Overlay Network • Since the repositorydisseminates updates to its users and dependents, the coherencyrequirement of a repository should be the moststringent requirement that it has to serve. • When a datachangeoccurs at the source, it checks which of its directand indirect dependents are interested in the changeand pushes the change to them.
Building a d3t • Start with a physical layout of the communicationnetwork in the form of a graph, where thegraph consists of a set of sources, repositories and theunderlying network. • Try to build a d3t for a data item x. • The root of the d3t will be the source, which serves x. • A repository P serving repository Q with data item x, is called the parent of Q; and Q is called the dependent of P for x.
Level 0 Parent R1 Dependents R2 Level 1 Level 2 USERS USERS Building a d3t Source for data itemx in each repository;
Building a d3t • A repository shouldideally serve at least as many unique pairs as the numberof data items served to it. • If a repository is currentlyserving less than this fixednumber, then we say that therepository has the resources to serve a new dependent. R1 Dependent Data Item R7 x R11 y R18 x R9 z R10 t R21 x ?
Enough resources? Enough resources? Enough resources? R6 c=0.5 Building a d3t SOURCE NO R4c=0.1 NO Max(c)=0.8 Max(c)=0.7 R5 c=0.4 So, replace R10 with R6, and push R6 down cR6 > cR10 YES Max(c)=0.8 Max(c)=0.6 Max(c)=0.7 R7 c=0.8 R8 c=0.6 R9 c=0.7 R10 c=0.3
R10 c=0.3 Building a d3t SOURCE R4c=0.1 Max(c)=0.8 Max(c)=0.7 This algorithm is called as Data-Item-at-a-Time-Algorithm (DiTA) R5 c=0.4 Max(c)=0.8 Max(c)=0.6 Max(c)=0.7 Max(c)=0.5 R7 c=0.8 R8 c=0.6 R6 c=0.5 R9 c=0.7
Building a d3t Traces – Collection procedure and charectristics • Real world stock price streams from http://finance.yahoo.com are used. • 10,000 values are polled during 1,000 traces; approximately a new data value is obtained per second.
Building a d3t Repositories – Data, Coherency and Cooperation characteristics • A coherency requirement cis associatedwith each of the chosen data items. • c’s associatedwith data in a repository are a mix of stringent tolerances(varying from $0.01 to 0.05) and less stringent tolerances(varying from $0.5 to 0.99). • T% of the data items havestringent coherency requirements at each repository (theremaining (100 – T)%, of data items have less stringent coherency requirements).
Building a d3t Physical Network – topology and delays • The router topology was generated usingBRITE(http://www.cs.bu.edu/brite). • The repositories and the sources are selected randomly. • node-node communication delaysderived from a Pareto distribution:x (1 / x1/α) + x1 where α = x’ / (x’-1) and
Building a d3t Physical Network – topology and delays • x’ is the mean, x1 is the minimum delay a link can have. • According to the experiments, x’=15 ms and x1=2 ms. • The computational delays for dissemination is taken to be 12.5 ms .
Building a d3t Metrics • The key metric is the loss in fidelity of the data. • Fidelity was the total length of time which the inequality; |P(t) – S(t)| < c holds. • Fidelity of a repository is the mean over all data items stored in that repository • Fidelity of the system is the mean fidelity of all repositories. • Obviously, the loss of fidelity is (100% - fidelity) • One another metric is the number of messages in the system (system load)
Building a d3t Performance Evaluation • For the base performance measurement, 600 routers, 100 repositories and 4 servers were used. • Total number of data items served by servers was varied from 100 to 1000. • T parameter was varied from 20 to 80. • A previous algorithm, LeLA was used as a benchmark.
Building a d3t Performance Evaluation • Each node in DiTA does less work than in LeLA. • Thus, in DiTA height of the dissemination tree will be more. • So, when computational delays are low; but link delays are large, LeLA may act better. • But, this happens only for negligible computational delays (0.5 ms) and very high link delays (110 ms)
Enchancing the Resiliency of the Repository Network • Active backups vs. Passive backups • Passive backups may increase the load, which causes the loss in fidelity. • So active backup parents are used. • A backup parent serves data to a dependent Q with a coherency cB > c.
Enchancing the Resiliency of the Repository Network • If all changes are less than cB, the dependent can not know when parent P fails. So P should send periodic “I’m alive” messages. • Once P fails, Q requests B to serve it the data at c . When P recoversfrom the failure, Q requests B to serve the data item at cB. • In this approach, there no backup for backups. So that when both P and B fails, Q can not get any updates.
Backup will send updates frequently which incur high computational and communication overheads Dependent will miss a large number of changes during failure of the parent Enchancing the Resiliency of the Repository Network Choice of cB Using a Probabilistic Model • For the sake of simplicity, cB = k * c. • Here, choice of k is important: k
Enchancing the Resiliency of the Repository Network Choice of cB Using a Probabilistic Model • Assuming that the data values change with uniform probability and • Using a Markov Chain Model: # Misses = 2k2 – 2 • 2k2-2 is the number of updates a dependent will miss before it detects that there is a failure. • According to the experiments, this number is rather pessimistic; nearly an upper limit.
Any siblings? Any siblings? B C Enchancing the Resiliency of the Repository Network Choice of backup parents R YES P B C NO Choose one of them randomly Q
Enchancing the Resiliency of the Repository Network Choice of backup parents • In case the coherency at which Q wants x from B is less then the coherency at which B wants x , • the parent of B is asked to serve x to Q with the requiredtighter coherency. • An advantage of choosing a sibling, is that the change in coherencyrequirement is not percolated all the way to thesource. • However, if an ancestor of P and B is heavilyloaded, then the delay due to the load will be reflected inthe updates of both the P and B . This might result inadditional loss in fidelity.
slow recovery fast recovery Enchancing the Resiliency of the Repository Network Effect of Repository failures on Loss of Fidelity • Because the kinds of failures are memory-less, an exponential probability distribution is used for simulating them. Pr (X > t) = e-λt • λ = λ1 time to failure • λ = λ2 time to recover • In this approach link failures are not taken into account. So the model is incomplete... λ2
Enchancing the Resiliency of the Repository Network Perfomance Evaluation • The effect of adding resiliency is shown. • k=2 is used. • When 100 data items are used, 23% of updates sent by backups are disseminated. • Some updates sent by backups reached before parents’.
Enchancing the Resiliency of the Repository Network Perfomance Evaluation • But when backup parents are loaded ( > 400), their updates are of no use, and increase the loss of fidelity. • The dependent should control them by time-stamping the updates.
Enchancing the Resiliency of the Repository Network Perfomance Evaluation • During the experiment, about 80-90% of the repositories experiencedat least one failure, • and themaximum number of failuresin the system at any given time for λ2 = 0.001 wasaround 12. • For λ2 = 0.01, the maximum number of failureswas 5 and for λ2 = 0.1 , the maximum failures was2.
Enchancing the Resiliency of the Repository Network Perfomance Evaluation • Effect of quick recovery is shown. • λ1 = 0.0001 andλ2 = 2 • For high coherence requirements, resiliency improves fidelity even for transient failures.
Enchancing the Resiliency of the Repository Network Perfomance Evaluation • However, withresiliency; with a very large number of dataitems, for e.g., 1000, fidelity drops. • This is because, at this point, the cost of resiliencyexceeds the benefits obtained by it, and hence thisincreases the lostin fidelity.
queing delay Update of x Update of y update of x x update of y y Queue update requests Process of the updates and disseminating data is complete! Check if update needed Reducing the Delay at a Repository Delays • Queing delay:The time delay between the arrival of the update and time its processing started • Processing delay: Check delay(decide if the update should be processed)+ computation delay( delay of computing the update and pushing data to the dependents) processing delay
Reducing the Delay at a Repository Question: How can we reduce the average delays to improve fidelity? This can be done by: • Better filteringi.e. Reducing the processing delay in determining if an update needs to disseminated to one or more dependents • Better scheduling of disseminations
Reducing the Delay at a Repository Better Filtering For each dependent, a repository maintains the coherency req. & last value pushed to Upper bound = last pushed value + cr Lower bound = last pushed value - cr C1=0.7 C2=0.6 C3=0.5 C4=0.3 C5=0.1 C6=0.05 Algorithm to find the dependents to disseminate data Sorted cr values The dependent with first largest cr which needs to be disseminated CR values for dependents reside at the repository For every window the below rule is valid If an update violates above rule a pseudo value is generated as actual value Dependent ordering
Reducing the Delay at a Repository Better Filtering • Better filtering provides: • Sendingthe updates of dynamic data to end users who are actually • interested in that update. • By filtering, no garbage data flow is on the network. (no flooding of • data over the network) This improves communication time in the • networks and provides better response times • By the help of filtering, a better scalable system can be established and it will resist against unexpected heavy loads.
u1 u2 Reducing the Delay at a Repository Better scheduling of disseminations Total delay of processing ui C(u1) Cost of update(delay) C(u2) Cost of update(delay) b(u1) Beneficiary of update b(u2) Beneficiary of update • Approach: • Instead of standard queueing of processing the update requests, a kind of prioritization is superior to have better performance b(u)/C(u)SCORING • Each update request is shceduled according to this score. B(u) is the number of dependents that will receive the update, C(u) is the cost of dissemination to all dependants. B(u) values are stored at aech repository so they are precomputed automatıcally. • Advantages: • Update requests that is important to many dependents will be processed earlier BUSINESS IMPORTANCE • Updates with low ratio gets delayed and if a new update arrives older ones are dropped, which improves performance especially in heaviliy loaded environments SCALABILITY
Reducing the Delay at a Repository Better scheduling of disseminations Scheduling provides: • Priority scheme and business importance approach that achieves better results • As filtering, it makes improvements on scalability; some out of date update requests are discarded from the queue. This saves unnecessary computations and queue delays.