290 likes | 534 Views
Unit II Mathematical Foundation Of Big Data. Probability theory Tail bounds with applications Markov chains and random walks Pair wise independence and universal hashing Approximate counting. Approximate median The streaming models Flajolet Martin Distance sampling Bloom filters
E N D
Unit IIMathematical Foundation Of Big Data • Probability theory • Tail bounds with applications • Markov chains and random walks • Pair wise independence and universal hashing • Approximate counting • Approximate median • The streaming models • Flajolet Martin Distance sampling • Bloom filters • Local search and testing connectivity • Boolean functions
Approximate median • Median and percentile are commonly used data statistic functions. They are also used in other data analysis algorithms • Two approximate functions: • APPROXIMATE_MEDIAN • APPROXIMATE_PERCENTILE • These functions scale well with big data, and are several magnitudes faster than the exact median and percentile functions. • The APPROXIMATE_MEDIAN functions start with a binary search tree sorting algorithm, Implemented in a distributed fashion, which This makes the APPROXIMATE_MEDIAN and function very fast.
Approximate Median Algorithm • Algorithm describe approximate median with great precision and perform instant execution. • Also it is fast : It need less than 4/3 n evolution and 1/3 n exchange on the average case and fewer than 3/2n comparison and ½ n exchange in best case. • Addition to its sequential efficiency, it is very easily parallelized. Used in median-filtering in image representation. • The size of the input is a power of 3: n = 3r • Let n = 3r be the size of the input array,with an integer r. The algorithm proceeds in rstages. At each stage it divides the input into subsets of three elements, and calculatesthe median of each such triplet.
The procedure Triplet Adjust finds the median of triplets with elements that are indexed by two parameters: one, i, denotes the position of the leftmost element of triplet in the array. The second parameter, Step, is the relative distance between the triplet elements.
Streaming Model Streaming Model • Big data streaming is a process in which big data is quickly processed. The data on which processing is done is the data in motion. Big data streaming is ideally a speed-focused approach where in a continuous stream of data is processed. • Streaming model contains 3 processing model 1.Landmark Window Model 2. Damped Window Model 3. Sliding Window Model
1. Landmark Window Model • A landmark data model considers the data in the data stream from the beginning until now.data points starting at a fixed point in time, the so-called landmark. • This model use the algo introduce by Manku and Matwani is lossy-counting approximation algorithm. It will determine approximate set of recurrent item set over whole stream. • Process a batch transaction occurring at particular time. Hence it will maintain item set, occurrence of item set and error set.
This algo use 3 different modules Buffer, Trie and set Gen. • The Buffer module continuously fill the m/m by incoming transaction and compute frequency of each item in present transaction. • The Trie module maintain forest prefix trees. It is an Array of tuples (X,freq(X),err(x),level). • The Set Gen module generate the set of item in present batch. Implemented using Heap queue. During each call , old set is copied into old Trie Array while same time new sets insert into new Trei Array
2. Damped Window Model • A damped window model associates weights with the data in the stream rather than performing a binary decision on whether to include a point or not. • Gives higher weights to recent data than those in the past. Damped windows are used finding recent frequent item sets. • This model use estDecalgo to reduce result of previous transaction on stream mining result and It use Decay rate concept. • It will use approximate algo for maintain recent FIs and use to estimate frequent item set.
3. Sliding Window Model • A sliding window model, on the other hand, considers the data from now up to a certain range in the past. • It process the items in window and maintain only frequent item sets. • The volume of sliding window will be depend on application and system recourses. • This module will use M/m prefix tree algo introduce by Chi to incrementally update the set of recurrent item over sliding window.
Flajolet Martin Distance Sampling • Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined. • The distance for pair of vertices in graph X is the length of shortest path between them. The distance distribution for X specifies how many vertex pair are at distance v. • The Eppstein-wang algo utilizes sampling though breadth first searches. Uses random position and smallest amount element record initially.
Boolean Function Representation of Boolean function • Boolean algebra deals with binary variables and logic operation. A Boolean Function Is described by an algebraic expression called as Boolean Expression which consist of binary variables, constant 0 and 1 and logic operation symbol. (1) Value Tables (2) Branching Programs (3) Circuits (4) Decision Tress (5) Formulas (6) Conjunctive Normal Forms (7) Disjunctive Normal Forms
Bloom Filters • A Bloom filter is a space-efficient probabilisticdata structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. • False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter); the more elements that are added to the set, the larger the probability of false positives. • Bloom Filter consist of array of X bit, Z[0]to Z[x-1]. A bloom Filter uses hash functions x1…xm with range {0,..,x-1} • Hashing is done for indexing and locating items in databases because it is easier to find the shorter hash value than the longer string. Hashing is also used in encryption.
Local Search and Testing Connectivity • Local search • In computer science, local search is a heuristic method for solving computationally hard optimization problems. • Local search can be used on problems that can be formulated as finding a solution maximizing a criterion among a number of candidate solutions • Local search algorithms are widely applied to numerous hard computational problems, including problems from computer science (particularly artificial intelligence), mathematics, operations research, engineering, and bioinformatics.