570 likes | 655 Views
The Role of Agents in Distributed Data Mining: Issues and Benefits . Josenildo Costa da Silva 1 , Matthias Klusch 1 , Stefano Lodi 2 , Gianluca Moro 2 , Claudio Sartori 2 1 Deduction and Multiagent Systems, German Research Center for Artificial Intelligence , Saarbruecken , Germany
E N D
The Role of Agents in Distributed Data Mining: Issues and Benefits Josenildo Costa da Silva 1, Matthias Klusch 1, Stefano Lodi2, Gianluca Moro 2, Claudio Sartori 2 1Deduction and Multiagent Systems, German Research Center for Artificial Intelligence, Saarbruecken, Germany 2Department of Electronics, Computer Science and Systems, Universityof Bologna, Bologna, Italy
Distributed Data Mining (DDM) • Data sets • Massive • Inherently distributed • Networks • Limited bandwidth • Limited computing resources at nodes • Privacy and security • Sensitive data • Share goals, not data AgentLink III: TFG1 IIA4WE, Roma
Centralized solution • Apply traditional DM algorithms to data retrieved from different sources and stored in a data warehouse • May be impractical or even impossible for some business settings • Autonomy of data sources • Data privacy • Scalability (~TB/d) AgentLink III: TFG1 IIA4WE, Roma
Agents and DDM • DDM exploits distributed processing and problem decomposability • Is there any real added valueof using concepts from agent technology in DDM? • Few DDM algorithms use agents • Evidence that cooperation amongdistributed DM processes may allow effective mining even without centralizedcontrol • Autonomy, adaptivity, deliberative reasoning naturally fit into the DDM framework AgentLink III: TFG1 IIA4WE, Roma
State of the Art • BODHI • Mobile agent platform/Framework for collective DM on heterogeneous sites • PADMA • Clustering homogeneous sites • Agent based text classification/visualization • JAM • Metalearning, classifiers • Papyrus • Wide area DDM over clusters • Move data/models/results to minimize network load AgentLink III: TFG1 IIA4WE, Roma
Agents for DDM (pros) • Autonomy of data sources • Scalability of DM to massive distributed data • Multi-strategy DDM • Collaborative DM AgentLink III: TFG1 IIA4WE, Roma
Agents for DDM (against) • Need to enforce minimal privileges at a data source • Unsolicited access to sensitive data • Eavesdropping • Data tampering • Denial of serviceattacks AgentLink III: TFG1 IIA4WE, Roma
The Inference Problem • Work in statistical DB (mid 70’s) • Integration/aggregation at the summary level is inherent in DDM • Infer sensitive data even from partial integration to a certain extent and with some probability (inference problem) • Existing DDM systems are not capable of coping with the inference problem AgentLink III: TFG1 IIA4WE, Roma
Data Clustering • Popular problem • Statistics (cluster analysis) • Pattern Recognition • Data Mining • Decompose multivariate data set into groups of objects • Homogeneity within groups • Separation between groups AgentLink III: TFG1 IIA4WE, Roma
DE-clustering • Clustering based on non-parametricdensity estimation • Construct an estimate of the probability density function from the data set • Objects “attracted” by a local maximum of the estimate belong to the same cluster AgentLink III: TFG1 IIA4WE, Roma
Kernel Density Estimation • The higher the number of data objects in the neighbourhood of x, the higher density at x • A data object exerts more influence on the value of the estimate at x than any data object farther from x than xi • The influence of data objects is radial AgentLink III: TFG1 IIA4WE, Roma
Formalizing Density Estimators • The density estimate at a space object x is proportional to a sum of weights • The sum consists of one weight for every data object • Weight is a monotonically decreasing function (kernel ) of the distance between x and xiscaled by a factor h (window width ) AgentLink III: TFG1 IIA4WE, Roma
Kernel Functions • Uniform kernel AgentLink III: TFG1 IIA4WE, Roma
Kernel Functions • Triangular kernel AgentLink III: TFG1 IIA4WE, Roma
Kernel Functions • Epanechnikov’s kernel AgentLink III: TFG1 IIA4WE, Roma
Kernel Functions • Gaussian kernel AgentLink III: TFG1 IIA4WE, Roma
Example (1/2) • Uniform kernel, h=250 AgentLink III: TFG1 IIA4WE, Roma
Example (2/2) • Gaussian kernel, h=250 AgentLink III: TFG1 IIA4WE, Roma
Distributed Data Clustering (1/2) • Clustering algorithm A( ) • Homogeneous distributed data clustering problem for A: • Data set S • Sites Lj • Ljstores data set Dj with AgentLink III: TFG1 IIA4WE, Roma
Distributed Data Clustering (2/2) • Problem: find clustering Cj in the data space of Lj such that: • Cj agree with A(S) (correctness requirement): • Time/communication costs are minimized (efficiency requirement) • The size of data transferred out of the data space of any Lj is minimized (privacy requirement) AgentLink III: TFG1 IIA4WE, Roma
Traditional (centralized) solution • Gather all local data sets into one centralized repository (e.g., a data warehouse) • Run A( ) on the centralized data set • Unsatisfied privacy requirement • Unsatisfied efficiency requirement for some A( ) AgentLink III: TFG1 IIA4WE, Roma
Sampling • Goal: given some class of functions of type represent every member as a sampling series where: • is a collection of points of • is some set of suitable expansion functions AgentLink III: TFG1 IIA4WE, Roma
Example • The class of polynomials of degree 1 • Sampling points • Expansion functions • Finite sum AgentLink III: TFG1 IIA4WE, Roma
Band-limited Functions • Function f of one real variable • Range of frequencies of a function f support of the Fourier transform of f • Any function whose range of frequencies is confined to a bounded set B is called band-limited to B(the band-region) AgentLink III: TFG1 IIA4WE, Roma
Example: sinc function AgentLink III: TFG1 IIA4WE, Roma
Sampling Theorem • If f is band-limited with band-region then AgentLink III: TFG1 IIA4WE, Roma
Sampling Theorem (scaled multidimensional version) • Let where is the -th component of a vector • If f is band-limited to Bthen AgentLink III: TFG1 IIA4WE, Roma
Sampling Density Estimates (1/4) • Additivity of density estimates of a distributed data set AgentLink III: TFG1 IIA4WE, Roma
Sampling Density Estimates (2/4) • The sampling series of the density estimate is also additive where AgentLink III: TFG1 IIA4WE, Roma
Sampling Density Estimates (3/4) • Truncation errors • The support of a kernel function is not bounded in general • Aliasing errors • The support of the Fourier transform of a kernel function is not bounded in general kernel functions are not band-limited AgentLink III: TFG1 IIA4WE, Roma
Sampling Density Estimates (4/4) • The sampling series of a density estimate can only be approximated • Trade-off between the number of samples and accuracy • Define a minimal multidimensional rectangle outside which samples are negligible • Define a vector of sampling intervals such that the aliasing error is negligible AgentLink III: TFG1 IIA4WE, Roma
The KDEC scheme • Every site Lj: • Helper H: • Waits for the samples of local density estimates • Computes a local density estimate of its data Dj • Samples • Orderly sums the samples • Sends the samples to H • Sends the summation back to each Lj • Waits for the samples of the global density estimate • Reconstructsfrom its samples • Applies DE-clustering to Dj and AgentLink III: TFG1 IIA4WE, Roma
The KDEC scheme Helper Site1 Site2 AgentLink III: TFG1 IIA4WE, Roma
The KDEC scheme Helper AgentLink III: TFG1 IIA4WE, Roma
The KDEC scheme Helper AgentLink III: TFG1 IIA4WE, Roma
The KDEC scheme Helper AgentLink III: TFG1 IIA4WE, Roma
The KDEC scheme Helper AgentLink III: TFG1 IIA4WE, Roma
Properties of the approach • Communication complexity depends only on the number of samples • Data objects are never transmitted over the network • Local clusters are close to global clusters which can be obtained using DE-cluster • Time complexity does not exceed the time complexity of centralized DE-clustering AgentLink III: TFG1 IIA4WE, Roma
Window width and sampling frequency • Good estimates when h is not less than a small multiple of the smallest distance between objects • As , the number of samples rarely exceeds the number of data points AgentLink III: TFG1 IIA4WE, Roma
Complexity • Site j • Sampling: O(q(N) Sam) • DE-cluster: O(|Dj|q(Dj)) • Helper • Summation of samples: O(Sam) • Communication • Time: O(Sam) • Volume: O(M Sam) AgentLink III: TFG1 IIA4WE, Roma
Complexity (centralized approach) • Site j • Transmission/Reception of data objects: O(|Dj|) • Helper • Global DE-clustering: O(N q(N)) • Communication: • Time: O(N) • Volume: O(N) AgentLink III: TFG1 IIA4WE, Roma
Stationary agent-based KDEC • The helper engages site agents to agree on: • Kernel function • Window width • Sampling frequencies • Sampling region • The global sampled form of the estimate is computed in a single stage AgentLink III: TFG1 IIA4WE, Roma
Mobile agent-based KDEC • At site Ln the visiting agent: • Negotiates kernel function, window width, sampling frequencies, sampling region • Carries the sum of samples collected at Lm, m<n, in its data space • The global sampled form of the estimate is returned to the interested agents AgentLink III: TFG1 IIA4WE, Roma
A Hierarchical Scheme • Additivity allows to extend the scheme to trees of arbitrary arity • Local sampled density estimates are propagated upwards in partial sums, until the global sampled DE is computed at the root and returned to the leaves • May provide more protection against disclosure of DEs AgentLink III: TFG1 IIA4WE, Roma
Inference and Trustworthiness • Inference problemforkernel density estimates • Goal of inference attacks: exploit information contained in a density estimate to infer the data objects • Trustworthiness of helpers • Trustworthy helper no bit of information written to memory by a process for the Helper procedure is sent to a system peripheral by a different process AgentLink III: TFG1 IIA4WE, Roma
Inference Attacks on Kernel Density Estimates • Let be extensionally equal to a density estimate: • For example, gis the reconstructed density estimate (sampling series) AgentLink III: TFG1 IIA4WE, Roma
Inference Attacks on Kernel Density Estimates • Simple strategy: Search the density estimate or its derivatives for discontinuities • Example: The kernel is the square pulse • For each pair of projections of objects on an axis there is a pair of projections of discontinuities on that axis having the same distance as the objects’ projections • If h is known then the objects can be inferred easily • If the kernel has discontinuous derivatives, then the same technique applies to the derivatives AgentLink III: TFG1 IIA4WE, Roma
Inference Attacks on Kernel Density Estimates • If g is not continuous at x an object lies at h=250 AgentLink III: TFG1 IIA4WE, Roma
Inference Attacks on Kernel Density Estimates • If the kernel is infinitely differentiable the problem is more difficult • Select space objects and attempt to solve a nonlinear system ofequations AgentLink III: TFG1 IIA4WE, Roma
Attack Scenarios • Single-site attack • One of the sites attempts to infer the data objects from the global density estimate • Unable to associate a specific data object to a specific site • Site coalition attack • A coalition computes the sum of the density estimates of all the other sites as difference • Special case: the coalition includes all sites but onethe attack potentiallyreveals the data objects at the site AgentLink III: TFG1 IIA4WE, Roma