210 likes | 363 Views
On Community Outliers and their Efficient Detection in Information Networks. Jing Gao 1 , Feng Liang 1 , Wei Fan 2 , Chi Wang 1 , Yizhou Sun 1 , Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu. OBJECTIVES. Determine outliers in information networks
E N D
On Community Outliers and their Efficient Detection in Information Networks Jing Gao1, Feng Liang1, Wei Fan2, Chi Wang1, Yizhou Sun1, Jiawei Han1 University of Illinois, IBM TJ Watson DebapriyaBasu
OBJECTIVES • Determine outliers in information networks • Compare various algorithms which does the same
CHARACTERISTICS OF INFORMATION NETWORK • Eg Internet, Social Networking Sites • Nodes – characterized by feature values • Links - representative of relation between nodes
OUTLIERS • Outliers – anomalies, novelties • Different kinds of outliers • Global • Contextual
OUTLIERS IN INFORMATION NETWORKS
COMMUNITY OUTLIER • Unified model considering both nodes and links • Community discovery and outlier detection are related processes
SUMMARY OF THE APPROACH • Treat each object as a multivariate data point • Use K components to describe normal community behavior and one component to denote outliers • Induce a hidden variable zi at each object indicating community • Treat network information as a graph • Model the graph as a Hidden Markov Random Field on zi • Find the local minimum of the posterior probability potential energy of the model.
UNIFIED PROBABILISTIC MODEL outlier community label Z node feature X link structure W model parameters high-income: mean: 116k std: 35k low-income: mean: 20k std: 12k K: number of communities
THE MODEL • Set of R.Vs X are conditionally independent given their labels P(X=S|Z) = ΠP(xi=si|zi) • Kth normal community is characterized by a set of parameters P(xi=si|zi=k) = P(xi=si|Θk) • Outliers are characterized by uniform distribution • P(xi=si|zi=0) = ρ0 • Markov random field is defined over hidden variable Z • P(zi|zI-{i}) = P(zi|zNi) • The equivalent Gibbs distribution is P(Z) = exp(-U(Z))*1/H1 H1 = normalizing constant, U(Z) = sum of clique potentials. • Goal is to find the configuration of z that maximizes P(X=S|Z)P(Z) for a given Θ
MODELING DATA • Continuous Data • Is modeled as Gaussian distribution • Model parameters: mean, standard deviation • Text Data • Is modeled as Multinomial distribution • Model parameters: probability of a word appearing in a community
COMMUNITY OUTLIER DETECTION ALGORITHM • Θ: model parameters • Z: community labels Initialize Z Given Z, find Θ that maximizes P(X|Z) PARAMETER ESTIMATION Given Θ, find Z that maximizes P(Z|X) INFERENCE
PARAMETER ESTIMATION • Calculate model parameters • maximum likelihood estimation • Continuous • mean: sample mean of the community • standard deviation: square root of the sample variance of the community • Text • probability of a word appearing in the community: empirical probability
INFERENCES • Calculate Zi values • Given Model parameters, • Iteratively update the community labels of nodes at each timestep • Select the label that maximizes P(Z|X,ZN) • Calculate P(Z|X,ZN) values • Both the node features and community labels of neighbors if Z indicates a normal community • If the probability of a node belonging to any community is low enough, label it as an outlier
DISCUSSIONS • Setting Hyper parameters • a0 = threshold • Λ = confidence in the network • K = number of communities • Initialization • Group outliers in clusters. • It will eventually get corrected.
SIMULATED EXPERIMENTS • Data Generation • Generate continuous data based on Gaussian distributions and generate labels according to the model • Define r: percentage of outliers, K: number of communities • Baseline models • GLODA: global outlier detection (based on node features only) • DNODA: local outlier detection (check the feature values of direct neighbors) • CNA: partition data into communities based on links and then conduct outlier detection in each community
EXPERIMENTS ON DBLP • Communities • data mining, artificial intelligence, database, information analysis • Sub network of Conferences • Links: percentage of common authors among two conferences • Node features: publication titles in the conference • Sub network of Authors • Links: co-authorship relationship • Node features: titles of publications by an author
RESULTS Community outliers: CVPR CIKM
CONCLUSIONS • Community Outliers • Community Outlier Detection QUESTIONS
REFERENCES • On Community Outliers and their Efficient Detection in Information Networks – Gao, Liang, Fan, Wang, Sun, Han • Outlier detection – Irad Ben-Gal • Automated detection of outliers in real-world data – Last, Kandel • Outlier Detection for High Dimensional Data – Aggarwal, Yu