1 / 21

On Community Outliers and their Efficient Detection in Information Networks

On Community Outliers and their Efficient Detection in Information Networks. Jing Gao 1 , Feng Liang 1 , Wei Fan 2 , Chi Wang 1 , Yizhou Sun 1 , Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu. OBJECTIVES. Determine outliers in information networks

cleo
Download Presentation

On Community Outliers and their Efficient Detection in Information Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Community Outliers and their Efficient Detection in Information Networks Jing Gao1, Feng Liang1, Wei Fan2, Chi Wang1, Yizhou Sun1, Jiawei Han1 University of Illinois, IBM TJ Watson DebapriyaBasu

  2. OBJECTIVES • Determine outliers in information networks • Compare various algorithms which does the same

  3. CHARACTERISTICS OF INFORMATION NETWORK • Eg Internet, Social Networking Sites • Nodes – characterized by feature values • Links - representative of relation between nodes

  4. OUTLIERS • Outliers – anomalies, novelties • Different kinds of outliers • Global • Contextual

  5. OUTLIERS IN INFORMATION NETWORKS

  6. COMMUNITY OUTLIER • Unified model considering both nodes and links • Community discovery and outlier detection are related processes

  7. SUMMARY OF THE APPROACH • Treat each object as a multivariate data point • Use K components to describe normal community behavior and one component to denote outliers • Induce a hidden variable zi at each object indicating community • Treat network information as a graph • Model the graph as a Hidden Markov Random Field on zi • Find the local minimum of the posterior probability potential energy of the model.

  8. UNIFIED PROBABILISTIC MODEL outlier community label Z node feature X link structure W model parameters high-income: mean: 116k std: 35k low-income: mean: 20k std: 12k K: number of communities

  9. SYMBOLS IN THE MODEL

  10. THE MODEL • Set of R.Vs X are conditionally independent given their labels P(X=S|Z) = ΠP(xi=si|zi) • Kth normal community is characterized by a set of parameters P(xi=si|zi=k) = P(xi=si|Θk) • Outliers are characterized by uniform distribution • P(xi=si|zi=0) = ρ0 • Markov random field is defined over hidden variable Z • P(zi|zI-{i}) = P(zi|zNi) • The equivalent Gibbs distribution is P(Z) = exp(-U(Z))*1/H1 H1 = normalizing constant, U(Z) = sum of clique potentials. • Goal is to find the configuration of z that maximizes P(X=S|Z)P(Z) for a given Θ

  11. MODELING DATA • Continuous Data • Is modeled as Gaussian distribution • Model parameters: mean, standard deviation • Text Data • Is modeled as Multinomial distribution • Model parameters: probability of a word appearing in a community

  12. COMMUNITY OUTLIER DETECTION ALGORITHM • Θ: model parameters • Z: community labels Initialize Z Given Z, find Θ that maximizes P(X|Z) PARAMETER ESTIMATION Given Θ, find Z that maximizes P(Z|X) INFERENCE

  13. PARAMETER ESTIMATION • Calculate model parameters • maximum likelihood estimation • Continuous • mean: sample mean of the community • standard deviation: square root of the sample variance of the community • Text • probability of a word appearing in the community: empirical probability

  14. INFERENCES • Calculate Zi values • Given Model parameters, • Iteratively update the community labels of nodes at each timestep • Select the label that maximizes P(Z|X,ZN) • Calculate P(Z|X,ZN) values • Both the node features and community labels of neighbors if Z indicates a normal community • If the probability of a node belonging to any community is low enough, label it as an outlier

  15. DISCUSSIONS • Setting Hyper parameters • a0 = threshold • Λ = confidence in the network • K = number of communities • Initialization • Group outliers in clusters. • It will eventually get corrected.

  16. SIMULATED EXPERIMENTS • Data Generation • Generate continuous data based on Gaussian distributions and generate labels according to the model • Define r: percentage of outliers, K: number of communities • Baseline models • GLODA: global outlier detection (based on node features only) • DNODA: local outlier detection (check the feature values of direct neighbors) • CNA: partition data into communities based on links and then conduct outlier detection in each community

  17. TRUE POSITIVE RATE(Top r% as outliers)

  18. EXPERIMENTS ON DBLP • Communities • data mining, artificial intelligence, database, information analysis • Sub network of Conferences • Links: percentage of common authors among two conferences • Node features: publication titles in the conference • Sub network of Authors • Links: co-authorship relationship • Node features: titles of publications by an author

  19. RESULTS Community outliers: CVPR CIKM

  20. CONCLUSIONS • Community Outliers • Community Outlier Detection QUESTIONS

  21. REFERENCES • On Community Outliers and their Efficient Detection in Information Networks – Gao, Liang, Fan, Wang, Sun, Han • Outlier detection – Irad Ben-Gal • Automated detection of outliers in real-world data – Last, Kandel • Outlier Detection for High Dimensional Data – Aggarwal, Yu

More Related