1 / 31

Community Prediction

Community Prediction. DISTRIBUTED Community Prediction on Social Graphs based on Louvain algorithm. Interesting real life facts. Sources: https://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/# a5c9d6517b1e

scottrobert
Download Presentation

Community Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Community Prediction DISTRIBUTED Community Prediction on Social Graphs based on Louvain algorithm.

  2. Interesting real life facts Sources: https://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/#a5c9d6517b1e https://waterfordtechnologies.com/big-data-interesting-facts/ https://www.brandwatch.com/blog/amazing-social-media-statistics-and-facts/ https://www.smartinsights.com/social-media-marketing/social-media-strategy/new-global-social-media-research/ https://learn.g2crowd.com/social-media-statistics#social-media-statistics

  3. Big data facts (1) • The data volumes are exploding, more data has been created in the past two years than in the entire previous history of the human race. • 4.4 Zettabytes of data exist in the digital universe today. • Every second more than 40,000 search queries are performed on Google alone, which makes it 3.5 searches per day and 1.2 trillion searches per year. • Every minute • up to 300 hours of video are uploaded to YouTube alone. • Facebook users send on average 31.25 million messages • Facebook users view 2.77 million videos. • Decoding the human genome originally took 10 years to process; now it can be achieved in one week.

  4. Big data facts (2) • Data is growing faster than ever before and by the year 2020: • 1.7 megabytes of new information will be created every second for every human being on the planet. • The accumulated digital universe of data will grow from 4.4 zettabytes today to around 44 zettabytes, or 44 trillion gigabytes. • At least a third of all data will pass through the cloud. • Within five years there will be over 50 billion smart connected devices in the world, all developed to collect, analyze and share data.

  5. Social media facts (1) • There are over 3 billion peopleusing social media, and the number increases every year (Smart Insights, 2018). • Every second, there are 11 new people that use social media for the first time (Skyword, 2018). • Users spend an average of 2 hours and 22 minutes per day on social networks and messaging. • 73% of marketers believe that social media marketing has been “somewhat effective” or “very effective” for their business. • 49% of consumers depend on influencer recommendations on social media. • 48% of Americans have interacted with companies and institutions through social media at least once (Fotor, 2019).

  6. Social media facts (2)

  7. Community detection

  8. Community detection • Biology, chemistry, sociology, marketing and computer science have embraced information networks since they seem extremely suitable for hierarchical data representation. • Regarding the social media, their information can be exceptionally modelized with graphs, where each node represents a unique user and each connection expresses any kind of social interaction. • Community Detection:Due to the natural human tendency to get associated and mainly interact with peers of similar interests, aka homophily, the formation of virtual clusters and communities can be considered as a doubtless consequence. • https://en.wikipedia.org/wiki/Homophily • https://en.wikipedia.org/wiki/Network_homophily

  9. Modularity function Modularity Function • The elementary method for graph clustering optimization which is basically a function that quantifies the average connectivity degree of the given community. • Intuitively reflects the concentration of remaining edges within communities compared to a random distribution of links between all nodes on the original graph. • Modularity function optimization is inherently classified as an NP-complete problem with typical solution of high polynomial computational complexity. Nonetheless all the proposed methodologies can be roughly distinguished as divisive algorithms, agglomerative algorithms, transformation methods and other approaches.

  10. Divisive algorithms (1) • The fundamental idea is to identify and remove all the edges that interconnect nodes belonging to different communities by applying an iterative process. • At the beginning, the original graph is considered to compose a single community. • During each iteration step, a set of edges that satisfy certain criteria and appear to to interconnect different communities, will be removed. • The process is repeated until the generated network structure converges to a stable state where no additional edge removal can be applied. • It is worth mentioning that in a few extreme case the removal of a whole subgraph may be required in terms of community structure validity and efficiency.

  11. Divisive algorithms (2) – girvan & Newman • The most representative divisive algorithm, is the one proposed by Girvan and Newman, in which the removal criterion depends on the edge betweenness measure. • Edge betweenness measures the number of shortest paths between all possible pairs of nodes that could run along this edge. Alternatives: • geodesic edge betweenness, • current-flow edge betweenness, • random-walk edge betweenees

  12. Agglomerative algorithms • Contrary to divisive, agglomerative algorithms are considered bottom up approaches. • At first, each node is considered a separate community, singleton, a number of iterations is applied, in each of which, the distinct communities are iteratively merged according to the calculated result of a well-defined similarity function. • The most representative, is the one introduced by the university of Louvain is the most accepted and considered as the state-of-the-art in terms of hierarchical community detection • Hierarchy structure extraction is a strong asset.

  13. Transformation methods • The transformation methods project the original network structure identification problem and project it to a different solution space. • The most characteristic are the Spectral clustering techniques and the adoption of Genetic Algorithms. • In Spectral clustering case each node is represented by a point in space, whose coordinates are elements of eigenvectors and to proceed to community detection simple clustering techniques, such as K-means, are applied. • In Genetic algorithms, the original graph is randomly initialized and each node is assigned to a community, where at each iteration, a modularity score is assigned to each of the individual communities based on a quality function and according to this, some of the current generation communities are selected to engage the crossover and mutation process in order to produce the next generation individuals.

  14. Other approaches & methodologies • Leverage different modalities of the network information by combining the graph's topological structure with the critical user content information. • Typical examples: • Simulated Annealing, • Information Diffusion Graph Clustering, • Graph clustering based on link prediction methodology.

  15. Community prediction

  16. introduction • Having more than 3 billions active, social media users, the application of classic community detection algorithms/methods seems absolutely pointless in real-life graphs. • Only scalable distributed solutions that leverage the topological insights retrieved from a small but representative part of the original graph, could raise to the expectations. • Community Prediction methodology based on LIME (Local Interpretable Model-Agnostic): • Graph Statistical Analysis & Feature Enrichment • Representative Subgraph Extraction • Community Prediction Model Training & Application

  17. Graph Statistical Analysis & Feature Enrichment (1) Graph Statistical Analysis • Mandatory for ensuring that the extracted subgraph would be representative of the original's. • Calculated Metrics: • the average node degree: which is the average number of connections per node, • the graph's degree distribution: which is the probability distribution of node degrees over the whole network.

  18. Graph Statistical Analysis & Feature Enrichment (2) Feature Enrichment • The information extracted from an edges and their adjacent nodes is not be sufficient, so the metrics calculation up to a k depth sounds promising. • Empirically: • For greater than 5 depth, the feature enrichment is not benefited since the calculated values tend to converge to graph's average. • The prediction model maximized its accuracy for up to 3 depth feature enrichment.

  19. Graph Statistical Analysis & Feature Enrichment (3) Feature Enrichment • The only node oriented feature selected is the number of k-depth nodes for each node. • The edge oriented features are: • the loose similarity: the fraction of common over the combined total number of nodes’ sets. • the dissimilarity: the fraction of uncommon over the combined total number of nodes’ sets. • the edge balance: the fraction of the absolute difference between the adjacent peers over the combined total number of nodes. • the edge information: considering that the corresponding element on a graph's kthpower adjacency matrix indicates the number of distinct k-length paths between the impeding vertices, this feature reveals the pair's interconnection strength.

  20. Representative Subgraph Extraction (1) • The Tiny Sample Extractoralgorithm has been selected for the subgraph extraction process due to its efficiency and high scalability. • Algorithm: • Considering the already calculated average node degree and the graph's degree distribution, the original graph's exponent degree which is the slope of the log-log complementary cumulative distribution function (CCDF), is calculated. Bear in mind that the CCDF function for a given degree value d, returns the fraction of nodes that have degree greater than d. • Starting from a random node, 3 biased random walks are performed to define: • The minimum exponent degree value from the first one, • The maximum exponent degree values from the second one, • The resulting subgraphfrom the last one. • In the extreme scenario where the original graph is that immensely big the representative subgraph extraction could be applied from an known / analysedsubpart of the original graph.

  21. Representative Subgraph Extraction (2) • Even if the extraction of a representative subgraph yet remains controversial, since real networks exhibit a degree distribution that follow a power law, the extracted graph's degree distribution remains remarkably similar to the one of the original graph as shown on the following figure.

  22. Community Prediction Model Training & Application (1) Model Selection • Despite the intrinsic accuracy drawbacks of linear models, it is frequently proven that on real-worlds problem, they seem to be surprisingly competitive to non-linear thanks to their low variance. • This fact actually follows the Occam's Razor principle. • The most suitable in terms of big data efficiency and interpretability is the logistic regression model. • Projecting its application to community detection, the ultimate scope of a logistic regression model would be to predict whether a given edge connects nodes of the same community or not, depending on the previously calculated graph topology features.

  23. Community Prediction Model Training & Application (2) Automatic Edge Labeling • To train the community prediction model properly, the Louvain's algorithm application in the extracted representative is considered radical. • However, as it is broadly known the Louvain's generated community structure is excessively affected by the merging fashion (BFS, DFS) and the order that candidates are checked, so its re-execution seems essential. • Even if this repetition might sound excessively demanding, the size of the extracted effectively actually limits the required calculations.

  24. Community Prediction Model Training & Application (3) Feature Selection (L1 Regularization) • Due to the curse of dimensionality(the situation where irrelevant input features are included on prediction model training) and the collinearity effect (the situation when two or more predictors are correlated to one another) that lead to unnecessary complexity introduction and interpretability reduction, the training process should automatically apply feature selection by applying the L1 regularization that exclude the redundant features effect. • Hence, an L1 regularized distributed logistic regression classifier is trained using the representative subgraph and applied to the original one to reveal the underlying community structure.

  25. Experiments

  26. Analyzed data sources • For this experimentation, the following widely used and extensively analyzed social graphs are used which are considered • Especially the Zachary Karate club and the Dolphinsdatasets, despite their tiny size, are generally considered as reference datasets, since they are historically two of the first social graphs that have ever been collected.

  27. System details & kpi explanation • All the experiments were executed in an eight node Spark 2.2.0 cluster, with 4 GBs of RAM and 1 virtual core per each virtual machineEven if the processing performance is out of the scope, it is worth mentioning that not only the processing time but also the memory requirements were constantly less comparing to the distributed Louvain implementation ones. • In terms of completeness, all the elementary classification performance metrics have been calculated. Specifically: • The accuracy: the number of edges correctly classified as either "community" or "non-community" over all the predictions made. • The precision: the number of edges correctly classified as "community" over the total number of "community" predictions. • The recall/sensitivity: the number of edges correctly classified as "non-community" over the total number of truly "non-community" edges. • The specificity: the number of edges correctly classified as "community" over the total number of truly "community" edges.

  28. Results (1) • Focusing on the outstanding specificity and precision statistics, it is more than obvious that the proposed implementation impressively identifies the communityedges. • The graph’s consistency is preserved by retaining social graph’s true connections and avoiding graph’s fragmentation. • However, the mediocre recall statistics shows that the trained model misclassifies several non-communityedges as communityones.

  29. Results (2)

  30. Future work

  31. Next steps • Despite the encouraging results, it is obvious that the methodology’s community prediction can be substantially improved by: • Applying a graph cleansing processing step, where the outliers’ removal will be applied. • Combing the collinear variables into a single predictor to tackle the multi-collinearity effect. • Applying less interpretable but more accurate models, e.g. non-linear prediction models.

More Related