130 likes | 320 Views
Sampling online communities: using triplets as basis for a (semi-) automated hyperlink web crawler. Yoann VENY Université Libre de Bruxelles (ULB) - GERME yoann.veny@ulb.ac.be This research is funded by the FRS-FNRS
E N D
Sampling online communities: using triplets as basis for a (semi-) automated hyperlink web crawler. Yoann VENY Université Libre de Bruxelles (ULB) - GERME yoann.veny@ulb.ac.be This research is funded by the FRS-FNRS Paper presented at the 15th General Online Reasearch Conference, 4-6 march, Mannheim
Online communities – a theoreticaldefinitions • Whatis an online community? • “social aggregations that emerge from the Net when enough people carry on those public discussions long enough, with sufficient human feeling, to form webs of personal relationship in cyberspace” » (Rheingold 2000) • long term involvement (Jones 2006) • sense of community (Blanchard 2008) • temporal perspective (Lin et al 2006) • Probably important … but the first operationshouldbe to takeintoaccount the ‘hyperlinkenvironment’ Graph analysis issue / SNA issue
Online Communities – A graphicaldefinition (1) • Community = more ties among members than with non-members • three general classes of ‘community’ in graph partitioning algorithm (Fortunato 2010) : • a local definition: focus on the sub-graphs (i.e.: cliques, n-cliques (Luce, 1950), k-plex (Seidman & Foster, 1978), lambda sets (Borgatti et al, 1990), … ) • a global definition: focus on the graph as a whole (observed graph significantly different from a random graph (i.e.: Erdös-Rényi graph)?) • vertex similarity: focus on actors (i.e.: euclidian distance & hierarchical clustering, max-flow/min-cut (Elias et al, 1956; Flake et al, 2000)
Online communities – graphicaldefinition (2) • 2 main problems of graph partitionning in a hyperlinkenvironment: • 1) network size / and form (i.e. tree structure) • 2) edges direction • betterdiscovercommunitieswith a efficient web crawler
Web crawling - Generalities • The general idea for a web crawling process: • - We have a number of starting blogs (seeds) • - All hyperlink are retrieved from these seeds blogs • - For each new website discovered, decide wether this new site is accepted or refused • - If the site is accepted, it become a seed and the process is reiterated on this site. Source: Jacomi & Ghitalla (2007)
Web crawling – constrain-based web crawler (1) • Twoproblems of a manual crawler : • Number and quality of decision • Closure? • A solution: takingadvantage of local structural properties of a network: Assume that a network is an outcome of the agregation of local social processes: • Examples in SNA: • General philosphy of ERG Models (seef.e. : Robins et al 2007) • Local clustering coefficient (seef.e. : Watts & Strogatz, 1998) Constrain the crawler to identify local social structures (ie: triangles, mutualdyads, transitive triads,…
Web crawling – constrain-based web crawler (2) An example of a constrained web crawler based on identification of triangles Generalisation
Experimentalresults - method Y is the n x n adjacency matrix of a binary network with elements: Undirected dyadic # edges = (unsupervised crawler) Directed dyadic (mutuality crawler) Undirected triadic # (Triangle crawler) Directed triadic (triplet crawler) Where is the number of “two path” connecting i and j or j and i.
Experimentalresults - method Y is the n x n adjacency matrix of a binary network with elements: Undirected dyadic # edges = (unsupervised crawler) Directed dyadic (mutuality crawler) Undirected triadic # (Triangle crawler) Directed triadic (triplet crawler) Where is the number of “two path” connecting i and j or j and i.
Experimentalresults – results(1) Starting set: 6 « polititicalecological » blogs Remarks: dyad sampler and triplets samplers closure Unsupervised and triangles samplers manuallystopped
Experimentalresults – results (2) Triangles Dyads Triplets
Unsupervised crawler is not manageable (+20000 actorsafter 4 iterations!!) • Dyads: did not selected ‘authoritative’ sources + sensitive to the number of seeds ? • Triplets seems to be the best solution: taketies direction intoaccount + take profit of authoritative sources + conservative • Triangles: problem of network size … but sampled network can have interestingproperties.
Conclusion and furtherresearches • Pitfalls to avoid: • Not necessary all relevant information in the core: thereis a lot of information in the periphery of thiscore. • Based on humanbehaviour patterns: not adaptedat all for otherkind of networks (wordsoccurencies, proteïnschains,…) • Do not throwaway more classical graph partitionningmethods • Always question yourresults. • How to assessefficiency of a crawler? Shouldcommunities in web graph alwaysbetopic-centered • Furtherresearches: • Analysis and detection of ‘multi-core’ networks • ‘Randomwalks’ in complete networks to findrecursive patterns using T.C. assumptions • Code of the samplers in ‘R’