110 likes | 274 Views
Accessing Web Data for Analysing Churn and Other Phenomena Marcel Karnstedt DERI, NUI Galway. Clique: Graph & Network Analysis Cluster. Data Access. Virtuoso: triple store for RDF data Query language: SPARQL If needed, raw data (triples or RDF/XML) can be provided http://virtuoso.deri.ie/
E N D
Accessing Web Data for Analysing Churn and Other PhenomenaMarcel KarnstedtDERI, NUI Galway Clique: Graph & Network Analysis Cluster
Data Access • Virtuoso: triple store for RDF data • Query language: SPARQL • If needed, raw data (triples or RDF/XML) can be provided • http://virtuoso.deri.ie/ • Currently, public access to DBPedia data • Access to boards.ie data will be restricted, actual URI TBA • License agreement required (by mail) • Login credentials as response
General Information • http://wiki.sioc-project.org/index.php/Data/Boards.ie/Structure • URIs containing specific prefixes (‘sioc:’, ‘dc:’, ‘dcterms:’, ‘foaf:’) • Numbers • Over 10 years of discussion • 156,035 persons (138,139 user accounts) • 17,896 buddy edges • 65,635 (sub-)forums • 826,487 threads • About 8 mil. Posts • 14 different types of edges
Graph Structure f title has_parent has_parent t title created numviews parent_of forum description thread has_parent seeAlso container_of id created reply_of links_to knows post maker content:encoded person content next_by_date nick previous_by_date name creator_of u#person has_creator holds_account role depiction name has_function user u#user accountName
SPARQL • Select and Where clause mandatory • From clause optional • Should be used to query only boards.ie • http://docs.openlinksw.com/virtuoso/rdfsparql.html select ?s, ?o, ?v [from <http://some/graph>] where { ?s ?p ?o . filter (?o > 20) . ?s <http://some/pred> ‘Marcel’ . ?o ?p2 ?v }
SPARQL Example Select ?u, ?u1, ?u2 from <http://boards.ie> Where { ?p <http://purl.org/dc/terms/created> ?d . Filter (regex(?d,’.*1995.*’)) . ?p <http://xmlns.com/foaf/0.1/maker> ?u . ?u <http://xmlns.com/foaf/0.1/knows> ?u1 . ?u1 <http://xmlns.com/foaf/0.1/knows> ?u2 }
SPARQL Example /2 select ?u, ?cnt, ?u1, ?u2 from <http://boards.ie> where { ?p <http://purl.org/dc/terms/created> ?d . filter (regex(?d,’.*2005.*’)) . ?p <http://xmlns.com/foaf/0.1/maker> ?u . ?u <http://xmlns.com/foaf/0.1/knows> ?u1 . ?u1 <http://xmlns.com/foaf/0.1/knows> ?u2 . { select ?u, count(?x) as ?cnt from <http://boards.ie> where { ?x <http://xmlns.com/foaf/0.1/maker> ?u } } }
Evolution • http://www.slideshare.net/Cloud/the-evolution-of-boardsie • Posts, users, threads grew almost exponentially • Forums grew almost linearly • August 2007: 43.65% bounce rate, 41.18% new visits 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 1 forum 1 user 20 forums 1500 users 680 forums 115K users Games: Rec: Topics: 3 97% 2% 5% 83% 7 12% 64% 10 42% 12% 12 12% 13 26% 14% 13 18% 13 15% 14% 17% 11% 13 11% 13 19% 13 11% 20%
More Data • What else could be provided • Data from IBM: DogEar, … • Web data: • DBPedia / DBPedia+delicious+Wikipedia • LOD cloud • Billion Triple Challenge / Sindice dump • Blog posts • 6 degrees of TBL (foaf) • Webcrawls • Twitter • Planned: Wikipedia edit data • Raw data or probably triple store, better SQL access? • What (else) is required?!
Coming Next • Distribution fitting on Idiro data • Aim: identify distribution of degrees, duration, #calls, #messages, … • Power law? Log-normal? DPLN?, … • Changes over time? • Test Kronecker graphs, maybe other generators • Aim: privacy and sampling • Does this method holds it promises? • Comparison to other generators • Activity measures in social networks • Nice tool for churn analysis • Cross-community effects in scientific communities
Discussion Thank You! Open for discussion!