1 / 11

Accessing Web Data for Analysing Churn and Other Phenomena Marcel Karnstedt DERI, NUI Galway

Accessing Web Data for Analysing Churn and Other Phenomena Marcel Karnstedt DERI, NUI Galway. Clique: Graph & Network Analysis Cluster. Data Access. Virtuoso: triple store for RDF data Query language: SPARQL If needed, raw data (triples or RDF/XML) can be provided http://virtuoso.deri.ie/

jake
Download Presentation

Accessing Web Data for Analysing Churn and Other Phenomena Marcel Karnstedt DERI, NUI Galway

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accessing Web Data for Analysing Churn and Other PhenomenaMarcel KarnstedtDERI, NUI Galway Clique: Graph & Network Analysis Cluster

  2. Data Access • Virtuoso: triple store for RDF data • Query language: SPARQL • If needed, raw data (triples or RDF/XML) can be provided • http://virtuoso.deri.ie/ • Currently, public access to DBPedia data • Access to boards.ie data will be restricted, actual URI TBA • License agreement required (by mail) • Login credentials as response

  3. General Information • http://wiki.sioc-project.org/index.php/Data/Boards.ie/Structure • URIs containing specific prefixes (‘sioc:’, ‘dc:’, ‘dcterms:’, ‘foaf:’) • Numbers • Over 10 years of discussion • 156,035 persons (138,139 user accounts) • 17,896 buddy edges • 65,635 (sub-)forums • 826,487 threads • About 8 mil. Posts • 14 different types of edges

  4. Graph Structure f title has_parent has_parent t title created numviews parent_of forum description thread has_parent seeAlso container_of id created reply_of links_to knows post maker content:encoded person content next_by_date nick previous_by_date name creator_of u#person has_creator holds_account role depiction name has_function user u#user accountName

  5. SPARQL • Select and Where clause mandatory • From clause optional • Should be used to query only boards.ie • http://docs.openlinksw.com/virtuoso/rdfsparql.html select ?s, ?o, ?v [from <http://some/graph>] where { ?s ?p ?o . filter (?o > 20) . ?s <http://some/pred> ‘Marcel’ . ?o ?p2 ?v }

  6. SPARQL Example Select ?u, ?u1, ?u2 from <http://boards.ie> Where { ?p <http://purl.org/dc/terms/created> ?d . Filter (regex(?d,’.*1995.*’)) . ?p <http://xmlns.com/foaf/0.1/maker> ?u . ?u <http://xmlns.com/foaf/0.1/knows> ?u1 . ?u1 <http://xmlns.com/foaf/0.1/knows> ?u2 }

  7. SPARQL Example /2 select ?u, ?cnt, ?u1, ?u2 from <http://boards.ie> where { ?p <http://purl.org/dc/terms/created> ?d . filter (regex(?d,’.*2005.*’)) . ?p <http://xmlns.com/foaf/0.1/maker> ?u . ?u <http://xmlns.com/foaf/0.1/knows> ?u1 . ?u1 <http://xmlns.com/foaf/0.1/knows> ?u2 . { select ?u, count(?x) as ?cnt from <http://boards.ie> where { ?x <http://xmlns.com/foaf/0.1/maker> ?u } } }

  8. Evolution • http://www.slideshare.net/Cloud/the-evolution-of-boardsie • Posts, users, threads grew almost exponentially • Forums grew almost linearly • August 2007: 43.65% bounce rate, 41.18% new visits 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 1 forum 1 user 20 forums 1500 users 680 forums 115K users Games: Rec: Topics: 3 97% 2% 5% 83% 7 12% 64% 10 42% 12% 12 12% 13 26% 14% 13 18% 13 15% 14% 17% 11% 13 11% 13 19% 13 11% 20%

  9. More Data • What else could be provided • Data from IBM: DogEar, … • Web data: • DBPedia / DBPedia+delicious+Wikipedia • LOD cloud • Billion Triple Challenge / Sindice dump • Blog posts • 6 degrees of TBL (foaf) • Webcrawls • Twitter • Planned: Wikipedia edit data • Raw data or probably triple store, better SQL access? • What (else) is required?!

  10. Coming Next • Distribution fitting on Idiro data • Aim: identify distribution of degrees, duration, #calls, #messages, … • Power law? Log-normal? DPLN?, … • Changes over time? • Test Kronecker graphs, maybe other generators • Aim: privacy and sampling • Does this method holds it promises? • Comparison to other generators • Activity measures in social networks • Nice tool for churn analysis • Cross-community effects in scientific communities

  11. Discussion Thank You! Open for discussion!

More Related