260 likes | 393 Views
1st Gossple Workshop on Social Networking ( december 2010). Large-scale data sharing by exploiting gossiping. Esther Pacitti. Saphir. SOPHIA ANTIPOLIS - M é DITERRAN é E. Context: P2P Data Sharing. We consider P2P online communities where participants can be
E N D
1st Gossple Workshop on Social Networking (december 2010) Large-scale data sharing by exploiting gossiping Esther Pacitti Saphir SOPHIA ANTIPOLIS - MéDITERRANéE
Context: P2P Data Sharing • We consider P2P online communities where participants can be • Professionals (researchers, engineers, support staff, etc.) who use web-scale collaboration in their workplace • Large scale of users and data (clouds, grids, internet) • Example of applications: • P2P Recommendation Systems • Useful for processing scientific workflows among participants’ peers • P2P Query Reformulation • Clinical case sharing among doctors or physicians • P2P CDN • Projects: • ANR DataRing (2009-2012, P2P online communities ) • Datluge (2010-2012, with UFRJ, Brazil on P2P scientific workflows)
MOTIVATIONS Bioinformatics Chemistry, Materials Science and Physics Computer Science
P2PRec: document recommender • Hudge graph:G = (D,U,E,T), where • D is the set of shared documents • U is the set of users in the system • E is the set of edgesbetween the userssuchthatthereis an edge e(u,v) if users u and v are friends • T is the set of users’ topics of intrest. • Problem: Given a query, recommend the most relevant documents • Our approach • Reduce the researchspace by indentifing relevant users • Identify relevant users • Usersthat stores/downloads enough high-quality documents, and become kind of providers in specific topics • Recommended by trusted friends • P2P Overlay : Semantic-Gossiping • Disseminate relevant users and theirtopics of intrests
P2PRec*: document recommender • Topics of intrest • With respect to the documents a user store • Extractedautomatically • Friendship network • Explicit friendship (maybelaveragedwithimplicit) • Expresses users trusts • Implementedis FOAF files (friend of friends files, machine-readablevocabularyserialized in RDF/XML) • Key-wordQueries • Mapped to topics • Mostlyrelated to the user topics of intrest • Mesure to • Check the similarity of userswrt to theirtopics (Dice coefficient) • Relevance of a user *Joint workwith F. Draidi, P. Valduriez, B. Kemme, to appear as Inria report
Semantic-Gossiping u1’s local-view before gossip u1’s viewafter gossip u1 FOAF If distance betweenuuand uv > τ ask for friendship u1topics: t1,t2 Friends: link to u5 FOAF u5 topics Dice coefficient u1 u1 u2 t3 t1 If friendshipisaccepted adduv to FOAF file u6 gossip t2,t3 u5 u4 t1,t2 t1 u5’s local-view before gossip u5’s local-view after gossip
Relevant Users • Users topics of intrest are automatically extracted using LDA* • by inspecting the documents topic vector • A user is considered relevant on a topic tTu,if a percentage of its documents have high quality in topic t • Each document doc at user u has • A rate given to doc: ratedoc • doc topic Vector (extracted using LDA) • Vdoc={wdoct1,…..,wdoctd} • doc is considered a high quality in a topic tqualityt(doc,u) • If wdoct *ratedoc> a threshold value • A user can be relevant in more than one topic *Latent Dirichlet Allocation (topic classifier)
QueryProcessing • ImplementsRecommendation • Input: Key words • Output: • Links to a set of good quality documents. May include links to documents on the topic of intrests of a friend (query expansion) • Popularity and Similarity info • Example: doctorsstuding the behavior of a gene X maybeglad to learn about the deseasesitcan cause and check someexperimental data sets
QueryProcessing Summary of Docs similarity and classification info query q requester q.t = t1, q.TTL=2 Computesim(doc,q) t3 u1 u2 t1 u7 q.TTL=1 q.TTL=0 query q.TTL=1 u3 Rec. docs u6 t2 u5 t2,t3 u4 t1,t2 t1 u1 FOAF • 1) Query q is mapped to a topic or topics Tq • 2) Select Top-kfriends in the FOAF wrt to the querytopics • (cosine similarity) • 3) Redirect Query • 4) Do 2) and 3) Recursively until TTL Computesim(doc,q) Computesim(doc,q) u1topics of intrests Friends: link to u5FOAF u5 topics
Conclusions P2PRec • P2PRec (BDA2010) • Findfriends (relevant users on similartopics) whilegossiping • Queryprocessing exploits relevant userswrt to the querytopics, recursively(FOAF friends) • Perf. Evaluation • Recall x Precision x Response Times • Limitation of LDA: needs some centralization for training, but good to validate our general approach • However there are other possibilities: • Ontology based automatic annotation • This exists for biomedical documents
P2P Query Reformulation* • P2P Data Management System (PDMS) • Eachpeer has: • Its own schema (and data) • 1 or more mapping acquaintances to/from which at least 1 mapping rule exists • Goal: Given a query, exploit mapping acquaintances as much a possible to enhance query responses. • ?= Hospital(x, “San Francisco”) Schema B __________ Schema A __________ Mb,a data A B data *Joint workwith A. Bonifati, G. Summa, P. Valduriez, to appear as Inria report
Concepts Hospital($X, “San Francisco”) HealtCareInst($X, “San Francisco”, $Z) ?= Q ?= Q’ ALONG Mb,a B A Schema __________ SourceHospital [0..*]name locationGrant [0..*]amountistitution managerDoctor [0..*]namesalary Schema __________ TargetHealthCareInst [0..*]name cityidGrant [0..*]amountscientist data data atoms MAPPING RULE Hospital(x, y) ⇢HealthCareInst(x, y, z) Mb,a BODY HEAD
Mapping Relevance • Each time a query gets translated by exploiting a mapping we got a Relevant Rewriting • The relevance can be Forward (along) or Backward (against) depending on the matched side of the mapping • Goal: • Collect as many rewriting as possible • Find the most intresting paths to take (avoid useless paths) • ?= Hospital(x, “San Francisco”) M1 Hospital(x, y) ⇢HealthCareInst(x, y, z) M2 Institution(x, y, z) ⇢ Hospital(x, y)
Problem … ?= Q’ ?= Q Mc,a Mb,a A C B AGAINST ALONG Mb,d . ALONG ?= Q’’ M Mb,z L D Z H 1) How to choose the most relevant paths to undertake in the reformulation task? 2) Are there other peers in the network which can be contacted?
Acquaintances • Gossiping acquaintances • Potential friends that dynamically appears in the local semantic view (LSV) • Mapping acquaintance • There is at least 1 direct mapping towards it (friend) • Established manually • Social acquaintance (FOAF friend) • No direct mapping is needed towards it • There are some common interests • Established explicitly
Our Approach • Gossip to disseminate mapping rules information to find friends • Users topics of intrest • are expressed according to the schema information or past queries topics • Measure to • Compute the relevance of a mapping wrt to a query • Compute similarity between users • Exploits recursively (to translate a query) • Mapping acquaintances • Social acquaintances
Social Acquaintances • Friend • Share common topics of • interests • Interests • Formulated by queries • Elements of peer’s schema • Approach: use the semantic view to discover friends • ?= Hospital(x, “San Francisco”) Schema __________ • ?= State( y, z, “California”) • ?= Doctor( w, k) • ?= Patology(“heart”, x) • … ……
Compute Relevance Goal: Given an Query and a mapping rule, determine if the mapping is relevant to the query Method (Standard Match Semantics) • Atom Label matching • Parameters compatibility • ?= Hospital(x, “San Francisco”) M1 Hospital(x, y) AND State (x,z) ⇢HealthCareInst(x, y, z) M2 Hospital(x, y,w) AND State (x,z) ⇢ HealthCareInst(x, y, z) M3 Ospedale(x,y) AND State (x,z) ⇢ HealthCareInst(x, y, z)
Compute Relevance • AF-IMF Measure, inspired by TF-IDF* • AF (Atom Frequency) • Localmeasure, establishing the importance of the query atom in the current mapping • IMF (Inverse Mapping Frequency) • Distributed measure, establishing the overall importance of the query atom • Relevance of a mapping wrt to q is AF * IMF *termfrequency-inverse document frequency
Compute Relevance (AF) • About the applied measure • To increase the effectiveness of the measure we distinguish, again, Forward/Backward relevance FORWARD MEASURE body BACKWARD MEASURE head • ?= Hospital(x, “San Francisco”) AF = 1/2 M1 Hospital(x, y) AND State (x,z) ⇢HealthCareInst(x, y, z) AF = 1 M2 Institution(x, y, z) ⇢ Hospital(x, y)
Compute Relevance (IMF) • IMF requires a way to get a value for • The total number of mappings • The total number of mappingscontainingthatatom • To do that, wecaninspect the semanticview of the peer • Also by sendinginquiries to peers in the FOAF
Translate-Query • Compute Relevance on Local Mappings wrt Q • Choose the TopK Mappings • Apply the translation semantics, along/against the mapping direction • Trigger Translate-Query on the mapping acquaintance, recursively (until TTL) • Select FOAF friends to be contacted • By looking at the best Mapping summaries wrt Q • Trigger query Translate-Query on the social acquaintance, recursively (until TTL)
Performance Evaluation • Baseline • No gossiping, original query propagated • Baseline+ • No gossiping, translated query propagated • Baseline# • No gossiping, translated query propagated, local measure to sort mappings (by using AF only) • Full- • Gossiping, translated query propagated, AF-IMF measure to sort mappings, no FOAF links (only local mappings) • Full (P2PRec) • Gossiping, translated query propagated, AF-IMF measure to sort mappings, FOAF links exploited Effectiveness of AF-IMF, LSV and gossiping
Conclusions • P2P Query Reformulation • Gossipingisused to disseminatedmappingsrules information • Exploits recursively relevant mappings • Mappingacquaintances • Social acquaintances • Initial Perf. Resuts: • Very good recallresults (over 90%) • Linearscale-up • Trade-off of Recall and Responses Times • Previouswork uses • DHTs or a centralizedmediation model.
About Montpellier Best quality of life in France Important laboratories (LIRMM) and research instituts (INRA, CIRAD, etc) University of Montpellier is part of the « opération campus » Soonwewill have a direct TGV line to Barcelona (1 hour)