470 likes | 590 Views
MAPPING DATA IN PEER-TO-PEER SYSTEMS:SEMANTICS AND ALGORITHMIC ISSUES Department of Computer Science University of Toronto Anastasios Kementsietsidis & Marcelo Arenas & Renee J.Miller presented by Ahmet OLGUN& Suzan BAYHAN. OUTLINE. 1-ABSTRACT 2-INTRODUCTION 3-MOTIVATING EXAMPLE
E N D
MAPPING DATA IN PEER-TO-PEER SYSTEMS:SEMANTICS AND ALGORITHMIC ISSUESDepartment of Computer Science University of TorontoAnastasios Kementsietsidis & Marcelo Arenas & Renee J.Millerpresented by Ahmet OLGUN& Suzan BAYHAN
OUTLINE 1-ABSTRACT 2-INTRODUCTION 3-MOTIVATING EXAMPLE 4-MAPPING TABLES 5-MAPPING AS CONSTRAINTS 6-CONSISTENCY AND INTERFERENCE 7-THE ALGORITHM 8-EXPERIMENTAL RESULTS 9-CONCLUSIONS
ABSTRACT • PROBLEM OF MAPPING DATA IN PEER-TO-PEER DATA SHARING SYSTEMS(PPDSS) • MAPPING TABLES LISTING CORRESPONDING VALUES IN A PPDSS • WHY TABLES ARE APPROPRIATE • A LANGUAGE TO SPECIFY MAPPING TABLES UNDER DIFFERENT SEMANTICS • COMPLEXITY OF THE PROBLEM • AN EFFICIENT ALGORITHM FOR ITS SOLUTION • IMPLEMENTATION WITH EXPERIMENTAL RESULTS • HYPERION PROJECT
INTRODUCTION • Traditionally data integration and exchange bw heterogeneous data sources is provided mainly through use of views i.e., queries • Sources share their schemas and cooperate • BUT IN OUR WORK SUCH CLOSE COOPERATION IS • Not desirable (PRIVACY) • Not feasible (maybe due to resource limitations)
SIMILARITY WITH FILE-SHARING SYSTEMS • TO FIND DATA WHEN THERE IS NO AGREEMENT ON THE LOGICAL DESIGN OF DATA, FOCUS ON VALUES AND HOW THEY CORRESPOND • IN FILE SHARING SYSTEMS LIKE NAPSTERAND GNUTELLA ,QUERYING IS DONE ON SIMPLE VALUE SEARCH OF FILE NAMES • QUERIES ARE OF THE FORM: “RETRIEVE ALL FILES NAMED X” EASY BECAUSE THERE IS A CONSENSUS ON NAMES
WHAT IF NO ACCEPTED NAMING STANDARD??? • Each peer has to develop its own naming standard • Conforming external standards is time-consuming and expensive So to search data in such environments MAPPING TABLES that store correspondence between values. • At simplest, tables are binary tables corresponding identifiers from two different sources • Mapping Tables represent EXPERT KNOWLEDGE
MOTIVATING EXAMPLE • DOMAIN:BIOLOGICAL DATABASES * GENE DATABASEGDB * PROTEIN DATABASESwissProt * GENETIC DISORDERS AND RELATED GENES DATABASEMIM
EXAMPLE (CONTD) • Integration of these resources is extremely desirable for scientists to have uniforn access BUT SEEMS UNATTAINABLE due to political,financial and technical reasons. • Among technical reasons , heterogeneity of sources like formatted files,spreadsheets,relational databases
MAIN CHARACTERISTICS AND USE OF MAPPING TABLES • Associations within and Across Domains • Peer Autonomy • Semantics • Automated discovery of mappings
Association within and Across Domains • Mapping table is not necessarily a function • By mapping tables we associate seemingly unconnect databases • Disjoint worlds can be associated since the corresponding worlds are semantically close to each other
Peer Autonomy • Autonomy has high importance in peer-to-peer systems. • Mapping tables do not restrict the operation of peers in any way beyond the agreement on values expressed in the tables.
Mapping Table 1 Figure 1
Semantics • Experts have varying degree of expertise,so we should better show the confidence level of mapping tables A tuple :(X,Y) • If X value appearing in a mapping table follows the open-world semantics then it can be associated with any Y value-Partial Information about X
Closed World • If X follows Closed-World semantics, then values in the table can only be associated with the specified Y values. • 4 alternatives 1-OO (No specific information,no practical interest) 2-OC (Partial knowledge) 3-CO(Partial knowledge) 4-CC(complete knowledge)
Open/Closed World Table 1:Alternative open/closed world semantics
Automated Discovery • Given a semantics for mapping tables, to reason about them,treat mapping tables as constraints on the exchange of information. • Simplest way to combine tables CONJUNCTION
MAPPING TABLES • A,B,C,D individual attributes • dom(A) domain of A like integers,characters • U,X,Y set of attributes • R a relational schema • R[U] attributes of a schema • r relation instance • t tuples
MAPPING TABLES(contd) t[X]values of tuple t in attributes of X X={A1,A2.... Ak} dom(X)=dom(A1)Xdom(A2)X...Xdom(Ak) To represent different semantics of mapping tables,it is necessary to introduce variables V a set of variables where V∩dom(A)=Φ for each attribute of A
DEFINITION 1 • Given a set of attributes U,t is a mapping over U if for each AєU,t[A] is either a constant in dom(A),a variable in V or an expression of the form v-S,where vєV and S is a finite subset of dom(A)
DEFINITION 2 • Let X and Y be nonempty disjoint set of attributes. A mapping table m from X to Y is a finite set of mappings over X UYsuch that each variable appears in at most one mapping
DEFINITION 2 • Set of mappings”mapping table” • Tablerelations containing variables • RESTRICT:Each variable appears in at most one mapping • TWO DIFFERENT MAPPINGS ARE COMPLETELY INDEPENDENT
DEFINITION 3 • A valuation ρ over a mapping table m is a function that maps each constant value in m to itself and each variable v of m to a value in the intersection of the domains of the attributes where v appears.Furthermore,if v appears in an expression of the form v-S,then ρ(v) is not an element of S.
MAPPING AS CONSTRAINTS • View mapping tables as constraints on the exchange of information between sources • Given a set of mapping constraints,we are able to infer new mapping constraints and check the consistency of the constraints
CONSISTENCY& INFERENCE • Infer new mapping tables: Combine the knowledge from mapping tables available in a network of peers • Determine consistency of mapping tables:Automated inference and consistency checks will help a curator to see whether semantics are valid
Problem Definition • Given a mapping constraint formula (MCF) Φ over a set of attributes U, Φ is consistent if there exists a nonempty relation r of U satisfying Φ. • Inference problem is the problem of verifying whether a set of MCFs implies another MCF
Theorems • Theorem: The consistency problem for conjunctions of mapping constraints is NP-complete. • Theorem: If the length of the paths or number of mapping constraints is fixed then the consistency problem for the conjunctions of mapping constraints is NP-complete.
Assumptions Assumptions to solve the consistency problem: • Number of mapping constraints per peer is small • The length of paths is small For example in Gnutella paths have maximum size of 7
THE ALGORITHM θ=P1,P2,..,Pn a path of peers Ui set of attributes at each peer Σset of constraints over path θ μ :X Y a mapping constraint ext(μ )={ρ(t) | t єm and ρ is a valuation over m}
THE ALGORITHM 1- Σis consistent iff there exists t єext(μ) 2-μ’:XY, Σ μ’ iff ext(μ) ext(μ’) For inference: check 2 if Σ μ’ For consistency:check 1.
Algorithm for computing the cover • P1 sends all mapping constraints to P2 • P2 uses those constraints with his own to create a cover between P1 and P3 • P2 forwards cover to P3 • P3 does the same thing to create a cover bw P1 and P4 • P3 sends the computed cover back to P1
Problems • Unnecessary computation Cover involving A6 can be done locally • Does not work in streaming fashion P1 has to wait for the whole computation to finish to get the cover between itself and P4 So ?...
Partitions Peer P2 Peer P1 π5 π1 π6 π7 π2 Peer P3 π3 π8 π4 π9
Description of the Algorithm Two phases: • Information gathering • Computation
Information Gathering • P1 sends to P2 the set of attributes at each partition BUT NO MAPPINGS • P2 computes inferred partitions • Inferred partitions to discover interdependencies or lack thereof bw partitions • Then computation phase
Inferred Partitions Peer P1 Peer P2
Computation Phase • The computation starts at penultimate peer • Cover between P3 and P4 computed and sent to P2 • Cover between P2 and P4 computed and streamed to P1 • Cover between P1 and P4 computed
EXPERIMENTAL RESULTS • Do our solutions provide added value for communities that already use mapping tables extenxively? • Are characteristics of our algorithm appropriate and effective in a peer-to-peer environment?
Implementation • Geographically distributed machines with one peer per machine • Each peer has 2 modules: • First module interacts with the storage manager to retrieve mappings and perform cover • Second is peer-to-peer networking protocol
Implementation • Each peer decides how much cache to use • Biology Domain:6 Biological DB used GDBMIMSwissProtHugoLocusUnigene • Tabe sizes range from 7000 to 28000 mappings with an average of 13000. • B2B Domain:business-to-business setting
Results • Cache sizes from 64 to 128 mappings result the best running times for those data character • B2B Complex semantics for tables,but still efficient new mappings Total execution time scales linearly with the number of computed mappings
CONCLUSION • Problem of managing collections of mapping tables • Alternative semantics for tables • A language that allows specification of mapping tables under different semantics • Complexity of Inference and consistency • An algorithm to solve the problem
ANY QUESTIONS? THANK YOU...