OUTLINE

MAPPING DATA IN PEER-TO-PEER SYSTEMS:SEMANTICS AND ALGORITHMIC ISSUESDepartment of Computer Science University of TorontoAnastasios Kementsietsidis & Marcelo Arenas & Renee J.Millerpresented by Ahmet OLGUN& Suzan BAYHAN

OUTLINE 1-ABSTRACT 2-INTRODUCTION 3-MOTIVATING EXAMPLE 4-MAPPING TABLES 5-MAPPING AS CONSTRAINTS 6-CONSISTENCY AND INTERFERENCE 7-THE ALGORITHM 8-EXPERIMENTAL RESULTS 9-CONCLUSIONS

ABSTRACT • PROBLEM OF MAPPING DATA IN PEER-TO-PEER DATA SHARING SYSTEMS(PPDSS) • MAPPING TABLES LISTING CORRESPONDING VALUES IN A PPDSS • WHY TABLES ARE APPROPRIATE • A LANGUAGE TO SPECIFY MAPPING TABLES UNDER DIFFERENT SEMANTICS • COMPLEXITY OF THE PROBLEM • AN EFFICIENT ALGORITHM FOR ITS SOLUTION • IMPLEMENTATION WITH EXPERIMENTAL RESULTS • HYPERION PROJECT

INTRODUCTION • Traditionally data integration and exchange bw heterogeneous data sources is provided mainly through use of views i.e., queries • Sources share their schemas and cooperate • BUT IN OUR WORK SUCH CLOSE COOPERATION IS • Not desirable (PRIVACY) • Not feasible (maybe due to resource limitations)

SIMILARITY WITH FILE-SHARING SYSTEMS • TO FIND DATA WHEN THERE IS NO AGREEMENT ON THE LOGICAL DESIGN OF DATA, FOCUS ON VALUES AND HOW THEY CORRESPOND • IN FILE SHARING SYSTEMS LIKE NAPSTERAND GNUTELLA ,QUERYING IS DONE ON SIMPLE VALUE SEARCH OF FILE NAMES • QUERIES ARE OF THE FORM: “RETRIEVE ALL FILES NAMED X” EASY BECAUSE THERE IS A CONSENSUS ON NAMES

WHAT IF NO ACCEPTED NAMING STANDARD??? • Each peer has to develop its own naming standard • Conforming external standards is time-consuming and expensive So to search data in such environments MAPPING TABLES that store correspondence between values. • At simplest, tables are binary tables corresponding identifiers from two different sources • Mapping Tables represent EXPERT KNOWLEDGE

MOTIVATING EXAMPLE • DOMAIN:BIOLOGICAL DATABASES * GENE DATABASEGDB * PROTEIN DATABASESwissProt * GENETIC DISORDERS AND RELATED GENES DATABASEMIM

EXAMPLE (CONTD) • Integration of these resources is extremely desirable for scientists to have uniforn access BUT SEEMS UNATTAINABLE due to political,financial and technical reasons. • Among technical reasons , heterogeneity of sources like formatted files,spreadsheets,relational databases

MAIN CHARACTERISTICS AND USE OF MAPPING TABLES • Associations within and Across Domains • Peer Autonomy • Semantics • Automated discovery of mappings

Association within and Across Domains • Mapping table is not necessarily a function • By mapping tables we associate seemingly unconnect databases • Disjoint worlds can be associated since the corresponding worlds are semantically close to each other

Peer Autonomy • Autonomy has high importance in peer-to-peer systems. • Mapping tables do not restrict the operation of peers in any way beyond the agreement on values expressed in the tables.

Mapping Table 1 Figure 1

Semantics • Experts have varying degree of expertise,so we should better show the confidence level of mapping tables A tuple :(X,Y) • If X value appearing in a mapping table follows the open-world semantics then it can be associated with any Y value-Partial Information about X

Closed World • If X follows Closed-World semantics, then values in the table can only be associated with the specified Y values. • 4 alternatives 1-OO (No specific information,no practical interest) 2-OC (Partial knowledge) 3-CO(Partial knowledge) 4-CC(complete knowledge)

Open/Closed World Table 1:Alternative open/closed world semantics

Automated Discovery • Given a semantics for mapping tables, to reason about them,treat mapping tables as constraints on the exchange of information. • Simplest way to combine tables CONJUNCTION

Example Mapping Tables

MAPPING TABLES • A,B,C,D  individual attributes • dom(A)  domain of A like integers,characters • U,X,Y  set of attributes • R  a relational schema • R[U]  attributes of a schema • r  relation instance • t  tuples

MAPPING TABLES(contd) t[X]values of tuple t in attributes of X X={A1,A2.... Ak} dom(X)=dom(A1)Xdom(A2)X...Xdom(Ak) To represent different semantics of mapping tables,it is necessary to introduce variables V a set of variables where V∩dom(A)=Φ for each attribute of A

DEFINITION 1 • Given a set of attributes U,t is a mapping over U if for each AєU,t[A] is either a constant in dom(A),a variable in V or an expression of the form v-S,where vєV and S is a finite subset of dom(A)

DEFINITION 2 • Let X and Y be nonempty disjoint set of attributes. A mapping table m from X to Y is a finite set of mappings over X UYsuch that each variable appears in at most one mapping

DEFINITION 2 • Set of mappings”mapping table” • Tablerelations containing variables • RESTRICT:Each variable appears in at most one mapping • TWO DIFFERENT MAPPINGS ARE COMPLETELY INDEPENDENT

DEFINITION 3 • A valuation ρ over a mapping table m is a function that maps each constant value in m to itself and each variable v of m to a value in the intersection of the domains of the attributes where v appears.Furthermore,if v appears in an expression of the form v-S,then ρ(v) is not an element of S.

MAPPING AS CONSTRAINTS • View mapping tables as constraints on the exchange of information between sources • Given a set of mapping constraints,we are able to infer new mapping constraints and check the consistency of the constraints

CONSISTENCY& INFERENCE • Infer new mapping tables: Combine the knowledge from mapping tables available in a network of peers • Determine consistency of mapping tables:Automated inference and consistency checks will help a curator to see whether semantics are valid

Problem Definition • Given a mapping constraint formula (MCF) Φ over a set of attributes U, Φ is consistent if there exists a nonempty relation r of U satisfying Φ. • Inference problem is the problem of verifying whether a set of MCFs implies another MCF

Theorems • Theorem: The consistency problem for conjunctions of mapping constraints is NP-complete. • Theorem: If the length of the paths or number of mapping constraints is fixed then the consistency problem for the conjunctions of mapping constraints is NP-complete.

Assumptions Assumptions to solve the consistency problem: • Number of mapping constraints per peer is small • The length of paths is small For example in Gnutella paths have maximum size of 7

THE ALGORITHM θ=P1,P2,..,Pn a path of peers Ui set of attributes at each peer Σset of constraints over path θ μ :X Y a mapping constraint ext(μ )={ρ(t) | t єm and ρ is a valuation over m}

THE ALGORITHM 1- Σis consistent iff there exists t єext(μ) 2-μ’:XY, Σ μ’ iff ext(μ)  ext(μ’) For inference: check 2 if Σ μ’ For consistency:check 1.

Design Decisions:P1,P2,P3,P4 path

Algorithm for computing the cover • P1 sends all mapping constraints to P2 • P2 uses those constraints with his own to create a cover between P1 and P3 • P2 forwards cover to P3 • P3 does the same thing to create a cover bw P1 and P4 • P3 sends the computed cover back to P1

Problems • Unnecessary computation Cover involving A6 can be done locally • Does not work in streaming fashion P1 has to wait for the whole computation to finish to get the cover between itself and P4 So ?...

Partitions Peer P2 Peer P1 π5 π1 π6 π7 π2 Peer P3 π3 π8 π4 π9

Description of the Algorithm Two phases: • Information gathering • Computation

Information Gathering • P1 sends to P2 the set of attributes at each partition BUT NO MAPPINGS • P2 computes inferred partitions • Inferred partitions to discover interdependencies or lack thereof bw partitions • Then computation phase

Inferred Partitions Peer P1 Peer P2

Computation Phase • The computation starts at penultimate peer • Cover between P3 and P4 computed and sent to P2 • Cover between P2 and P4 computed and streamed to P1 • Cover between P1 and P4 computed

EXPERIMENTAL RESULTS • Do our solutions provide added value for communities that already use mapping tables extenxively? • Are characteristics of our algorithm appropriate and effective in a peer-to-peer environment?

Implementation • Geographically distributed machines with one peer per machine • Each peer has 2 modules: • First module interacts with the storage manager to retrieve mappings and perform cover • Second is peer-to-peer networking protocol

Implementation • Each peer decides how much cache to use • Biology Domain:6 Biological DB used GDBMIMSwissProtHugoLocusUnigene • Tabe sizes range from 7000 to 28000 mappings with an average of 13000. • B2B Domain:business-to-business setting

Results • Cache sizes from 64 to 128 mappings result the best running times for those data character • B2B Complex semantics for tables,but still efficient new mappings Total execution time scales linearly with the number of computed mappings

CONCLUSION • Problem of managing collections of mapping tables • Alternative semantics for tables • A language that allows specification of mapping tables under different semantics • Complexity of Inference and consistency • An algorithm to solve the problem

ANY QUESTIONS? THANK YOU...

OUTLINE

OUTLINE

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: