350 likes | 361 Views
Today's talk covers aspects of distribution, scalability, and security of provenance data structures. Learn about logical and physical distribution patterns, querying processes, and applications' requirements. Understand the importance of links, views, and references in implementing distributed queries and supporting large data sets. Discover innovative solutions for handling distributed queries and managing large data volumes in provenance systems.
E N D
Overview of Today’s Talks • Provenance Data Structures • Recording and Querying Provenance • Break (30 minutes) • Distribution and Scalability • Security • Methodology
Distribution and Scalability by Paul Groth (pg03r@ecs.soton.ac.uk)
Applications require scalability • Applications may have millions of interactions • These interactions may be simultaneous • Applications may have large amounts of data
These Issues & Provenance • Because applications are distributed and need scalability, a provenance system must support these requirements • Provenance Systems have their own requirements in these areas • Large numbers of p-assertions. • Scalability in terms of querying and recording
Provenance Store Distribution Recording Patterns • - Bandwidth • - Access Control • Storage • Legal • Multiple physical Provenance Stores per site PS PS PS PS PS PS
Logical Distribution • Provenance Stores are both physical and logical entities • Single physical store could have multiple logical stores • Logical Provenance Stores provide bounds to process documentation • Could be organisational, experimental, or individual
Logical Provenance Stores Physical Provenance Store Hospital Payroll Store Paul’s Store Donor Data Collector Store SurgeryWard Store
Provenance Store Usage • Combinations of logical and physical Provenance Stores can be adopted depending on the application’s needs • In terms of: • Scalability • Regulatory / Legal • Information partitioning
Distributed Query? • Process Documentation is in multiple stores • How do we get the provenance of a data item in this case? • Solution: connections embedded in process documentation • Shared Context • Links
Shared Context revisited Querier PS 2 PS 1 Query PAs with IK 1 Query PAs with IK 1 P-assertion With IK 1 P-assertion With IK 1 IK 1
Links • Links are unidirectional pointers to provenance stores • Links connect provenance stores • Links are recorded by actors as part of p-assertions • Links are transferred between actors using interaction contexts • There are two kinds of links • View Links • Object Links
Views revisited • A view is the set of assertions by 1 actor about 1 interaction. • A view contains: • An actor identity • A set of p-assertions • A view is one of two view kinds: sender or receiver User Interface Donor Data Collector
View Links • A view link points to the provenance store containing the opposite view of the interaction • View Links are transferred in p-headers or interaction contexts PS 2 PS 1 Record Link to PS 1 Will Record P-Assertions Record Link to PS 2 Will Record P-Assertions Inform of PS1 usage Receiver Inform of PS 2 usage Sender
Object Links • A pointer to the provenance store where the object of a relationship is stored • This allows for distributed provenance queries PS 1 PS 2 PS 3
Implementing Distributed Queries • Querying actor centric (thick client) • The querying actor follows links • Provenance Store centric (thin client) • Provenance Stores follow links
Querying Actor Centric Process Results for links PS 1 Querying Actor Issue Query Receive Result Issue Query PS 2 Receive Result
Provenance Store centric Process Internal Results For Links PS 1 Issue Query Querying Actor Receive Results Collate Results Issue Query Receive Results Receive Results Issue Query PS 2 PS 3
Analysis of Links • Links are unidirectional like the Web • This approach should be fairly scalable • Maintain autonomy of application actors • There is no need for synchronization between actors • Like the web, queriers must traverse the link structure to find content of interest • Two mechanisms for implementing distributed queries using links.
Supporting Large Data • Depending on the size of the data involved the provenance store may not: • Be able to store the data immediately • Asynchronous recording • Be able to store the data • Solution: references • References to data, instead of the data itself • Support for three kinds • Application • Internal • External
Application References • The application already transfers references in its application messages • Nothing to do. Record p-assertions as is • Inform querying actors of how to resolve these application specific references http://datastore/pr#1234
External References • Application transfers a large message • Stores all or part of the message in some data repository • Reference to this external data repository • Burden is placed on the data repository to maintain the data as long as process documentation
External References cont. Large Patient Record Data Repository PS DocStyle: Reference http://DataRepository/#LPR1
External References cont. <soap:envelope> <soap:header>…</soap:header> <soap:body> <echrs:store> <echrs:patientRecord> <pid>1</pid> <xray>j8ladfhaufjalkdjkfaslalkfdjaljfafjaljajfdlja adfhaldfjhaslfjdasldfjaslfj…. </xray> </echrs:patientRecord> </echrs:store> </soap:body> </soap:envelope>
Styled Reference P-Assertion <ps:interactionPAssertion> <ps:localPAssertionId>1</ps:localPAssertionId> <ps:documentationStyle> http://www.pasoa.org/.../styles#Reference </ps:documentationStyle> <ps:content> <soap:envelope> <soap:header>…</soap:header> <soap:body> <echrs:store> <echrs:ref> http://DataRepository/#LPR1 </ echrs:ref> </echrs:store> </soap:body> </soap:envelope> </ps:content> </ps:interactionPAssertion>
Internal References • Same as External References • However, the reference is to data already stored inside the provenance store • This is made possible by the unique addressability of p-assertions • Useful for the case of large actor state p-assertions that are recorded several times • Example: System Configuration Information
Internal References cont. PS Actor State P-Assertion Lots of Configuration information Actor State P-Assertion 1 Actor State P-Assertion 2 Actor State P-Assertion 3 Actor State P-Assertion 4
Provenance Query Results Scalability • Provenance Query result sets are scalable • Return pointers to p-assertions not the assertions themselves
<ps:interactionPAssertion> <ps:localPAssertionId>1</ps:localPAssertionId> <ps:documentationStyle> http://www.pasoa.org/.../styles#Reference </ps:documentationStyle> <ps:content> <soap:envelope> <soap:header>…</soap:header> <soap:body> <echrs:store> <echrs:ref> http://DataRepository/#LPR1</echrs:ref> </echrs:store> </soap:body> </soap:envelope> </ps:content> </ps:interactionPAssertion> <ps:interactionPAssertion> <ps:localPAssertionId>2</ps:localPAssertionId> <ps:documentationStyle> http://www.pasoa.org/.../styles#Reference </ps:documentationStyle> …
<psdid> <interactionKey> <sender>donerdatacollector</sender> <receive>echr</receiver> <id>12233</id> </interactionKey> <viewkind>sender</viewkind> <localPAssertionId>1></localPAssertionId </psdid> <psdid> <interactionKey> <sender>donerdatacollector</sender> <receive>echr</receiver> <id>1224</id> </interactionKey> <viewkind>sender</viewkind> <localPAssertionId>5></localPAssertionId </psdid> <psdid>
Provenance Query Results Scalability • Provenance query results are scalable • Return pointers to p-assertions not the assertions themselves • Scoping means provenance query results are only what is necessary for the querier
Iterative Query Results • Return iterators over results from process documentation or provenance query results PS Issue Query Querying Actor Results Iterator Results Iterator getNextRes() getNextXRes(int x)
Iterative Query Results • Return iterators over results from process documentation or provenance query results • This functionality is planned for future implementations • The planned implementation makes use of • OGSA-DAI • WSRF
Summary • Discussed both Distribution and Scalability • Introduced links for connecting distributed provenance stores • Two ways of implementing distributed queries • Large data support through asynchronous recording and references • Query Scalability • Provenance Query Results • Iterative Query Results
Questions? Paul Groth pg03r@ecs.soton.ac.uk