240 likes | 378 Views
Enabling Privacy in Provenance-Aware Workflow Systems. Susan B. Davidson 1 Joint work with Sanjeev Khanna , Sudeepa Roy, Julia Stoyaonovich , Val Tannen 1 Yi Chen 2 Tova Milo 3 1 University of Pennsylvania 2 Arizona State University 3 Tel Aviv University.
E N D
Enabling Privacy in Provenance-Aware Workflow Systems Susan B. Davidson1 Joint work with SanjeevKhanna,Sudeepa Roy, Julia Stoyaonovich,Val Tannen1 Yi Chen 2 Tova Milo3 1University of Pennsylvania 2Arizona State University 3Tel AvivUniversity
Science has been revolutionized • High-throughput technologies generatefloods of data, which must be analyzed to create knowledge. • Scientists are turning to scientific workflow systems to help manage the analysis.
Scientific Workflow TGCCGTGTGGCTAAATG… CTGTGC … CTAAATGTCTGTGC… GGCTAAATGTCTG TGCCGTGTGGCGTC… ATCCGTGTGGCTA.. Split Entries • Graphical representation of a sequence of actions to perform a task (e.g., a biological experiment) • Vertex ≡ Module (program) • Edge ≡ Dataflow • Run:An execution of the workflow • Actual data appears on the edges • . Align Sequences Format-1 Functional Data Curate Annotations Format-3 Format-2 Construct Trees
… Need for provenance TGCCGTGTGGCTAAATGTCTGTGC … CCCTTTCCGTGTGGCTAAATGTCTGTGC … Need for privacy s TGCCGTGTGGCTAAATGTCTGTGC GTCTGTGC… TGCCGTGTGGCTAAATGTCTGTGC GTCTGTGC… TGCCGTGTGGCTAAATGTCTGTGC… ATGGCCGTGTGGTCTGTGCCTAACTAACTAA… Split Entries Align Sequences Format How has this tree been generated? Curate Annotations Functional Data ? Format Format Which sequenceshave been used to produce this tree? Construct Trees t Biologist’s workspace
Outline • The need: Workflow provenance repositories • The challenge of privacy • Privacy-aware search and query
Current situation: Workflow repositories • To enable sharing and reuse, repositories of workflow specifications are being created • e.g. myExperiment.org • Keyword search is used to find specifications of interest • Several workflow systems are storing provenance information • Module executions, input parameters, input/output data • “Input-only”
The Vision • “Workflow Provenance” repositories will store specifications as well as executions (i.e. provenance information) • Searchable • Queryable • Privacy protecting • Searching/querying these repositories can be used to • Find/reuse workflows • Understand meaning of a workflow • Correct/debug erroneous specifications • Find “bad” data and what its downstream effect might be
“You are better off designing in security and privacy… from the start, rather than trying to add them later.” (Shapiro, CACM 53:6, June 2010)
Privacy Concerns in Scientific Workflows Data, module, structural
Example: Module Privacy Split entries Patient record: Gender, smoking habits, Familial environment, blood pressure, blood test report, … Module functionalityshouldbekept secret (From patient’s standpoint): output should not be guessed given input data values (From module owner’s standpoint): no one should be able to simulate the module and use it elsewhere. Check for Cancer Check for Infectious disease P: (X1, X2, X3, X4, X5) (X1, X2, X3) (X1, X2, X4, X5) Create Report P has cancer? P has an infectious disease? If X1 > 60 OR (X2 < 800 AND X5=1) AND …. report
Example: Structural Privacy Protein + Functional annotation M2 M1 Protein M2 compares domains of proteins (more precise but more time consuming) M1 compares the entire protein against already annotated genomes Relationshipsbetween certain data/module pairs shouldbekept secret
Privacy concerns at a glance smoking habits, blood pressure, blood test report, …… • Data Privacy • Data items are private • Module Privacy • Module functionality is private (x, f(x)) • Structural Privacy • Execution paths between certain data is private Split entries P: (X1, X2, X3, X4) (X1, X2, X3) (X1, X2, X4) Check for infectious disease Check for cancer Check for cancer DB P has infectious disease? P has cancer? Create Report report
Module Privacy (a hint) • A module f = a function • For every input x to f, f(x) value should not be revealed • Enoughequivalent possible f(x) values w.r.t. visible information According to required privacy guarantee • There is a knife, a fork and a spoon in this figure arxiv.org/abs/1005.5543
Workflow model, revisited • Vertex ≡ Module • Atomic or composite (subworkflow) • Edge ≡ Dataflow W2 W4 Workflow hierarchy W1 W3 W4 W2 W1 Generate Database Queries M5 Expand SNP Set lifestyle, family history, physical symptoms M3 I query query SNPs, ethnicity Query OMIM Query PubMed SNPs M6 M7 Determine Genetic Susceptibility M1 M4 Consult External Databases disorders disorders M8 Combine Disorder Sets disorders W3 M2 Evaluate Disorder Risk prognosis …. O
Access Views W2 W4 W1 W2 W4 • Prefixes of workflow hierarchy can be used to define the finest level of granularity the user can “see” W1 W1 W2 W3 W1: I M1 M2 O
Search • Keyword search: “Database, disorder risk” • Result should be concise and informative W1 W2 W4 W4 W2 W1 Generate DatabaseQueries M5 Expand SNP Set lifestyle, family history, physical symptoms M3 I query query SNPs, ethnicity Query OMIM Query PubMed SNPs M6 M7 Determine Genetic Susceptibility M1 M4 Consult ExternalDatabases disorders disorders M8 Combine Disorder Sets disorders M2 Evaluate DisorderRisk prognosis “WISE: a Workflow Information Search Engine”, (ICDE 2009) O
Result of “database, disorder risk” lifestyle, family history, physical symptoms M3 Expand SNP Set SNPs, ethnicity I W1 W2 W4 SNPs M5 Generate DatabaseQueries query query Query OMIM M6 Query PubMed M7 disorders disorders M2 M8 Evaluate DisorderRisk Combine Disorder Sets disorders prognosis O
Result of “database, disorder risk” W1 W2 lifestyle, family history, physical symptoms M3 W2 W1 Expand SNP Set I Expand SNP Set lifestyle, family history, physical symptoms M3 I SNPs, ethnicity SNPs, ethnicity SNPs SNPs Determine Genetic Susceptibility M1 M4 Consult ExternalDatabases M4 Consult ExternalDatabases disorders disorders M2 Evaluate DisorderRisk M2 Evaluate DisorderRisk prognosis O prognosis O
Queries • “Workflows in which Expand SNP is performed before Query PubMed.” • “Data items that were used to produce d19.” • … • Shares with search the need to match certain terms and give a view of the result and preserve privacy.
Research Challenges • More ways of enabling scientists to construct workflows • Formalizing privacy notions • Provenance query language? • Efficiently identifying data that users can access • users may have different privileges, yielding many different access views. • …
Acknowledgments Members of the PennDB group Susan Davidson SanjeevKhanna Sudeepa Roy Julia Stoyanovich Val Tannen Friend of PennDB and Tel Aviv faculty Tova Milo PennDB alum and ASU faculty Yi Chen Bioinformatics collaborators Sarah Cohen-Boulakia This work was supported in part by NSF IIS-0803524, IIS-0629846, IIS-0915438, CCF-0635084, and IIS-0904314; NSF CAREER award IIS-0845647; and CRA 0937060