200 likes | 354 Views
An Approach to Evaluate Data Trustworthiness Based on Data Provenance. Department of Computer Science Purdue University. Outline. Motivation Application Scenario and Problem Definition A Trust Model for Data Provenance Performance Study Conclusion. Motivation.
E N D
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University
Outline • Motivation • Application Scenario and Problem Definition • A Trust Model for Data Provenance • Performance Study • Conclusion
Motivation • Data integrity is critical for making effective decisions • Evaluate the trustworthiness of data provenance is essential for data integrity • Few efforts have been devoted to investigate approaches for assessing how trusted the data are • No exist techniques try to protect data against data deception
Motivation • To evaluate the trustworthiness of data provenance, we need to answer questions: • Where did the data come from? • How trustworthy is the original data source? • Who handled the data? • Are the data managers trustworthy?
Application Scenario and Problem Definition • In our scenario, parties are characterized as • Data source providers (sensor nodes or agents that collect data items) • Intermediate agents (computers that pass the data items or generate knowledge items) • Data users (people or computers that use items to make decisions). • Items (data items and knowledge items) describe the properties of certain entities or events information. • Data items is generated or collected by Data source providers • Knowledge items refer to the new information generated by the intermediate agent by inference techniques.
Application Scenario and Problem Definition • Our goal is to evaluate the trustworthiness of data items, knowledge items, source providers and intermediate agents. • Aspects needed to be considered including: • Data similarity: two items similar to each other can be considered support to each other. • Path similarity: two items come from different paths (source nodes) can be considered more trustworthy. • Data conflict: two items against each other based on certain prior knowledge defined by the users. • Data deduction: knowledge deducted by the intermediate agents from items they received
Application Scenario and Problem Definition • We model an item (denoted as r) as a row in a relational table and each item has k attributes A1, ..., Ak. • As shown in the table, there are five items, each of which has seven attributes RID, SSN, Name, Gender, Age, Location, Date. RID is the identifier of each item. The information represents the location of the person at a certain time
A Trust Model for Data Provenance • How to Compute Data Similarity • Employ a clustering algorithm to group items describing the same event. • The purpose of the clustering is to eliminate minor errors like typos. • After clustering, we obtain sets of items and each set represents a single event. • For each item r, the effect of data similarity on its trust score, denoted as sim(r), is determined by the number of items in the same cluster and the size of the cluster. • Formal definition where is the diameter of the cluster, is the number of items in the cluster.
A Trust Model for Data Provenance • Path Similarity • Given two items r1 and r1, suppose their paths are P1 and P1 respectively. • The path similarity between P1 and P1 is defined as the edit distance between their identifiers. • Formal definition is a parameter range from to 1. when no two items share one path, it equals to one. when all items share one path, it equals to .
A Trust Model for Data Provenance • Data conflict • Refers to inconsistent descriptions or information about the same entity or event. A simple example of a data conflict is that the same person appears at different locations during the same time period. • Prior knowledge is used to define the data conflict. • The data conflict score of one cluster against another cluster is determined by the distance between two clusters and the number of items in the second cluster taking into account path similarity. • Formal definition: where is the distance between the two clusters.
A Trust Model for Data Provenance • Data Deduction • It is computed based on all its input items and the inference techniques used by the intermediate agent. • A weighted function is used to compute the score. Here, is a parameter based on the operation the intermediate agent takes and its impact on the trustworthiness of knowledge k, t(a) is the trustworthiness of agent a, and t(rj) is the trust worthiness of the input item set.
A Trust Model for Data Provenance • Computing trust scores • We compute the trust score of a data item by taking the above four aspects into account • Above equation is chosen based on the probability theory. Where t(f) is the probability of fact f being true and t(r) is the probability of item r being true. f and r belong to the same cluster. • in the equation is to take the similarity between two items into account. • The more similar of two items, the more likely they represent the same event.
A Trust Model for Data Provenance • Computing trust scores (cont’) • Similar equations are used to take the conflict of items into account. • Trustworthiness of intermediate agents and source nodes are computed as the average value of the trust scores of items belong to them. • The complexity of our algorithm is dominated by the cost of computing the data similarity, path similarity and data conflict, which are all O(n2). • An overview of our algorithm is listed on the next ppt
A Trust Model for Data Provenance • 1. cluster data facts and knowledge items • 2. for each cluster • 3. compute data similarity • 4. compute path similarity • 5. compute data conflict • 6. assign initial trust scores to all the source providers intermediate agents • 7. repeat • 8. for each data fact and knowledge item • 9. compute its trust score • 10. for each knowledge item • 11. compute data deduction • 12. recompute trust score of the knowledge item by combining the effect of data deduction • 13. compute trust scores for all the source provider and intermediate agents • 14. until the change of trust scores is ignorable
Performance Study • In the performance study, we simulate a network containing 100 source providers and 100 intermediate agents. • As shown in Figure (a), the running time of initialization phase increases when the dataset size becomes large. • This is because in the worst case the complexity of the clustering algorithm, the computation of data similarity, path similarity and data conflict are all O(n2).
Performance Study • Compared to the initialization phase, the iteration phase is much faster (see Figure (d)). • This is because the iteration phase simply computes score functions based on the results obtained from initialization phase and trust scores converge to stable values in a short time.
Performance Study • As shown in Figure (c) and (f), the running time of both phases increases with the length of path.
Conclusion • Formulated and introduced the problem of evaluation of trustworthiness of data provenance. • Proposed a trust model by taking into account four important factors that influence trustworthiness. • Evaluated the efficiency of our approach. • Our proposed method can deal with both unintentional errors and malicious attacks without collusion.
Future Work • Develop an approach to estimate the confidence results of a query. • Develop a policy language to specify which is the minimum confidence level that a query result must have for use by users in certain roles. • How to dynamically adjust our trust model when information keeps streaming into the system. • How to certify data provenance so to achieve a certified data lineage.