280 likes | 448 Views
K-relevance. Measuring source relevance in data integration query. Queries, relations and sources. K-relevance is defined for queries, which query one or more relations. Every relation is based on data extracted from one or more external sources.
E N D
K-relevance Measuring source relevance in data integration query
Queries, relations and sources • K-relevance is defined for queries, which query one or more relations. • Every relation is based on data extracted from one or more external sources. • The data in a relation may be not up-to-date. (the data from some sources may be extracted from previous versions of these sources)
Relations and sources • Every tuple in the relation is based on exactly one source, and has a column which contain reference to the source. • Example:
Relations and sources • One source may be used by more than one relation. • Example: positiveNumsnegativeNums
Source information is needed • If an user thinks that there is mistake in the query results, knowledge on which sources the query results are based may help in finding the origin of the mistake. • If an sources can’t ever contribute to the query results, there is no need to extract data from it. • If a source can contribute to the query result regardless of the other sources, there may be need to extract the data from it more frequently.
Query results and sources • Every tuple in the query results is a join of tuples – one tuple for each relation. • The sources of the resulting tuple is an union of the sources of the joining relations.
0-relevance – the actual data sources • The union of the sources for all the tuples in the query results, is called the 0-relevant sources • If the query result is empty, there are no tuples in the results, so there are no 0-relevant sources.
0-relevance - example SELECT allNums.n FROM allNums,evenNums WHERE allNums.n≤evenNums.n The 0-relevant sources:{nums1,nums2}{nums2}= {nums1,nums2}
0-relevance via relation • For relation R,if its tuple with source S has joined to create result tuple, then S is 0-relevant via R. • Example: {nums1,nums2} are 0-relevant Via allNums. {nums2} is 0-relevant Via evenNums
Definition: Potential tuple • “Potential tuple” for a relation is any tuple which fit the schema of the relation. (it may actually exist in the relation). • For example, for the relation R(string, int) every tuple of the form (string s,int i) is potential tuple. • For a relation which contain source column, every potential tuple which has S in this column is called potential tuple from S • Note:every “real” tuple in R is also potential tuple, because it fits the schema of R.
∞-relevance via relation • If there are • a potential tuple from the source S for the relation R • and potential tuples for the other relations in the query • which can join to satisfy the query and create a resulting tuple,S is called ∞–relevant source via R
∞-relevance • The union of the ∞-relevant sources via the relations in the query, are the ∞-relevant sources of the query. • Note: the ∞-relevant sources are independent of the data in the relations, and depend only on the query and the sources of the queried relations.
∞-relevance • Every source of the relations is ∞-relevant, unless there are constraints in the query on the source column. • Note: the data sources of the relations are shared: if S is source of R1, it is also source of R2 • Therefore, if there are no constraints on the source column of one of the relations, all of the sources are ∞-relevant.
∞-relevance - example • For example, if the data sources are {src1.html,src2.html} in the query SELECT A.x FROM A,B WHERE A.source!=‘src1.html’ AND A.x < B.x • There is no possible tuple for A from src1 which will satisfy the query • There are • possible tuple for A from src2 (for example, {x=1,src=src2}) • and possible tuple for B (for example, {x=2,src=src1}) • which satisfy the query and create the resulting tuple (1) • src2 is ∞-relevant via A.
∞-relevance - example • the data sources are {src1.html,src2.html} SELECT A.x FROM A,B WHERE A.source!=‘src1.html’ AND A.x < B.x • There are • possible tuple for B from src1 (for example, {x=2,src=src1}) • and possible tuple for A (for example, {x=1,src=src2}) • which satisfy the query and create the resulting tuple (1) • src1 is ∞-relevant via B. • There are • possible tuple for B from src2 (for example, {x=3,src=src2}) • and possible tuple for A (for example, {x=2,src=src2}) • which satisfy the query and create the resulting tuple (1) • src2 is ∞-relevant via B.
∞-relevance - example • {src2} is ∞-relevant via A • {src1,src2} are ∞-relevant via B • {src2} {src1,src2}={src1,src2} are the ∞-relevant sources of the query
k-relevance • Assume the query is to m relations. • If there are • potential tuple from the source S for the relation R • and other (at most) k-1 potential tuples for (at most) k-1 relations (one tuple for each relation) • And real tuples for each of the remaining relations in the query which can join to create resulting tuple in the query, S is called k-relevant source via R.
K-relevance • The union of the k-relevant sources via all relations in the query, is called the k-relevant sources of the query. • Note:If k is greater than or equal to m (the number of queried relations), k-relevance is equal by definition to ∞-relevance, because all of the joining tuples may be potential tuples, and there is no need to join with real tuples.
K-relevance - notes • If S is k-relevant, it means that k potential tuples (one of them from S) can join with m-k real tuples to satisfy the relation. • k+1 potential tuples can also join with m-k-1 real tuples, because real tuple is also potential tuple by definition. • Therefore, K-relevance is monotone: every k-relevant source is also k+1 relevant source.
K-relevance - example • The sources are {sigcomm.html,sigmetrics.html} • The query is: SELECT Papers.title FROM Authors,Papers WHERE Papers.author= Authors.name AND Authors.org=‘MIT’ AND Papers.title like '%Ubiquitous%‘ AND Papers.src=Authors.src
K-relevance - example SELECT Papers.title FROM Authors,Papers WHERE Papers.author= Authors.name AND Authors.org=‘MIT’ AND Papers.title like '%Ubiquitous%‘ AND Papers.src=Authors.src • The relations are: • The query result are empty,Because there is no tuple in Authors with org=‘MIT’. • Therefore, there are no 0-relevant sources. • Moreover, even if any source will add a tuple to Papers, the result will be empty because the tuple won’t be able to join with any tuple in Authors. • Therefore, there are no 1-relevant sources via Papers.
K-relevance - example SELECT Papers.title FROM Authors,Papers WHERE Papers.author= Authors.name AND Authors.org=‘MIT’ AND Papers.title like '%Ubiquitous%‘ AND Papers.src=Authors.src • If sigcomm.html will add the tuple (sigcomm.html, John, MIT, john@google.com) to Authors, it can join with the first tuple from papers. Therefore, sigcomm.html is 1-relevant via Authors. • However, every tuple from sigmetrics.html, even (sigmetrics.html,John,MIT,john@google.com) can’t join with any tuple from Papers, because all the tuples in Papers have ‘sigcomm’ in the source column. • Therefore, the 1-relevant sources for the query are {sigcomm.html}
K-relevance - example SELECT Papers.title FROM Authors,Papers WHERE Papers.author= Authors.name AND Authors.org=‘MIT’ AND Papers.title like '%Ubiquitous%‘ AND Papers.src=Authors.src • The potential tuples: • (sigmetrics.html,Todd, MIT, todd@msn.com) from sigmetrics.html in Authors • And (sigmetrics.html, Todd, Boost Ubiquitous Access) in Papers • Can join to create the result tuple (Boost Ubiquitous Access). • Therefore, sigmetrics.html is 2-relevant source via Authors.
K-relevance - example SELECT Papers.title FROM Authors,Papers WHERE Papers.author= Authors.name AND Authors.org=‘MIT’ AND Papers.title like '%Ubiquitous%‘ AND Papers.src=Authors.src • sigmetrics.html is also 2-relevant source via Papers: • The potential tuples: • (sigmetrics.html, Todd, Boost Ubiquitous Access) from sigmetrics.html in Papers • And (sigmetrics.html,Todd, MIT, todd@msn.com) in Authors • Can join to create the result tuple (Boost Ubiquitous Access). • Sigmetrics.html is 2-relevant source of the query. • Sigcomm.html is also 2-relevant source of the query, because it’s 1-relevant source and k-relevance is monotone.
K-relevance – example - conclusion • There are no 0-relevant sources. • The only 1-relevant source is {sigcomm.html} • The 2-relevant sources are {sigcomm.html,sigmetrics.html} • The query queries only 2 relations, therefore the ∞-relevant sources are {sigcomm.html,sigmetrics.html}
K-relevance - summary • A source is 0-relevant if tuple extracted from it to one or more of the queried relations has joined to create a tuple in the query results. • A source is ∞-relevant if a potential tuple from it, in one of the relations, can join with potential tuples in the other ralations to satisfy the query and create a tuple in the results. • A source is k-relevant if a potential tuple from it, in one of the relations, can join with potential tuples in at most (k-1) of the other ralations, and with real tuples in the remaining relations to satisfy the query and create a tuple in the results.