170 likes | 274 Views
Inconsistent Data on the Semantic Web A Theoretical Approach Brian Goodrich. The Problem. An computer application has a set of input and a set of output based upon the set of input and its internal logic.
E N D
Inconsistent Data on the Semantic WebA Theoretical ApproachBrian Goodrich
The Problem • An computer application has a set of input and a set of output based upon the set of input and its internal logic. • If an application is given data as input which causes a conflicted state in deciding its output, it will crash without some kind of logic by which to decide that conflict. • The Semantic Web is based being able to parse human intent from structured, semi-structured, and unstructured data on the Web. • Human intent is frequently conflicting.
Conflicting Data Sources • Malicious - (deceptive or rerouting attempts) or just ignorantly incorrect information • Incomplete Information – having insufficient context or simply unfinished data • Humor – especially sarcasm, satire and exaggeration (e.g. political cartoons) • Time – what once was one thing is now another (e.g. quality of service, price, etc.) • Ontological Deficiency – when extraction ontology lacks sufficient vividness to separate data appropriately.
Thesis • To propose a method for simplifying the task of dealing with conflicting data on the Semantic Web in a fast, accurate and dynamic way by supplying each web source with a derived indicator of its communal usage called a Consensual Reliability Score. (CRS)
Methods (a*z) + (b*y) + (c*x) + (d*w) = CRS(f) • Formula for deriving CRS from inputs a, b, c, & d. • With weighted constants z, y, x, & w.
Site Type Mining (a * z)… Five types of Web Pages • Head Pages • Navigation Pages • Content Pages • Look up Pages • Personal Pages
Incoming Index …(b * y)… • Distributed web crawler that counts hyperlinks then traverses the unique hyperlink paths, looking for additional links. • Link counts are stored in a hash indexed by the destination of the hyperlinks. • Provides a dynamic count of how often the internet as a whole is pointing to a given web source. Therefore an indication of how often people use the given web source. • Excludes orphan sites (mostly personal sites and spam pop-ups) • Based on the success of the Google search engine
Usage Mining …(c * x)… • Most straight forward approach of testing how often people use a web source. Query site’s # of hits or how many people have seen this site? • Problem: Unlike Incoming Index method, does not exclude orphan sites. • Further experimentation needed to determine x’s weight.
Direct Survey …(d * w)… • Most reliable method of determining reliability. Manually query users directly. • Too slow and costly to be consider a whole solution but can assist in CRS derivation. Hopefully offset frequently visited sites with no true info (onion.com, humor, etc.) • More experimentation needed to determine w’s weight.
Review (a*z) + (b*y) + (c*x) + (d*w) = CRS(f)
“Classical content data mining is not applicable in this case (CRS derivation) because it is the content of the web sources that is in question.” -Brian Goodrich
Storage • Global Index – • Fast access • Centralized storage for CRSBot. • Centralized vulnerability. • Vital non-distributed resource in a distributed system. • Local Storage • Non-centralized vulnerability • Non-unified derivation formula (disrupts trust algorithm) • Local Derivation • Too slow to be useful (problem size too large)
Related Work • Tim Berners-Lee • There is a choice here, and I am not sure right now which appeals to me most. One is to say precicely, • "whatever any document says of the form xxxx is a member of W3C so long as it is signed with key 32457934759432". • The other is to say, • "whatever is of form xxxx and can be inferred from information signed with key 32457934759432“ • Problems with both choices, but both use static references in a dynamic environment (the web)
Contributions • CRS provides a fast and accurate measure of community consensus on the web. • Allows reliable decision about deciding between conflicting data on the web, fine-tuning the results from the Semantic Web.
Limitations • Totally reliant on usage patterns of the internet, which may not always reflect which data is more correct. • Reflects only consensus to a data source, not the actual data contained in it. • Cannot express complex or compound relationships or extract partial truths.