340 likes | 423 Views
Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web. Andrew Hogue Google MIT CSAIL. Acknowledgments. David Karger (karger@csail.mit.edu) Haystack Group (http://haystack.csail.mit.edu). Agenda. Overview Demo Details Induction Matching Semantics
E N D
Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue Google MIT CSAIL WWW 2005 -- Chiba, Japan
Acknowledgments • David Karger (karger@csail.mit.edu) • Haystack Group (http://haystack.csail.mit.edu) WWW 2005 -- Chiba, Japan
Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan
Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan
Unwrapping the Web • Majority of semantic content in “deep web” • Transformed into human-readable HTML by scripts • HTML is difficult for automated agents to understand • Little incentive for content providers to provide RDF markup • How to “unwrap” this content? WWW 2005 -- Chiba, Japan
Thresher • Simple UI for wrapper induction on structured web content • “Demonstrate” examples of objects • Induce wrapper, or pattern, based on DOM • User may also label properties with RDF WWW 2005 -- Chiba, Japan
Thresher • Built on Haystack Semantic Web client • Everything is RDF • Everything has context menus • Thresher brings RDF into the web browser • Wrappers reify web objects for full interaction WWW 2005 -- Chiba, Japan
Thresher • Underlying wrapper algorithm based on tree edit distance • Align user’s examples • Keep aligned nodes (layout elements) • Wildcard non-aligned nodes (content) • Pattern matching is also alignment WWW 2005 -- Chiba, Japan
Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan
Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan
Wrapper Induction • Wrapper: pattern created from examples • User provides positive examples • Generalize examples into reusable pattern • Existing techniques: • head-left-right-tail (HLRT) descriptors • Hidden Markov models • Support Vector Machines • Other Machine Learning WWW 2005 -- Chiba, Japan
Wrapper Induction • Our approach: take advantage of hierarchical structure of HTML • Each example picks out a subtree of DOM • Calculate tree edit distance between examples • Least-cost edit distance gives best mapping • Remove unmapped nodes to make pattern WWW 2005 -- Chiba, Japan
Tree Edit Distance • Calculate cost ( ) of sequence of operations to transform one tree into the other • Operations: insert, delete, change a node • Cost of an operation = size of subtree it affects • Least-cost set of operations gives best mapping between elements WWW 2005 -- Chiba, Japan
Mapping Examples WWW 2005 -- Chiba, Japan
Mapping Examples WWW 2005 -- Chiba, Japan
Mapping Examples WWW 2005 -- Chiba, Japan
Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan
Pattern Matching • Look for document subtrees with similar structure • Find alignments of wrapper in tree • Require every node in wrapper be mapped to some node in document subtree • Wildcards match zero or more times • Each valid alignment is a match WWW 2005 -- Chiba, Japan
Matching Example WWW 2005 -- Chiba, Japan
Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan
Adding Semantics • How to tie wrappers to semantic content? • Assert RDF statements about unwrapped objects • Tied to wrapper structure • Classes bound to wrappers • Properties bound to wildcards WWW 2005 -- Chiba, Japan
Semantic Labels WWW 2005 -- Chiba, Japan
Semantic Matching WWW 2005 -- Chiba, Japan
Semantic Matching WWW 2005 -- Chiba, Japan
Semantic Matching [ <rdf:type> <TalkAnnouncement> ; <series> “Dertouzos Lect…” ; <dc:title> “Distributed Hash…” ; <time> “3:30 PM” ] WWW 2005 -- Chiba, Japan
Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan
Automatically Adding Examples • Find additional examples automatically • Consider nodes neighboring the example • Require low normalized cost: • Often allows us to create wrappers with a single example WWW 2005 -- Chiba, Japan
Automatically Adding Examples T TR WWW 2005 -- Chiba, Japan
List Collapse • Current wrappers generalize well for single elements • Will not recognize variable length lists • Collapse neighboring nodes with low normalized cost • For matching, allow nodes to match more than once WWW 2005 -- Chiba, Japan
Wrapper Wrap-up • Gather user example(s) • Automatically find additional examples • Generalize examples using best mapping • Add semantic labels • Match by finding alignments • Overlay objects on the page for interaction WWW 2005 -- Chiba, Japan
Additional Tools • Wrapper Sharing • RSS • Web Operations WWW 2005 -- Chiba, Japan
Our Contributions • End-user wrapper induction • Few examples required • Bring object interaction into the browser • Wrappers bridge syntactic-semantic gap WWW 2005 -- Chiba, Japan
Future Work and Applications • Document-level classes • Page reformatting • Autonomous agent interaction • Negative examples • Automatic wrapper induction WWW 2005 -- Chiba, Japan
ahogue@google.com http://haystack.csail.mit.edu WWW 2005 -- Chiba, Japan