250 likes | 378 Views
Wrapper Induction for End-User Semantic Content Development. Andrew Hogue MIT CSAIL. Acknowledgments. David Karger (karger@csail.mit.edu) Haystack Group (http://haystack.csail.mit.edu). Labeling the Semantic Web. Semantic Web requires RDF labeling of semantic data
E N D
Wrapper Induction for End-User Semantic Content Development Andrew Hogue MIT CSAIL Interaction Design and the Semantic Web
Acknowledgments • David Karger (karger@csail.mit.edu) • Haystack Group (http://haystack.csail.mit.edu) Interaction Design and the Semantic Web
Labeling the Semantic Web • Semantic Web requires RDF labeling of semantic data • Most existing labeling methods geared towards content providers • End-user tools require knowledge of underlying HTML of page • Goal: easy interface for non-technical end-users Interaction Design and the Semantic Web
Labeling the Semantic Web • Our approach: create patterns for existing semantic content • User provides examplesof semantic content in the browser • Induce patterns from examples • Pattern matches provide content-specific context menus Interaction Design and the Semantic Web
Labeling the Semantic Web • Extends Haystack information management client • Provides context-sensitive menus • Matched patterns overlay semantic context on Web documents Interaction Design and the Semantic Web
Demo Interaction Design and the Semantic Web
Wrapper Induction • Wrapper: pattern created from examples • User provides positive examples • Generalize examples into reusable pattern • Existing techniques: • head-left-right-tail (HLRT) descriptors • Hidden Markov models • Support Vector Machines • Other Machine Learning Interaction Design and the Semantic Web
Wrapper Induction • Our approach: take advantage of hierarchical structure of HTML • Each example picks out a subtree of DOM • Calculate tree edit distance between examples • Least-cost edit distance gives best mapping • Remove unmapped nodes to make pattern Interaction Design and the Semantic Web
Edit Distance • Least-cost sequence of operations to transform one tree into the other • Operations: insert, delete, change a node • Cost of an operation = size of subtree it affects • Byproduct: best mapping between elements Interaction Design and the Semantic Web
Mapping Examples Interaction Design and the Semantic Web
Underlying Structure • Each example is built with similar HTML • Only text is different • Tree edit distance provides us with a mapping • Create general pattern by removing unmapped nodes • Replace with wildcards Interaction Design and the Semantic Web
Mapping Examples Interaction Design and the Semantic Web
Mapping Examples Interaction Design and the Semantic Web
Pattern Matching • Look for document subtrees with similar structure • Find alignments of wrapper in tree • Require every node in wrapper be mapped to some node in document subtree • Wildcards match zero or more times • Each valid alignment is a match Interaction Design and the Semantic Web
Matching Example Interaction Design and the Semantic Web
Matching Example Interaction Design and the Semantic Web
Adding Semantics • How to tie wrappers to semantic content? • Assert RDF statements • Tied to wrapper structure • Classes bound to wrappers • Properties bound to wildcards Interaction Design and the Semantic Web
Semantic Labels Interaction Design and the Semantic Web
Semantic Matching Interaction Design and the Semantic Web
Semantic Matching Interaction Design and the Semantic Web
Semantic Matching [ <rdf:type> <TalkAnnouncement> ; <series> “Dertouzos Lect…” ; <dc:title> “Distributed Hash…” ; <time> “3:30 PM” ] Interaction Design and the Semantic Web
Additional Heuristics • Allow us to create more flexible, reusable patterns with as few as a single example • List Collapse • Context • Automatic additional examples • URL Prefixes Interaction Design and the Semantic Web
Our Contributions • Ease-of-use • Few examples required • Wrappers bridge syntactic-semantic gap Interaction Design and the Semantic Web
Future Work and Applications • Document-level classes • Mozilla port • “Push” wrappers • Page reformatting • Autonomous agent interaction • Wrapper sharing • Automatic wrapper induction Interaction Design and the Semantic Web
ahogue@csail.mit.edu http://haystack.csail.mit.edu Interaction Design and the Semantic Web