200 likes | 304 Views
The DIADEM Ontology. Yiyang Bao 2 , Xiaonan Guo 2 , Giorgio Orsi 1,2 , Christian Schallhart 2 , Cheng Wang 2 1 Institute for the Future of Computing University of Oxford 2 Department of Computer Science University of Oxford. DIADEM 1.0. The languages of the web. <html> <head>
E N D
The DIADEM Ontology Yiyang Bao2, Xiaonan Guo2, Giorgio Orsi1,2, Christian Schallhart2, Cheng Wang2 1Institute for the Future of Computing University of Oxford 2Department of Computer Science University of Oxford DIADEM 1.0
The languages of the web <html> <head> </head> <body> <title> </title> <div> … </div> </body> </html> • HTML objects provide the data model of a web-page. • CSS boxes and properties provide the layout. • Javascript provides web dynamics. this.value.toLowerCase(); ox:address • … ? xsd:string Web ox:Property • RDF annotations provide the conceptualization of the domain. Real World
Why ontology? • Ontologies provide a conceptualization of a domain of interest (Gruber ‘93) ox:partOf • But… we do not only want to model the application domain ox:priceSegment ox:minPrice ox:address • We must model the domain of its web representations, i.e., its phenomenology. xsd:string ox:Property • In the end, it is also an ontology
Why ontology? • Can be used to complete an incomplete model. • Can be used to verify a model. • Must tolerate uncertainty and inconsistency.
A logical model for web extraction • Logical model for web entities • input and refinement forms. • result pages • page blocks (e.g., ads) • … • Phenomenological model • How logical entities are concretely represented
The building blocks <form> <label for="male">Male</label> <input type="radio" name="sex" id="male" /> <label for="female">Female</label> <input type="radio" name="sex" id="female" /></form> • HTML entities • labels • fields (included links) • text-nodes and text attributes • Logical entities • constructs of our data model <div> <span> Price: </span> <span> £ 250 </span> </div> Price: £ 250 • Rules • describe the phenomenology
The form model • Goal: model web form phenomenology
The form model • Areas: • button • location • price • room • type • buy/rent • order-by • display • Root entity: • RealEstateForm • Properties: • partOf hierarchical structures
The form model: elements • price • type = {min, max} • purpose = {buy, rent} • currency • geographic • location • area/branch • granularity = {area, branch} • area/branch input • Area/branch select • address PO • radius • room • category = {bathroom, bedroom, …} • type = {min, max}
The form model: elements • property type • order-by • button • submit • reset • map search • advance submit • link button • display • per page • add-in-time • new/resale • SSTC • buy • rent • buy/rent • other
The form model: phenomenology • Based on linguistic annotations and (visual) heuristics. buyElement(X,F) :- visibleField(X), hasAnnotationFeature(X,"majorType", "reform.label"), hasAnnotationFeature(X,"minorType", "buy"), not hasAnnotationFeature(X,"minorType", "rent"), not hasAnnotationFeature(X,"minorType", "includeSSTC"), group(Ns,_,_,F),#member(X,Ns). radiusElement(X,F) :- visibleField(X), hasAnnotationFeature(X,"majorType","reform.label"), hasAnnotationFeature(X,"minorType","radius"), group(Ns,_,_,F),#member(X,Ns).
The form model: segments • Segments • buttons • geographic • price • Room • property type • buy/rent • order-by • display • per page • add in time • new/resale • SSTC • A segment is: • a single element • a group of elements • a group of segments • a pair <segment, label> • Form • real-estate
The result-page model • Goal: model result-pages phenomenology
The result-page model • Attributes and values • e.g., < price, £ 250,000 > • Record • groups of pairs < attribute, value > • Data area • groups of records • Mandatory attribute(s) • must be present in a record • sanity check purposes
A Conceptual Model for Data Extraction • Conceptual Modelling on the Web • Software modelling e.g., UML and stereotypes • Ad hoc languages e.g., WebML
DIADEM Ontology: discussion • Adaptability • result-page model is substantially domain independent • Form model is domain dependent (entity types) • The number of entities is limited • Expressive power • safe nr-datalog with stratified negation and aggregation • pros: easy to compute • cons: not robust to uncertainty and inconsistencies
Uncertainty, Vagueness and Inconsistencies • Origin • annotations are noisy • entity types are uncertain • Multiple models • probabilistic models • Markov Logic Networks (Lukasiewicz and Simari) • C-tables, Bayesian Networks (Olteanu) • ASP • disjunctive models • weak constraints