1 / 23

Structured Data

Structured Data. HTML XML XHTML JSON XMLSchema. Structured Data. Machine processable data needs to be structured There are many examples Properties files: h ost= example.com p ort=8080 p rotocol=https Comma Separated Values: host,port,protocol example.com,8080,https

willem
Download Presentation

Structured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Structured Data • HTML • XML • XHTML • JSON • XMLSchema

  2. Structured Data • Machine processable data needs to be structured • There are many examples • Properties files: host=example.com port=8080 protocol=https • Comma Separated Values: host,port,protocol example.com,8080,https • These are examples of ‘flat files’ • hard to model composite structures

  3. HTML and XML • Derivatives of Standard Generalized Markup Language (SGML). • Offer machine readable, yet machine independent means of conveying information • Use the angle bracket syntax (<>) to structure the document. • Based on a tree-structure: root <html> <head> </head> <body> <p> hello world </p> </body> </html> child siblings

  4. Elements and Attributes • Elements are structural • Attributes qualify elements <html> <head> </head> <body bgcolor=“red”> <p> hello world </p> </body> </html> attribute element

  5. Hypertext Markup Language (HTML) • Its primary purpose is to convey information to a browser for human consumption: • <p>, <bold>, <italic>, <pre> etc. • It does contain other tags that are not presentational. • Like one for metadata: • <meta> • And ones that are structural: • e.g. <head>, <body>, <div>, <span> • And some that are sort of in between: • e.g. , <ol>, <ul>, <h1>, <title> • HTML can embed information: • e.g. <img>, <object> • It can also contain style and script content in the header: • <style>, <script> • Most importantly, it can link to other resources via the anchor tag and hrefattribute: • e.g. <a href=“http:// example.com/otherpage.html”>

  6. HTML • HTML Example describing a book <h1>The Cat in the Hat</h1><br> <p>by Dr Seuss</p> <ul> <li>Publisher: HarperCollins</li> <li>Genre: Children’s Fiction</li> <li>Year: 2003</li> <li>ISBN: 0-00-715853</li> </ul> <br>visit the website <a href=“http://harp.co.uk”>here</a>

  7. HTML • The main limitations of HTML are: • Fixed set of tags • Focus on presentation • Like the Web, it is primarily for human consumption • Not all HTML is ‘well-formed’, i.e. it breaks the tree structure • The classic case is orphan <br> tags. Strictly speaking, a tag must either contain child tags, or be an empty tag (<br/>). • During the browser wars mostly between M$ and Netscape, browsers became very forgiving of invalid markup to recruit users. • This is just about OK when dealing with a fixed set of presentational tags, free market economics permitting • But not sustainable and not good for machine parsing

  8. Extensible Markup Language (XML) • XML is (e)xtensible. • You can create your own tags which means • Tags can be understood in semantic terms: • e.g. <book> contains <author> • XML MUST be well-formed (no structural inconsistencies like <br>) • validation against a Document Type Definition (DTD) or XML Schema or RelaxNGdocument is easier because it is well-formed. • These define what a particular document can contain, e.g. a book element MUST contain >= 1 author elements

  9. XML • XML Example of a book <?xml version="1.0"?> <book> <title>The Cat in the Hat</title> <author>Dr Seuss</author> <isbn>0-00-715853<isbn> <genre>Children’s Fiction</genre> <published>2003</published> <publisher> <name>HarperCollins</name> <url>http://harp.co.uk</url> </publisher> </book>

  10. XML Pros • Plain text • Human readable • Create/edit in standard text editor (if you really want to) • Self-Describing, Structured Data • Extensible tag language • Machine readable • Can be validated against DTDs and Schema • Presentation independent • Unlike HTML • Format to other languages using transformations (e.g. XSLT) • Programming language independent • Java, C, C++, Visual Basic, Perl… • Simple to parse • Widely used in many domains and for many purposes

  11. XML Cons • The main limitations of XML are: • Verbose way of describing data • How do you include binary data (e.g. images)? • (work in progress and not ubiquitously supported) • A proliferation of DTD and Schema types because anyone can create their own tags • Lots of processing time for each new XML doc and DTD/Schema you come across • New software components to understand the new XML docs (their semantics not structure) • How do I know if your <author> tag means the same as my <author> tag?

  12. XML Namespaces <a:author xmlns:a=“http://andrew/namespace”> • This last issue is addressed through namespaces • Allows a tag to be qualified by a URI: <s:author xmlns:s=“http://sue/namespace”> prefix binding namespace • Now I can tell the difference between the two author tags :-) • But the XML is more complicated :-( • And what happens if I change the definition of my author tag? • I suppose I better change the namespace: <a:author xmlns:a=“http://andrew/namespace/v1”> • That’s better :-) • But now every client that understood the previous namespace is broken :-(

  13. RDF XML example <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/”> <foaf:Personrdf:about="#AL"> <foaf:name>Archibald Leach</foaf:name> <foaf:mbox_sha1sum>cf2342293...</foaf:mbox_sha1sum> <foaf:knows> <foaf:Person> <foaf:name>Katharine Hepburn</foaf:name> </foaf:Person> </foaf:knows> </foaf:Person> </rdf:RDF>

  14. XHTML • In between HTML and XML • It is valid HTML and valid XML • MUST be well-formed. • Fixed set of tags • Makes use of HTML non-presentational tags. • Defers presentational concerns completely to Cascading Style Sheets (CSS) • Instead uses element attributes to inject presentational hints to the CSS: <div class=“my-important-type”>I’m important</div> Class attribute

  15. Cascading Style Sheets(CSS) • A rendering language that goes in the header of an HTML page • Property based • element -type {presentation-key : value} • CSS allows for extensibility! • I can define a class, and define rendering hints to the browser for that class: <style type=“text/css”> .my-important-type{font-color: red} </style> And in the document: <div class=“my-important-type”>Hey wait!</div> • Hey, wait! • at the same time as defining rendering hints to the browser, I’m also classifying an element in the document. • Perhaps I can use this to support semantic information, not just rendering information • So I could call my class .book and have elements inside it like .title and .author. Hmm…

  16. XHTML example <head> <title>My Book</title> </head> <body> <div class=“book”> <h1 class=“title”>The Cat in the Hat</h1> <p>by <span class=“author”>Dr Seuss</span></p> <ul> <li>Publisher: <span class=“pub”>HarperCollins</span></li> <li>Genre: <span class=“genre”>Children’s Fiction</span></li> <li>Year: <span class=“year”>2003</span></li> <li>ISBN: <span class=“isbn”>0-00-715853</isbn></li> </ul> </div> <p>visit the website at <a href=“http://harp.co.uk” class=“url” title=“http://harp.co.uk”>here</a> </body>

  17. XHTML with some CSS • Here’s what it looks like in a browser with a bit of CSS in the head of the HTML page: The important thing to take away here is that the data has not been lost through rendering. It looks nice for a human, but a machine can still extract the book properties

  18. HTML 5 • Builds on HTML 4 • A set of features, rather than a monolithic spec. • Not all browser support all features yet. • HTML 5 MUST be well-formed (XHTML) • Some core features: • Canvas – drawing area • Video – embed directly – no need for plugins • Local storage • Multi-threaded Javascript • GEO location • Semantic tags – section, header, footer etc. • Micro data – embedded semantic metadata, e.g. licencing, vCards and your own vocabs.

  19. HTML 5 • Micro data – embedded semantic metadata, e.g. licencing, vCards and your own vocabs. • You can create scopes on a tag: <section itemscopeitemtype="http://data-vocabulary.org/Person"> • Then mark up elements within the scope: <imgitemprop="photo” src=“…”/> <pitemprop=”name”>Andrew</p> Then publish your vocabulary so people can use it. Publish in human readable for, and RDF for machine processing. See http://html5demos.com/

  20. JavascriptObject Notation (JSON) • Another structured document type, not based on XML. • Instead uses properties, and nested curly braces to describe data: {"location": {"id": "WashingtonDC", "city": "Washington DC", "venue": "Hilton Hotel, Tysons Corner", "address": "7920 Jones Branch Drive” } } • Essentially a dictionary • Supports number, string, boolean, array (list) and Object (map) • JSON can be parsed into a Javascript object using the eval(string) method. • Popular because it is simpler than XML and natively understood by browsers.

  21. XML Schema • XML Syntax for describing how XML documents should be structured. • Has some built-in data types • Allows for validation of an XML document • Allows for code generation • Create objects in your favorite programming language to manipulate XML documents

  22. <xsd:schemaxmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:book" xmlns:bk="urn:book"> <xsd:element name="book" type="bk:Book"/> <xsd:complexType name="Book"> <xsd:sequence> <xsd:element name="title" type="xsd:string"/> <xsd:element name="author" type="xsd:string"/> <xsd:element name=”isbn" type="xsd:string"/> <xsd:element name="genre" type="xsd:string"/> <xsd:element name=”published” type="xsd:date" /> <xsd:element name=”publisher" type=”bk:Publisher”/> </xsd:sequence> </xsd:complexType> <xsd:complexType name=”Publisher"> <xsd:sequence> <xsd:element name=”name" type="xsd:string"/> <xsd:element name=”url" type="xsd:anyURI"/> </xsd:sequence> </xsd:complexType> </xsd:schema>

  23. Structured Data Why use structured data? Understand how structured data encapsulates information What are the strengths/weaknesses of different types of structured data?

More Related