240 likes | 393 Views
Structured Data. HTML XML XHTML JSON XMLSchema. Structured Data. Machine processable data needs to be structured There are many examples Properties files: h ost= example.com p ort=8080 p rotocol=https Comma Separated Values: host,port,protocol example.com,8080,https
E N D
Structured Data • HTML • XML • XHTML • JSON • XMLSchema
Structured Data • Machine processable data needs to be structured • There are many examples • Properties files: host=example.com port=8080 protocol=https • Comma Separated Values: host,port,protocol example.com,8080,https • These are examples of ‘flat files’ • hard to model composite structures
HTML and XML • Derivatives of Standard Generalized Markup Language (SGML). • Offer machine readable, yet machine independent means of conveying information • Use the angle bracket syntax (<>) to structure the document. • Based on a tree-structure: root <html> <head> </head> <body> <p> hello world </p> </body> </html> child siblings
Elements and Attributes • Elements are structural • Attributes qualify elements <html> <head> </head> <body bgcolor=“red”> <p> hello world </p> </body> </html> attribute element
Hypertext Markup Language (HTML) • Its primary purpose is to convey information to a browser for human consumption: • <p>, <bold>, <italic>, <pre> etc. • It does contain other tags that are not presentational. • Like one for metadata: • <meta> • And ones that are structural: • e.g. <head>, <body>, <div>, <span> • And some that are sort of in between: • e.g. , <ol>, <ul>, <h1>, <title> • HTML can embed information: • e.g. <img>, <object> • It can also contain style and script content in the header: • <style>, <script> • Most importantly, it can link to other resources via the anchor tag and hrefattribute: • e.g. <a href=“http:// example.com/otherpage.html”>
HTML • HTML Example describing a book <h1>The Cat in the Hat</h1><br> <p>by Dr Seuss</p> <ul> <li>Publisher: HarperCollins</li> <li>Genre: Children’s Fiction</li> <li>Year: 2003</li> <li>ISBN: 0-00-715853</li> </ul> <br>visit the website <a href=“http://harp.co.uk”>here</a>
HTML • The main limitations of HTML are: • Fixed set of tags • Focus on presentation • Like the Web, it is primarily for human consumption • Not all HTML is ‘well-formed’, i.e. it breaks the tree structure • The classic case is orphan <br> tags. Strictly speaking, a tag must either contain child tags, or be an empty tag (<br/>). • During the browser wars mostly between M$ and Netscape, browsers became very forgiving of invalid markup to recruit users. • This is just about OK when dealing with a fixed set of presentational tags, free market economics permitting • But not sustainable and not good for machine parsing
Extensible Markup Language (XML) • XML is (e)xtensible. • You can create your own tags which means • Tags can be understood in semantic terms: • e.g. <book> contains <author> • XML MUST be well-formed (no structural inconsistencies like <br>) • validation against a Document Type Definition (DTD) or XML Schema or RelaxNGdocument is easier because it is well-formed. • These define what a particular document can contain, e.g. a book element MUST contain >= 1 author elements
XML • XML Example of a book <?xml version="1.0"?> <book> <title>The Cat in the Hat</title> <author>Dr Seuss</author> <isbn>0-00-715853<isbn> <genre>Children’s Fiction</genre> <published>2003</published> <publisher> <name>HarperCollins</name> <url>http://harp.co.uk</url> </publisher> </book>
XML Pros • Plain text • Human readable • Create/edit in standard text editor (if you really want to) • Self-Describing, Structured Data • Extensible tag language • Machine readable • Can be validated against DTDs and Schema • Presentation independent • Unlike HTML • Format to other languages using transformations (e.g. XSLT) • Programming language independent • Java, C, C++, Visual Basic, Perl… • Simple to parse • Widely used in many domains and for many purposes
XML Cons • The main limitations of XML are: • Verbose way of describing data • How do you include binary data (e.g. images)? • (work in progress and not ubiquitously supported) • A proliferation of DTD and Schema types because anyone can create their own tags • Lots of processing time for each new XML doc and DTD/Schema you come across • New software components to understand the new XML docs (their semantics not structure) • How do I know if your <author> tag means the same as my <author> tag?
XML Namespaces <a:author xmlns:a=“http://andrew/namespace”> • This last issue is addressed through namespaces • Allows a tag to be qualified by a URI: <s:author xmlns:s=“http://sue/namespace”> prefix binding namespace • Now I can tell the difference between the two author tags :-) • But the XML is more complicated :-( • And what happens if I change the definition of my author tag? • I suppose I better change the namespace: <a:author xmlns:a=“http://andrew/namespace/v1”> • That’s better :-) • But now every client that understood the previous namespace is broken :-(
RDF XML example <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/”> <foaf:Personrdf:about="#AL"> <foaf:name>Archibald Leach</foaf:name> <foaf:mbox_sha1sum>cf2342293...</foaf:mbox_sha1sum> <foaf:knows> <foaf:Person> <foaf:name>Katharine Hepburn</foaf:name> </foaf:Person> </foaf:knows> </foaf:Person> </rdf:RDF>
XHTML • In between HTML and XML • It is valid HTML and valid XML • MUST be well-formed. • Fixed set of tags • Makes use of HTML non-presentational tags. • Defers presentational concerns completely to Cascading Style Sheets (CSS) • Instead uses element attributes to inject presentational hints to the CSS: <div class=“my-important-type”>I’m important</div> Class attribute
Cascading Style Sheets(CSS) • A rendering language that goes in the header of an HTML page • Property based • element -type {presentation-key : value} • CSS allows for extensibility! • I can define a class, and define rendering hints to the browser for that class: <style type=“text/css”> .my-important-type{font-color: red} </style> And in the document: <div class=“my-important-type”>Hey wait!</div> • Hey, wait! • at the same time as defining rendering hints to the browser, I’m also classifying an element in the document. • Perhaps I can use this to support semantic information, not just rendering information • So I could call my class .book and have elements inside it like .title and .author. Hmm…
XHTML example <head> <title>My Book</title> </head> <body> <div class=“book”> <h1 class=“title”>The Cat in the Hat</h1> <p>by <span class=“author”>Dr Seuss</span></p> <ul> <li>Publisher: <span class=“pub”>HarperCollins</span></li> <li>Genre: <span class=“genre”>Children’s Fiction</span></li> <li>Year: <span class=“year”>2003</span></li> <li>ISBN: <span class=“isbn”>0-00-715853</isbn></li> </ul> </div> <p>visit the website at <a href=“http://harp.co.uk” class=“url” title=“http://harp.co.uk”>here</a> </body>
XHTML with some CSS • Here’s what it looks like in a browser with a bit of CSS in the head of the HTML page: The important thing to take away here is that the data has not been lost through rendering. It looks nice for a human, but a machine can still extract the book properties
HTML 5 • Builds on HTML 4 • A set of features, rather than a monolithic spec. • Not all browser support all features yet. • HTML 5 MUST be well-formed (XHTML) • Some core features: • Canvas – drawing area • Video – embed directly – no need for plugins • Local storage • Multi-threaded Javascript • GEO location • Semantic tags – section, header, footer etc. • Micro data – embedded semantic metadata, e.g. licencing, vCards and your own vocabs.
HTML 5 • Micro data – embedded semantic metadata, e.g. licencing, vCards and your own vocabs. • You can create scopes on a tag: <section itemscopeitemtype="http://data-vocabulary.org/Person"> • Then mark up elements within the scope: <imgitemprop="photo” src=“…”/> <pitemprop=”name”>Andrew</p> Then publish your vocabulary so people can use it. Publish in human readable for, and RDF for machine processing. See http://html5demos.com/
JavascriptObject Notation (JSON) • Another structured document type, not based on XML. • Instead uses properties, and nested curly braces to describe data: {"location": {"id": "WashingtonDC", "city": "Washington DC", "venue": "Hilton Hotel, Tysons Corner", "address": "7920 Jones Branch Drive” } } • Essentially a dictionary • Supports number, string, boolean, array (list) and Object (map) • JSON can be parsed into a Javascript object using the eval(string) method. • Popular because it is simpler than XML and natively understood by browsers.
XML Schema • XML Syntax for describing how XML documents should be structured. • Has some built-in data types • Allows for validation of an XML document • Allows for code generation • Create objects in your favorite programming language to manipulate XML documents
<xsd:schemaxmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:book" xmlns:bk="urn:book"> <xsd:element name="book" type="bk:Book"/> <xsd:complexType name="Book"> <xsd:sequence> <xsd:element name="title" type="xsd:string"/> <xsd:element name="author" type="xsd:string"/> <xsd:element name=”isbn" type="xsd:string"/> <xsd:element name="genre" type="xsd:string"/> <xsd:element name=”published” type="xsd:date" /> <xsd:element name=”publisher" type=”bk:Publisher”/> </xsd:sequence> </xsd:complexType> <xsd:complexType name=”Publisher"> <xsd:sequence> <xsd:element name=”name" type="xsd:string"/> <xsd:element name=”url" type="xsd:anyURI"/> </xsd:sequence> </xsd:complexType> </xsd:schema>
Structured Data Why use structured data? Understand how structured data encapsulates information What are the strengths/weaknesses of different types of structured data?