450 likes | 562 Views
XML – eXtensible Markup Language. The World Wide Web and What We Would Like to Do with It. XML has a lot of hype surrounding it This week we discuss: Why XML is needed Basic technologies used together with XML In the next few weeks: challenges in using XML. XML in One Slide.
E N D
The World Wide Web and What We Would Like to Do with It • XML has a lot of hype surrounding it • This week we discuss: • Why XML is needed • Basic technologies used together with XML • In the next few weeks: challenges in using XML
XML in One Slide • Basically, XML looks like HTML. • However, in XML, you can use any tag names that you want • Example: <person> <name> Lisa Simpson</name> <tel> 02-828-1234 </tel> <tel> 054-470-777 </tel> <email> lisa@cs.huji.ac.il </email> </person> Is that all? Big Deal?!
Example 1: A Homepage on the Web • Tom's Hobbies: • Boating on the Mississippi River • Chewing Gum • Painting the Fence Tom Sawyer's Homepage Tom's Friends
Web Pages are Written in HTML • HTML is a markup language • An HTML page consists of tags with attributes and data • HTML describes the style of the page (e.g., color, font type, etc.)
<html> <body> <h1>Tom Sawyer's Homepage</h1> <img src="tom.jpg"> Hi'ya all. Did you know that my best friend is <b>Huckleberry Finn</b>? Sometimes, I like <b>Becky Thatcher</b>? <p> <font color = "red"> Here are some of my hobbies: <ul> <li> Boating on the Mississippi River <li> Chewing gum <li> Painting the fence </ul> </font> If you want to discuss common interests, contact me at <a href="mailto:tom@mark.twain">tom@mark.twain</a> </body></html>
Automatically Using Information • Tom Sawyer has a homepage. So do a lot of other people. It would be nice to be able to do the following things automatically (via a computer program) • Querying the Page:Find Tom Sawyer's email address and the names of his friends • Querying Similar Pages: Find people who have interests in common with Tom Sawyer
Automatically Using Information • Site Personalization: Tom Sawyer's interests should be automatically recognized by sites • When Tom Sawyer enters Amazon, he should get "book recommendations" that match his interests • When Tom Sawyer enters a site that sells food, he should be told about sales on gum • This should all happen without Tom having to tell every site about his interests
Can we Automatically use the Information? • In order to perform the tasks described before, we have to: • Findweb pages that describe people • Extractthe relevant information • Problems: • How can we know if a page describes a person? • How can we know what to extract? (Everyone has their own style for their homepage...) • How can we "understand" the extracted information (What parts of the page describe which information?)
Example 2: Weather Forecasting National Weather Service: Weather Forecasting and Weather Alerts Flood Alerts in Mississippi
Wouldn't it be great if… Wouldn't it be great if Tom could get automatic updates of weather problems in Mississippi? It is dangerous to go boating if there are floods…
Example 3: News Alerts Traffic Jam in the Mississippi River Yahoo News
Wouldn't it be great if… Wouldn't it be great if Tom could get automatic updates of important news related to Mississippi? He might want to choose a different river to go boating…
It is difficult (perhaps impossible) to perform these tasks Can these things be done? • Once again, we need to FIND the relevant pages and EXTRACT the relevant data • HTML pages are constantly changing • How can we figure out what data is relevant and what the data is talking about automatically? (even when the page changes) • HTML describes only style and not meaning (or semantics)
Two Basic Approaches • If the information on the Web was neatly organized in a huge database, these problems could be solved.But its not – What should we do? • AI, NLP Approach: Use smart techniques to recognize information, e.g., recognize patterns about how things are written • DB Approach: Turn the Web in to a “database”, by writing it in XML
The Semantic Web • The Semantic Web is a machine-understandableWeb • The meaning of data (i.e., the semantics of data) should be encoded together with the data • Tim Berners-Lee, the inventor of the Web (by putting together the ideas of hyper-text, TCP/IP, DNS) is one of the main people behind the Semantic Web
Main Technologies Needed • XML: The syntax for marking up text with meaning • RDF: Defines objects and relationships between them • OWL: Defines ontologies which connect different concepts (e.g., a car is an automobile, a car is a type of locamotive) • Web Services: Allow services given online to be accessed programmatically Here is a simplified version of how it could work
<Person> <name>Thomas Sawyer</name> <gender>Male</gender> <mbox resource="mailto:tom@mark.twain"/> <picture resource="http://www.cs.huji.ac.il/~sarina/tom.jpg"/> <speaks>English</speaks> <interest resource="Boating on the Mississippi"/> <interest resource="Chewing Gum" /> <knows> <Person> <name>Huckleberry Finn</name> <mbox resource="mailto:finn@mark.twain"/> <Person> </knows> </Person> Simplified version of the FOAF standard
Is there XML on the Web? (1) • The weather forecasting site exports its forecasts as RSS (a standard for marking up news) - this data can easily be used by a program
Is there XML on the Web? (2) • Yahoo News (seen before) exports its news as RSS - this data can easily be used by a program
Insurance Co. Rating Provider sites Physician’s Agent Mom required treatment in-plan? close-by? Specialist? Schedule appointment Driving schedule Lucy’s Agent Pete’s Agent The Sky’s The Limit: Doctor’s appointment“The Semantic Web”, Scientific American, May 2001
Exchanging Data • Problem: Many data sources, each of a different type (different vendor), with a different schema. • How can the data be combined and used together? • How can different companies collaborate on their data? • What (proprietary?) format should be used to exchange the data?
Usage Scenario: Company Collaboration • Several companies want to collaborate • Need to share data • Each company has a different type of database system with a different schema • Solution: Agree on a XML schema for exchange. Import to and export from this schema
Web Site Development • Web sites develop over time • Important to separate style from data in order to allow changes to the site structure and appearance • CSS separates style from data only in a limited way – HTML will still have tables, lists, etc • Using XML, we can store data alone • Using XSL, this data can be translated into HTML • The data can be translated differently as the site develops
XSL XSL XSL WML (hand-held devices) HTML (web browser TEXT (Excel) Write Once Use Everywhere XML Stock Data
HTML • Used for publishing hypertext on the World-Wide Web • Designed to describe how a Web browser should arrange text, images and push-buttons on a page • Easy to learn, but does not convey structure • Fixed tag set
Opening tag Text (PCDATA) “Bachelor” tag Attribute name Attribute value Closing tag HTML Example • <HTML> • <HEAD><TITLE>Welcome to the DBI course</TITLE></HEAD> • <BODY> • <H1>Introduction</H1> • <IMGSRC= "dragon.gif"WIDTH="200"HEIGHT="150" > • </BODY> • </HTML>
XML Vs. HTML • XML and HTML are “brothers”. They are both special cases of SGML. • HTML has specific tag and attribute names. These are associated with a specific meaning • XML can have any tag and attribute name. These are not associated with any meaning • HTML is used to specify visual style • XML is used to specify meaning SGML XML HTML
element, a sub-element of element not an element Terminology The segment of an XML document between an opening and a corresponding closing tag is called an element <person> <name>Bart Simpson</name> <tel>02 – 444 7777</tel> <tel>051 – 011 022</tel> <email>bart@tau.ac.il</email> </person>
person name tel tel email XML Document is a Tree Bart Simpson 051 – 011 022 02 – 444 7777 bart@tau.ac.il • XML documents are abstractly modeled as trees, as reflected by their nesting • Sometimes, XML documents are graphs (by using IDs and IDREFs)
Example XML Fragment <addresses> <person> <name> Donald Duck</name> <tel> 04-828-1345 </tel> <tel> 04-828-1374 </tel> <email> donald@cs.technion.ac.il </email> </person> <person> <name> Miki Mouse</name> <tel> 03-426-1142 </tel> </person> </addresses>
Another Example An element may contain a mixture of sub-elements and PCDATA <airline> <name>British Airways</name> <motto> World’s<dubious>favorite</dubious> airline </motto> </airline>
A Complete XML Document <?XML version ="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE addresses SYSTEM "http://www.addbook.com/addresses.dtd"> <addresses> <person> <name>Lisa Simpson</name> <tel> 02-828-1234 </tel> <tel> 054-470-777 </tel> <email> lisa@cs.huji.ac.il </email> </person> </addresses> Required Optional
Attributes • An opening tag may contain attributes • These are typically used to describe the contents of an element <entry> <wordlanguage= “en”>cheese</word> <wordlanguage= “fr”>fromage</word> <wordlanguage= “ro”>branza</word> <meaning>A food made …</meaning> </entry>
When to Use Attributes • It’s not always clear when to use attributes <person> <ssno>123 4589</ssno> <name> L. Simpson </name> <email> lisa@cs.huji.ac.il </email> ... </person> <personssno= “123 4589”> <name> L. Simpson </name> <email> lisa@cs.huji.ac.il </email> ... </person>
When to Use Attributes • It’s not always clear when to use attributes General Rule: Use an element if you need to nest data Use an attribute for “IDs”, i.e., identifying data More on this soon… <person> <ssno>123 4589</ssno> <name> L. Simpson </name> <email> lisa@cs.huji.ac.il </email> ... </person> <personssno= “123 4589”> <name> L. Simpson </name> <email> lisa@cs.huji.ac.il </email> ... </person>
Rules for XML (1) • XML is order sensitive, i.e. the following are different: • XML is case-sensitive, i.e., the following are different: <person>, <Person>, <PERSON> <entry> <wordlanguage= “en”>cheese</word> <wordlanguage= “fr”>fromage</word> </entry> <entry> <wordlanguage= “fr”>fromage</word> <wordlanguage= “en”>cheese</word> </entry>
Rules for XML (2) • Tags come in pairs<date> ...</date> • They must be properly nested. Which of the following are good? • <date> ... <day> ... </day> ... </date> • <date> ... <day> ... </date>... </day> • <date> ... <day> ... </day></Date> • There is a special shortcut for tags that have no text in between them (bachelor tags) • <person fname=“Sam” lname=“Iam” /> • <person fname=“Sam” lname=“Iam” ></person>
Rules for XML (3) • There should be exactly one top-level element. This element is also called the root element • Which of the following is legal? • <?xml version=“1.0”?> • <Question> Is this legal? </Question> • <?xml version=“1.0”?> • <Question> Is this legal? </Question> • <Answer> You tell me. </Answer>
Well Formed Documents • A document is well-formedif it • obeys all the above rules, and in addition • does not repeat an attribute within a tag, i.e., the following is illegal: <a val=’12’ val=’13’> … </a>
Tables Versus XML • Can you easily represent the contents of a table in XML? • Example: Projects(title, budget, managedBy), Employees(name, age, ssn) • Can you easily represent the contents of an XML document in a table? • Example: Remember the phone book