290 likes | 538 Views
Semi-Structured Data and XML. Agenda. Semi-Structured Data XML. Semi-Structured Data: an Introduction. What is structured data What is non-structured data What is semi-structured data How is semi-structured data represented? What can we do with semi-structured data?.
E N D
Semi-Structured Data and XML Jacob (Jack) Gryn - Presented November 28, 2002
Agenda • Semi-Structured Data • XML Jacob (Jack) Gryn - Presented November 28, 2002
Semi-Structured Data: an Introduction • What is structured data • What is non-structured data • What is semi-structured data • How is semi-structured data represented? • What can we do with semi-structured data? Jacob (Jack) Gryn - Presented November 28, 2002
What is Structured Data? • Strongly typed variables/attributes (ie. int, float, string[20]) • Every attribute in a relation is defined for all records • Data is represented in some organized fashion Jacob (Jack) Gryn - Presented November 28, 2002
An Example of Structured Data A relational database can be considered structured data Jacob (Jack) Gryn - Presented November 28, 2002
What is Non-Structured Data? • Data that has no type definitions • Data is not organized according to any pattern • No concept of variables or attributes Jacob (Jack) Gryn - Presented November 28, 2002
An Example of Non-Structured Data “Bob was born sometime in August of 1949. He has a reasonable salary of 52000. Someone else was born on the 12th of a different month, his name is Bill. By the way, Bob was born on the 13th of August.” As you can see, such data would be almost impossible to have a computer automatically parse. Jacob (Jack) Gryn - Presented November 28, 2002
Then what is Semi-Structured Data? Anything in between structured and non-structured data! Jacob (Jack) Gryn - Presented November 28, 2002
Then what is Semi-Structured Data? • Everything in between structured and non-structured data • Variables are loosely typed • x=1 is valid, so is x=“hello” • A record does not need to have all attributes defined • ie. In a database of cars, if we don’t know the engine type, we can choose not to define the field for tha particular record. Whereas in a structured database, the attribute would be defined, but set to NULL. • An attribute of a record could be another record • It does not necessarily have to differentiate between an identifier and a value Jacob (Jack) Gryn - Presented November 28, 2002
So how is semi-structured data represented? Semi-Structured data can be represented as a tree Jacob (Jack) Gryn - Presented November 28, 2002
So how is semi-structured data represented? Semi-Structured data can be represented in the form of indented text: Bob Birthday 1949 August 13 Salary $52,000 Bill Birthday 1967 April Jacob (Jack) Gryn - Presented November 28, 2002
So how is semi-structured data represented? Semi-Structured data can be represented as a markup language:(ie. HTML, XML, LISP, AceDB, Tsimmis) <employee id=”3”> <name>Bob</name> <extension>5513</extension> <department>Sales</department> <salary>45000</salary> </employee> <employee id=”1”> <name>Ed</name> <extension>6766</extension> <office>312</office> <department>Executive</department> <salary>Confidential</salary> <employee> Jacob (Jack) Gryn - Presented November 28, 2002
Overview • Semi-Structured data is not necessarily created with the intention of being processed. • ie. Web pages are not necessarily intended to be queried by a language like SQL; the web designer, not taking this into consideration may not make it easy for the data to be processed by a machine. Jacob (Jack) Gryn - Presented November 28, 2002
What can we do with Semi-Structured Data? • Since there is some structure, it can be scanned and parsed • Once the data is parsed, we can query it using specialized query languages such as UnQL, GEXT and Lorel • We can “clean it up” to be placed into a structured relational database Jacob (Jack) Gryn - Presented November 28, 2002
XML: an Introduction to XML • What is XML? • What does it offer to creators of DB’s? • How can XML be used as a DB? • Representations of XML • Other features of XML • Disadvantages to XML Jacob (Jack) Gryn - Presented November 28, 2002
Summary / Key Points of Semi-Structured data • In between structured and non-structured data • Loosely typed attributes • Not all attributes need to be defined for every record • Can be parsed and queried Jacob (Jack) Gryn - Presented November 28, 2002
What is XML? • XML stands for eXtensible Markup Language • Based on tags similar to HTML • Actually, XHTML is a form of XML • Used to define markup languages Jacob (Jack) Gryn - Presented November 28, 2002
What does XML offer to database designers? • Readable by humans using Unicode or ASCII text • Easy for computers to parse • Can easily be used as ‘back-end’ for web sites Jacob (Jack) Gryn - Presented November 28, 2002
How can XML be used as a database? Consider the following data: <employee id=”3”> <name>Bob</name> <extension>5513</extension> <department>Sales</department> <salary>45000</salary> </employee> <employee id=”1”> <name>Ed</name> <extension>6766</extension> <office>312</office> <department>Executive</department> <salary>Confidential</salary> <employee> It can be written in XML as follows: Notice that this is semi-structured data, since not all the fields are filled in and because they are loosely typed. Jacob (Jack) Gryn - Presented November 28, 2002
In XML, there are few restrictions to how data can be laid out • The tag names can represent either attribute names or data itself • Tag names can be defined to anything the creator wishes Jacob (Jack) Gryn - Presented November 28, 2002
But, there are still a few restrictions • Every tag that is opened, must be closed. • <name>Bob</name> • Close tag is not needed for empty data • <myelement/> • If one tag is opened inside the field of another tag, it must be closed before the outer tag is closed. • <employee><name>Bob</employee></name> • <employee><name>Bob></name></employee> • Tags are case sensitive Jacob (Jack) Gryn - Presented November 28, 2002
How can XML be represented? • As a tree structure • As text/markup tags Jacob (Jack) Gryn - Presented November 28, 2002
How can XML be represented? As a tree structure: Take our previous example: • Leaf nodes generally, but do not necessarily store the data • Recent web browsers will show this structure Jacob (Jack) Gryn - Presented November 28, 2002
How can XML be represented? As a text/markup language: Take our previous example: <employee id=”3”> <name>Bob</name> <extension>5513</extension> <department>Sales</department> <salary>45000</salary> </employee> <employee id=”1”> <name>Ed</name> <extension>6766</extension> <office>312</office> <department>Executive</department> <salary>Confidential</salary> <employee> Jacob (Jack) Gryn - Presented November 28, 2002
Other features of XML • It is easy to parse • It can be queried like a database • It can be used with XSL Templates to easily generate web pages from data • It can be used with DTS (Document Type Definition) to run as a fully structured database Jacob (Jack) Gryn - Presented November 28, 2002
Disadvantages to XML • Difficult create indexes on • Difficult to optimize queries • Requires additional disk space • Text format • Redundant data in tags • No single standard of how data should be stored in XML Jacob (Jack) Gryn - Presented November 28, 2002
Summary / Key points of XML • Data stored using text-based markup language • Can also be represented in tree format • Can store structured and semi-structured data • Easy to parse and query, but inefficient Jacob (Jack) Gryn - Presented November 28, 2002
Where to Get More Information • Search the web, you’ll find something! Jacob (Jack) Gryn - Presented November 28, 2002