360 likes | 519 Views
Using XML Parsers and Unicode. Ellen Pearlman Eileen Mullin Programming the Web Using XML. Learning Objectives. Understanding what an XML parser does Working with the basic Microsoft parser Differentiating between valid documents in different parsers and the way they define error statements
E N D
Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming the Web Using XML
Learning Objectives • Understanding what an XML parser does • Working with the basic Microsoft parser • Differentiating between valid documents in different parsers and the way they define error statements • Learning about Unicode and UTF-8, UTF-16 and UTF-32 • Investigating different character sets and typefaces for Unicode
Introduction • A parser is a grammar and syntax checker for markup and other programming languages. • A parser compares a XML document against the grammar in its DTD. This process, called validation ensures there are no mistakes that could potentially confuse the XML applications that access your content. • If a document follows the rules listed in its DTD, then it is said to be valid. If the document has markup errors that contradict the rules of the DTD, then it would be labeled invalid.
What is Unicode? • The Unicode Consortium was founded with the goal to foster a character encoding that encompasses all major scripts in the world. • Currently, Unicode has a little less than 50,000 different characters encoded in 16 bits for a total of up to 65,536 possible characters. Already almost a third of the encoded characters are in Han Chinese ideographs. • More languages are on the way, and so Unicode will jump to 32 bits per character. • Because XML uses Unicode as its character set, all character sets are compatible.
Parsers • When XML parsers first became commonly used, they consisted of basic text editors like Microsoft's NotePad, Wordpad and Apple SimpleText, and not much else. These basic text editors could not support Unicode. • Now parsers are divided into three categories, basic text editors, graphical text editors and integrated development environments.
How Parsers Work • In general, a parser looks for certain specifics, like the beginning of an XML statement <?xml version =, or even parenthesis (), percent sign % and so on. • Just as we look for a period (.) to end a sentence, a parser looks for certain pre-established XML grammatical conventions to know that a statement is correctly formed.
Differences Between an XML Parser and an HTML Parser • With HTML, there is already a pre-set standard that already tells the Web browser application how to render the information visually. • An XML editor or parser does not have any predetermined definition of your documents’ element and attribute names. An XML parser only knows basic valid and invalid rules. An XML parser only knows how to look at pure character strings.
The Basic Microsoft Parser • MSXML, Microsoft’s basic XML parser, is a good, free parser that is embedded into the Internet Explorer browser. • MSXML is a graphical text editor. It can be referred to as a WYSIWYG (What You See Is What You Get) editor. That means that there are no implied statements, and everything is displayed on the screen.
Creating Your Own Valid Document: validatortest.xml document
A Word About Errors • Most parsers deal with errors in XML in one of two ways. There are errors and then there are fatal errors. • A basic error is a violation of the rules in whatever specification it is checking the code against (i.e. XSLT, plain XML). The parser points out the error and continues processing. • A fatal error stops the parser from checking the code. It also stops the XML document from being well-formed.
Using XML Spy • XML Spy can be thought of as an IDE because it not only has a text and code editor, but also a compiler, debugger and GUI intuitive interface. • With XML Spy a developer could actually build a sophisticated project. • There are two basic views, the Text view, which resembles any text editor and the Enhanced Grid View, which shows more of the schema of the document.
Initial Code Listing: validatortest.xml <?xml version="1.0" encoding="UTF-8"?> <!-- This is good to use as a test --> <!DOCTYPE scribble [ <!ELEMENT scribble (first, second, third, fourth)> <!ELEMENT first (#PCDATA)> <!ELEMENT second (#PCDATA)> <!ELEMENT third (#PCDATA)> <!ELEMENT forth (#PCDATA)> ]> <scribble> <first>Our first line</first> <second>Our second line</second> <third>Our third line</third> <fourth>Our fourth line</fourth> </scribble>
Corrected Version: validatortest.xml <!-- This is good to use as a test --> <!DOCTYPE scribble [ <!ELEMENT scribble (first, second, third, fourth)> <!ELEMENT first (#PCDATA)> <!ELEMENT second (#PCDATA)> <!ELEMENT third (#PCDATA)> <!ELEMENT fourth (#PCDATA)> ]> <scribble> <first>Our first line</first> <second>Our second line</second> <third>Our third line</third> <fourth>Our fourth line</fourth> </scribble>
Other XML Editors: Viewing validatortest.xml in XML Edit Pro
The Development of a Global Standard: Introducing ASCII • ASCII is actually a subset of other character sets that contain 256 characters. • ASCII was a 7-bit coding system with a limited range and in order to increase its range, an 8-bit coding system was developed, Latin-1 (ISO 646), which coded 256 characters. • It became the language character set of choice for the Internet, e-mail, gopher, and ftp sites. However, this did not cover all characters that existed in all other non-Latin based languages.
The Development of a Global Standard: Unicode • In order to expand the range of permissible characters in 1983, ISO 10646 was developed that used 32 bits and could code 4 billion different characters. However, the code string became too big, and actually clogged up the bandwidth pipes it flowed through. • Unicode, developed in 1987 by the International Standard ISO/IEC and maintained since 1991 by the Unicode Consortium, halved the code bit to 16, making it a workable solution because now it could handle more characters using less bandwidth.
The Adoption of Unicode • Unicode provides a unique number for each and every character in the world, no matter what platform, program or language they are viewed on. • Every major vendor and standards body, operating system, browser and host of other products has adopted the standard. • Another standard, ISO 10646-1:1993, is being used on the Web and has, for all purposes, Unicode has become a subset of that ISO standard.
Unicode Enabled Operating Systems • Below is a list of operating systems that are Unicode-enabled: • Apple Mac OS 9.2, Mac OS X 10.1, Mac OS X Server, ATSUI • Bell Labs Plan 9 • Compaq's Tru64 UNIX, Open VMS • GNU/Linux with glibc 2.2.2 or newer - FAQ support • IBM AIX, AS/400, OS/2 • Inferno by Vita Nuova • Java • Microsoft Windows CE, Windows NT, Windows 2000, and Windows XP • SCO UnixWare 7.1.0 • Sun Solaris • Symbian Platform
XML:LANG Attribute • One of the most important attributes used in combination with XML and Unicode is the xml:lang attribute. It is the only attribute to use a language code. • This attribute asks the XML software to call upon the server to process the current document with the specified language. • An example of this would be as follows coded in an XML statement: <spanishtext xml:lang=ES> Hola amigo </spanishtext>
Pull-down Menu Structure in XML Spy to Add Elements and Attributes
UTF-8 and Beyond • UTF, which stands for Universal Character Set Transformation Format, allows Unicode to be broken into 8, 16 or even 32 bit values that are used in email and on the Internet. <?XML version ="1.0" encoding="UTF-8>. • Unicode encodes all text by the type of script (i.e. English language, Cyrillic, etc) used, not the language used, an important distinction that avoids unnecessary duplication of letters.
Character Sets and Typeface • Character sets do not refer to display formats, colors or typefaces. Unicode characters become visible to the user through a special rendering process that maps characters into glyphs. • Glyphs are the specific shape of any given character as it is displayed. The actual character "A" is really a generic "A" which might look like the plain letter "A". • Many things affect this rendering process such as operating systems, language settings, keyboard and display software, word processing software, type rasterizer and input and output hardware.
Character Sets and Typeface (2) • In ASCII there is a one-to-one correlation between the character, the glyph and the character set. That means that ASCII strips a character raw and renders it in basic text which resembles to most of us plain Courier. • This is not true for Unicode. It can render beautiful scripts. Different standards bodies have been set up to make sure languages and scripts coordinate.