110 likes | 178 Views
Introduction. XML stands for eXtensible Markup Language. Designed to transport and store data; not to display it XML is similar to HTML, but tags are not predefined. Tags are defined by users. XML is a W3C recommendation.
E N D
Introduction • XML stands for eXtensible Markup Language. • Designed to transport and store data; not to display it • XML is similar to HTML, but tags are not predefined. • Tags are defined by users. • XML is a W3C recommendation. The main idea is to compress well formed xml files, for an application, which are generated from database queries.
<!header…> <Main Tag> <Row1 Tag> <Col1 Tag>Data</Col1 Tag> … <Coln Tag>Data</Coln Tag> </Row1 Tag> … </Main Tag> xml file head main xml ELEMENT xml ELEMENT by query’s row Xml ELEMENT by query’s col Xml file structures
Algorithm • The algorithm takes advantages of the well defined structure of the xml files. • Also, the frequency that row’s columns could have. This is the big deal of the algorithm! • Some compression strategies, similar to Static Dictionary, where xml tags, and “DataKeys” are replace by unused Ascii characters.
Description Compression Algorithm • The file is processed in two (2) phases. • Phase One means figuring out xml tags, Ascii characters available, and DataKeys. • DataKey are sorting by the following rule: Length(DataKey) * frequency – (Length(DataKey) + frequency). Any DataKey over availability is discarded. Example: Key len= 20, frequency= 10; means 30 instead of 200= 170 Key len= 15, frequency= 10; means 25 instead of 150= 125 Key len= 30, frequency= 5; means 35 instead of 150= 115
Description Compression Algorithm • Phase II means reading again the xml file in order to create a new file with a header -built from the information taken from Phase I, and its detail is shown later-, to reconstruct the xml file, and replacing Tags/DataKeys by available Ascii Characters.
Description Compression Algorithm • Rules to replace Tags/DataKeys • Main Tag is skipped • Row Tag, an Ascii char is assigned. • Column Tag, an Ascii char is assigned. • If Column Data is a DataKey • If Ascii char is assigned, just Assigned Ascii • Else Assigned Column Char + Column Data • Else • Assigned Column Char + Column Data
Description Decompression Algorithm • Read Header file • First four (4) Characters mean • Number of BitWise characters. -used Ascii chars. • First used Ascii char. • Number of Element tag. • Number of Data Keys set. • According to Char 4, reads pair Col/Num • According to Char 1, reads Bitwise • According to Char 3, reads Element String • According to Total Num from pairs, reads DK • Reads the rest of file replacing assigned Ascii
Application Syntax xmlzip [-c filename.xml] [-k column _1 … column_n]] | [-d filename.xzp] Where -c: Compressing -k: Column numbers to be Data Keys -d: Decompressing
We can notice Header Length is proportional to characters found in XML file, XML file Elements, and Datakey founds in XML file: NUMELEMENT ∑ SUBDATAKEY H = 4 + DATAKEYNUM*2 + NUMBITWISE + ∑ [length(ELEMENTSTRi)+1] + ∑ [DATAKEYSTRj)+1] + 1 i=1 j=1 In this case, the file HEADER is: H= 4 + 2 * 1 + 12 + 8 + 3 + 6 + 7 + 8 + 8 + 6 + 5 + 4 + 3 + 9 + 1= 87
<?xml version="1.0" encoding="ISO-8859-1"?> <CATALOG> <CD> <TITLE>Empire Burlesque</TITLE> <ARTIST>Bob Dylan</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> <PRICE>10.90</PRICE> <YEAR>1985</YEAR> </CD> <CD> <TITLE>Hide your heart</TITLE> <ARTIST>Bonnie Tyler</ARTIST> <COUNTRY>UK</COUNTRY> <COMPANY>CBS Records</COMPANY> <PRICE>9.90</PRICE> <YEAR>1988</YEAR> </CD> <CD> <TITLE>Thriller</TITLE> <ARTIST>Michael Jackson</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> <PRICE>11.90</PRICE> <YEAR>1985</YEAR> </CD> <CD> <TITLE>Love Songs</TITLE> <ARTIST>Bee Gee</ARTIST> <COUNTRY>UK</COUNTRY> <COMPANY>Records</COMPANY> <PRICE>12.00</PRICE> <YEAR>1980</YEAR> </CD> <CD> <TITLE>Oral Fixation</TITLE> <ARTIST>Shaquira</ARTIST> <COUNTRY>Colombia</COUNTRY> <COMPANY>Epic</COMPANY> <PRICE>18.70</PRICE> <YEAR>2006</YEAR> </CD> </CATALOG> HEADER <?xml version="1.0" encoding="ISO-8859-1"?> @ &Empire Burlesque !Bob Dylan % *Columbia $10.90 #1985 @ &Hide your heart !Bonnie Tyler ~ *CBS Records $9.90 #1988 @ &Thriller !Michael Jackson % *Columbia $11.90 #1985 @ &Love Songs !Bee Gee ~ *Records $12.00 #1980 @ &Oral Fixation !Shaquira ^ *Epic $18.70 #2006
Next • The next step is to make the algorithm generic. I mean the algorithm feature of taking column frequency advantage. • It can be exploited by Tag’s name instead of column number. I didn’t try to make it available because of time, but it’s a good point in order to avoid any conflict due to column order. • Also, it’s necessary the implementation of xml Attribute recognition. It’s almost done so far, but I didn’t keep going because of time constraint. It would be a good implementation that the user could say -by parameters- which specific Attribute is going to be taken into account. A good example is that Element’s Tags, and Attributes Tags could share the same name, even thought they are different data type. • Finally, but not least, complete the implementation of a modified PPM algorithm. The first task would be adding to the HEADER those DataKey over the available Ascii chars holding the condition: Length(DataKey) > Largest Context, and frequency >1 –at least. In order to add them to a “temporary” count array, where the size of the DataKey no matter.