1 / 35

Reading Microsoft Word XML files with SAS August 25, 2005

Reading Microsoft Word XML files with SAS August 25, 2005. Larry Hoyle -- Policy Research Institute University of Kansas. revised 8/18/2005. 3 scenarios. Extracting text along with associated properties (styles and attributes) Extracting all data from tables

neil-weaver
Download Presentation

Reading Microsoft Word XML files with SAS August 25, 2005

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reading Microsoft Word XML files with SAS August 25, 2005 Larry Hoyle -- Policy Research Institute University of Kansas revised 8/18/2005

  2. 3 scenarios • Extracting text along with associated properties (styles and attributes) • Extracting all data from tables • Extracting coordinates of objects in drawings

  3. Must begin with this prolog tag Paired tags, must have 1 root tag case sensitive Empty tags end with /> Tags and content called "element" Tags can be Qualified by attributes XML - syntax <?xml version="1.0" ?> <LarryRootTag> <EmptyTag/> <nestedTag> Some content </nestedTag > <nestedTag anAttribute="wha"> Other content </nestedTag > </LarryRootTag> Elements can be nested, Start and end in same parent

  4. Word XML

  5. Body Section Paragraph Run Text Properties Word XML

  6. Extracting text and properties • SAS XML Engine • Needs XMLMAP file • Can use XML Mapper to generate XMLMAP • Only needs to be generated once for each type of extract

  7. Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys.

  8. XML - Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. Paragraph property: /w:wordDocument/w:body /wx:sect/w:p/w:pPr Run property: /w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr.

  9. Rows • The XMLMap has to describe a path that delineates rows: • In this case it’s each text element in a run (in a paragraph…) <TABLE-PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</TABLE-PATH>

  10. Columns – the text • The XMLMap has to describe a path that delineates each column: • The text itself is: <COLUMN name="t"> <PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</PATH>

  11. Columns – the text element number • A sequential number for the text element is: <COLUMN name="tNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</INCREMENT-PATH>

  12. Columns – the paragraph number • A sequential number for the paragraph is: <COLUMN name="pNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p</INCREMENT-PATH>

  13. Columns –paragraph color <COLUMN name="PColorVal" retain="YES"> <PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:pPr/w:rPr/w:color/@val</PATH>

  14. Columns – run color <COLUMN name="RColorVal" retain="YES"> <PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:color/@val</PATH>

  15. Our dataset

  16. Tables

  17. All Tables Into One Dataset

  18. Tables – Word XML

  19. Tables - DataSet Rows <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t </TABLE-PATH>

  20. Tables – Table Number <COLUMN name="tblNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl </INCREMENT-PATH>

  21. Tables – Row Number <COLUMN name="trNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl/w:tr </INCREMENT-PATH>

  22. We Could Add Properties if Needed

  23. Nested tables

  24. Nested Tables – Absolute Path for Rows <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t </TABLE-PATH>

  25. Nested Tables – Rootless Path for Rows <TABLE-PATH syntax="XPath"> w:tbl/w:tr/w:tc/w:p/w:r/w:t </TABLE-PATH>

  26. Drawing ObjectsVML – Vector Markup Language • Drawings in Word get stored as XML also • We’ll just look at lines

  27. VML – Vector Markup Language

  28. Dataset – One Row for Each Line <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line </TABLE-PATH>

  29. Dataset – Column: From <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@from </PATH>

  30. Dataset – Column: To <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@to </PATH>

  31. Dataset – Column: StrokeColor <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@strokecolor </PATH>

  32. The Dataset Trick: "Flip" indicates coordinates are swapped

  33. Usage Example: Annotate dataset if prxmatch(xyPattern, from) then do; function='move'; x= input(PRXPOSN (xyPattern, 1, from),10.); if prxmatch('/flip:y/',style) then y= -1* input(PRXPOSN (xyPattern, 2, to),10.); else y= -1* input(PRXPOSN (xyPattern, 2, from),10.); output;

  34. Plotted in SAS

  35. Contact Information Larry Hoyle Policy Research Institute, University of Kansas LarryHoyle@ku.edu http://www.ku.edu/pri/ksdata/sashttp/sugi31

More Related