350 likes | 425 Views
Reading Microsoft Word XML files with SAS August 25, 2005. Larry Hoyle -- Policy Research Institute University of Kansas. revised 8/18/2005. 3 scenarios. Extracting text along with associated properties (styles and attributes) Extracting all data from tables
E N D
Reading Microsoft Word XML files with SAS August 25, 2005 Larry Hoyle -- Policy Research Institute University of Kansas revised 8/18/2005
3 scenarios • Extracting text along with associated properties (styles and attributes) • Extracting all data from tables • Extracting coordinates of objects in drawings
Must begin with this prolog tag Paired tags, must have 1 root tag case sensitive Empty tags end with /> Tags and content called "element" Tags can be Qualified by attributes XML - syntax <?xml version="1.0" ?> <LarryRootTag> <EmptyTag/> <nestedTag> Some content </nestedTag > <nestedTag anAttribute="wha"> Other content </nestedTag > </LarryRootTag> Elements can be nested, Start and end in same parent
Body Section Paragraph Run Text Properties Word XML
Extracting text and properties • SAS XML Engine • Needs XMLMAP file • Can use XML Mapper to generate XMLMAP • Only needs to be generated once for each type of extract
Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys.
XML - Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. Paragraph property: /w:wordDocument/w:body /wx:sect/w:p/w:pPr Run property: /w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr.
Rows • The XMLMap has to describe a path that delineates rows: • In this case it’s each text element in a run (in a paragraph…) <TABLE-PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</TABLE-PATH>
Columns – the text • The XMLMap has to describe a path that delineates each column: • The text itself is: <COLUMN name="t"> <PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</PATH>
Columns – the text element number • A sequential number for the text element is: <COLUMN name="tNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</INCREMENT-PATH>
Columns – the paragraph number • A sequential number for the paragraph is: <COLUMN name="pNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p</INCREMENT-PATH>
Columns –paragraph color <COLUMN name="PColorVal" retain="YES"> <PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:pPr/w:rPr/w:color/@val</PATH>
Columns – run color <COLUMN name="RColorVal" retain="YES"> <PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:color/@val</PATH>
Tables - DataSet Rows <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t </TABLE-PATH>
Tables – Table Number <COLUMN name="tblNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl </INCREMENT-PATH>
Tables – Row Number <COLUMN name="trNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl/w:tr </INCREMENT-PATH>
Nested Tables – Absolute Path for Rows <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t </TABLE-PATH>
Nested Tables – Rootless Path for Rows <TABLE-PATH syntax="XPath"> w:tbl/w:tr/w:tc/w:p/w:r/w:t </TABLE-PATH>
Drawing ObjectsVML – Vector Markup Language • Drawings in Word get stored as XML also • We’ll just look at lines
Dataset – One Row for Each Line <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line </TABLE-PATH>
Dataset – Column: From <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@from </PATH>
Dataset – Column: To <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@to </PATH>
Dataset – Column: StrokeColor <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@strokecolor </PATH>
The Dataset Trick: "Flip" indicates coordinates are swapped
Usage Example: Annotate dataset if prxmatch(xyPattern, from) then do; function='move'; x= input(PRXPOSN (xyPattern, 1, from),10.); if prxmatch('/flip:y/',style) then y= -1* input(PRXPOSN (xyPattern, 2, to),10.); else y= -1* input(PRXPOSN (xyPattern, 2, from),10.); output;
Contact Information Larry Hoyle Policy Research Institute, University of Kansas LarryHoyle@ku.edu http://www.ku.edu/pri/ksdata/sashttp/sugi31