270 likes | 340 Views
On Embedding Machine-Processable Semantics into Documents. Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Dayton, OH-45435, USA. Talk Outline. Background and Motivation ( Why ?) Goals ( What? ) Details ( How ?) Conclusions.
E N D
On Embedding Machine-Processable Semantics into Documents Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Dayton, OH-45435, USA
Talk Outline • Background and Motivation (Why?) • Goals (What?) • Details (How?) • Conclusions
Content Extraction: Formalize doc, using controlled vocabulary Heterogeneous Doc. Spec. Defn. Rep.
Problems with this approach to content extraction • Archiving spec (for human comprehension) separately from its formalization is not conducive traceability. • Manual extraction from spec (from scratch) for each use is labor intensive, time consuming, and prone to typographical errors.
Observation • Conceptually, every piece of information in an extraction owes its existence to a phrase in spec, and possibly, controlled vocabulary. • So, explore techniques to maintain correspondence between a spec fragment and its formalization.
General Problem • Embed domain-specific mark-up (annotations) into human sensible document • to make explicit semantics of “content” text and complex data, and • to augment an interpretation in a modular fashion. • Document text: Human comprehensible • Semantic Mark-up: Machine processable
Nature of Specs • Semi-structured • Heterogeneous • Text • Tables • Images • Constrained technical vocabulary • Available as MS Word document
Pre-processing Spec • Abstract content from spec document by removing display oriented information • Save text • Save tabular data, preserving grid layout • Retain links to images • … • Note: “Save Astext” option in MS Word inadequate
Annotating Pre-processed Spec • Embedding Machine Processable Semantics • Recognizing and tagging text using controlled vocabulary • By product of: Document Indexing and Semantic Search • Tagging tabular data to make explicit its semantics : Same grid layout, but different interpretation and dependencies based on headings • Explore: XML-based programming languageWater for defining data and its behavior (semantics)
Example of Tagged Table Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi) table.<setHeading thickness strength.tensile strength.yield/> 0.50 and under 165 155 table.<addRow 0 0.50 165 155 /> 0.50 - 1.00 160 150 table.<addRow 0.50 1.00 160 150 /> 1.00 - 1.50 155 145 table.<addRow 1.00 1.50 155 145 /> ...
Example of Processing Code <defclass table rows=required=vector heading=optional=vector> <defmethod setHeading t=required ts=required ys=required> <set heading=<vector t ts ys/>/> </> <defmethod addRow smin smax ts ys> <set rows= table.rows.<insert <vector smin smax ts ys/>/>/> </> <defmethod computeYieldStrength> … </> <defmethod computeTensileStrength> … </> … </>
(cont’d) <defclass table rows=required=vector heading=optional=vector> … <defmethod computeTensileStrength> <set temp=fluid.Thickness/> <set i=0/> <do> <until <and temp.<less table.rows.<get i/>.1/> temp.<more_or_equal table.rows.<get i/>.0/> /> > table.rows.<get i/>.2 </until> <set i=i.<plus 1/>/> </do> </> </>
(cont’d) <defclass table rows=required=vector heading=optional=vector> … </> fluid.<set Thickness=0.60> <try <set TensileStrength=table.<computeTensileStrength/>/> TensileStrength > "TABLE: out of range error occurred" </try>
Water • XML-based OO Scripting Language • Facilitates creating Web Services • Run methods remotely via web-browser • Generalizes dynamic typing to constraint checking • Conformance of actuals to formals
Pros and cons • Encoding Improvement • Amount of tagging can be controlled by suitably delimiting table data and annotating it with corresponding “string-processing” method • Master Copy Update • Changes to spec requires manual modification to archived annotated version. • Irregular Tables in Specs • Different units, etc
Some Related Work • Microsoft Smart Tags • Recognize “controlled” words in Office 2003 documents and associate predefined list of actions with each occurrence • SHOE • Table data in a declarative (logic) language
Prolog rendition strengthTableRow( 0, 0.50, 165, 155). strengthTableRow(0.50, 1.00, 160, 150). strengthTableRow(1.00, 1.50, 155, 145). ... strengthTable(Thickness, TensileStrength, YieldStrength) :- strengthTableRow(L, U, TensileStrength, YieldStrength), L =< Thickness, U > Thickness. thicknessToTensileStrength(Thickness, TensileStrength) :- strengthTable(Thickness, TensileStrength, _). thicknessToYieldStrength(Thickness, YieldStrength) :- strengthTable(Thickness, _, YieldStrength). ?- thicknessToYieldStrength(0.6,YS).
A Step towards Holy Grail • Ultimately enable authoring and/or extracting, human-comprehensible and machine-processable parts of a document “hand in hand”, and keep them “side by side”.