June 4, 2007, MSc Thesis in Computer Science

Document Engineering of Complex Software Specifications Mehrdad Nojoumian Supervisor: Professor T. C. Lethbridge University of Ottawa School of Information Technology and Engineering June 4, 2007, MSc Thesis in Computer Science

Motivation and Goal Problems triggering our motivation: Software Specifications: are dense and intricate (Numerous materials) have complicated structures (lots of tables, figures, lists, codes, etc) are difficult for browsing and navigating are mostly available in the PDF format or just a single hypertext page Major goal: Re-engineer PDF based documents (Specifications, Conf. Proceedings, e-Books, etc) Illustrate how to make more usable version of documents

Data Analyses Headings and the document index carry the most important words in a document UML Superstructure Specifications The most frequent words among headings Frequency of the previous words as found in the entire document The most frequent words in the doc. index Other OMG Specifications Sorted document and heading tokens based on their frequency in two separate lists Defined position of heading tokens among document tokens: P1, P2, …, PN MP: Mean of [P1…PN] NDT: Total number of document tokens Percentage = (MP * 100) / NDT Most frequent headings (# of occurrence > 2) are among the most frequent words in the entire doc

Document Transformation Transforming the raw input into a format more amenable to analysis (XML) Extracting and refining the structure Conversion Experiments: Tools: Adobe Acrobat Professional 7.8 Microsoft Word 2003 Stylus Studio XML Enterprise Suite ABBYY PDF Transformer 1.0 Criteria: Generality Low Volume Clean & Understandable Similarity to XML Having Good Clues

Logical Structure Extraction Java parsers Solved the mis-tagging problem which had been created during previous phase Extracted entire headings existing in the document bookmark Removed some information and XML tags Formed the document logical structure in a clean XML format

Hypertext pages & Text Extraction Produced multiple outputs for each Chapter, Section, Subsection, etc (1.html, 2.html, 2.1.html, etc) Generated table of contents for headings (use it as a frame) Connected hypertext outputs sequentially XPath expressions Programming approach Formed major document elements Anchors in long pages Figures and their captions Simple & Nested Lists Dynamic Tables

Concept Extraction UML Superstructure Specification UML class & package hierarchies extraction If the first child of a <Section> element contains the ‘Class Descriptions’ string then you can detect UML classes & packages in grandchildren of that <Section> element Other specifications: Common Warehouse Meta-model (CWM) UML Infrastructure (UML Inf.) Meta Object Facility (MOF) Question? How can we detect such a logical relation among heading elements automatically?

Cross Referencing Developed an XSLT program to extract heading phrases and their corresponding hyperlinks Filtered some phrases which had common substrings such as Association & AssociationClass Removed phrases which had many independent hypertext pages (different entries in user interfaces) Also applied package names just for UML Superstructure Specification in cross referencing as anchors Finally, developed a Java program to replace hyperlinks in generated HTML pages

Usability of User Interfaces Reasons for generating small hypertext pages: A better sense of location (navigating) Less chance of getting lost (scrolling) Less overwhelming sensation (learning) Statistical analyzing (interesting topics) Faster downloading (entire document!) Easier printing, Cross referencing among diverse specifications, etc User Interfaces Demo

Contributions A generic approach to reengineer complex documents A data analyses showing that words in headings provide a sufficient basis for the document reengineering Extraction of the document logical structure in XML format Various techniques for text & concept extractions using W3C technologies Major software components for an “Integrated Document Engineering Tool”

Engineering Lessons & Challenges Engineering Lessons: Generating a clean XML file from PDF images requires complicated features to recognize each document element correctly and deal with mis-tagging, page boundary, etc Remarkable role of latest technologies in engineering tasks: e.g. XPath 2.0 vs. parsing packages which is a high level interaction close to human’s language Comprehensive data analysis can facilitate the DocEng process, form a better understanding, and construct robust rules & regulations for such a processing Low Level Challenges: Generating multiple hypertext pages by Saxon Detecting errors in XSTL programming Creating complicated XPath expressions, etc

Future Work Extracting the initial XML document independently from Adobe Acrobat Automating the concept extraction procedure or creating some HCI features Developing an automatic document analyzer for comprehensive data analyses Investigating usability of current user interfaces to discover users’ demands Generating interaction features in UIs: online query submission to XML files

Publications Refereed Conference Paper: M. Nojoumian & T. C. Lethbridge, “Extracting document structure to facilitate a KB creation for UML specifications”, in proceedings of the 4th IEEE International Conference on Information Technology: New Generations (ITNG), pp. 393-400, Las Vegas, USA, 2007. Invited to publish in the Journal of Computers (JOC): M. Nojoumian & T. C. Lethbridge, “Document engineering of complex software specifications”, Academy Publisher.

Thank you very much Questions?

June 4, 2007, MSc Thesis in Computer Science

June 4, 2007, MSc Thesis in Computer Science

Presentation Transcript

June 4, 2007 Version 4.0

MSc Thesis Department of Computer Science

MSc in Industrial Computing Systems Computer Communications Lecture 4.

MSc Thesis Project - Aisling Mannion

Computer Science 686 Spring 2007

MSc / BSc Thesis Topics 2012

ENVIRONMENT SELECT COMMITTEE 4 JUNE, 2007

HUMAINE Plenary Paris, June 4, 2007

Mebit Mitiku MSc. Thesis

MSc in Advanced Information Systems MRes in Computer Science

Ulrika Roos Master thesis presentation 1 june 2007

June 4, 2007 Diane Webb, President

June 4, 2007 – June 5, 2007

Welcome to the Open Day - School of Computer Science MSc Computer Science (online)

Henrik Bengtsson hb@maths.lth.se (MSc Computer Science, PhD candidate in Statistics)

Alfredo Montana, Msc. Thesis @ INAOE

MSc thesis presentation

Mebit Mitiku MSc. Thesis

Emily Kearney June 4, 2007

Computer Science 686 Spring 2007

June 4, 2007, MSc Thesis in Computer Science

Computer thesis|Computer Academic Help