Master Thesis

Master Thesis Document Logical Structure Extraction Mehrdad Nojoumian Supervisor: Professor T. C. Lethbridge University of Ottawa School of Information Technology and Engineering October 27, 2006

Contents Motivation and Goal Document Properties (UML Superstructure Specification) Document Transformation Logical Structure Extraction Summary and Future Work 1

Motivation and Goal Problems: Specifications are: Dense, repetitive and difficult to use Written primarily in semi-structured text, but the structure must be maintained manually, resulting in inconsistency End users cannot use them efficiently due to: Duplications Numerous concepts connected only implicitly General complexity of the document High level goal: Enable easier browsing and editing of specifications To achieve this we have the following lower-level goals Extract document's logical structure Generate a Knowledge Base for the UML specification 2

Definitions Document analysis: Extraction of the geometric structure which refers to the pages, blocks, lines, and words Document understanding: Mapping physical structure into the logical structure which refers to the chapters, sections, subsections, etc Knowledge acquisition: Extracting concepts embedded in the document structure (physical or logical) Unstructured document: A plain text with natural language Semi-structured document: A document with tags dividing it into the paragraphs, headings and sections such as web pages Structured document: A document in which all the elements are marked with meta-tags, typically using XML 3

Quick Literature Review Document analysis: WISDOM(Wise System for Document Management): is a document processing system that operates in five steps: 1. Document analysis (physical) 2. Document classification 3. Document understanding (mapping physical  logical) 4. Text recognition with OCR 5. Text transformation into XML format MKB (Mathematical Knowledge Browser): by Using this browser 1. Printed mathematical documents can be scanned and recognized by OCR 2. The meta-information (e.g. title, author, abstract, etc) can be extracted 3. The logical structure (e.g. theorem, lemma, prove, etc) can be extracted OCR: Optical Character Recognition 4

Document Properties UML Superstructure Specification (version 2.1): Is a large specification in PDF format Has 771 pages Contains almost 2200 headings with a lot of nested lists, hyperlinks, figures, tables, etc. Reasons for choosing the PDF format: People do not have access to the original word-processor formats much of the time PDF format has some useful features that make it semi-structured such as bookmarks When documents are published, the best choice is PDF format to guarantee that everyone can read it 5

Document Transformation Transforming the raw input into a format more amenable to analysis Extracting and refining the structure Conversion Experiments: We performed various conversions using a similar sample file We applied different tools such as: Adobe Acrobat Professional 7.8 Microsoft Word 2003 Stylus Studio 2006 XML Enterprise Suite ABBYY PDF Transformer 1.0 6

Document Transformation (Cont) Criteria:To select the best conversion, we defined a set of criteria Generality: A format should enable the design of a general extraction algorithm for processing other electronic documents Low volume: We should avoid a format which contains of a lot of extra material which is not related to the document content Clean and understandable: Even if the output is small, it should be clean and understandable, e.g. formats which mark constructs such as paragraphs Similarity to XML: We prefer a format which has a similar structure to XML because our final goal is to extract the logical structure in this style Having good clues: A format should use markers which provide accurate and good clues for finding the logical structure, e.g. meaningful keywords with respect to the headings: “LinkTarget”, “DIV”, “Sect”, “Part”, etc 7

Document Transformation (Cont) First Stage of Evaluation: DOC & RTF: They are messy even code figures among the contents of the document. In addition, they store information related to the font, size, style, etc of each heading, paragraph, sentence and even words HTML/XML:If we extract HTML/XML formats from DOC/RTF, the results tend to have the same properties TXT: It is very simple but does not give us any clues for processing and you may not even find the beginning of the chapters, headings, tables, etc PDF:It is complex itself, but after a conversion into HTML/XML by Adobe Acrobat Professional 7.8, the result is very nice, especially in the case of PDF files which have bookmarks 8

Document Transformation (Cont) Second Stage of Evaluation: Our finalist candidates are HTMLandXML formats extracted by Adobe Acrobat professional 7.8 from the PDF file with bookmarks We analyzed the following sample parts using the two finalist candidates: Sample paragraphs Sample figures (e.g. figure 7.25) Sample tables (e.g. table 2.1) Complex tables which have phrases, figures and hyperlinks in their cells Complex nested lists which have complicated hierarchy structures After many assessments, we found out the XML is the best candidate for processing 9

Period Next Line Word Number Start Space Logical Structure Extraction First refinement approach: Grammars Applied various parsing packages Tried to write a comprehensive grammar to parse the XML document Sample headings: 7 Classes\n 7.1 Overview\n 7.2 Abstract Syntax\n 7.3.1 Abstraction\n Encountered too many exceptions, resulting in the need for: 1. Too many rules 2. Context-sensitive parsing 10

Logical Structure Extraction (Cont) Second refinement approach: stack-based parsing written in Java We turned to writing a simple java code to match major tags, such as <Part>, <Sect> and <Div>, which Acrobat used to open and close each part, chapter, section, etc of the document <Sect name=”Generalization”> <Generalization> <Sect name=”Class-Ref”> <Class-Ref> <Sect name=”Name”> <Name> </Sect> </Name> <Sect name=”Package-Ref”> <Package-Ref> </Sect> </Package-Ref> </Sect> </Class-Ref> </Sect> </Generalization> Using a straightforward stack-based parsing approach 11

Logical Structure Extraction (Cont) Second refinement approach: stack-based parsing written in Java After running the program for diverse chapters and the whole document, it failed The tool opened each part, chapter, section, etc by <Sect> in a proper place of the document but it closed all of these tags by </Sect> in the wrong places The problem was more crucial when we processed the whole document at once because of the accumulative mis-tagging <Sect number=” 7.3”> <Sect number=”7.3.1”> </Sect> <Sect number=”7.3.2”> </Sect> Correct place for closing <Sect number=”7.3”> <Sect number=”7.4”> </Sect> </Sect>  Wrong place 12

Logical Structure Extraction (Cont) Third implementation approach: leveraging the bookmarks We wrote a java-based parser which focused on a keyword: LinkTarget It corresponds to the bookmark elements created in the transformation phase It is attached to each heading in the bookmark e.g.: <P id="LinkTarget_111914">7 Classes</P> We extracted all the lines containing the LinkTarget and put them in a queue We also defined the different type of headings in our document: 13

Logical Structure Extraction (Cont) Part I T Part = 1  <Part I> 1 Classes T Chapter = 2  <Chapter 1> 1.1 Description T Section = 3  <Section 1.1> 1.1.1 Abstraction  T Subsection = 4  <Subsection 1.1.1> ProcedureDocumentStructureAnalysis(LinkTargetQueue) Q: Part I 1 Classes 1.1 Description 1.1.1 Abstraction T of the last member of the HeadingStack = 0, HeadingStack = empty While (LinkTargetQueue != empty) do Get “L” from the LinkTargetQueue L//Line: e.g.:<P id="LinkTarget_111914">7 Classes</P> Extractthe heading “H” from the “L” H//Heading: e.g.: 7 Classes Define heading's type: “T” T//Type: e.g.: for the Chapters, T Chapter = 2 While (T =< T of the last member of the HeadingStack) do Pop “H” and “T” from the HeadingStack Close the suitable tag w.r.t the popped “T” If (HeadingStack == empty) Break this while loop End if End while Push the new “H” and “T” in the HeadingStack Open new tags w.r.t the pushed “H” & “T” End while While (HeadingStack != empty) do Pop “H” and “T” from the HeadingStack Close the suitable tag w.r.t the popped “T” End while Return“F” End procedure 1.1.1 Abstraction T Subsection = 4  </Subsection 1.1.1> 1.1 Description T Section = 3  </Section 1.1> 1 Classes T Chapter = 2  </Chapter 1> Part I T Part = 1  </Part I> 14

Logical Structure Extraction (Cont) Logical structure in XML format  We extracted 2191 headings from the UML Superstructure Specification (V: 2.1) We tested other specifications such as: UML Infrastructure Specification (V: 2.0) Extractions were well in all cases with 100% accuracy We also imported the new XML file into Protégé  Logical structure model in Protégé 15

Summary Goal: Make specifications more usable Main task in this stage of our work: Extract clean structure from published PDF specification Document transformation: Took raw PDF version of a published specification Experimented with tools to convert to other formats: DOC, RTF, TXT, XML, etc Logical structure extraction: Best format: XML file extracted from PDF with bookmarks Key challenge: Dealing with mis-tagging We needed to write a procedural program Declarative grammars with a parsing package did not work well 16

Future Work We extracted the document’s logical structure (document entity) Intend to focus on the hidden concepts in the document (UML entity) Interested to know what knowledge could be captured: We will capture list of all words, bi-grams, tri-grams and quad-grams Calculate their frequency of occurrence Earn a sense of the terminology and concepts by these frequencies We also like to do related-phrases analysis: “X is a kind of Y”, “X has a Y” 17

Some References [1] S. Mao, A. Rosenfeld, and T. Kanungo, “Document Structure Analysis Algorithms: A Literature Survey”, in Proceedings of SPIE Electronic Imaging, USA, 2003, pp. 197–207. [2] S. Klink, A. Dengel, and T. Kieninger, “Document structure analysis based on layout and textual Features”, in Proceedings of International Workshop on Document Analysis systems, Brazil, 2000, pp. 99-111. [3] J. Liang, “Document Structure Analysis and Performance Evaluation”, PhD thesis, University of Washington, Seattle, USA, 1999. [4] K.Nakagawa, A.Nomura, and M.Suzuki, “Extraction of Logical Structure from Articles in Mathematics”, 3rd International Conference on Mathematical Knowledge Management, Bialowieja, Poland, 2004, pp. 276-289. [5] O. Altamura, F. Esposito and D. Malerba, “Transforming paper documents into XML format with WISDOM++”, International Journal on Document Analysis and Recognition, vol. 4, 2001, pp. 2-17. [6] W. Cohen and L. Jensen, “A structured wrapper induction system for extracting information from semi-structured documents”, 17th International Joint Conference on AI, Workshop on Adaptive Text Extraction and Mining, Seattle, USA, 2001. 18

Thanks Questions? 19

Master Thesis

Master Thesis

Presentation Transcript

Master thesis

Master thesis writing

Diplomarbeit / Master thesis / Studienarbeit

Titel of Master thesis Max Mustermann Master thesis, Spring Semester 2014

Master Thesis

Master Thesis

Master Thesis

Master thesis Jakob Beetz

Master thesis information meeting

Master Thesis Harald Groen

NKNU phys. Master Thesis

Master Thesis Project

Master Thesis Preparation

Master thesis project

Master Thesis Seminar , 2010

Master of Science Thesis

Master thesis