Information Extraction from Documents for Automating Softwre Testing by Patricia Lutsky

Information Extraction from Documents for Automating Softwre Testingby Patricia Lutsky Presented by Ramiro Lopez

Outline • Why is there a need for a natural-language-based system for extracting information from documents • Alternative ways for extracting information from documents • System design and implementation details • Experimental Results

Motivation for SIFT • What is SIFT? SIFT stands for Specification Information From Text. • Various documents in Software Engineering are written in natural language. • Examples: Requirements and Specification Documents, User Manuals. • Software Engineering Documents tend to be written in a very particular way with specific sections and subsections, i.e., they are semi-structured.

What does SIFT do? • SIFT is essentially an automated testing tool • It extracts specification-level information, generates tests with that information and adds them to the set of existing test cases • The tests are then run to check that the system conforms to the documentation

Alternative ways for extracting information from documents • Use a controlled language for requirements specifications • Parse natural language texts about testing entirely and generate test scripts • Extract specific facts on system specifications, but no specific testable facts

What is unique about SIFT? • Extracts specific testable facts from semi-structured documents • Uses XML, which separates content information from presentation formats, to give the document a consistent structure • Does not pursue full-text understanding, thus avoiding issues related to the endless ways of saying the same thing

How to use SIFT • Identify concepts that can be extracted for testing • Examine a document to find out how it is organized and to find the different sentence types • Encode sentence types in a grammar • Create XML tags to give the document a consistent structure

XML tag examples

Example of how a sentence is processed • Natural-language specification: The maximum value you can specify with the BUFQUO argument is 65355 • The parser translates this to a canonical form: The maximum value for BUFQUO is 65355 and a canonical form (maximum_value BUFQUO 65355) • Maximum_value BUFQUO 65355 is then mechanically converted into actual code, a test case, and added to the system

Example of a rule in a grammar • Suppose you have two structurally equivalent sentences: The box is on the counter. The glass is under the counter. • They would be translated into a rule in a grammar as follows: NounPhrase is Preposition NounPhrase

When can SIFT be used • Use on long-term projects where documentation will go through many versions • Use on semi-structured documents that are organized in a predefined way • Use on documents written in a consistent style • Use on domains that have many similar semantic entities (example: methods that have arguments)

Experimental Results • SIFT was used to extract information from an operating system’s reference manual • The total number of tests identified by the developers was 174 • SIFT was able to find 25 or 14% of the 174

Final thoughts • It is only a proof-of-concept testing tool, but it has potential to save developers time on trivial test cases • I think the natural-language approach is error-prone and costly because people may not follow a consistent writing style • Deciding on a standard template that limits the choices of structure in a document might be more useful, since people will be forced to follow the standard and it is less likely that tests will be missed because of an inconsistent writing style

Information Extraction from Documents for Automating Softwre Testing by Patricia Lutsky

Information Extraction from Documents for Automating Softwre Testing by Patricia Lutsky

Presentation Transcript

Information Extraction From Automobile Advertisements

Information Extraction from Web Documents

Content Extraction from HTML Documents

Information Extraction From Recipes

Information Extraction from Scientific Texts

A Survey on Information Extraction from Documents Using Structures of Sentences

Indices for information extraction from satellite imagery

Information Extraction from Biomedical Text

Automating the Extraction of Genealogical Information from Historical Documents

Information extraction from text

Information Extraction from Literature

Information extraction from text

Information extraction from text

Information extraction from Queries

Automating the Extraction of Domain Specific Information from the Web

Information extraction from text

Information extraction from text

Information extraction from text

Information extraction from Queries

Information extraction from text

Corporate Information Extraction from SGX