440 likes | 600 Views
VisualText. Hongyu Wang. What is VisualText?. An IDE (Integrated Development Environments) for creating text analyzers Supports both shallow and deep NLP (Natural Language Processing) Provided with integrated NLP++ language and knowledge base management system (KBMS). Why IDE?.
E N D
VisualText Hongyu Wang
What is VisualText? • An IDE (Integrated Development Environments) for creating text analyzers • Supports both shallow and deep NLP (Natural Language Processing) • Provided with integrated NLP++ language and knowledge base management system (KBMS)
Why IDE? • IDE supports, organizes, and dramatically accelerates the construction of accurate, fast, extensible and maintainable text analyzers. • Helps to integrate knowledge bases, expert systems, experts, planners, and other artificial intelligence systems with the overall framework.
What is a text analyzer? • Any program that takes an input text data and produces an output result • Examples include: • programs that extract information to populate structured business databases • web page categorizers • email routers and autoresponders • chat managers • analyzers for text-from-speech • grammar checkers and spell checkers
Shallow processing • Primary technologies are: • statistical NLP (e.g., Bayesian methods) • probabilistic methods (e.g., Hidden Markov Models) • neural networks (NN) • machine learning (ML) • These technologies enable systems to be constructed automatically or by simpler set ups than handcrafting an NLP system.
Limitation of shallow processing • Cannot support: • high precision and recall in many applications • deep, accurate, and complete analysis and understanding of text • Implementation is relatively cut-and-dried, and therefore competition is fierce among companies
Shallow vs. Deep • With shallow processing, a person still has to read every text of interest. While deep processing can convert text to a database or other structured representation, so that people need not read the voluminous text. Instead, they can query a database to immediately get precise answers to queries such as "list all the acquisitions in the last quarter valued at between $10M and $50M."
What is an NLP++ language? • A general purpose programming language that integrates extensions for NLP. • Supports Concept Oriented Programming (COP), enabling you to focus on heuristics, the domain, and the task, rather than the underlying data structures. • Evolving, but rich and robust already.
About KBMS • The Knowledge Base Management System (KBMS) provides a permanent store for linguistic, conceptual, and domain knowledge. • The KBMS serves as a flexible framework in which users may implement arbitrary representation schemes.
Its uses • Anyone who needs to process documents, reports, web pages, email, chats, and any other communication. • VisualText is ideal for text analysis applications to combat terrorism, narcotics, espionage and nuclear proliferation.
Analyzer Sequence • The analyzer sequence is a series of steps or passes, each containing its own pass algorithm. • There are two types of passes: • System Passes • User Passes
Passes • As the analyzer is run over the text, each pass is taken in the order it occurs in the sequence and executes the code and rules that are contained in it. • Analyzer passes use and modify a common data structure, called a parse tree.
Pass Types • Pattern - For each node in the currently selected phrase, each rule within the pass file is tried in turn. When a rule has matched, the pattern algorithm executes its associated code and actions, then moves onto the node following those that matched the current rule. • Recursive - The nodes of the current phrase are in effect traversed repeatedly, as long as a rule in the current pass file matches. When no rules match in a phrase, then the next phrase in the parse tree is selected, till no more phrases remain to select.
Parse Tree • Parse Tree is a data structure that tracks the patterns that have matched within the input text. • The first pass in the analyzer sequence is tokenize by default, which is a system pass that converts a stream of characters to a parse tree in which alphabetic, numeric, whitespace, and punctuation are grouped into units called nodes.
Parse Tree Terminology • a root is the top-level node of the parse tree and is named _ROOT. The remaining nodes are children of the root, and the root is their parent. • levels of nodes in the parse tree are indicated by indentation. • phrase or list of children are all in a sibling relationship to each other. • a token, literal, leaf or terminal node, represents a literal text. A nonterminal (or abstract) node is one with descendants. We say that a nonterminal node dominates its descendant nodes.
How to add Passes? • Automatically Generating Passes Automated RUle Generation (RUG) creates passes and rules for you automatically, so that you don't have to worry about the details of rule writing. • Hand Building Passes Write NLP++ code and rules in the pass file used by the pass.
About RUG • Best used for word and phrase-level constructs that are modular and well defined, e.g., telephone numbers. • Less useful for completely open ended or nonproductive constructs. For example, RUG might not be very useful for storing and identifying many instances of full names, such as "John Smith.“
Pass Files • A pass file holds the NLP++ code and rules associated with a Pattern or Recursive pass type. Five zones divide a pass file and are ordered as follows: • Declare Zone • Code Zone • Context Zone • Minipass Zone • Grammar Zone
Declare Zone • Contains user-defined NLP++ function definitions • Consists of a single region, the DECL Region • Delimited by a begin mark @DECL and optionally by a terminal mark @@DECL
Code Zone • Contains NLP++ code that is independent of rule matching. • May optionally contain a single CODE Region. • Delimited by @CODE and optionally by @@CODE
CODE Region • NLP++ code in the CODE Region is executed prior to any rule matching for the current pass • exitpass() function is useful for conditionally and immediately terminating the current pass, without executing any rules.
Context Zone • Selects nodes of the parse tree. Rules will match only against the phrases immediately under the selected nodes. • Each selected node, or context node, serves as a locus for rule matching. • A single SELECT Region may be specified. • Delimited by @SELECT and optionally by @@SELECT.
SELECT Region • The SELECT Region may include at most one selector. (i.e., @NODE, @PATH and @MULTI) • With the selectors @NODE or @PATH, the phrase immediately under the selected node is subjected to rule matching. • With the @MULTI selector, every phrase in the subtree of the selected node is subjected to rule matching.
Minipass Zone • Contains nested minipasses called Recurse Regions. • Each Recurse Region is named and contains a Grammar Region. • Rule elements in the Grammar Zone can invoke these named rule sets.
Grammar Zone • Contains the main Grammar Region of the current pass file. • Place for the main rules. • Contains zero or more Grammar Regions • Does not have a distinguishing marker
Grammar Region • Consists of one or more @RULES markers with associated rules and optionally with associated action regions. • Action Regions are: • @PRE - Actions that represent additional conditions on the matching of individual rule elements. NOTE: Does not allow general NLP++ code. • @CHECK - When a rule has matched, NLP++ code in this region checks for self-consistency and/or builds semantic data. A rule match may still be rejected from this region. • @POST - NLP++ code that operates once a rule match has been accepted.
Rule Syntax • All rules are written in the RULES Region of a pass file. • Any number of rules can occur. • The syntax for a rule is _SUGG <- _PHRASE @@
Rule Elements • _SUGG • referred to as the suggested rule element. • It is the node that is created if a rule matches. • often referred to as the left-hand side of the rule.
Rule Elements (contd.) • _PHRASE • the phrase of elements that the rule matcher is trying to match. • If the rule matcher finds a match for the listed phrase, the node will suggest or reduce to what is specified in _SUGG. • referred to as the right-hand side of the rule.
Rule Element (contd.) • <- is the rule arrow. It divides the rule between the suggested element and the phrase of rule elements. • @@ indicates the end of a rule. Rules must always end in the @@ marker.
Rule Element Types • Each rule element, including the suggested element, consists of a literal or a nonliteral. What gets matched for each is different: • Literals match tokens, or "real text". • Nonliterals match nodes which begin with an underscore. • A literal is a terminal node (or leaf node) in the parse tree. It represents words, numbers and punctuation. • A nonliteral is a nonterminal node (also called an internal node). It represents a node which dominates other nodes.
Special Rule Elements • NLP++ has a set of predefined rule elements that make writing rules easier. • These special rule elements match various types of tokens such as alpha characters, punctuation, wildcards, etc.
Special Rule Elements (contd.) • _xWILD Match anything. • _xANY Match any single node. • _xNIL Match nothing. It designates a suggested element when the rule performs a special action, such as removing the matched nodes from the parse tree. • _xALPHA Match an alphabetic token, including accented and other extended ANSI chars
Special Rule Elements (contd.) • _xCTRL Match control and nonalphabetic extended ANSI characters. (See _xALPHA.) • _xNUM Match a numeric token. • _xPUNCT Match a punctuation token. • _xWHITE Match a whitespace token, including newline. • _xBLANK Match a whitespace token, excluding newline.
Special Rule Elements (contd.) • _xCAP Match an alphabetic with uppercase first letter. • _xEOF Match the end of file. • _xSTART Match if at the start of a phrase (or "segment"). • _xEND Match if at the end of a phrase (or "segment").
Variables • There are two broad classes of variables in NLP++: • general variables • parse tree node variables
General variables • Global variables • G(varname) • Local variables • L(varname) • specify parameter lists and appear within user-defined NLP++ function definitions in the @DECL region
Parse tree node variables • Attached to particular nodes in the parse tree, and serve to decorate the tree with semantic information. • The special functions N, X, and S refer to parse tree nodes in the context of a matched rule
Parse tree node variables (contd.) • S(varname) - variable belonging to the suggested concept of a rule. • X(varname [,num]) - variable belonging to the numth context node starting at the root of the parse tree. • N(varname [,num]) - variable belonging to a node that matched the numth element of a rule phrase. • N(varname) and X(varname) refer to the last element of the phrase and the last context node, respectively
Reference • http://www.textanalysis.com/help/help.htm • http://www.textanalysis.com/TAI-IDE-WP.pdf