130 likes | 144 Views
ANNIC is a full-featured annotation indexing and search engine developed as part of GATE powered with Apache Lucene technology, allowing flexible querying of linguistic metadata and document content. It supports indexing and extraction of information from overlapping annotations and features. ANNIC can be used for rule development in NLP systems and enables the discovery and testing of patterns in corpora.
E N D
ANNICANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani
Motivation - I • Need for efficient corpus indexing and querying arises frequently both in machine learning-based and human-engineered NLP systems. • Language Engineers use their intuition when writing patterns trying to strike the ideal balance between specificity and coverage. This requires them to make a series of informed guesses which are then validated by testing the resulting rule set over a corpus. (Isn’t it painful?)
Motivation - II • Need a system that allows querying the information contained in a corpus in more flexible ways than simple full-text search (e.g. identifying share movements like “BT shares ended up 36p” • Required: A system that can index and query both linguistic metadata and document content - in a flexible way and also allows validating the derived rule set with minimum possible efforts.
ANNIC - ANNotations In Context Description Full featured annotation indexing and search engine, developed as part of GATE Powered with? Apache Lucene technology What can be indexed? Documents in any format supported by GATE (i.e. XML, HTML, RTF, E-mail, text, etc.) Indexing of Linguistic metadata Extensive indexing of document content and linguistic information (annotations and features) associated with document content, independent of document format
ANNIC - ANNotations In Context What is special? Indexing and extraction of information from overlapping annotations and features Result? Matching texts in the corpus, displayed within the context of Linguistic annotations (and not just text, as is customary for KWIC systems) Interface? Advanced GUI provides a graphical view of annotation mark-ups over the text along with ability to build new queries interactively Where to use? Can be used as first step in rule development in NLP systems as it enables the discovery and testing of patterns in corpora
GATE Documents • Format of document is analysed and converted into a single unified model of annotations. • Documents and corpora is encoded in the form of annotations • The annotations associated with each document are a structure central to GATE. • Each annotation consists of • - start offset • - end offset • - a set of features associated with it • - each feature has a name and a relative value • Various processing resources to annotate documents
The Pattern Syntax • JAPE – Java Annotation Pattern Engine in GATE • - It executes the JAPE grammar phases- each phase consists of • regular expression pattern/action rules over annotations • - LHS represents an annotation pattern • e.g. {Title}{Token.orth=“upperinitial”} • - RHS describes the action to be taken when pattern found • e.g. Annotate the above pattern as a Person • ANNIC allows indexing documents with annotations and features and • users to issue queries that contain LHS part of the JAPE pattern/action • rule • e.g. {Person} {Token.string==“from”} {Organization}
Klene Operators • ANNIC supports two Klene operators “+” and “*” • ({A})+n one and upto n occurrences of annotation {A} • ({A})*n zero and upto n occurrences of annotation {A} • Also supports | (OR) operator • {A}({B} | {C}) {A}{B} | {A}{C} • {A} ({B} | {C})+2 ({A} ({B} |{C})) | • ({A} ({B} |{C}) ({B} | {C})) ({A}{B}) | ({A}{C}) | ({A}{B}{B}) | ({A}{B}{C}) | ({A}{C}{B}) | ({A}{C}{C})
ANNIC PRs • ANNIC Index PR • Allows indexing document content and metadata from a given corpus • Parameters • Corpus (serialized corpus) • Base token annotation type (e.g. Token) • Annotation features to be excluded (e.g. SpaceToken) • Index location
ANNIC PRs • ANNIC Search PR • Allows searching over indexed documents • Parameters • Corpus (serialized corpus) OR one or more index locations • Limit (number of maximum patterns) • Context window (number of base tokens to show as context on each (left and right) side • Query (JAPE L.H.S. pattern)
ANNIC • DEMO • QUESTIONS
Thank You!This talk: http://gate.ac.uk/sale/talks/gate-course-apr06/annic.ppt