580 likes | 708 Views
ATLAS Architecture and Tools for Linguistic Analysis Systems. What is ATLAS?. The ATLAS framework provides an architecture targeted at facilitating the development of linguistic applications. The principal goal of ATLAS is to provide an abstraction over the diversity of linguistic annotations.
E N D
What is ATLAS? The ATLAS framework provides an architecture targeted at facilitating the development of linguistic applications. The principal goal of ATLAS is to provide an abstraction over the diversity of linguistic annotations.
Why do you need it? “Annotated corpora are a central component of research in human language technology. As corpora have proliferated across languages, disciplines, and technologies, the lack of common exchange and storage formats has become a critical problem. This profusion of formats has made reusing annotated data or adapting existing tools for new annotation tasks significantly more difficult.” (Quote from “A Practical Introduction to ATLAS”)
Introduction This problem is the motivation for the creation of ATLAS. Existing tools generally implement a two-level architecture consisting of an application level and a physical level which are task specific.
Introduction ATLAS has been created to add an intermediate logical level, independent of the application and the physical level. The logical level is based around the notion of an “annotation graph”.
How it began ATLAS began as a collaboration between NIST, MITRE and LDC in 1999 following Bird and Liberman’s work on Annotation Graphs.
NIST The National Institute of Standards and Technology. NIST's mission is to develop and promote measurement, standards, and technology to enhance productivity, and facilitate trade.
NIST For several years, NIST has been organizing and implementing language technology evaluations for several research domains. NIST realized the need for a more abstract, open-ended transcription format that could accommodate unforeseen changes. Being both domain-independent and permit any conceivable extension
MITRE MITRE is a not-for-profit national resource that provides systems engineering, research and development, and information technology support to the government .
MITRE MITRE has experience in language technology development and annotation and visualization using the Alembic Workbench.
LDC The Linguistic Data Consortium supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards.
LDC LDC projects • AGTK-Annotation Graph Toolkit • TalkBank-research in the study of human and animal communication • DASL-investigates best practices in the use of digital speech corpora in the study of language variation • English Gigaword-12 gigabytes of data, over a billion words.
Annotation Graphs LDC moved on to implement Annotation Graphs – also called AGs or ATLAS level 0 – to address immediate needs in annotation infrastructure for linear signals.
Annotation Graphs Annotated graphs are a “formal framework for representing linguistic annotations of time series data. Annotation graphs abstract away from file formats, coding schemes and user interfaces, providing a logical layer for annotation systems” (http://agtk.sourceforge.net/)
Annotation Graphs Generalized Model The Annotation Graph model addresses annotation for one-dimensional signals (text, audio). It consists of: • intervals specified with start and end nodes • annotations specified as labeled arcs between nodes • collection of annotations => annotation graph
Annotation Graph Example XML example
ATLAS The current ATLAS is like AGs but supports the annotation of a greater variety of data. ATLAS consists of the following four components: • A data model • Application Programming Interface (API) • ATLAS Interchange Format (AIF) • Meta-Annotation Infrastructure for ATLAS (MAIA)
ATLAS - 3 Levels ATLAS, as briefly mentioned earlier, consists of three levels: application, logical and physical.
ATLAS - Logical The logical layer consists of a linguistic formalism and an API. • The formalism is the annotation graph model and it’s generalization to higher-dimensional cases. • The API defines a set of procedures for creating, modifying, searching and storing well-formed annotation sets
ATLAS - Physical The API specification will allow various physical storage implementations that applications are free to access in multiple ways. There are two dominant storage strategies: • ATLAS Interchange Format (AIF), an XML Interchange Format • RDBMS (Relational Database Management Systems) accessible from ODBC-compliant calls
ATLAS - Application This part is left up to the developer. Any application that can read, manipulate, or annotate ATLAS data would fall under this category.
Visualization and Exploration Extraction Systems Annotation Tools Query Systems Automatic Aligners Evaluation Software Conversion Tools RDB AIF Files Graphical representation of the three layers Applications ATLAS CORE ATLAS API ATLAS Logical Level ATLAS Physical Level
ATLAS Generalized Model • The generalized model has been designed to accommodate non-linear signals such as images, whereas the Annotation Graph model focuses on one dimensional data sources.
ATLAS Constructs The ATLAS model includes five primary constructs. • Signal • Annotation • Region • Content • Anchor
ATLAS Constructs Annotation sources are represented in ATLAS by the Signalconstruct. An ATLAS Signal is an immutable, N-dimensional space containing phenomena that are the target of Annotations.
ATLAS Constructs Groups of related Signals can be formed to create targets for Annotations spanning several signals. The SignalGroup construct is used to model this grouping
ATLAS Constructs Signals of a SignalGroup must share common dimensions. Example: creating a signal group of a left and right audio track, which are each of the same length
ATLAS Constructs Definition: An annotation is the fundamental act of associating some content to a region in a signal. An Annotation references a Region in a Signal and associates it with a Content element.
ATLAS Constructs A Region is an abstraction for identifying an area of the Signal space. Regions are delimited by a set of coordinates that mark specific areas of interest.
ATLAS Constructs Content elements contain information about the Regions that they point to. They can point to other data sources or describe the Regions with text.
ATLAS Constructs Anchors are used to mark the areas of interest in the specified source. Anchors are the only ties that annotations have to the physical structure of the signal. Both Region and Content constructs use Anchors.
ATLAS Constructs The Children construct is used to create relationships between annotations. Children constructs maintain a list of references to Annotations that are descendants of a parent Annotation.
ATLAS Generalized Model • Annotation elements describe regions within signals with signal pointer(s) and content-bearing attributes Annotation <Annotation> <Source> <Region> … </Region> </Source> <Content> … </Content> </Annotation> Region Content Signal
ATLAS Object Model Annotation Graph Object Model The difference between these two graphs is the ‘region’ object. Multiple regions gives us the ability to references multiple dimensions. ATLAS Object Model (Simplified chart showing primary objects and their relationships)
Single Dimension Cases Annotation Graphs focus on single dimension data sources such as audio. ATLAS was created to support Annotation Graph data, but many of the restrictions defined in the AG data model had to be removed in order to fully generalize the ATLAS data model.
Single Dimension Cases In the one dimensional case, regions are simply intervals, and there are various ways to specify these (endpoints, start-point plus offset, midpoint plus radius)
Single Dimension Example Children Sentence Annot. Children Annotation Word Annot. She Content had Region Interval Region Anchor Offset Anchor Offset Anchor Signal audio
Higher Dimensional Cases There are many signals greater than one dimension that could benefit from linguistic modeling techniques. Examples: • OCR (Optical Character Recognition) • Sign language video • Lip reading in television broadcasts
Higher Dimensional Cases ATLAS strives to model the annotations required for all of theses examples as well as all well-formed combinations of them. Annotation graphs, however, do not provide a natural framework for identifying regions of a signal having more than one dimension.
Higher Dimensional Cases When creating annotations related to higher dimension cases, these are called Cartesian Annotation Sets because boundaries are usually set using the Cartesian coordinate system.
Gesture Region Interval 3DSegment Multi-Dimension Example Forearm Annotation Frame Anchor Frame Anchor XYZ Anchor XYZ Anchor
ATLAS APIs The core API provides means to create, edit, and delete annotations. It also enables users and ATLAS developers to manipulate annotation components at the lowest level of abstraction.
ATLAS APIs Components will also be developed which provide higher-level services to ATLAS-compliant tools, easing the task of creating those tools. These components will facilitate visualization and editing annotations and will support a query interface.
ATLAS Interchange Format Also called AIF, is an XML Interchange Format for ATLAS Annotation Graphs
Signal types Annot set Annot element AIF Example <AnnotationSet id="http://ace.program/ocr/9801.10/9801.10.omni.xml”> <Signal mime-class=“AUDIO” mime-type=“wav” encoding=“wav” ID=“Audio1”> <Signal mime-class=“TEXT” mime-type=“PLAIN” encoding=“UTF8” ID=“Text1”> <Annotation id=“a1” type=“transcription”> <Source> <Region Signal=“Audio1” type=“interval”> <Value type=“integer” role=“start” unit=“msec”>453</Value> <Value type=“integer” role=“end” unit=“msec”>497</Value></Region> </Source> <Content> <Region Signal=“Text1” type=“interval”> <Value type=“integer” role=“start” unit=“char”>25</Value> <Value type=“integer” role=“end” unit=“char”>29</Value></Region> </Content> </Annotation> <Annotation id=“a2” type=“transcription”> … </Annotation> … </AnnotationSet>
AIF Currently, jATLAS (the JAVA implementation of the ATLAS API) is the only resource for creating AIF files.
ATLASTypes An ATLASType is a piece of metadata associated with an ATLAS construct to describe attributes of, and permitted operations for a specific annotation element.