CIS 530 Orientation November 2001 Linguistic Data Consortium University of Pennsylvania

CIS 530 Orientation November 2001 Linguistic Data Consortium University of Pennsylvania Philadelphia, PA 19104

There are several thousand languages. Over 320 are spoken by over 1,000,000 speakers. The ability to process foreign languages supports global economy, internationalization of business, software localization, military roles, intelligence gathering, humanitarian efforts, foreign policy To develop technology for language requires large amounts of data appropriately selected sampled, organized and annotated in corpora Corpus creation requires special equipment, unique legal arrangements and business models and specialized skills not usually taught in the programs of users of language data LDC exists to make language data broadly available for linguistic education, research and technology development Motivation

LDC began in 1993 as a specialized publisher of language data. The data was typically produced elsewhere. Distributed over 14,000 copies of 196 corpora to >1000 organizations worldwide LDC gradually developed the ability to create language resources locally newswires/text collection, collection of conversational data via telephone, broadcast news collection transcription, time-alignment, topic relevance annotation, named entity annotation, phonological /morphological resources LDC more recently extended its research program TalkBank & Linguistic Exploration, Open Languages Archives, African Language Lexicons, DASL Linguistic technologies Information Detection, Extraction and Summarization Speech Recognition and Speech Synthesis Machine Translation Language and Speaker Identification Language Teaching, Linguistics LDC Role

Annotating LDC Corpora: TDT • Topic Detection & Tracking (TDT) Corpora • TDT4 Corpus (most recent) contains 9 months of data in 6 languages • Subset of 4 months of English, Chinese, Arabic for annotation • Topics selected and defined from all sources • Topic is a specific event or activity along with all directly related events (e.g., Hurricane Mitch) • Multiple levels of annotation • segmentation of audio signal into individual stories • topic-story relevance judgements • first story identification • story-link identification • Millions of annotation decisions

Audio Segmentation • Using commercial transcripts or closed-caption annotators • assess existing story boundaries • add, delete, move boundaries as needed • classify units as “news” or “not news” (commercials, etc.) • set and confirm timestamps for all story boundaries

Topic-Story Annotation • Annotators read and evaluate news stories against topic list • Classify story as directly, briefly or not at all related to a target topic

Annotating LDC Corpora: ACE • Automatic Content Extraction Project (ACE) • Develop technology to support automatic processing of human language in text form • Classification, filtering, representing language content • Four annotation tasks • Identify all nominal entities in news story • Categorize according to type • Persons, organizations, GPE, location, facility • Name, nominal, pronominal • Co-index all mentions of single entity within story • Classify relations among entities

Nominal Entity Tagging

Best practices in use of large-scale corpora in study of linguistic variation • Focus on -t/d deletion in American English (well-known variable) • Four LDC Corpora, all created for linguistic technology development • All data already transcribed, segmented to provide fine-grained access • Basic demographic information available (gender, age, education, region, race/ethnicity)

DASL Technology • Create concordance -regular expression search of corpus • Create tag set -specify which factors to code • Create annotation file -combines data with tag set • Annotate using web browser -play each example, tool supports common audio formats -code factors in each factor group, adding comments when needed -demographic information displayed • Save results and output to text file -can be exported to Excel Spreadsheet, statistical analysis package

TDT Overview

Transcripts

ASR Output

Boundary Table

Relevance Table

Story Links

CIS 530 Orientation November 2001 Linguistic Data Consortium University of Pennsylvania

CIS 530 Orientation November 2001 Linguistic Data Consortium University of Pennsylvania

Presentation Transcript

Linguistic Data Consortium Member Survey: Purpose, Execution and Results

November 1, 2001

University of Pennsylvania

University of Pennsylvania:

University of Dallas 2001

Denver Regional Data Consortium November 2012

Representing Linguistic Data

CIS 550 Fall 2001

Pennsylvania Consortium of Education Foundations

Andrew W. Cole andrew.cole@ldc.upenn Linguistic Data Consortium University of Pennsylvania

University of Pennsylvania

University of Pennsylvania

Enriching Word Alignment with Linguistic Tags Linguistic Data Consortium, IBM

CIS 550 Fall 2001

University of Pennsylvania

University of Illinois Consortium

Linguistic Data Consortium Member Survey: Purpose, Execution and Results

CIS 550 Fall 2001

University of Pennsylvania

Indiana University Of Pennsylvania