170 likes | 393 Views
CIS 530 Orientation November 2001 Linguistic Data Consortium University of Pennsylvania Philadelphia, PA 19104. There are several thousand languages. Over 320 are spoken by over 1,000,000 speakers. The ability to process foreign languages supports
E N D
CIS 530 Orientation November 2001 Linguistic Data Consortium University of Pennsylvania Philadelphia, PA 19104
There are several thousand languages. Over 320 are spoken by over 1,000,000 speakers. The ability to process foreign languages supports global economy, internationalization of business, software localization, military roles, intelligence gathering, humanitarian efforts, foreign policy To develop technology for language requires large amounts of data appropriately selected sampled, organized and annotated in corpora Corpus creation requires special equipment, unique legal arrangements and business models and specialized skills not usually taught in the programs of users of language data LDC exists to make language data broadly available for linguistic education, research and technology development Motivation
LDC began in 1993 as a specialized publisher of language data. The data was typically produced elsewhere. Distributed over 14,000 copies of 196 corpora to >1000 organizations worldwide LDC gradually developed the ability to create language resources locally newswires/text collection, collection of conversational data via telephone, broadcast news collection transcription, time-alignment, topic relevance annotation, named entity annotation, phonological /morphological resources LDC more recently extended its research program TalkBank & Linguistic Exploration, Open Languages Archives, African Language Lexicons, DASL Linguistic technologies Information Detection, Extraction and Summarization Speech Recognition and Speech Synthesis Machine Translation Language and Speaker Identification Language Teaching, Linguistics LDC Role
Annotating LDC Corpora: TDT • Topic Detection & Tracking (TDT) Corpora • TDT4 Corpus (most recent) contains 9 months of data in 6 languages • Subset of 4 months of English, Chinese, Arabic for annotation • Topics selected and defined from all sources • Topic is a specific event or activity along with all directly related events (e.g., Hurricane Mitch) • Multiple levels of annotation • segmentation of audio signal into individual stories • topic-story relevance judgements • first story identification • story-link identification • Millions of annotation decisions
Audio Segmentation • Using commercial transcripts or closed-caption annotators • assess existing story boundaries • add, delete, move boundaries as needed • classify units as “news” or “not news” (commercials, etc.) • set and confirm timestamps for all story boundaries
Topic-Story Annotation • Annotators read and evaluate news stories against topic list • Classify story as directly, briefly or not at all related to a target topic
Annotating LDC Corpora: ACE • Automatic Content Extraction Project (ACE) • Develop technology to support automatic processing of human language in text form • Classification, filtering, representing language content • Four annotation tasks • Identify all nominal entities in news story • Categorize according to type • Persons, organizations, GPE, location, facility • Name, nominal, pronominal • Co-index all mentions of single entity within story • Classify relations among entities
Best practices in use of large-scale corpora in study of linguistic variation • Focus on -t/d deletion in American English (well-known variable) • Four LDC Corpora, all created for linguistic technology development • All data already transcribed, segmented to provide fine-grained access • Basic demographic information available (gender, age, education, region, race/ethnicity)
DASL Technology • Create concordance -regular expression search of corpus • Create tag set -specify which factors to code • Create annotation file -combines data with tag set • Annotate using web browser -play each example, tool supports common audio formats -code factors in each factor group, adding comments when needed -demographic information displayed • Save results and output to text file -can be exported to Excel Spreadsheet, statistical analysis package