1 / 20

New Resources for Document Classification, Analysis and Translation Technologies

New Resources for Document Classification, Analysis and Translation Technologies. Stephanie Strassel, Lauren Friedman, Safa Ismael, David Lee, Kazuaki Maeda, Linda Brandschain {strassel, lf, safa, david4, maeda, brndschn}@ldc.upenn.edu Linguistic Data Consortium

ledell
Download Presentation

New Resources for Document Classification, Analysis and Translation Technologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New Resources for Document Classification, Analysis and Translation Technologies Stephanie Strassel, Lauren Friedman, Safa Ismael, David Lee, Kazuaki Maeda, Linda Brandschain {strassel, lf, safa, david4, maeda, brndschn}@ldc.upenn.edu Linguistic Data Consortium http://projects.ldc.upenn.edu/MADCAT

  2. Presentation Outline • MADCAT Program Overview • Technology Challenges • Roadmap • Data Creation • Phase 1 Data Profile • Processing • Collection • Annotation • Data Format • Evaluation • Conclusions and Future Work

  3. MADCAT Overview • MADCAT: Multilingual Document Classification Analysis and Translation • A 5-year DARPA program • MADCAT technologies will convert foreign language document images into English text, enabling English speakers to extract, assess, and respond to information in a timely manner • Multiple input types and domains • Hard-copy, PDF, camera-captured • Newspapers, letters, signs, graffiti, how-to manuals, memos, postcards, forms, diaries, ledgers, etc.

  4. Technology Challenges • Extract relevant metadata about the document structure • Integrate and optimize page segmentation, metadata extraction, OCR and translation technologies • Create end-to-end system for deployment at program’s end with over 90% accuracy • Current baseline is ~2% • Primary evaluation metric is edit distance: HTER • Same protocols as used in the GALE program • Limited focus in Phase 1 • Arabic > English • High resolution (600 dpi) images of handwritten newspaper and web text • Topics primarily news, current events and commentary • Manual segmentation provided

  5. Roadmap Phase Phase 1: Add handwriting Pre-MADCAT: State of the Art Phase 2-3: New data types Phase 4-5: New genres, topics, quality conditions Newswire Newswire Letters Calendar Personal Identif. Genre Letters Diaries Broadcast Broadcast Forms Maps Instructns Forms Calendars Talk Shows Talk Shows Ledgers Poems Books Ledgers Instructns Weblogs Weblogs Diaries Verdicts Training Manuals Topic News News Commentary Engineering Science Personal Military Commentary Commentary Science Personal Engineering Religious Other Medium Printed Printed Printed Printed Handwritten Handwritten Handwritten Source Data Quality Controlled Controlled Controlled Uncontrolled Uncontrolled

  6. Phase 1 Data Profile • In Phase 1, data drawn from DARPA GALE program • New collection to acquire handwritten versions • Genres: Formal text (newswire) and informal text (weblogs) • Benefits • Eliminates domain mismatch between GALE state of the art MT models and MADCAT test sets • Allows developers to focus on primary challenge: handwriting • Data characteristics well understood, cost and time factors are reasonably well known • Training data costs controlled since translations exist • Production begins immediately, training data available sooner • Provides controlled test sets for evaluation across programs • Subsequent phases will add new data types, genres and other challenge elements

  7. Training and DevTest • Training • Minimum 2000 unique pages • Half formal (newswire), half informal (web text) • 100-250 words per page • Minimum 100 unique scribes in training pool • 5 scribes per page • At minimum 10,000 manuscripts (scribe-pages) in Phase 1 training set • DevTest • 320 unique pages • Half formal (newswire), half informal (web text) • 125 words/page • 50 scribes in devtest pool • 25 from training, 25 previously unseen • 2 scribes per page, ~7 pages per scribe • Total of 640 manuscripts; 80,000 words

  8. Evaluation Data • 320 unique pages from GALE P3 Eval set • Half formal (newswire), half informal (web text) • 125 words/page • 50 scribes in eval partition • 25 from training, 25 previously unseen • 6 scribes per page, ~40 pages per scribe • Total of 1920 manuscripts, 240,000 words • Subset of eval set designated for pilot evaluation in September 2008

  9. Data Preparation • Start with electronic text from GALE • Whole documents collected from newswire or web • Segmented into SUs (semantic/sentence units) • Each segment manually translated • Pre-processing prior to handwriting • Tokenization to words for later stages • Segments reordered and formatting added to create optimal pages for handwriting assignment • Roughly 5 words/line to avoid line wrapping • No more than 25 lines/page to avoid page breaks • After handwriting, images scanned at high resolution (600 dpi, greyscale) • Images are ground truth annotated at line, word level • Major challenge is logical storage of many layers of information across multiple versions of the same data

  10. Collection • New human subjects collection required to produce handwritten versions of existing data • Pilot collection currently underway at LDC in Philadelphia • LDC Arabic staff and recent Iraqi immigrants in Philly • Additional collections planned with partner sites in Lebanon, Morocco and possibly Egypt • Regional variety necessary to capture stylistic writing differences • E.g. use of Indic vs. Arabic numbers • Assignment and tracking of data and scribes controlled through centralized LDC database and assignment protocol • Scribe partition (train only, test only, both) • Writing conditions • Regional variation • Genre, topic and source balance

  11. Writing Conditions • Implement • 90% ballpoint pen (I) • 10% pencil (P) • Paper • 75% unlined white paper (U) • 25% lined paper (L) • Writing speed • 90% normal (N) • 5% fast (F) • 5% careful (C)

  12. Collection Workflow LDC selects source data LDC generates kits (documents + writing conditions) LDC delivers data kits to collection sites Sites publicize study and recruit participants Scribe visits public URL, contacts site coordinator Site coordinator verifies scribe eligibility Site coordinator schedules appointment Scribe comes in, takes writing sample test Site coordinator logs in to secure website via login page Scribe completes registration via registration page Scribe verifies info via confirmation page Site coordinator prints out subject ID and instructions for subject via assignment page Coordinator pulls kit for this subject ID Scribe leaves with kit and instructions Scribe returns completed kit to site Coordinator verifies kit completeness and arranges payment Coordinator files completed kit for scanning/delivery Site scans completed kit(s) as safeguard Site uploads image file to LDC LDC processes completed kits for subsequent tasks Site ships completed paper kit(s) to LDC for archiving

  13. Scribe Demographics • Scribes register in person at collection site and take writing test • To assess literacy and ability to follow instructions • Enter demographic info on LDC's secure server • Name, address (for payment purposes only) • Age, gender, level of education, occupation • Where born, where raised • Primary language of educational instruction • Handedness • After registration, scribes receive brief tutorial • No line wrapping, no page breaks • Copy text exactly: no omissions or insertions, no corrections to source text

  14. Scribe Assignments • Assignments are in the form of printed "kits" • 50 printed pages to be copied plus assignment table • Assignment table specifies page order and writing conditions • Multiple scribes/kit, so conditions and order vary • Printed pages labeled with page and kit ID • Scribes affix label with scribe, page and kit ID to back of completed manuscript • To facilitate data tracking during scanning and post-processing • Scribes supply paper and writing instrument • To sample natural variation • Payment per completed kit • Exhaustive check on first assignment (completeness and accuracy) • Spot check on remainder of assignments

  15. Ground Truthing • Zones created at word level only for Phase 1 • Lines can be extrapolated from annotation • Other zone types possible in future phases • Structural elements (e.g. signature block) • Explicit reading order preserved • Locations are polygons • Restricted to upright rectangles in the first phase • Each zone contains a unique ID, the contents, location (coordinates) • Status tags to accommodate scribe mistakes • extra, missing, typo • nextZoneID tag to indicate reading order • In Phase 1, ground truthing primarily by partner site (Applied Media Analysis)

  16. GEDI Toolkit • GroundTruth - Editor and Document Interface (GEDI) created by Applied Media Analysis (AMA)

  17. Data Format MADCATUnifier Process takes multiple data streams and generates single xml output file which contains all required information 1) Text layer *Source Text *Tokenization *SU Segmentation *Translation 2) Image layer *zone bounding boxes 3) Scribe demographics 4) Document metadata

  18. Evaluation • Input: (segmented) Arabic handwritten image • Output: segmented English text • HTER is primary evaluation metric (edit distance) • Manual post-editing task corrects MT output one segment at a time until it has the same meaning as the reference translation, making as few edits as possible • NIST-developed MTPostEditor GUI • Editors review segment-aligned MT and gold standard translation • No access to original Arabic text or handwritten image file • No official separate evaluation of OCR or processing components

  19. Conclusions; Future Work • LDC is creating a set of new linguistic resources for image processing, document classification and translation on a scale not previously available • Phase 1: Large collection of Arabic handwritten, translated, segmented, ground truthed text • Infrastructure for collection, annotation and data management • Including a unified, extensible data format • Extended to new data types, domains, languages, annotations in future phases • Resources will be available through LDC

  20. Acknowledgements • This work was supported in part by the Defense Advanced Research Projects Agency, MADCAT Program Grant No. HR0011-08-1-004. The content of this paper does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. • Thank you to Audrey Le and Mark Przybocki at NIST for helping to define data and format requirements

More Related