1 / 16

ListReader : Wrapper Induction for Lists in OCRed Documents

ListReader : Wrapper Induction for Lists in OCRed Documents. Thomas Packer BYU CS DEG 2012.03.17. We Love Data (In Digital Form). Lots of Paper Documents in the World. Lots of Text in Paper Documents. Lots of Lists in Text Lots of Data in Lists. Manual Data Entry.

otylia
Download Presentation

ListReader : Wrapper Induction for Lists in OCRed Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ListReader:Wrapper Induction for Lists in OCRed Documents Thomas Packer BYU CS DEG 2012.03.17

  2. We Love Data (In Digital Form)

  3. Lots of Paper Documents in the World

  4. Lots of Text in Paper Documents

  5. Lots of Lists in TextLots of Data in Lists

  6. Manual Data Entry

  7. Wrappers: Individualized Extraction Rules

  8. Semi-Automatic Wrapper Induction

  9. Weakly-supervisedWrapper Induction Semi-supervised Wrapper Induction

  10. Data: Image

  11. Data: OCR

  12. Data: Hand-Labeled OCR

  13. Induced Regex Wrappers Single-Specific Character Classes (i)(\. )(Lydia)( )(Lewis)(\*, )(b\. ) … [a-z] [A-Z] [0-9] [ \t\r\n] <each punct.> Single-General ([a-z]{1,5})([ \t\r\n\.]{1,6})([a-zA-Z]{3,9})([ \t\r\n]{1,5}) ([a-zA-Z]{3,9})([ \t\r\n\*,]{1,7})([ \t\r\n\.a-z]{1,7}) … Multi-General ([iv]{1,3})([ \.]{2,2})([ACHJLacdehilmnrstuvy]{4,6})([ ]{1,1}) ([Leisw]{5,6})([ \*,]{2,3})([ \.abp]{3,5}) …

  14. Preliminary Results(Field Label F-measure, Small Dataset)

  15. Conclusions and Future Work

  16. All done. Suggestions?

More Related