1 / 17

The task of cleaning and enriching large collections

The task of cleaning and enriching large collections. what aspects can we share?. C ontributing to this work UIUC English: Ted Underwood Jordan Sellers Mike Black UIUC Library: Harriett Green I3: Loretta Auvil Boris Capitanu Andrew W. Mellon Foundation.

zoltin
Download Presentation

The task of cleaning and enriching large collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The task of cleaning and enriching large collections what aspects can we share?

  2. Contributing to this work UIUC English: Ted Underwood Jordan Sellers Mike Black UIUC Library: Harriett Green I3: Loretta Auvil Boris Capitanu Andrew W. Mellon Foundation

  3. “Enrich” as well as “clean.”

  4. Yearly values of a ratio between two wordlists in three different genres. 4,275 volumes. 1700-1899.

  5. “representative?”

  6. analyzing the data cleaning the data

  7. “clean” is relative

  8. different projects will strike a different balance between precision and recall makes it tricky to share resources

  9. Cleaning the data • Clean up the OCR / assess error. • Identify parts of a volume (e.g., articles in a serial, poetry/prose). • Remove library bookplates and running headers — after using them for (3).

  10. period-specific lexica incl. foreign lang. context of an individual doc: It is furely a mortal fin to ... collection-level observations: proper nouns, words that appear mainly in dirty docs Correction rules

  11. Cleaning/enriching the metadata • “18??” • Discard duplicate volumes / select early editions? • Add metadata that you need for interpretive purposes, like — gender (see Ben Schmidt’s technique), — genre.

  12. first stab at genre – naive Bayes

  13. Things we could share period lexicons / variant spellings gazetteers of proper nouns OCR correction rules for a period document segmentation and/or cleaned and segmented text ferberization cleaned / enriched metadata code to do all of the above

  14. get clues from metadatabreak vols into partsensemble / boostingactive learning

  15. active learning: documents classified as “fiction,” plotted by confidence in classification (y axis). Red points are misclassified.

More Related