A TaLISMAN: Automatic Text and LIne Segmentation of historical MANuscripts

A TaLISMAN: Automatic Text and LIne Segmentation of historical MANuscripts Ruggero Pintus1, Ying Yang2, Enrico Gobbetti1 and Holly Rushmeier2 1CRS4 2Yale University ruggero.pintus@crs4.it,ying.yang.yy368@yale.edu, enrico.gobbetti@crs4.it, holly.rushmeier@yale.edu

Given a book, we extract per-page text leadings and features. We select the most salient pages and image descriptors, and we compute a rough text segmentation that we use to train a SVM classifier. We re-launch the prediction to all original features to obtain a fine segmentation. We convert these sparse text positions into a dense text region representation, and we finally extract text blocks and lines. Evaluated on a heterogeneous corpus content: ~3K pages, ~4K blocks, ~66K lines Robust to: - Different writing styles - High layout variability - One, two or more columns, marginalia, calendars - Presence of capital letters, portraits, ornamental bands, graphical contents - Aging – holes, spots, ink bleed-through, fading, missing parts, damages Text lines Original Text regions Text blocks Titel

A TaLISMAN: Automatic Text and LIne Segmentation of historical MANuscripts

A TaLISMAN: Automatic Text and LIne Segmentation of historical MANuscripts

Presentation Transcript

Image Segmentation by Histogram Thresholding

Automatic Verification of Industrial Designs

Automatic Transmission Fundamentals

Market Segmentation and Target Markets

Automatic Text Summarization

Image Segmentation

Segmentation & Fitting

Slippery slides

Making Connections: Text to Self and Text to Text

Advanced Access Control Course 3000

Contour Detection and Hierarchical Image Segmentation

此报告仅供客户内部使用。未经麦肯锡公司的书面许可，其它任何机构不得擅自传阅、引用或复制。

Text vs. Subtext

Writing Scientific Manuscripts in English

Text Structure

Manual and Automatic Subjectivity and Sentiment Analysis

Example text Go ahead and replace it with your own text. This is an example text.

<Title>

Automatic Voltage Regulator

Text-main1

Alaska Soldiers train 'down under'

A TaLISMAN: Automatic Text and LIne Segmentation of historical MANuscripts