340 likes | 641 Views
A System for Understanding Imaged Infographics and Its Applications Weihua Huang, Chew Lim Tan School of Computing National University of Singapore Outline Introduction Syntactic and semantic information in scientific charts Chart recognition Chart interpretation Applications
E N D
A System for Understanding Imaged Infographics and Its Applications Weihua Huang, Chew Lim Tan School of Computing National University of Singapore
Outline • Introduction • Syntactic and semantic information in scientific charts • Chart recognition • Chart interpretation • Applications • Experiment results • Conclusion
Introduction • Information graphics (infographics) are frequently used in various kinds of documents. • Recognition and interpretation of infographics is important for automatic document processing and information retrieval. • What are the elements/components in an infographic? Recognition task • What does an infographic try to tell? Interpretation task • This paper focus on one type of infographics: scientific charts
Introduction • Imaged infographics are harder to recognize and interpret: Because everything is in pixels!
Outline • Introduction • Syntactic and semantic information in scientific charts • Chart recognition • Chart interpretation • Applications • Experiment results • Conclusion
Y-axis ticks Y-axis end Y-axis unit Chart Title Y-axis label X-axis end Origin X-axis Title X-axis label X-axis ticks Data components Scientific Charts • Syntactic elements:
Comparison, trend, distribution, etc. Graphical representation Intended message Tabular Data Scientific Charts • Semantic information: • Recognition and interpretation is the reverse process
Outline • Introduction • Syntactic and semantic information in scientific charts • Chart recognition • Chart interpretation • Applications • Experiment results • Conclusion
Text/graphics separation Edge detection Text components The original image Graphical image Edge map Chart Recognition • Preprocessing • Text/graphics separation: connected component analysis • Edge detection: Canny edge detector
Chart Recognition • Graphical symbol construction • Vectorization • Detection of coordinate lines • Geometric constraint between candidate lines • Coverage of other lines in the candidate plot area • Attachment of text blocks Edge Map DSCC Straight segments Ellipse fitting Circular arcs, Elliptic arcs
Chart Recognition • Graphical symbol construction (cont.) • Construction of data components • Bottom up process with the vectorized edges and intersections • Model based parsing rules using the domain knowledge • Example: BarChart = {x-axis, y-axis, BarSet}, where BarSet = {Bar}, where number of elements ≥ 2 and Bar = {l1, l2, l3 | l1 ┴ l3, l2 ┴ l3, l3 || x-axis, CE(l1, l3), CE(l2, l3), EL(l1, x-axis), EL(l2, x-axis)} Constraints: a || b: line a is parallel to line b. a ┴ b: line a is perpendicular to b. CE(a, b): shape a and b share one common endpoint. EL(a, b): one end point of shape a lies on shape b.
Chart Recognition • Text grouping • Yuan’s method to group connected components: • Text recognition • Omnipage Scansoft Capture SDK 12.0 • Errors are manually corrected.
Chart Recognition • Sample result: Green: bars bar1: (281,249), (345,248), (346,301), (281,302) Bar2: (430,109), (494,108), (499,298), (435,299) Bar3: (581,134), (645,132), (648,296), (585,298) …… Red: axis X: (239,304) to (994,290) Y: (239,304) to (236,100) Type: bar chart
Outline • Introduction • Syntactic and semantic information in scientific charts • Chart recognition • Chart interpretation • Applications • Experiment results • Conclusion
Chart Interpretation • Associating text with graphics • Assign syntactic role to each text block • Label graphical symbols using the text blocks • 11 roles of text in the scientific charts identified • The problem is modeled as classification of text blocks
Chart Interpretation • Associating text with graphics (cont.) • To train the classifier and classify a new text block, 4 features are defined: • Distance to the nearest graphical symbol • Type of the nearest graphical symbol • Relative position of the text block and the graphical symbol • Type of the text string itself • Centricity of a text block • Learning algorithm C4.5 is used for building decision tree.
D1 D2 Chart Interpretation • Obtaining the tabular data • Assign label to each data entry if its label is not directly presented. D1: Distance to nearest label on the left. D2: Distance to nearest label on the right If (D1 < D2) label = L1 Else if (D1 > D2) label = L2 Else label = L1 + L2
H1 H2 Chart Interpretation • Obtaining the tabular data (cont.) • Calculate value for each data entry if its value is not directly presented. H1: Data height H2: Unit height Value per unit height: 30 Data value: H1 * 30 / H2
Chart Interpretation • Generating chart description • XML format description • Keeping data in the tabular form • Good for querying on data value or label • Natural language description • Fact based sentences generated from templates • Good for factoid question
Outline • Introduction • Syntactic and semantic information in scientific charts • Chart recognition • Chart interpretation • Applications • Experiment results • Conclusion
Applications • Enriching OCR output • Traditional OCR output: Text + Figures • The information in figures are not extracted • The proposed system helps to extract more information • The tabular data obtained can be used to reproduce the document in machine readable form. (Electronic) (Image format)
OCR Electronic text Imaged text Segmentation Layout information Scanned document Imaged infographic The proposed system XML description Document Reproduction Applications • Enriching OCR output (cont.) • Approach: • Question: where to insert the infographics? Clue: Look for the figure number in the text.
Applications • Assisting QA systems • Question type 1: factoid question • Example: “How many fatalities were there in the year 1984?” • Solution: Add the NL description of the infographics into the original text • Question parsing and answer extraction: Cui et al’s method based on soft pattern matching
Applications • Assisting QA systems (cont.) • Question type 2: query-like question • Example: “What is the maximum number of fatalities among all years?” • Solution: Translate the question into one of the pre-defined queries • Question translation: Semantic parser proposed by Mooney et al
Outline • Introduction • Syntactic and semantic information in scientific charts • Chart recognition • Chart interpretation • Applications • Experiment results • Conclusion
Experiment Results • Chart recognition and classification: using 200 scientific chart image collected
Experiment Results • Text block classification: using 200 scientific chart images collected
Experiment Results • Question answering: using 10 scanned document pages from the UW database I
Outline • Introduction • Syntactic and semantic information in scientific charts • Chart recognition • Chart interpretation • Applications • Experiment results • Conclusion
Conclusion • A system for recognizing and interpreting imaged infographics is introduced. • Current focus is on scientific charts, a commonly used type of infographics • The system can be generalized to handle more variety of infographics • The system can be enhanced to handle more complex layout and special effects etc.
Thank you! Questions?