OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop Loretta Auvil UIUC October 18, 2011

Pearson Correlation Algorithm Correlation-Ngram Viewer

Correlation-Ngram Viewer • new version of the Google ngrams viewer (for 1 grams) • addresses case-sensitivity • period spellings • past-tense syncope (' d) • f/s substitution as well as other OCR issues • searches within already stored correlation results (using Pearson) results for top 10K ngrams • Computes correlation (using Pearson) results for given word against top 1K ngrams

OCR Correction • HTRC Example of one of the worst pages of text based on number of corrections per word rate = 0.1994

Worst Page

Corrected Page

Some Stats

Spellcheck Component • Wrapped existing spellchecker from com.swabunga.spell • Input • Dictionary to define the correct words • Transformations is a set of rules that should be tried on misspelled words before taking the spell checker's suggestions • Token counts is a set of counts that can be used to choose word when spell checker suggests multiple ones • Output • Replacement Rules are the transformation rules for misspelled words • Replacements are suggestions for misspelled words • Corrected Text is theoriginal textwithcorrectionsapplied • UncorrectedMisspellingsisthelistofwordsforwhich a correction/replacementcould not befound

Adding Levenshtein • Use the Levenshtein algorithm to filter the list of suggestions considered • The Levenshtein distance is a metric for measuring the amount of difference between two sequences. The value of this property is expressed as a percentage that will depend on the length of the misspelled word • Example:

Transformation Rules Complete List • o=0; i=1; l=1; z=2; o=3; e=3; s=3; d=3; t=4;e=4; l=4; s=o; s=5; c=6; e=6; fi=6; o=6; l=7; z=7; y=7; j=8; g=8; s=8; a=9; c=9; g=9; o=9; ti=9; b={h,o}; c={e,o,q}; cl={ct,d}; ct={cl,d,dl,dt,ft}; d={cl,ct}; dl=ct; dt=ct; e=c; fl={ss,st}; ft=ct; h={li,b,ii,ll}; i=l; j=y; l=i; li=h; m={rn,lll}; n={ll,il,h}; oe=ce; r=ll; rn=m; s=f; sh={fli,ih,jb,jh,m,sb}; ss=fl; st=fl; tb=th; th=tb; v=y; u={ll,n,ti}; y={j,v};

Mashup Framework Visualizations User Interfaces Apps Plugins Web Apps Services Meandre Workbench Repositories Data Analysis Components Flows Meandre Data-Intensive Flows Components Developer Tools Data Analytics Visualization Component Repository Component Discovery Meandre Infrastructure Virtualization Infrastructure Computational Resources

Meandre for Mashups • Major Capabilities • Dataflow execution • Semantic technology (using RDF for storing meta info) • Web-Oriented • Supports publishing services for data, analytics and visualization • Modular components • Encapsulation and execution mechanism • Promotes reuse, sharing, and collaboration • Cloud-friendly infrastructure • Implements MapReduce for parallelization • Open source • Note: Trading off some performance for reuse, flexibility and modular components… with option to parallelize components to improve performance

Meandre Workbench • Web-based UI (GWT) • Components and flows are retrieved from server • Additional locations of components and flows can be added to server • Create flow using a graphical drag and drop interface • Change property values • Execute the flow Components Flows Locations

Spellcheck Flow

Knowledge Discovery Infrastructure Benefits • Provides access to data management tools • Selecting/Loading data from databases, flat files or repositories • Integrates data mining algorithms • Supports an extensible interface for creating one’s own algorithms • Provides means for building and applying models • Provides integrated visualizations components • Provides capability to build custom applications • Provides access for local or distributed computation • Provides the ability to share components and applications

From Silos to Mashups • Definition: Mashup is a web page or application that uses and combines data, presentation or functionality from two or more sources to create new services • Why do we want this? • Enable out services in many applications and on a variety of devices (laptop, high-res display wall, ipad, iphone or the others) • Share and reuse is a good thing • Reach communities with our tools and their data!!! • What can we do to change this? • We can think and create data driven solutions so that they can be mashed up with other tools. • We can build web services that can be deployed or accessed. • We can create API’s to be used.

Components • Analytics • Unsupervised Learning • Clustering • Frequent Pattern Analysis • Topic Modeling (Mallet) • Concept Mapping • Supervised Learning • Naïve Bayesian • Support Vector Machines (Weka) • Decision Trees (c4.5) • Optimization Approaches • Genetic Algorithm • Text Analysis (NLP, Entity Extraction) • OpenNLP • Stanford NER • Spellcheck • OpenMary (NLP, Text-Speech) • Visualization • Geographic (Google Maps) • Temporal (Simile) • Network Graphs – Link Nodes and Arcs (Protovis) • Line Charts (D3) • Parallel Coordinates (Protovis) • Stacked Area Chart (Flare) • Tag Cloud Maker • Decision Tree (Applet D2K) • Naïve Bayes (Applet D2K) • Rule Association (Applet) • Dendogram (GWT)

Topic Modeling • Uses Mallet Topic Modeling to cluster nouns from over 4000 documents from 19th century with 10 segments per document • Top 10 topics showing at most 200 keywords for that topic

Concept Mapping • Sentiment Analysis • six core emotions (Love, Joy, Surprise, Anger, Sadness, Fear)

Thanks • Xavier Llora lead developer, now at Google • Boris Capitanu, developer of Workbench, and now lead developer • Other team members

Links • www.seasr.org • www.seasr.org/meandre

OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop Loretta Auvil UIUC October 18, 2011

Presentation Transcript

Contracts Inception Workshop October 2011

WORKSHOP 10 October 2011

Visualizing Text Loretta Auvil UIUC February 25, 2011

A: 18 October 2011

October 18, 2011

Key Club: OCTOBER 18, 2011

DEVELOPER WORKSHOP OCTOBER 18, 2011

Warm-Up: October 18, 2011

SEMIDEC Workshop 24/25 October 2011

Role of Mashups , Cloud Computing, and Parallelism for Visual Analytics Loretta Auvil

LASP seminar , 18 October 2011, Boulder

Loretta

October 18, 2011

18 th October 2011

Biarritz, 18 October 2011

October 18, 2011, Bucharest

CONTROLLERS WORKSHOP 6 October 2011 Canberra

PSM workshop -- October 14, 2011

Mapping Workshop - Printing 6 October 2011

Wednesday, August 4, 1999 William H. Hsu, Loretta Auvil, Tom Redman, Michael Welge

October 18, 2011

SHPA Annual meeting October 18, 2011