210 likes | 317 Views
OCR Workshop Loretta Auvil UIUC October 18, 2011. Pearson Correlation Algorithm. Correlation- Ngram Viewer. Correlation- Ngram Viewer. new version of the Google ngrams viewer (for 1 grams) addresses case-sensitivity period spellings past-tense syncope (' d)
E N D
OCR Workshop Loretta Auvil UIUC October 18, 2011
Pearson Correlation Algorithm Correlation-Ngram Viewer
Correlation-Ngram Viewer • new version of the Google ngrams viewer (for 1 grams) • addresses case-sensitivity • period spellings • past-tense syncope (' d) • f/s substitution as well as other OCR issues • searches within already stored correlation results (using Pearson) results for top 10K ngrams • Computes correlation (using Pearson) results for given word against top 1K ngrams
OCR Correction • HTRC Example of one of the worst pages of text based on number of corrections per word rate = 0.1994
Spellcheck Component • Wrapped existing spellchecker from com.swabunga.spell • Input • Dictionary to define the correct words • Transformations is a set of rules that should be tried on misspelled words before taking the spell checker's suggestions • Token counts is a set of counts that can be used to choose word when spell checker suggests multiple ones • Output • Replacement Rules are the transformation rules for misspelled words • Replacements are suggestions for misspelled words • Corrected Text is theoriginal textwithcorrectionsapplied • UncorrectedMisspellingsisthelistofwordsforwhich a correction/replacementcould not befound
Adding Levenshtein • Use the Levenshtein algorithm to filter the list of suggestions considered • The Levenshtein distance is a metric for measuring the amount of difference between two sequences. The value of this property is expressed as a percentage that will depend on the length of the misspelled word • Example:
Transformation Rules Complete List • o=0; i=1; l=1; z=2; o=3; e=3; s=3; d=3; t=4;e=4; l=4; s=o; s=5; c=6; e=6; fi=6; o=6; l=7; z=7; y=7; j=8; g=8; s=8; a=9; c=9; g=9; o=9; ti=9; b={h,o}; c={e,o,q}; cl={ct,d}; ct={cl,d,dl,dt,ft}; d={cl,ct}; dl=ct; dt=ct; e=c; fl={ss,st}; ft=ct; h={li,b,ii,ll}; i=l; j=y; l=i; li=h; m={rn,lll}; n={ll,il,h}; oe=ce; r=ll; rn=m; s=f; sh={fli,ih,jb,jh,m,sb}; ss=fl; st=fl; tb=th; th=tb; v=y; u={ll,n,ti}; y={j,v};
Mashup Framework Visualizations User Interfaces Apps Plugins Web Apps Services Meandre Workbench Repositories Data Analysis Components Flows Meandre Data-Intensive Flows Components Developer Tools Data Analytics Visualization Component Repository Component Discovery Meandre Infrastructure Virtualization Infrastructure Computational Resources
Meandre for Mashups • Major Capabilities • Dataflow execution • Semantic technology (using RDF for storing meta info) • Web-Oriented • Supports publishing services for data, analytics and visualization • Modular components • Encapsulation and execution mechanism • Promotes reuse, sharing, and collaboration • Cloud-friendly infrastructure • Implements MapReduce for parallelization • Open source • Note: Trading off some performance for reuse, flexibility and modular components… with option to parallelize components to improve performance
Meandre Workbench • Web-based UI (GWT) • Components and flows are retrieved from server • Additional locations of components and flows can be added to server • Create flow using a graphical drag and drop interface • Change property values • Execute the flow Components Flows Locations
Knowledge Discovery Infrastructure Benefits • Provides access to data management tools • Selecting/Loading data from databases, flat files or repositories • Integrates data mining algorithms • Supports an extensible interface for creating one’s own algorithms • Provides means for building and applying models • Provides integrated visualizations components • Provides capability to build custom applications • Provides access for local or distributed computation • Provides the ability to share components and applications
From Silos to Mashups • Definition: Mashup is a web page or application that uses and combines data, presentation or functionality from two or more sources to create new services • Why do we want this? • Enable out services in many applications and on a variety of devices (laptop, high-res display wall, ipad, iphone or the others) • Share and reuse is a good thing • Reach communities with our tools and their data!!! • What can we do to change this? • We can think and create data driven solutions so that they can be mashed up with other tools. • We can build web services that can be deployed or accessed. • We can create API’s to be used.
Components • Analytics • Unsupervised Learning • Clustering • Frequent Pattern Analysis • Topic Modeling (Mallet) • Concept Mapping • Supervised Learning • Naïve Bayesian • Support Vector Machines (Weka) • Decision Trees (c4.5) • Optimization Approaches • Genetic Algorithm • Text Analysis (NLP, Entity Extraction) • OpenNLP • Stanford NER • Spellcheck • OpenMary (NLP, Text-Speech) • Visualization • Geographic (Google Maps) • Temporal (Simile) • Network Graphs – Link Nodes and Arcs (Protovis) • Line Charts (D3) • Parallel Coordinates (Protovis) • Stacked Area Chart (Flare) • Tag Cloud Maker • Decision Tree (Applet D2K) • Naïve Bayes (Applet D2K) • Rule Association (Applet) • Dendogram (GWT)
Topic Modeling • Uses Mallet Topic Modeling to cluster nouns from over 4000 documents from 19th century with 10 segments per document • Top 10 topics showing at most 200 keywords for that topic
Concept Mapping • Sentiment Analysis • six core emotions (Love, Joy, Surprise, Anger, Sadness, Fear)
Thanks • Xavier Llora lead developer, now at Google • Boris Capitanu, developer of Workbench, and now lead developer • Other team members
Links • www.seasr.org • www.seasr.org/meandre