340 likes | 472 Views
Handwriting Synthesis for Human Interactive Proof in Web Services. CSE 717 Project – Progress Report Gabriel Terejanu. Outline. HIP Introduction CAPTCHA Challenges Guidelines Previous work in Handwritten CAPTCHA Proposal Problem Approach Conclusion. Human Identity Proof (HIP).
E N D
Handwriting Synthesisfor Human Interactive Proof in Web Services CSE 717 Project – Progress Report Gabriel Terejanu
Outline • HIP Introduction • CAPTCHA Challenges • Guidelines • Previous work in Handwritten CAPTCHA • Proposal • Problem Approach • Conclusion
Human Identity Proof (HIP) • “Defend services from malicious attacks differentiating bots from human users”. [Zhang04] • Turing Test : 1950 Alan M. Turing • First idea for web: 1996 Moni Noar • First web implementation: 1997 Altavista
Human Identity Proof (HIP) • Text-based: • Printed text: Gimpy (captcha.net) • Handwritten text: Handwritten CAPTCHA [Rusu04] • Non-text based: • Clock face (exploits common knowledge) - broken • Picture base (exploits understanding): person spot, count cars, tell weather • http://hotcaptcha.com/captcha :)
CAPTCHA • Completely Automated Public Turing test to Tell Computers and Humans Apart • Manuel Blum - Carnegie Mellon University • Exploits human ability to read corrupted text
CAPTCHA : Applications • Portals • Weblogs • IM, message boards • Social networking • Spam-filtering • Banking • Web-Ticket • Web-vote Very business oriented field
CAPTCHA : Challanges • W3C : Inaccessibility of CAPTCHA (2005) • Sometimes very frustrating for “normal” people • Accessibility: blind, low vision, dyslexia, color blindness (lose business) • Increase cost of business to supplement for accessibility (hot lines) • False Security: • paid human operator – “Borrow (or rent) someone’s eyes” • Breaking a Visual CAPTCHA (EZ-Gimpy 92%, Gimpy 33%) [Greg Mori - UC Berkeley]
CAPTCHA : Challenges (2) • Cognitive disabilities • Foreign languages • Static CAPTCHA dangerous • Variable font (http://sam.zoy.org/pwntcha)
Dyslexia (www.dyslexia-teacher.com) • “dyslexia” comes from the Greek meaning “difficulty with words” • difficulties with spelling • confusion over left and right (b <-> d) • confusion over up and down ( p <-> 9) • writing letters or numbers backwards • difficulties with math/s • difficulty following 2- or 3-step instructions • 10-15% of the US population has dyslexia • (http://www.dyslexia-add.org) • 8-12% of males of European origin have color blindness
“An explicitly inaccessible access control mechanism should not be promoted as a solution, especially when other systems exist that are not only more accessible, but may be more effective, as well.” [W3C05]
Guidelines • Redefine the ability to read corrupted text • Easy to use • Low cost (small site mass usage) • Hard to solve (out of context) by a third person • Use understanding of the 1st grade • Very clean and well spelled text • Use of very light deformations allowed
Importance of Handwriting Generation • CAPTCHA project • After the writer identification makes the handwriting recognition easier • Error correction for handwritten text • Adds personal touch to the communication • Create customized fonts (My Font Tool for Tablet PC, fontifier.com)
Handwriting Synthesis • Movement simulation technique • Based on motor models • Usually accompanied by on-line acquisition • Shape simulation techniques • More practical when the dynamics of handwriting are not available • Off-line acquisition (easy to collect samples)
Why Handwritten CAPTCHA ? • “As agreed by most researchers, it is impossible to achieve a correct ratio of 100% for handwriting recognition and segmentation.” [J.Wang2004]
Handwritten CAPTCHA • Rusu and Govindaraju - HIP 2005 • Collet handwritten words and gluing • Original images public knowledge (city names from postal applications) • Gestalt laws of perception • Closure, similarity, proximity, symmetry, continuity, familiarity, figure and ground, memory • Random select transformations • Overlaps, occlusion, extra strokes, orientation,
H-CAPTCHA (Room for improvement) • Accessibility • Some instances hard to read for “normal” people • Paid eye problem, easy to solve out of context
Sequences of CAPTCHA’s • Leveraging the CAPTCHA Problem by Daniel Lopresti • Rely on digital libraries and transcripts • n CAPTCHA challenges • One decision CAPTCHA permit/deny access • The rest are used to label the challenges (assuming we have a human) -> promote to a decision CAPTCHA • No intermediate results are provided
Sequences of CAPTCHA’s (Room for improvement) • Time consuming • Might be complicated for people with dyslexia • Paid eye problem, easy to solve out of context • May be too static / predictable
CAPTCHA Proposal • Combine ability to read handwritten text with 1st grade understanding of the text • Moderate complexity to each task (human) => very difficult to solve for machines • Single CAPTCHA image with at most 2 lines of handwritten text • Synthetic handwriting • vast amount of handwritten styles • Ligature generator • Controlled randomness • Sentence / QA generator • answer not necessary in CAPTCHA image • Include prior information • available in the web form fields • “Light” deformations – easy tolerable by humans
Example Proposal • English • terejanu@buffalo.edu
Sample Collection • Depended of the method • I.Guyon – glyphs: eq: port, sid, wil – 1 hr • J.Wang – words – rely on segmentation • Collect a series of sample for each character – no segmentation, easy scanable • A writer may have few different styles for the same character
Landmarks extraction • Good choice: consistent from one image to another (high curvature, junctions) – for precision intermediate points • Mark by hand • J.Wang – series of 1-D Gabor filters • C.H.Teh – On the Detection of Dominant Points On Digital Curves • No parameters http://cg.scs.carleton.ca/~luc
Character prototype creation • Point correspondence • Shape pose/scale differences • Create prototype • Extract variations from shape clustering • N.Duta – Automatic Construction of 2D Shape Models
Random Character Generator • 1 character – 30 control points • Small perturbations not allowed • Preserve readability • Random generation in a simplex (uniform distributed)
Curve Generation • Essential for high-resolution graphics • LeGrange interpolation • Smooth curve pass through a group of ordered control points • Blending Functions - thought of as a function specifying how much the ith control point draws the curve towards it • Curve wiggles between the control points • Corners at control points when connecting curves • B-Spline Curves • The curve does not pass through each control point, but instead just passes near them • Slopes between curve segments are continuous • Usually cubic B-splines are used • Bezier Curves • Pass through the first and last control points • http://web.cs.wpi.edu/~matt/courses/cs563/talks/curves.html
Ligature Generator • Aesthetic quality • Optimization process • Limited centrifugal acceleration • Limited acceleration / retardation in the direction of the velocity vector • Works for variable spacing between characters • M.Kokula – Automatic generation of script font ligature based on curve smoothness optimization
Variable width splines ? • R.V.Klassen - Variable width splines: a possible font representation? • centerline curve (spline) + width function (spline) • Control points + w (control width / scale factor) • Little experience desining characters width variable width splines • Storing burden & creation complexity
Randomly Generated Sentences • http://www.manythings.org/rs/ • Subject + Verb • I swim. Joe swims. They swam. • Subject + Verb + Object • I drive a car. Joe plays the guitar. They ate dinner. • Subject + Verb + Complement • I am busy. Joe became a doctor. They look sick. • Subject + Verb + Indirect Object + Direct Object • I gave her a gift. She teaches us English. • Subject + Verb + Object + Complement • I left the door open. • We elected him president. • They named her Jane.
Possible Generated Text • Sentence + Question => one word answer • I drove a car. • What did I drive ? => car • Sentence + Question (prior) • Your name is Gabriel Terejanu. • What is your zip code? => 14217 • Instructions (use with prior information) • Write again your email address. => terejanu@buffalo.edu • Etc… (help)
Difficulties for Recognition Algorithms • Variety of handwritten styles • Random characters • Random spacing between characters • Ligatures • Variable width strokes ? • Huge lexicon • Prior information from the web form • Random spacing between words • Random sentence generator • Variable baselines for the words in a sentence • Maybe write on a curve / wave • Different handwritten style for the two sentences • Understanding engine
Seems difficult for machines Accessible Easy to automate the process (collection, modulation …) Prior information (against paid eye) Foreign language Broking possibility –need variety in sentence formulation Pros / Cons
Conclusion • Integrate the CAPTCHA generation process into a script font (Postscript Type3): random character, ligature paper • Handwriting is a characteristic task to humans that is difficult to reproduce using algorithms • Need first results • Test procedure ?
A Sense of Success : Get the first Handwritten CAPTCHA in the W3C Reports as an better alternative for CAPTCHA