1 / 84

Regexes in the Wild Empirical Studies on Security and Correctness

Explore the usage, implementation, and impact of regular expressions (regexes) in software engineering, focusing on security vulnerabilities like ReDoS. Learn about extracting regexes, building regex corpuses, and analyzing regex patterns. Discover insights from empirical studies on regexes in programming languages.

tgamble
Download Presentation

Regexes in the Wild Empirical Studies on Security and Correctness

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regexes in the WildEmpirical Studies on Security and Correctness James Davis Dongyoon Lee

  2. Regexes!

  3. Talk outline 1. Background • What is a regex? • What do software engineers use them for? • How are they implemented in programming languages? • How are regexes related to security (ReDoS)? (SECURITY’18) 2. Selected research • Methods: Where do I get regexes to analyze? (ASE’19) • Security: How widespread might ReDoS be? (ESEC/FSE’18) • Correctness: The promise and perils of re-using regexes (ESEC/FSE’19) 3. Advice to my past self

  4. Part 1: Background

  5. Primer on Regular Expressions (Regexes) Concept Sample Notation ab “ab” a+ “a”,“aaa” a* “ ”, “aaa” [a-z] “a”, “x”, “z” \w [0-9 a-z A-Z] • String language • Supported in all PLs • Extended Regexes • NP-hard

  6. Software engineers use regexes for… • 30-40% of Python and JavaScript projects • Diverse purposes User-agent string  Server-side rendering File names  Command-line tools Tokenizing  Lexers HTML  Browser plug-ins Email  Input validation [C&S ‘16] [D et al. ‘18]

  7. Some regex examples /.+$/  “Chars” /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/  IPv4 /^[a-zA-Z0-9]+([._]?[a-zA-Z0-9]+)*$/  Username Super-linear “ReDoS regex”

  8. How most regex engines work [Spencer’94] What are they? match(regex, string)  “Does this regex match this string?” How do they work? • /^a+$/ • Simulate on input “aaa”

  9. Super-linear “ReDoS Regexes” Simple ReDoS regex /^(a+)+$/ NFA Malicious input “aaaaaaaaaa…aa!” Recurrence relation T(n) = 2*T(n-1) = 2*(2*T(n-2)) = O(2n) Mismatch - backtracking Exponential paths

  10. Regular Expression Denial of Service (ReDoS) [C&W ‘03] /^[a-zA-Z0-9]+([._]?[a-zA-Z0-9]+)*$/ “aaa…aaaa!” “Susie” Malicious input injected [S&P ‘18] [D et al. ‘18]

  11. ReDoS @ CloudFlare – July 2019 CPU utilization (all machines) 100% 75% 50% 25% 0% (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

  12. ReDoS @ JavaScript: 10% of npm vulnerabilities SECURITY’18 – Excerpt

  13. Part 2: Research

  14. 2 1 ReDoS ESEC/FSE’18 Methodology ASE’19 Re-use ESEC/FSE’19 Research outline 3

  15. ASE’19 – Excerpt How to build a regex corpus (research excerpt) James Davis Daniel Moyer Ayaan KazerouniDongyoon Lee

  16. What’s a regex corpus, and why do I want one? Q: How to extract regexes? • Typical regex practices • Popular and unpopular features • Extent of super-linear regexes •  Tool builders, regex engine devs.

  17. How have researchers built regex corpuses? • Programming language  • Software  Applications, modules • Extraction methodology  “Static” and “Dynamic”

  18. Experimental design • Programming language  • Software  Important open-source modules • Extraction methodology  Let’s try both! Research question: Does it matter which extraction methodology we follow?

  19. Regex collection methodology (125K regexes) Module regex extraction . . . Module selection Static analysis Prog. Instr. “Important” Top 25K

  20. Regex metrics • Representation • String length • Features used • NFA size • Language diversity – matching strings • Complexity • DFA size • Super-linear behavior (“ReDoS regexes”) • Use of advanced features

  21. Representative results (all charts look like this)

  22. Conclusions from this work

  23. Within a Prog. Lang.,the two competing regex extraction methodologies yield indistinguishable regex corpuses (So let’s use the easier methodology!)

  24. 2 1 ReDoS ESEC/FSE’18 Methodology ASE’19 Re-use ESEC/FSE’19 Research outline 3

  25. ESEC/FSE’18 – Distinguished paper The Impact of ReDoS in Practice James Davis Christy Coghlan Francisco Servant Dongyoon Lee

  26. Highlighted contributions • ReDoS regexes are prevalent in the wild • Developers (try to) fix them with 3 techniques Not just one or two Thousands!

  27. ReDoS Regexes in the Wild

  28. Collecting Regexes Module selection Module regex extraction 45% 35% 565K 125K Giant list of regexes “Can clone” filter 350K 60K 375K (66%) 70K (58%)

  29. Analyzing Regexes 60K 350K 2. Degree 1. ReDoS regexes [R&T ‘14] [WMBW ‘16] [WOHD ‘17] 704 (1%) 3.6K (1%) 13K (3%) 705 (1%)

  30. ReDoS Regexes are Usually Quadratic

  31. ReDoS Regexes occur in Prominent Places 1 regex 1 regex 3 regexes 3 regexes 2 regexes

  32. The Repair of ReDoS Regexes

  33. (ReDoS) Regexes Are Hard to Understand [CWS ‘17] /^(\+|-)?(\d+|(\d*\.\d*))? (E|e)?([-+])?(\d+)?$/ /([^\=\*\s]+)(\*)?\s*\=\s* (?:([^;'"\s]+\'[\w]*\’ [^;\s]+)|(?:\"([^"]*)\")| ([^;\s]*))(?:\s*(?:;\s*)|$)/ /^\S+@\S+\.\S+$/ /^(.*?)([.,:;!]+)$/ /<(\/)?([^ ]+?)(?:(\s*\/)| .*?)?>/ /^([\\s\\S]*?)((?:\\.{1,2}|[^\\\\\\/]+?|)(\\.[^.\\/\\\\]*|))(?:[\\\\\\/]*)$/ /^(\\/?|)([\\s\\S]*?)((?:\\.{1,2}|[^\\/]+?|(\\.[^.\\/]*|))(?:[\\/]*)$/ /\+OK.*(<[^>]+>)/ /\s*#?\s*$/ /^\s*/\*\s*(.+?)\s*\*/\s*$/

  34. Methodology Historic Disclosures & Fixes “What do developers prefer when they know all the fix strategies?” Email 284 module maintainers Vulnerability disclosure Fix strategies Study all ReDoS reports in CVE and Snyk.io DBs

  35. Fix Strategies For RedoS Regexes Original /^\S+@\S+\.\S+$/ Fix strategies Trim TRUNCATE(input, 1000) Revise /^[^@]+@([^\.@]+\.)+$/ Replace* (Custom parser) • Exactly one @, somewhere in the middle of the string • A ‘.’ to the right of the @ • But not immediately to the right Match different strings! cf. [vdM et al. ‘17]

  36. Fix strategies and correctness 2 incorrect 1 incorrect “All correct”

  37. What did we learn from this work?

  38. ReDoS regexes are areal problem in practice Regexes are widely used in JavaScript and Python modules 1% of unique regexes are ReDoS regexes ReDoS regexes occur in 1-3% of modules ReDoS regexes are hard to fix

  39. 2 1 ReDoS ESEC/FSE’18 Methodology ASE’19 Re-use ESEC/FSE’19 Research outline 3

  40. ESEC/FSE’19 The promise and perils of re-using regexes James Davis Mischa Michael Christy Coghlan Francisco Servant Dongyoon Lee

  41. Summary of Contributions • Developer perspectivesSurvey of 158 devs • Regex re-useCorpus of 500K regexes • Portability experimentsLanguage differences

  42. The regex development process  583K views 1.23 \n1.23 ^\d+(\.\d{1,2})?$

  43. Research questions Developer perspectives Do developers re-use regexes? Where do developers re-use regexes from? Do developers believe regexes are portable across languages? Measuring regex re-use How commonly are regexes re-used? (other software ; Internet) Empirical portability Different regex semantics? Different regex performance?

  44. Developer perspectivesDevelopers’ Regex Re-use Practices

  45. 158 Respondents: “We re-use regexes across languages” Methods • IRB-approved survey Respondents The median respondent: • 3-5 years of professional experience • medium-size company • intermediate regex skill Key results (in this paper – see also ASE’19) • 94% of developers re-use regexes • When re-using, developers are rarely confident that the regex comes from the same programming language

  46. Measuring Regex Re-UsePolyglot Regex Corpus

  47. Regex collection methodology Module regex extraction Module selection Unique regexes Top 25K

  48. Regex corpus Unique regexes Num.modules 150K 25K 20K ” 45K ” 193K modules 580K unique regexes 45K ” 150K ” 20K ” All (30K) 140K 2K All (10K)

  49. “Complex” regex re-use by module

  50. (Empirical) PortabilitySemantic + Performance Differences

More Related