840 likes | 860 Views
Explore the usage, implementation, and impact of regular expressions (regexes) in software engineering, focusing on security vulnerabilities like ReDoS. Learn about extracting regexes, building regex corpuses, and analyzing regex patterns. Discover insights from empirical studies on regexes in programming languages.
E N D
Regexes in the WildEmpirical Studies on Security and Correctness James Davis Dongyoon Lee
Talk outline 1. Background • What is a regex? • What do software engineers use them for? • How are they implemented in programming languages? • How are regexes related to security (ReDoS)? (SECURITY’18) 2. Selected research • Methods: Where do I get regexes to analyze? (ASE’19) • Security: How widespread might ReDoS be? (ESEC/FSE’18) • Correctness: The promise and perils of re-using regexes (ESEC/FSE’19) 3. Advice to my past self
Primer on Regular Expressions (Regexes) Concept Sample Notation ab “ab” a+ “a”,“aaa” a* “ ”, “aaa” [a-z] “a”, “x”, “z” \w [0-9 a-z A-Z] • String language • Supported in all PLs • Extended Regexes • NP-hard
Software engineers use regexes for… • 30-40% of Python and JavaScript projects • Diverse purposes User-agent string Server-side rendering File names Command-line tools Tokenizing Lexers HTML Browser plug-ins Email Input validation [C&S ‘16] [D et al. ‘18]
Some regex examples /.+$/ “Chars” /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/ IPv4 /^[a-zA-Z0-9]+([._]?[a-zA-Z0-9]+)*$/ Username Super-linear “ReDoS regex”
How most regex engines work [Spencer’94] What are they? match(regex, string) “Does this regex match this string?” How do they work? • /^a+$/ • Simulate on input “aaa”
Super-linear “ReDoS Regexes” Simple ReDoS regex /^(a+)+$/ NFA Malicious input “aaaaaaaaaa…aa!” Recurrence relation T(n) = 2*T(n-1) = 2*(2*T(n-2)) = O(2n) Mismatch - backtracking Exponential paths
Regular Expression Denial of Service (ReDoS) [C&W ‘03] /^[a-zA-Z0-9]+([._]?[a-zA-Z0-9]+)*$/ “aaa…aaaa!” “Susie” Malicious input injected [S&P ‘18] [D et al. ‘18]
ReDoS @ CloudFlare – July 2019 CPU utilization (all machines) 100% 75% 50% 25% 0% (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))
ReDoS @ JavaScript: 10% of npm vulnerabilities SECURITY’18 – Excerpt
2 1 ReDoS ESEC/FSE’18 Methodology ASE’19 Re-use ESEC/FSE’19 Research outline 3
ASE’19 – Excerpt How to build a regex corpus (research excerpt) James Davis Daniel Moyer Ayaan KazerouniDongyoon Lee
What’s a regex corpus, and why do I want one? Q: How to extract regexes? • Typical regex practices • Popular and unpopular features • Extent of super-linear regexes • Tool builders, regex engine devs.
How have researchers built regex corpuses? • Programming language • Software Applications, modules • Extraction methodology “Static” and “Dynamic”
Experimental design • Programming language • Software Important open-source modules • Extraction methodology Let’s try both! Research question: Does it matter which extraction methodology we follow?
Regex collection methodology (125K regexes) Module regex extraction . . . Module selection Static analysis Prog. Instr. “Important” Top 25K
Regex metrics • Representation • String length • Features used • NFA size • Language diversity – matching strings • Complexity • DFA size • Super-linear behavior (“ReDoS regexes”) • Use of advanced features
Within a Prog. Lang.,the two competing regex extraction methodologies yield indistinguishable regex corpuses (So let’s use the easier methodology!)
2 1 ReDoS ESEC/FSE’18 Methodology ASE’19 Re-use ESEC/FSE’19 Research outline 3
ESEC/FSE’18 – Distinguished paper The Impact of ReDoS in Practice James Davis Christy Coghlan Francisco Servant Dongyoon Lee
Highlighted contributions • ReDoS regexes are prevalent in the wild • Developers (try to) fix them with 3 techniques Not just one or two Thousands!
Collecting Regexes Module selection Module regex extraction 45% 35% 565K 125K Giant list of regexes “Can clone” filter 350K 60K 375K (66%) 70K (58%)
Analyzing Regexes 60K 350K 2. Degree 1. ReDoS regexes [R&T ‘14] [WMBW ‘16] [WOHD ‘17] 704 (1%) 3.6K (1%) 13K (3%) 705 (1%)
ReDoS Regexes occur in Prominent Places 1 regex 1 regex 3 regexes 3 regexes 2 regexes
(ReDoS) Regexes Are Hard to Understand [CWS ‘17] /^(\+|-)?(\d+|(\d*\.\d*))? (E|e)?([-+])?(\d+)?$/ /([^\=\*\s]+)(\*)?\s*\=\s* (?:([^;'"\s]+\'[\w]*\’ [^;\s]+)|(?:\"([^"]*)\")| ([^;\s]*))(?:\s*(?:;\s*)|$)/ /^\S+@\S+\.\S+$/ /^(.*?)([.,:;!]+)$/ /<(\/)?([^ ]+?)(?:(\s*\/)| .*?)?>/ /^([\\s\\S]*?)((?:\\.{1,2}|[^\\\\\\/]+?|)(\\.[^.\\/\\\\]*|))(?:[\\\\\\/]*)$/ /^(\\/?|)([\\s\\S]*?)((?:\\.{1,2}|[^\\/]+?|(\\.[^.\\/]*|))(?:[\\/]*)$/ /\+OK.*(<[^>]+>)/ /\s*#?\s*$/ /^\s*/\*\s*(.+?)\s*\*/\s*$/
Methodology Historic Disclosures & Fixes “What do developers prefer when they know all the fix strategies?” Email 284 module maintainers Vulnerability disclosure Fix strategies Study all ReDoS reports in CVE and Snyk.io DBs
Fix Strategies For RedoS Regexes Original /^\S+@\S+\.\S+$/ Fix strategies Trim TRUNCATE(input, 1000) Revise /^[^@]+@([^\.@]+\.)+$/ Replace* (Custom parser) • Exactly one @, somewhere in the middle of the string • A ‘.’ to the right of the @ • But not immediately to the right Match different strings! cf. [vdM et al. ‘17]
Fix strategies and correctness 2 incorrect 1 incorrect “All correct”
ReDoS regexes are areal problem in practice Regexes are widely used in JavaScript and Python modules 1% of unique regexes are ReDoS regexes ReDoS regexes occur in 1-3% of modules ReDoS regexes are hard to fix
2 1 ReDoS ESEC/FSE’18 Methodology ASE’19 Re-use ESEC/FSE’19 Research outline 3
ESEC/FSE’19 The promise and perils of re-using regexes James Davis Mischa Michael Christy Coghlan Francisco Servant Dongyoon Lee
Summary of Contributions • Developer perspectivesSurvey of 158 devs • Regex re-useCorpus of 500K regexes • Portability experimentsLanguage differences
The regex development process 583K views 1.23 \n1.23 ^\d+(\.\d{1,2})?$
Research questions Developer perspectives Do developers re-use regexes? Where do developers re-use regexes from? Do developers believe regexes are portable across languages? Measuring regex re-use How commonly are regexes re-used? (other software ; Internet) Empirical portability Different regex semantics? Different regex performance?
158 Respondents: “We re-use regexes across languages” Methods • IRB-approved survey Respondents The median respondent: • 3-5 years of professional experience • medium-size company • intermediate regex skill Key results (in this paper – see also ASE’19) • 94% of developers re-use regexes • When re-using, developers are rarely confident that the regex comes from the same programming language
Regex collection methodology Module regex extraction Module selection Unique regexes Top 25K
Regex corpus Unique regexes Num.modules 150K 25K 20K ” 45K ” 193K modules 580K unique regexes 45K ” 150K ” 20K ” All (30K) 140K 2K All (10K)