400 likes | 534 Views
INFO 320 Server Technology I. Week 7 Regular expressions. Overview. One of the most powerful tools in UNIX/Linux is the ability to compare regular expressions Regular expressions overview grep Character classes Applications. Regular expressions overview.
E N D
INFO 320Server Technology I Week 7 Regular expressions INFO 320 week 7
Overview One of the most powerful tools in UNIX/Linux is the ability to compare regular expressions Regular expressions overview grep Character classes Applications INFO 320 week 7
Regular expressions overview Mostly from Regular-Expressions.info and the man pages cited INFO 320 week 7
Regular expressions? “A regular expression (regex or regexp for short) is a special text string for describing a search pattern” While developed in UNIX, regular expressions can be also used with little modification in Windows, Perl, PHP, Java, or a .NET language “little modification?” Yes, you have to be careful which set of regex rules you’re using INFO 320 week 7
Regular expressions The down side? They look like complete and utter gibberish^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$ The good news? There are zillions of cookbook recipes for common uses of them And with commands (grep, ed, sed), they can be used in scripts INFO 320 week 7
Fancy wildcards? The basic idea is that regex are wildcards on steroids We saw that, in bash scripting A star ‘*’ can substitute for zero or more of any character (except a line break) A question mark ‘?’ can substitute for exactly one any character HERE IT DOESN’T We’ll refine our use of brackets [ ] to include or exclude any specific one character INFO 320 week 7
Regex syntax Within UNIX, there are variations on regex syntax GNU grep (our main tool) uses GNU Basic Regular Expressions syntax (BRE) GNU egrep uses GNU Extended Regular Expressions syntax (ERE) POSIX-compliant systems use POSIX Basic Regular Expressions for grep, or POSIX Extended Regular Expressions for egrep INFO 320 week 7
BRE (grep) vs ERE (egrep) The only difference is that BRE's will use backslashes to give various characters a special meaning, while ERE's will use backslashes to take away the special meaning of the same characters egrep has the same functions as grep, it’s just a little faster grep –E is the same as egrep INFO 320 week 7
Ed and sed Similar regex rules are used by grep, ed, and sed ed is a text line editor sed is used to perform basic transformations on an input text stream INFO 320 week 7
grep INFO 320 week 7
Regular expressions and grep Regular expressions were first implemented in the 1970’s in UNIX for the ‘grep’ command grep = generate regular expression egrep = extended grep We’ll focus on grep grep matches BREs, which were defined by IEEE Std 1003.1-2001, Section 9.3, Basic Regular Expressions (now dated 2008) INFO 320 week 7
grep syntax The basic form is grep –options pattern file The normal output from grep is a text list of all the lines which matched the pattern in the file Notice that patterns like ‘re-member’ which cross lines are not found! Regex matches cannot span multiple lines INFO 320 week 7
grep options Like most UNIX commands, grep has many options (see handout), including -c shows the count of lines matched, instead of the lines themselves -i ignores case when matching (!) -n gives the line number of each line matched -v gives lines which don’t match the pattern(s) as output INFO 320 week 7
grep options You can also include a list of patterns with the –e option Or use a file with patterns using the –f option You can match lines where the whole line matches the pattern, with the –x option INFO 320 week 7
Search patterns As a good habit, put the search pattern in single or double quotes (either works if consistent) The pattern is a regular expression If you give an empty patternall lines will be matched So what does grep –c ‘’ filename do? INFO 320 week 7
Metacharacters Regex metacharacters are text strings that have special meaning in this context We’ll look at them in groups We already mentioned the wildcard ‘*’ which matches zero or more of any character (except newline) To match any exactly one character, use a period ‘.’ Notice a ‘?’ did this in the context of scripting INFO 320 week 7
Metacharacters We can identify words that start or end of a line ‘^’ (the carat) marks the start of the line ‘^Four’ ‘$’ (dollar) marks the end of the line ‘ago$’ Again, different meaning than in scripting INFO 320 week 7
Metacharacters We can identify the start or end of a word ‘\<‘ marks the start of a word ‘\<eat’ would match eats or eating, not feat ‘\>’ marks the end of a word ‘ing\>’ would match loving but not sings INFO 320 week 7
Character classes INFO 320 week 7
Character classes With a "character class" (or set) you can tell the regex engine to match only one out of several characters Simply place the possible characters you want to match between square brackets If you want to match an a or an e, use [ae] You could use this in gr[ae]y to match either gray or grey Very useful if you do not know whether the document you are searching through is written in American or British English From http://www.regular-expressions.info/charclass.html INFO 320 week 7
Character classes The order of the characters inside a character class does not matter The results are identical [ae] or [ea] The characters don’t have to be sequential [dptjgm583;] is fine But if you want cite special characters [\^$.|?*+(){} literally, you need to add a backslash before them So [abc\\\?] matches a b c \ or ? INFO 320 week 7
Character classes More generally in character classes ‘[]’ matches any one character specified between the brackets ‘[^abc]’ matches any one character NOT specified between the brackets That example means ‘does not have a b or c in it’ Notice the ^ has very different meaning in a character class or as its own metacharacter INFO 320 week 7
Character classes Within character classes, ranges of possible characters can be given [a-z] means any lower case letter [a-zA-Z] means any upper or lower case letter [a-zA-Z0-9] could be any character that isn’t a letter or number INFO 320 week 7
Metacharacters The pipe means logical OR in an expression, here called alternation abc(def|xyz) matches abcdef or abcxyz Multiple alternations are allowed s[i|a|o]ng Notice the parentheses group a string of characters to be treated as one INFO 320 week 7
Bracket expressions POSIX has bracket expressions to provide abbreviations for common search terms For example instead of [a-z] can use [:lower:] [a-zA-Z] becomes [:alpha:] [a-zA-Z0-9] becomes [:alnum:] What does [A-Fa-f0-9] = [:xdigit:] mean? So [^x-z[:digit:]] matches a single character that is not x, y, z or a digit [0-9] From http://www.regular-expressions.info/posixbrackets.html INFO 320 week 7
Optional The question mark will attempt match the preceding token zero times or once, in effect making it optional colou?r matches both colour and color Nov(ember)? will match Nov and November INFO 320 week 7
Repetition The asterisk or star tells the engine to attempt to match the preceding token zero or more times. ‘<[A-Za-z][A-Za-z0-9]*>’ matches an HTML tag without any attributes The plus tells the engine to attempt to match the preceding token once or more. ‘<[A-Za-z0-9]+>’ will match a tag with any one or more alphanumeric characters INFO 320 week 7
Limiting repetition As a further refinement, it’s possible to specify how many times a string will be repeated, by adding {min,max} instead of a star or plus Max is infinite if not specified, so * = {0,}+ = {1,} and ? = {0,1} But {0,3} would limit the previous character to appear zero to three times INFO 320 week 7
() [] [::]? So in the context of a regex Parentheses ( ) are used for grouping, to treat a series of characters as one for repetition Square brackets [ ] define a character class, matches any one character in that class Square brackets with colons [: :] define a POSIX bracket expression INFO 320 week 7
?*+{}? And following any kind of grouping, character class, or bracket expression ? Makes a group repeated zero or one time (optional) + makes a group repeated one or more times * makes a group repeated zero or more times Curlybrackets { } are used for controlling repetition by giving min and max limits INFO 320 week 7
Searching for special characters To match a ], put it as the first character after the opening [ or the negating ^ To match a -, put it right before the closing ] To match a ^, put it before the final literal - or the closing ] Put together, []\d^-] matches ], \, d, ^ or - INFO 320 week 7
Applications From http://www.regular-expressions.info/examples.html INFO 320 week 7
Ok, now what? Given this terribly complex set of rules for defining a regular expression … so what? Regexes are very handy for searching for specific terms, or validating inputs Here we’ll review a few cookbook examples INFO 320 week 7
Trimming Whitespace A mundane example is to use regular expressions to get rid of spaces at the start and end of lines Search for ^[ \t]+ and replace with nothing to delete leading whitespace Search for [ \t]+$ and replace with nothing to trim trailing whitespace [ \t] matches a space or tab INFO 320 week 7
Match IP addresses A simplified version is \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b But that will catch illegal IP addresses above 255; to fix that use \b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b Ok, matching numbers is tough in a text world INFO 320 week 7
Numbers are challenging To get a real number [-+]?[0-9]*\.?[0-9]+ But if you might need exponential notation [-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)? INFO 320 week 7
Validate email addresses If you get a string and want to see if it’s an email address, could try ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$ What assumption is made here about case? INFO 320 week 7
Validate a date (19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01]) Matches a date in yyyy-mm-dd format from between 1900-01-01 and 2099-12-31 INFO 320 week 7
Validate credit cards To validate a credit card, need their format, and first strip out spaces & dashes Visa: ^4[0-9]{12}(?:[0-9]{3})?$ All Visa card numbers start with a 4; new cards have 16 digits, old cards have 13 MasterCard: ^5[1-5][0-9]{14}$ All MasterCard numbers start with the numbers 51 through 55; all have 16 digits INFO 320 week 7
References Regular-expressions.infohttp://www.regular-expressions.info/ Grep man pagehttp://manpages.ubuntu.com/manpages/jaunty/en/man1/grep.1posix.html Lots of books are also available on regular expressions INFO 320 week 7