320 likes | 554 Views
Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching Prepared by Xu Zhenya( xzy@buaa.edu.cn ). Draft – Xu Zhenya( 2002/10/01 ) Rev1.0 – Xu Zhenya( 2002/10/09 ). Agenda. 1. An Introduction to Regular Expression in UNIX BRE & ERE, GNU-RE 2. grep, egrep, fgrep
E N D
Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching Prepared by Xu Zhenya( xzy@buaa.edu.cn ) Draft – Xu Zhenya( 2002/10/01 ) Rev1.0 – Xu Zhenya( 2002/10/09 )
Agenda • 1. An Introduction to Regular Expression in UNIX • BRE & ERE, GNU-RE • 2. grep, egrep, fgrep • 3. sed • Chapter 4( 4.1 & 4.2 )
An Introduction to Regular Expression(1) • UNIX commands combined with REs allow us perform three tasks: • Pattern matching • Search a particular pattern. RE specifies the pattern. • UNIX command searches: ed ex sed vi grep awk • Modify • Search for a particular pattern and change it. RE specifies the pattern and sometimes how to change. • UNIX command searches and modifies: ed ex sed vi • Programming • awk provides a programming language which can use REs. • Perl, python, tcl, etc • lex ( flex in GNU tools ) • POSIX defines a standard regular expression library include the libc: • regcomp, regexec… • Three different regular expression definitions: • the shells : for filename/pathname expansion • Simple/Basic ( BRE ) • grep, sed, vi • Extended ( ERE ) • egrep, awk, perl, etc
Meta-characters • Meta-characters can be divided into three categories: • matching characters: the primary building blocks of REs • grouping and repeat characters: • Tagging and back-referencing: • Some of the RE meta characters are also shell meta characters. For example: • $ egrep r* /etc/passwd # “filename expansion • => so REs meta characters should be quoted.
Matching Characters ( 2 ) • Example REs • helloMatch the string hello • ...\....Match any three letters, followed by a ., followed by three more letters • ^...\....Match the same as the previous one but it must appear at the start of the line. • ^hello$Match any line which contains hello ONLY • [a-z][^a-z]Match any two characters where the first one is between a-z and the second isn't. • \[\]\\ The characters '[', ']', '\' in order and contiguous.
Grouping and Repetition Characters( 2 ) • Examples • OO+Match two or more Os • /bin/(tcsh|bash)$Match /bin/tcsh or /bin/bash which occur at the end of the line. • ^[^:]*:[^:]*:[0-9] • the start of a line (^), followed by • 0 or more characters which aren't :s ([^:]*), followed by • a : (:), followed by • 0 or more characters which aren't :s ([^:]*), followed by • a : (:), followed by • a single number ([0-9]), followed by • a : (:) • (\+|-)?[0-9]+ • [+-]?[0-9]+ • an optionally signed integer (a plus or minus or nothing followed by an integer).
Tagging and back-referencing (1) • \( \) is used to tag/remember the RE you wish to back reference. • Contents are placed into a numeric register 1, 2, ... • access the contents of a register using \N where N is the number of the register • Examples • \(hello\) \1Matches hello hello • \([0-9]*\),\([0-9]*\),\([0-9]*\) \3,\2,\1Matches patterns like12,55,34 34,55,121023,5321,934 934,5321,1023
Tagging and back-referencing (2) $ sed -e 's/\([a-zA-Z]\+\) \([a-zA-Z]\+\)/\2, \1/' | | | | | | | extract memory 1 | | extract memory 2 | Put “UNIX" to memory 2 put “Programming" to memory 1 Programming UNIX UNIX, Programming
Conclusion • 1. There are a few metacharacters common, both in representation and meaning, to all three definitions. • The back slash (\) • The square brackets ([]) • There are also two metacharacters within the brackets which are common among all three forms of RE. • If the caret (^) is the first character within the brackets, the complement of the set of characters given is meant. • If the hyphen (-) occurs within the brackets, it indicates a range of characters. • 2. In addition to BRE, extended REs add • the parentheses (()) for grouping: • the vertical bar (|) (also called pipe) for alternation. • the plus sign (+) meaning repeat the preceding item at least once. • the question mark (?) meaning match the preceding item either zero or one times.
Conclusion • In basic regular expressions the metacharacters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \). • Large repetition counts in the {m,n} construct may cause grep to use lots of memory. In addition, certain other obscure regular expressions require exponential time and space, and may cause grep to run out of memory. • Backreferences are very slow, and may require exponential time.
Overview of Text Manipulation Utils • ed was a very early UNIX line editor. It included a number of commands to manipulate the file being edited. • Other UNIX commands like sed, ex, vi were built on ed and use the same commands. • Ex/vi command syntax:[ from_address [ , to_address ] command [ parameters ] • Read the textbook – Appendix A • Supplementary Reading: • An Introduction to Display Editing with Vi, William Joy, Mark Horton • VI (and Clones) Editor Reference Manual, Miles O'Neal
2. Grep, egrep, fgrep • grep: BRE • egrep: ERE • fgrep: f = fixed string • Option:-r recursively • Reading the textbook: p72
3. sed • sed(1) is a (s)tream (ed)itor, which manipulates the data according given rules. • The sed command line syntax is: • $ sed [OPTIONS] -e 'INSTRUCTION' [-e 'INSTRUCTION' ..] FILE • $ sed [OPTIONS] -f SCRIPT.sed FILE • $ cat FILE | sed -f SCRIPT.sed "COMMANDS" • Some most used options: • -f SCRIPT.sed : Read commands from file SCEIPT.sed • -e "SED-EXPRESSION" : Expression follows immediately. You can give this option multiple times. • -n : Do not print lines unless p command used.
sed (2) • The INSTRUCTION choices can be in format [OPTION]/RE/COMMAND | | | | | what to do | | p = print | | d = delete | | | | regular expression to search | Optional, can be left out. g = global option. Do the COMMAND for all lines
sed (3) • commands and options • 1. The delete line command • $ sed -e '/this/d' text1.txt • 2. The print lines options • $ sed -n -e '/this/p' text1.txt
sed (4) • 3. The substitution of text in the line • The command that sed uses most is the (s)ubstitute and the syntax is bit different: [address]s/RE/replacement/[flag] | substitute command here • Examples: • s/this-word//g • s/UPE/UNIX Programming Environment/g • The (g)lobal flag causes regular expression search to continue to the end of line, so all words on the line will be replaced.
sed (5) • Address in substitute command • The [address] says, where the command does its work. It can be • numeric address • special marker; like $ which denotes the end of file • regular expression to delimit the the commands to certain lines. • # 1. We refer to explicit lines by a number here: • 1s/BJ/Beijing/g Do substitution only at line 1. • 10,20s/BJ/Beijing/g RANGE: from 10 to 20. • #2. We can mix the number with the other address markers: • 50,$s/BJ/Beijing/g from line 50 to the end of file. • 1,/^$/s/BJ/Beijing/g from line 1 to next empty line. • #3. Or use only regular expression to delimit the line area • $ sed -e '/BEGIN/,/END/s/variable1/variable2/g' some.code • BEGIN • line1; • line2 • variable1 = variable1 + variableX; • END
sed (6) • Examples: • $ sed –e ‘s/sweeping \(.*\) of \(.*\) steel/sweeping \2 \1 of/g’ • sweeping blade of flashing steel • sweeping flashing blade of steel • $ sed –e ‘s/sweeping \(.*\) of \(.*\) steel/evil &/g’ • sweeping blade of flashing steel • evil sweeping blade of flashing steel • & : specifys the entire expression
Supplementary Readings • Regular Expression in Unix • Including two parts. The first part is an simple and incomplete introduction to regular expressions in UNIX. The other is the RE specification intercepted from SUSv2, including the formal grammar for BER and ERE. • External Filters, Programms and Commands in Unix , Mendel Cooper