380 likes | 611 Views
The awk command. Introduction. Awk is a programming language used for manipulating data and generating reports. The data may come from standard input, one or more files, or as output from a process.
E N D
Introduction • Awk is a programming language used for manipulating data and generating reports. The data may come from standard input, one or more files, or as output from a process. • Awk can be used at the command line for simple operations, or it can be written into programs for larger applications. • Awk scans a file ( or input) line by line, from the first to the last line, searching for lines that match a specified pattern and performing selected actions ( enclosed in curly braces ) on those lines.
Awk stands for the first initials in the last names of each of the authors of the language, Alfred Aho, Brian Kernighan, and peter Weinberger. • There are a number of versions of awk : old awk, new awk, gnu awk, POSIX awk, and so on. • Awk combines features of several filters, but it has two unique features. • 1. it can identify and manipulate individual fields in a line. • 2. awk is the only UNIX filter that can perform computation. • Further, awk also accepts extended regular expressions (EREs) for pattern matching, has C-type programming constructs and several built-in variables and functions.
awk Preliminaries • The awk command follows the general syntax: • Awk <options> ‘selection_criteria { action }’ <file(s) > • Note the use of single quotes and curly braces. • The selection_criteria ( a form of addressing) filters input and selects lines for the action component to act on. This component is enclosed within curly braces. • The selection_criteria and action constitute an awk program that is surrounded by a set of single quotes. • These programs are often one-liners though they can span several lines as well. • Ex: to select the directors from the file, the awk command is: • $ awk '/dir./ {print}' emp.lst • 7898 | akash |dir. |mark. | 11/06/70 |9000
Unlike other filters, awk uses a contiguous sequence of spaces and tabs as the default delimiter. This default has been changed in the example by “|” using the –F option. A ,(comma) has been used to delimit the field specification. • $ awk -F"|" '/dir./ {print $2,$3,$4,$6}' emp.lst • akash dir. mark. 9000 • Fields in awk are numbered $1,$2,etc. • Awk also addresses the entire line as $0. • Ex: to display the number of records in the file e.lst: • $ awk '{print $0}' e.lst |wc -l • 6
The action section is represented by the statement { print }, which has the effect of printing all the selected lines. • If the selection_criteria is missing, then the action will apply to all lines of the file. • If the action is missing , then the entire line will be printed. • Either the address or the action is optional, but both must be enclosed within a pair of single quotes. • All context patterns have to be enclosed within a pair of /’s. • The print statement if used without any field specifiers prints the entire line , though you can also use the variable $0 to indicate that explicitly. • Since print is the default action of awk, there is no need to specify it if you want to print the entire line. All the three forms are equivalent: • $ awk ‘/dir/ ’ emp.lst • $ awk ‘/dir/ {print} ‘ emp.lst • $ awk ‘/dir/ {print $0} ‘ emp.lst
For pattern matching, awk uses regular expressions of the egrep variety, with the same requirement that all these expressions be bounded on either side by a /. This lets you locate both ‘sharma’ and ‘sarma’ : • $ awk -F"|" '/[Ss]h*arma/ ' e.lst • 9876 | sharma | mgr |product| 12/03/60 |15000 • 8888 | Sarma | dir.| sales | 05/09/60 |25000 • Awk also accepts a line address (single or double) to select lines. • Ex: to select lines 3 to 6 from a file, use the built-in variable NR to specify line numbers : • $ awk -F"|" 'NR==3,NR==6 {print NR, $2, $3,$6}' e.lst • 3 akash dir. 9000 • 4 tiwary g.m 23000 • 5 kumar mgr 1500 • 6 Sarma dir. 25000
Formatting output with printf • Awk uses the print and printf statements to write to standard output. Print produces unformatted output. • Ex: to print all fields except the 4th , we can assign the one we don’t want to an empty string : • $ awk -F"|" '{ $4=""; print}' e.lst |head -2 • 2233 shukla g.m 12/12/52 20000 • 9876 sharma mgr 12/03/60 15000 • When placing multiple statements in a single line, use the ; as their delimiter. Print here is the same as print $0. • With the C-like printf statement, you can use awk as a stream formatter. Printf uses a quoted format specificier and a field list. • %s – String • %d – Integer • %f – Floating point number
To produce formatted o/p from unformatted i/p, using a regular expression, • $ awk -F"|" '/[sS]h*arma/{ • printf("%-20s %-12s %6d\n",$2,$3,$6) }' e.lst • sharma mgr 15000 • Sarma dir. 25000
The Logical And Relational Operators • To print the 3 fields for the directors and the manager, you can write each awk program in a separate line: • $ awk –F”|” ‘/director/ { printf “%-20s %-12s %d\n”, $2,$3,$6} >/manager/ {printf “%-20s %-12s %d\n”, $2,$3,$6}’ emp.lst • But this method of repeating the print action on each line can be tedious. Awk also uses the || and && logical operators. • $ awk -F"|" '$3==" mgr " || $3=="dir. "{ • printf("%-20s %-12s %6d\n",$2,$3,$6) }' e1.lst • akash dir. 9000 • kumar mgr 15000
If you want to print only those lines for persons who are neither director nor manager, you should use the != and && operators: • $ awk -F"|" '$3!=" dir." && $3!=" mgr" { • printf "%-20s %-12s %d\n", $2,$3,$6}' e1.lst • While using the operators == and != for string matching, you must remember that they can handle only fixed strings, and not regular expressions. • How to match regular expressions: • Awk offers the ~ and !~ operators to match and negate a match, respectively. • $ awk -F"|" '$3 ~/g.m/ {print}' e1.lst • 2233 | shukla | g.m | sales | 12/12/52 | 20000 • 9876 | sharma |d.g.m|product| 12/03 60 | 15000 • 3456 | tiwary |g.m |product| 05/02/89 |23000
The previous example prints the d.g.m’s as well as the g.m’s, since the pattern g.m. is embedded in the larger string . • Therefore use the characters ^ and $ used by the regular expressions, which indicate the beginning and the end of a field, respectively. • $ awk -F"|" '$3 ~/^g.m/ {print}' e1.lst • 3456 | tiwary |g.m |product| 05/02/89 |23000
The relational and regular expression matching operators used by awk • OperatorSignificance • < Less than • <= less than or equal to • == equal to • != not equal to • >= greater than or equal to • > greater than • ~ match a regular expression • !~ doesn’t match a regular expression
Number Processing • Awk uses the arithmetic operators +,-,*,/, and %(modulus). • It also overcomes the most major limitations of the shell ; the inability to handle decimal numbers. • You can use awk to print a pay-slip for the directors: • $ awk -F"|" '$3~/^dir./ { >printf "%-20s %-12s,%d %d %d\n", $2,$3,$6,$6*0.4,$6*0.15}' e1.lst • akash dir. ,9000 3600 1350 • While awk has certain built-in variables, like NR and $0, it also permits the user to use variables of his choice. A user-defined variable used by awk has a special feature ; no type declaration is needed, and it is initialized to zero or a null string, by default, depending on its type. Awk has a mechanism of identifying the type of variable used from its context.
$ awk -F"|" '$6>=15000 { • > cnt = cnt+1 • > print cnt,$2,$3,$6}' e1.lst • 1 shukla g.m 20000 • 2 sharma d.g.m 15000 • 3 tiwary g.m 23000 • 4 kumar mgr 15000
THE –f OPTION • Awk offers the –f option to take the program from the file that follows this option. • $ cat q1.awk $6>=15000 { print ++count,$2,$3,$6} • $ awk -F"|" -f q1.awk e1.lst • 1 shukla g.m 20000 • 2 sharma d.g.m 15000 • 3 tiwary g.m 23000 • 4 kumar mgr 15000
THE BEGIN AND END SECTIONS • If you are to print something before processing the first line, for example, a heading, then the BEGIN section can be used quite gainfully. Similarly, if you want to print some totals after the processing is over, then you should do it in the END section. • The BEGIN and END are optional, and take the form: BEGIN {action} END {action} These two sections, when present, are delimited by the body of the awk program. They also use a pair of curly braces to enclose the program. You can use these two sections to print a suitable heading at the beginning, and the average salary at the end.
$ cat q2.awk • BEGIN { • printf "\n\t\t EMPLOYEE ABSTRACT \n\n" • } • $6>15000 { • # used for comments • count++; • tot+=$6 • printf "%3d%-20s%-12s%d\n", count,$2,$3,$6 • } • END{ • printf "\n\t The average basic pay is %6d\n", tot/count • }
$ awk -F"|" -f q2.awk e1.lst EMPLOYEE ABSTRACT 1 shukla g.m 20000 2 tiwary g.m 23000 The average basic pay is 21500
Positional Parameters • The program q1.awk could take a more generalized form if the number 15000 is replaced with a variable. • To do that, the entire awk command (not just the program) should be stored in a shell script, and the parameter supplied as an argument to the script. This parameter is then compared with the variable. These variables are known as positional parameters, and identified by the shell as $1,$2,$3, etc. in the order they are presented in the command line. • The positional parameters used by awk should be enclosed within single quotes, so as to distinguish between a positional parameter and a field identifier.
Cat q1.awk • awk -F"|" '$6>='$1' { print $2,$3,$6}' e1.lst • $ q1.awk 15000
BUILT–IN VARIABLES • VARIABLEFUNCTION • NR Cumulative number of records read • FS The input field separator • OFS The output field separator • NF Number of fields in current record • FILENAME The current input file • ARGC Number of arguments in the command line • ARGV The list of arguments
NR stores the record number of the current line. • FS defines the input field separator. This is an alternative to the –F option of the command. When used at all it must occur in the BEGIN section so that the body of the program knows its value before it starts processing : • The default output field separator, can be reassigned using the variable OFS in the BEGIN section • Ex: • $ awk 'BEGIN {FS="|";OFS="~"} • $6>15000 {print $1,$2,$3,$6}' e1.lst 2233 ~ shukla ~ g.m ~ 20000 3456 ~ tiwary ~g.m ~23000
NF is used in cleaning up a database from records which don’t contain the right number of fields. • Ex: to locate those records not having 6 fields, and which have crept in due to faulty data entry: • $ awk 'BEGIN {FS="|"} • > NF!=6 • > print "record no ",NR," has ",NF, " fields"}' emp.lst • FILENAME stores the name of the current file being processed. By default, awk doesn’t print the filename, but you can instruct it to do so: • $ awk -F "|" '$6<15000 {print FILENAME,$0}' e1.lst • e1.lst 7898 | akash |dir. |mark. | 11/06/70 |9000
While using awk program within shell scripts, you can arrange to pass parameters to the script. ARGV[ ] , stores the entire list of arguments in the array. • And the number of such arguments is stores in the variable ARGC • $ emp.awk 3500 7000 director • Then ARGC takes the value 4, while the array ARGV[ ] is filled up with the words in the command line: • ARGV[0] = empfind.awk • ARGV[1] = 3500 • ARGV[2] = 7000 • ARGV[3] = director
FUNCTIONS • Awk has several built-in functions, performing both arithmetic and string operations. • The parameters are passed to a function in C-style, delimited by commas, and enclosed by a matched pair of parentheses.
Built – in functions in awk • FunctionDescription • int(x) Returns the integer value of x • sqrt(x) Returns the square root of x • index(s1,s2) Returns the position of the string s2 in the string s1 • length( ) Returns the length of the argument (the complete record in case of none) • substr(s1,s2,s3) Returns portion of the string of length s3, starting from the position s2 in the strting s1 • split(s,a) Split string s into the array a; optionally returns number of fields
Control flow – THE if statement • the control command itself must be enclosed in parentheses. • $ awk -F"|” '{ if ($6 >15000) print($2,$6)}' e1.lst • shukla 20000 • tiwary 23000 • $ awk -F"|" '{ if ($6 >15000) commission = 0.15*$6 • else commission = 0.10 *$6 } {print ($2,$6,commission)}' e1.lst • shukla 20000 3000 • sharma 15000 1500 • akash 9000 900 • tiwary 23000 3450 • kumar 15000 1500