510 likes | 671 Views
AWK. Tip of the Day. You have a file that you misplaced and want to find it quickly You want to use the command find. find command example. % find ~ -name “final_exam.txt” Above command will search for the file “final_exam.txt” in all subdirectories under your home directory
E N D
Tip of the Day • You have a file that you misplaced and want to find it quickly • You want to use the command find
find command example % find ~ -name “final_exam.txt” • Above command will search for the file “final_exam.txt” in all subdirectories under your home directory • When found, it will print out the full path file name
Another find command example % find . -name “*.pro” -ls • Above command will search for all your IDL files in all subdirectories under your current working directory • When found, it will print out an ls listing of the file
Yet another find command example % find / -name “*junk*” -exec rm {} \; • Above command will search for all files in the entire directory tree that contains the pattern junk in the file name • When found, the file will be deleted
Aho, Weinberger, Kernighan AWK - is a text processing utility that can efficiently process and extract text data with minimal programming
Example #1 - I have a column of numbers (input.dat) that require conversion, e.g., square root. Centigrad to Fahrenheit, etc. 1 2 … 100
Solutions • You can write an IDL or C program to do this. • Transfer the data over to a spreadsheet • Or write a one line awk program
How Does AWK Work? • Awk is based on the concept of pattern matching • Think of AWK as a filter program • Looks for key “patterns” and process records matching patterns.
Syntax of AWK /pattern/ {action}
Simplest AWK program % gawk ‘{print $0}’ input.dat This simply prints out (echoes) the output file
To take the square root % gawk ‘{print sqrt($0)}’ input.dat
If you have two columns of data and you want to add them up % gawk ‘{print $1+$2}’ input.dat
Meaning of the fields $0 - represent the entire input line $1 - represent the first field $2 - represents the second field Etc. NF - number of fields NR - record number
Patterns and Regular Expression • In UNIX, some metacharacters such as *, ?, \, and many more are used to create what are known as “regular expressions” • Regular expressions are concise means of specifying a pattern template • Its most common use is found in matching filenames in a directory structure. • Following examples will illustrate this
Metacharacter Patterns(Most often used for filename matching) The asterisk character * Indicates matching a pattern of zero or more characters Example: % ls * [will match all files in directory] Example: % ls *dat [will match files such as “data.dat”, ”dat”,”1dat”, “1.dat”] The question mark character ? Indicates matching a pattern of exactly one character Example: % ls ? [will match files such as “0”, “1”, “a”, ”z”] Example: % ls ?dat [will match files such as “1dat”, “bdat”]
More Complex Patterns The pattern construct [0-9] or [a-z] Indicates a single character match constrained to digits Example: % ls [a-z] [will match files “a”, “b”, “c”, “z”] Example: % ls [0-9]dat [will match files “0dat”, “5dat”, “9dat”] The pattern construct {larry, moe, curly} Indicates explicit pattern entries with which to match Example: % ls {larry,moe,curly}.mail [will match the files “larry.mail”, “moe.mail”, “curly.mail”]
What if I want to match to an actual metacharacter? • To match to an actual metacharacter, we need to tell UNIX that the following character should be interpretted literally rather than as a metacharacter • This is accomplished using the “\” character • For example, if we have the filename “*.dat” in our directory and we want to remove it • DO NOT GIVE THE COMMAND % rm *.dat • INSTEAD % rm \*.dat
Pattern Matching in other UNIX Utilities • Although our examples are in the context of file matching, be aware that pattern matching is prevalent in many UNIX utilities such as • vi (text editor) • sed (batch stream text editor) • awk (text processing utility) • shell programming • There will be slight differences between each utility, but the concept is the same • Proficiency with patterns can be gained with the following exercises
Pattern Exercise (Set 1) • In the directory ~rvrpci/public_html/simg726/20012/patterns are several filenames • Given the following patterns, list what filenames are selected (e.g. using ls) and explain why • Each of these patterns will evoke different results and you will need to study each one to understand any subtleties
image.{red,grn,blu} image.* image* image. image[0-9] image?[0-9] image?[a-c] image?[a-z] image.[0-9] image?* image? image.? image.?? image?? image??? image.? image.\? image{1,3,5,7} image{0,1}0 Pattern Exercise (Set 2)
Review of patterns? * - matches all patterns ? - matches a single character [0-9] - matches a single character that is a number [A-Z] - matches a single character that is an upper case letter.
Matching /pattern/ - tries to match the pattern /^pattern/ - makes sure the pattern starts at the beginning of a line /pattern$/ - end of a line $1 ~ /pattern/ - tries to match the first field to a pattern $1 !~ /pattern/ - tries to NOT match the first field to a pattern
Suppose you had headers on the top of your file which you wanted to ignore % gawk ‘/[0-9]/ {print $0}’ input.dat
Removing comments # gawk '$0 !~ /^#/’ • Above works for # at the beginning of line gawk '$0 !~ /^ *#/’ • Better Pattern • Works for # at the beginning of line when preceded by whitespace
Water Quality Samples MISI Image example at 2000'AGL4'pixel 4 4 4 ID Chlor SS CDOM B1 P1 P2 Legend MISI flight area Boston Whaler canoe kayak pier/bridge truth panels Pier Team radiometer thermistors secchi depth water samples 4 4 ASD Truth Panels Real Life Problem 1:ASD Spectra Conversion
Conversion of wavelength units from nanometers to microns for a spectral file (water.ref) 400.350 0.0509975 410.170 0.0502359 419.990 0.0474999 … 683.900 0.0215759 693.440 0.0214323 702.980 0.0213168
Conversion AWK script % gawk ‘{print $1/1000.0, $2}’ water.ref > water.ref.microns
What if you have multiple files Water_0001.ref Water_0002.ref … Soil_0001.ref Soil_0002.ref … Cement_1000.ref
How do we repeatedly apply the AWK script • We would use the foreach UNIX statement. • The form of the foreach statement % foreach shell_variable (regular_expression) unix_statements unix_statements … unix_statments end
Processing only the water files % foreach i (water*.ref) foreach? echo “Processing $i” foreach? gawk ‘{print $1/1000.0, $2}’ $i > $i.microns foreach? end
Renaming a set of files • Suppose you had a set of files Water_0001.ref.microns Water_0002.ref.microns … Water_0100.ref.microns • You want to rename them back to Water_0001.ref Water_0002.ref
We need tools to extract file name components • Given the sample file water_0001.ref.microns • Need to extract the file name extension(s) .ref.microns .microns • Need to extract the file name base Water_0001
Shell Filename Modifiers h Remove a trailing pathname component, leaving only the head. r Remove a trailing suffix of the form .xxx, leaving the basename. e Remove all but the trailing suffix. t Remove all leading pathname components, leaving the tail.
Sample output of the shell modifiers % set a=/usr/tmp/water_00001.ref.microns % echo $a /usr/tmp/water_00001.ref.microns % echo $a:h /usr/tmp % echo $a:r /usr/tmp/water_00001.ref % echo $a:e microns % echo $a:t water_00001.ref.microns
Renaming the water files % foreach i (water*.microns) foreach? echo “Renaming $i to $i:r” foreach? mv $i $i:r foreach? end
foreach statement can extract elements of a shell variable % set a='0.0 0.1 0.2' % foreach i ($a) foreach? echo $i foreach? end 0.0 0.1 0.2
Real Life Problem 2:MODTRAN Output • How do you extract a single value out of a 40 page output? 1 ***** MODTRAN 3.5 Version 1.1 Jan 97 ***** 0 CARD 1 *****t0 7 2 2 1 0 0 0 0 0 0 1 1 0 0.000 0.00 0 CARD 1B *****T 8F 0 360.000 0 CARD 2 ***** 1 1 0 0 0 0 30.00000 0.00000 0.00000 0. 00000 0.31500 0 GNDALT = 0.31500 0 CARD 2C ***** 15 0 0AUG01 MODEL ATMOSPHERE NO. 7 ICLD = 0 MODEL 0 / 7 USER INPUT DATA 0.315 9.842E+02 3.230E+01 7.545E-01 0.000E+00 0.000E+00 ABD2222222 22222 0.554 9.581E+02 2.720E+01 6.765E-01 0.000E+00 0.000E+00 ABD2222222 2
What do we want? • “H2O” value Z P T REL H H2O CLD AMT RAIN RATE AEROSOL (KM) (MB) (K) (%) (GM M-3) (GM M-3) (MM HR-1) TYPE PROFILE 0.315 984.200 305.45 2.20 7.545E-01 0.000E+00 0.000E+00 RURAL RURAL 0.554 958.100 300.35 2.60 6.765E-01 0.000E+00 0.000E+00 RURAL … H2O O3 CO2 CO CH4 N2O ( ATM CM ) 2.2208E+02 1.3433E-01 2.6589E+02 8.2446E-02 1.1924E+00 2.2553E-01 … Z P T REL H H2O CLD AMT RAIN RATE AEROSOL (KM) (MB) (K) (%) (GM M-3) (GM M-3) (MM HR-1) TYPE PROFILE 0.315 984.200 305.45 2.20 7.545E-01 0.000E+00 0.000E+00 RURAL RURAL 0.554 958.100 300.35 2.60 6.765E-01 0.000E+00 0.000E+00 RURAL
What do we know? • We know that the value we want has the table name “H2O” in the first field. Z P T REL H H2O CLD AMT RAIN RATE AEROSOL (KM) (MB) (K) (%) (GM M-3) (GM M-3) (MM HR-1) TYPE PROFILE 0.315 984.200 305.45 2.20 7.545E-01 0.000E+00 0.000E+00 RURAL RURAL 0.554 958.100 300.35 2.60 6.765E-01 0.000E+00 0.000E+00 RURAL … H2O O3 CO2 CO CH4 N2O ( ATM CM ) 2.2208E+02 1.3433E-01 2.6589E+02 8.2446E-02 1.1924E+00 2.2553E-01 … Z P T REL H H2O CLD AMT RAIN RATE AEROSOL (KM) (MB) (K) (%) (GM M-3) (GM M-3) (MM HR-1) TYPE PROFILE 0.315 984.200 305.45 2.20 7.545E-01 0.000E+00 0.000E+00 RURAL RURAL 0.554 958.100 300.35 2.60 6.765E-01 0.000E+00 0.000E+00 RURAL
Using grep to help analyze pattern % grep H2O output.tp6 Z P T REL H H2O CLD AMT RAIN RATE AEROSOL I Z P H2O O3 CO2 CO CH4 N2O O2 NH3 NO NO2 SO2 HNO3 1 J Z H2O O3 CO2 CO CH4 N2O O2 NH3 NO NO2 SO2 H2O O3 CO2 CO CH4 N2O 1 J Z H2O O3 CO2 CO CH4 N2O O2 NH3 NO NO2 SO2 H2O O3 CO2 CO CH4 N2O
Need to Identify Unique Pattern Property • Several H2O’s in the file • Desired record is in the first column • Need to specify “first column”-only matches $1 ~ /H2O/
Need to skip to the value and extract the value • Based on the following pattern H2O O3 CO2 CO CH4 N2O ( ATM CM ) 2.2208E+02 1.3433E-01 2.6589E+02 8.2446E-02 1.1924E+00 2.2553E-01 • We need to “skip” to the third line and get the first record • This can be accomplished by the getlinecommand
Putting it all together gawk '$1 ~ /H2O/ { getline; getline; getline; \ print ($1*18.015/22413.83) }’ input_modtran.dat • Action is a unit conversion of water vapor value print ($1*18.015/22413.83)
Can be made into a shell script (get_water_vapor.csh) #!/bin/csh gawk '$1 ~ /H2O/ { getline; getline; getline; \ print ($1*18.015/22413.83) }' $1
From within IDL IDL> spawn, ‘get_water_vapor.csh input.dat’, results
What is this file? 400.350 0.0509975 410.170 0.0502359 419.990 0.0474999 … 683.900 0.0215759 693.440 0.0214323 702.980 0.0213168
Commented File # Water reflectance data file # ASD Reflectance May 20, 1999 11:31 PM # Local Time # Wavelength [Nanometers] Reflectance # [unitless] 400.350 0.0509975 410.170 0.0502359 419.990 0.0474999 … 702.980 0.0213168