Lecture 8: Text processing

Lecture 8: Text processing CSE 4251

tools I assume you are already familiar with grep, sed, awk • tr : translate or delete characters • nl : number of lines (different with ln) • od : dump file in octal format • paste : merge lines of file • split : split a file into pieces • cut : remove sections from each line of files • sort : sort lines of text files • head : print the front lines of a file • tail : print the last few lines of a file

Head and tail head -nNUM : print the first NUM line tail -nNUM: print the last NUM line tail -f -nNUM file : print the last NUM line, but keep updating if file is changed

less and more • both can read a file content • % less filename • % more filename • less can scroll back, and has more options than more • less commands: move, search • q, Q, :q, :Q, ZZ -- exit • see others by h, H, or man less

tr--translate text Learn from examples % echo "a test" | tr t p a pesp % echo "a test" | traest 1234 1 4234 % echo "a test" | tr -d t a es % echo "a test" | tr '[:lower:]' '[:upper:]' A TEST change Windows newline to UNIX newline tr -d '\r' < win_file.txt >unix_file.txt

od/hexdump print file/string in octal, hex, or ASCII format % echo 'a' | od 0000000 005141 0000002 % echo 'a' |od -c 0000000 a \n 0000002 %echo 'abcdefgh' | od -w4 -c 0000000 a b c d 0000004 e f g h 0000010 \n 0000011 See other options in man od

Here document • <<SYMB : read until SYMB %cat >save.txt <<EOF %od -c <<END >save.txt

cut • remove sections from each line of files • Options: • -d specify delimiter • -f select column • -f1 print the first column • -f1-3 print column 1, 2, 3 • -f3- print all the columns after line 3 • -f 3,5 print column 3 and column 5 % cut -d':' -f2- emplyees

nl • nl : number of lines • -w number width • -s separate between number and content • -n number format • ln left justified • rn right justified • rz right justified, leading zeros • -v starting line number • -i increment number for each line

nl %echo -e "a\nb\nc"|nl 1 a 2 b 3 c %echo -e 'a\nb\nc' |nl -v 2 -s "----" -w4 -n rz -i 2 0002----a 0004----b 0006----c

sort • sort : sort lines of text files • -n : treat sorting key as number • -k : specify the key used to sort • -r : reverse the order (default is incremental order) % sort -k2 -n file c 1 b 2 a 3 d 19 original file a 3 b 2 d 19 c 1

uniq • The utility uniq removes adjacent lines that are identical to each other • the input to uniq is the output from sort %sort file |uniq 1 3 4 file : 1 3 4 3

diff and patch • diff: compare files line by line • -y output two columns, side by side • -r recursively compare any subdirectories found • patch: apply a diff file

diff and patch cont. You downloaded an oldfile, which dose not work on your system. After modifying, you got newfile that works, then you share it. create a patch: % diff -u oldfile, newfile >new.patch Someone else can apply the patch % patch oldfile <new.patch

join and paste • join: join lines of two files on a common field • need to sort the common field before join • %join -1 2 -2 1 joinAjoinB • paste: simply join two lines in two files • % paste fileAfileB

split • split: split a file into smaller files % split -l 20 file you will see files, xaa, xab, xac , each contains 20 lines, except the last file for advanced commands, see csplit, which support separate file by regular expression

md5sum and sha1sum message digest: a crypto graphical term, ideally • it is easy to compute the hash value for any given message • it is infeasible to generate a message that has a given hash • it is infeasible to modify a message without changing the hash • it is infeasible to find two different messages with the same hash • md5 is not safe anymore, SHA1 is OK.

md5sum and sha1sum cont. %md5sum employees b8b5425d0f8ed7278a0b8c47e4b43447 employees %md5sum list.txt >md5oflist.txt %md5sum -c md5oflist.txt list.txt: OK -c :check message digest

Application: clean data files • A real project data. • three files: entity, property and transaction. entity -- the entity dictionary property -- the property dictionary transaction -- the nth line means the entity n has properties in the line.

File contents part of property file name:61 speed:80 capital:40 part of entity file head:5 foot:10 eat:20 german:51 • transaction ( at line 51 ): 61, 40 • means : entity 51 (index 51 is german) has property 61 and 40 (name and capital)

Prepare the files • prepare the entity file cat col_mapping |tr ':' ' '|awk '{print $2,$1}' |sort -n -k1 >entity • prepare the property file cat row_mapping |tr ':' ' '|awk '{print $2, $1}' |sort -n -k1|sed -n '/^[1-9]\+ /p' |awk '{print $2, $1}' |uniq -f1|awk '{print $2, $1}' >property • add line number to transactions cat property_transactions |nl -n ln |sed -n 's/[ \t]\+/ /gp'>transactions

expand and unexpand • expand : convert tabs into spaces • unexpand: convert spaces into tabs % echo -e "a\tb" % echo -e "a\tb" |expand How to see the difference? (use wc, count the characters) % echo -e "a\tb" |wc -c % echo -e "a\tb" |expand |wc -c

text format • fold - wrap each line to fit in specified width • -s only break line at white space • -w NUM set line width to be NUM • fmt -simple optimal text formatter • -w NUM set line width to be NUM • pr : convert text file for printing, cannot wrap line. % cat article.txt |fmt -w30 |pr -2 -w 60 (also try without fmt -w30, see the difference)

Reference • http://www.ibm.com/developerworks/aix/library/au-textprocess.html • http://www.ibm.com/developerworks/linux/tutorials/l-gnutex/index.html

Lecture 8: Text processing