1 / 24

Lecture 8: Text processing

Lecture 8: Text processing. CSE 4251. tools. I assume you are already familiar with grep , sed , awk tr : translate or delete characters nl : number of lines (different with ln ) od : dump file in octal format paste : merge lines of file split : split a file into pieces

karlyn
Download Presentation

Lecture 8: Text processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 8: Text processing CSE 4251

  2. tools I assume you are already familiar with grep, sed, awk • tr : translate or delete characters • nl : number of lines (different with ln) • od : dump file in octal format • paste : merge lines of file • split : split a file into pieces • cut : remove sections from each line of files • sort : sort lines of text files • head : print the front lines of a file • tail : print the last few lines of a file

  3. Head and tail head -nNUM : print the first NUM line tail -nNUM: print the last NUM line tail -f -nNUM file : print the last NUM line, but keep updating if file is changed

  4. less and more • both can read a file content • % less filename • % more filename • less can scroll back, and has more options than more • less commands: move, search • q, Q, :q, :Q, ZZ -- exit • see others by h, H, or man less

  5. tr--translate text Learn from examples % echo "a test" | tr t p a pesp % echo "a test" | traest 1234 1 4234 % echo "a test" | tr -d t a es % echo "a test" | tr '[:lower:]' '[:upper:]' A TEST change Windows newline to UNIX newline tr -d '\r' < win_file.txt >unix_file.txt

  6. od/hexdump print file/string in octal, hex, or ASCII format % echo 'a' | od 0000000 005141 0000002 % echo 'a' |od -c 0000000 a \n 0000002 %echo 'abcdefgh' | od -w4 -c 0000000 a b c d 0000004 e f g h 0000010 \n 0000011 See other options in man od

  7. Here document • <<SYMB : read until SYMB %cat >save.txt <<EOF %od -c <<END >save.txt

  8. cut • remove sections from each line of files • Options: • -d specify delimiter • -f select column • -f1 print the first column • -f1-3 print column 1, 2, 3 • -f3- print all the columns after line 3 • -f 3,5 print column 3 and column 5 % cut -d':' -f2- emplyees

  9. nl • nl : number of lines • -w number width • -s separate between number and content • -n number format • ln left justified • rn right justified • rz right justified, leading zeros • -v starting line number • -i increment number for each line

  10. nl %echo -e "a\nb\nc"|nl 1 a 2 b 3 c %echo -e 'a\nb\nc' |nl -v 2 -s "----" -w4 -n rz -i 2 0002----a 0004----b 0006----c

  11. sort • sort : sort lines of text files • -n : treat sorting key as number • -k : specify the key used to sort • -r : reverse the order (default is incremental order) % sort -k2 -n file c 1 b 2 a 3 d 19 original file a 3 b 2 d 19 c 1

  12. uniq • The utility uniq removes adjacent lines that are identical to each other • the input to uniq is the output from sort %sort file |uniq 1 3 4 file : 1 3 4 3

  13. diff and patch • diff: compare files line by line • -y output two columns, side by side • -r recursively compare any subdirectories found • patch: apply a diff file

  14. diff and patch cont. You downloaded an oldfile, which dose not work on your system. After modifying, you got newfile that works, then you share it. create a patch: % diff -u oldfile, newfile >new.patch Someone else can apply the patch % patch oldfile <new.patch

  15. join and paste • join: join lines of two files on a common field • need to sort the common field before join • %join -1 2 -2 1 joinAjoinB • paste: simply join two lines in two files • % paste fileAfileB

  16. split • split: split a file into smaller files % split -l 20 file you will see files, xaa, xab, xac , each contains 20 lines, except the last file for advanced commands, see csplit, which support separate file by regular expression

  17. md5sum and sha1sum message digest: a crypto graphical term, ideally • it is easy to compute the hash value for any given message • it is infeasible to generate a message that has a given hash • it is infeasible to modify a message without changing the hash • it is infeasible to find two different messages with the same hash • md5 is not safe anymore, SHA1 is OK.

  18. md5sum and sha1sum cont. %md5sum employees b8b5425d0f8ed7278a0b8c47e4b43447 employees %md5sum list.txt >md5oflist.txt %md5sum -c md5oflist.txt list.txt: OK -c :check message digest

  19. Application: clean data files • A real project data. • three files: entity, property and transaction. entity -- the entity dictionary property -- the property dictionary transaction -- the nth line means the entity n has properties in the line.

  20. File contents part of property file name:61 speed:80 capital:40 part of entity file head:5 foot:10 eat:20 german:51 • transaction ( at line 51 ): 61, 40 • means : entity 51 (index 51 is german) has property 61 and 40 (name and capital)

  21. Prepare the files • prepare the entity file cat col_mapping |tr ':' ' '|awk '{print $2,$1}' |sort -n -k1 >entity • prepare the property file cat row_mapping |tr ':' ' '|awk '{print $2, $1}' |sort -n -k1|sed -n '/^[1-9]\+ /p' |awk '{print $2, $1}' |uniq -f1|awk '{print $2, $1}' >property • add line number to transactions cat property_transactions |nl -n ln |sed -n 's/[ \t]\+/ /gp'>transactions

  22. expand and unexpand • expand : convert tabs into spaces • unexpand: convert spaces into tabs % echo -e "a\tb" % echo -e "a\tb" |expand How to see the difference? (use wc, count the characters) % echo -e "a\tb" |wc -c % echo -e "a\tb" |expand |wc -c

  23. text format • fold - wrap each line to fit in specified width • -s only break line at white space • -w NUM set line width to be NUM • fmt -simple optimal text formatter • -w NUM set line width to be NUM • pr : convert text file for printing, cannot wrap line. % cat article.txt |fmt -w30 |pr -2 -w 60 (also try without fmt -w30, see the difference)

  24. Reference • http://www.ibm.com/developerworks/aix/library/au-textprocess.html • http://www.ibm.com/developerworks/linux/tutorials/l-gnutex/index.html

More Related