160 likes | 253 Views
Unix Lecture 6. Hana Filip. HW6 - Part II. solutions posted on my website see syllabus . sed wc awk comm cut. ex iconv join paste sort tr uniq xargs. Text Processing Command Line Utility Programs. TextPro Lexicon File. Lexicon file “core.text” Background:
E N D
Unix Lecture 6 Hana Filip LIN 6932
HW6 - Part II • solutions posted on my website see syllabus LIN 6932
sed wc awk comm cut ex iconv join paste sort tr uniq xargs Text ProcessingCommand Line Utility Programs LIN 6932
TextPro Lexicon File Lexicon file “core.text” Background: TextPro • An information extraction system used as SRI International, Menlo Park, CA • Developed by Doug Appelt LIN 6932
copy “machen.txt” into your account > cd .. > cd c6932aab > ls … machen.txt … > cp machen.txt ~c6932aad > cd > ls … machen.txt … LIN 6932
Text ProcessingCommand Line Utility Programs tr translate or delete characters Example 1: delete (-d) all the new line characters from “machen.txt”, and redirect the output to a file named “machen-cont.txt”. % cat machen.txt | tr -d "\n" > machen-cont.txt Example 2: delete (-d) all characters from “machen.txt” except for alphabetical characters, new lines, and spaces, and redirect the output to a file named “machen-alpha.txt”. % cat machen.txt | tr -c -d "[:alpha:]\n " > machen-alpha.txt Try also: % cat machen.txt | tr -c -d "[:alpha:]\n" > machen-alpha.txt LIN 6932
Text ProcessingCommand Line Utility Programs trcan be used to make a wordlist from a text. This can be done by replacing all spaces with a newline: % cat machen.txt | tr " " "\n" | less % cat machen.txt | tr " " "\012" | less We can combine the command above with the delete functionality of tr to make a wordlist without unwanted characters: % cat machen.txt | tr " " "\n" | tr -c -d "[:alpha:]\n" > lex LIN 6932
Text ProcessingCommand Line Utility Programs sortprints the lines of its input or concatenation of all files listed in its argument list in sorted order. (The -r flag will reverse the sort order.) % sort -r movie_characters LIN 6932
Text ProcessingCommand Line Utility Programs uniqtakes a text file and outputs the file with adjacent identical lines collapsed to one • it is a kind of filter program • typically it is used aftersort % cat machen.txt | tr " " "\n" | tr -c -d "[:alpha:]\n” | sort | uniq > lex LIN 6932
Text ProcessingCommand Line Utility Programs sed = stream editor • a special editor for automatically modifying files • a find and replace program, it reads text from standard input and writes the result to standard outout (normally the screen) The search pattern is a regular expression (see references). • sed search pattern is a regular expression, essentially the same as a grep regular expression • often used in a program to make changes in a file LIN 6932
Text ProcessingCommand Line Utility Programs sed: simple example 1 % sed 's/United States/USA/' < usa-gaz.text > new-usa-gaz.text s Substitute command /../../ Delimiter United States Regular Expression Pattern String USA Replacement string < old_file > new_file LIN 6932
Text ProcessingCommand Line Utility Programs sed: simple example 2 % sed 's/\(United\)\(States\)/\2\1/'< usa-gaz.text>usa-switch-gaz.text switch two words around \( word onset \) word end /../../ delimiter \1 register 1 \2 register 2 LIN 6932
Text ProcessingCommand Line Utility Programs multiple sed commands may also be stored in a script file. The "-f" option is used on the command line to access the commands in the script: % sed -f sedscript.sed [file] LIN 6932
Text ProcessingCommand Line Utility Programs % sed 's/^/LexEntry: /g;s/$/ ; ./' lex > newlex ^ match the beginning of the line $ match the end of the line LIN 6932
Text ProcessingCommand Line Utility Programs& shell script #! /usr/local/bin/tcsh #usage: make_lex filename1; make_lex filename1 filename2, … # first, make sure the user typed in at least one argument if ( $# < 1 ) then echo "This script needs at least 1 argument." echo "Exiting...(annoyed)" exit 666 endif foreach name ($*) cat $name | tr " " "\n" | tr -c -d "[:alpha:]\n" | sort | uniq > mylex sed 's/^/LexEntry: /g;s/$/ ; ./' mylex > newlex echo "Your new lexical file is called 'newlex'." end LIN 6932