170 likes | 327 Views
Working with Command-Line Tools. Danielle Cunniff Plumer School of Information The University of Texas at Austin Summer 2014. Download the dataset.
E N D
Working with Command-Line Tools Danielle Cunniff Plumer School of Information The University of Texas at Austin Summer 2014
Download the dataset • We will be working with a smallish (34M) dataset consisting of US Trademark Application Images from the USPTO. We will only be working with images from January 4, 2008. The data is made available by PublicResource.org. • Wget • GNU Wget is a free utility for non-interactive download of files from the Web. It supports HTTP , HTTPS , and FTP protocols, as well as retrieval through HTTP proxies. • Because we are downloading only a single file, you do not need to specify any options. • Open a terminal bcadmin@ubuntu:~$ cd Downloads/ bcadmin@ubuntu:~/Downloads$ wget https://bulk.resource.org/trademark/USTrademarkImages/hr080104.zip
Run a checksum on the zip file • md5sum • Print or check MD5 (128-bit) checksums. With no FILE, or when FILE is -, read standard input. • In terminal • make sure you are in the Downloads directory or other directory containing the zip file $ md5sum hr080104.zip • Redirect the output to a file • Syntax: command and arguments followed by > and name of file for output. • In terminal $ md5sum hr080104.zip > hr080104zip_md5sum.txt $ less hr080104_md5sum.txt
Unzip the file using tar • Unzip • unzip will list, test, or extract files from a ZIP archive, commonly found on MS-DOS systems. The default behavior (with no options) is to extract into the current directory (and subdirectories below it) all files from the specified ZIP archive. • Option: -d will extract into a directory (directory does not need to exist) • In terminal $ unzip hr080104.zip –d hr080104 • taris an alternative to unzip, and more powerful in general, but it doesn’t work for zip files. man tar for details.
Inspect the files • Install tree • Tree is a recursive directory listing program that produces a depth indented listing of files $ sudo apt-get install tree • Look at the files in the unzipped directory $ tree hr080104
Tree options • Options $ man tree • -a Includes hidden files (those beginning with a dot ‘.’). • -f Prints the full path prefix for each file. • -i Makes tree not print the indentation lines, useful when used in conjunction with the -f option. • -p Print the file type and permissions for each file (as per ls -l). • -s Print the size of each file in bytes along with the name. • -h Print the size of each file but in a more human readable way. • -D Print the date of the last modification time for the file listed. • -ofilename Send output to filename. • -r Sort the output in reverse alphabetic order. • -t Sort the output by last modification time instead of alphabetically. • Look at the files again $ tree -afihD hr080104 –o hr080104.txt $ less hr080104.txt
Make a copy of a few files to play with $ mkdir temp $ cphr080104/773621/77362188/* temp $ cd temp $ ls • Remember that you can use the Ubuntu autocomplete options to help avoid typing mistakes • tab will complete the name of a directory or a file after you’ve typed the first few characters, starting in the directory you’re currently in. • tab tab will show you what files match the characters you’ve entered so far • The up and down arrows will let you go back to commands you’ve previously entered.
Corrupt a file • Calculate a checksum on the .xml files $ md5sum 00000001.XML > md5sum.txt • Open the file (for simplicity, we’ll use gedit). Be sure to enter the file name correctly; if you see an empty document, gedit has created a new document with nothing in it. $ gedit 00000001.XML • Change one character, save the file with a new name, and close gedit (either click the x in the top left, or do a Ctrl-C from the command line) • Save as 00000001r.XML • Run the checksum again, using >> toappend the new output to the file you previously created $ md5sum 00000001r.XML >> md5sum.txt • Compare the two checksums $ less md5sum.txt
Corrupt an image file • Calculate a checksum on one of the .jpg files $ md5sum 00000002.JPG > md5sum_jpg.txt • Open the file (for simplicity, we’ll use ghex) $ ghex 00000002.JPG • Change one character, save the file with a new name, and close ghex • Save as 00000002r.JPG • Run the checksum again $ md5sum 00000002r.JPG >> md5sum_jpg.txt • Compare the two checksums $ less md5sum_jpg.txt
JHOVE • See http://jhove.sourceforge.net/using.html • Install JHOVE • sudoapt-get install jhove • Run JHOVE on the XML file in the directory that you DIDN’T edit • $ jhove00000001.XML • Run JHOVE on the XML file in the directory that you corrupted • $ jhove00000001r.XML • It might help to open these side-by-side in two terminal windows • Repeat for the JPG files. What difference do you see? Why?
Extract metadata with ExifTool • See http://www.sno.phy.queensu.ca/~phil/exiftool/ • Run exiftool on your uncorrupted image file $ exiftool00000002.JPG • Try it on the corrupted image file $ exiftool 00000002r.JPG • Output exiftool results to CSV $ cd .. $ exiftool –csvtemp > out.csv • Open results in LibreOfficeCalc (be sure to select the “comma” option when importing
Bulk metadata operations with ExifTool • Run exiftool over your complete download $ exiftool–r –csvhr080104 > hr080104.csv • Open results in LibreOfficeCalc • For more work with exiftool, see the video tutorials by AVPreserve • http://www.avpreserve.com/exiftool-tutorial-series/
FITS • FITS is a powerful set of tool for extracting and validating metadata. FITS includes: • Jhove • Exiftool • National Library of New Zealand Metadata Extractor • DROID • FFIdent • File Utility (windows) • To run FITS, locate the script fits.sh on your virtual machine. It is probably located in /home/bcadmin/Tools/fits/. Verify this: $ ls/home/bcadmin/Tools/fits/
FITS options -i The input file you want to examine -o The destination of the output XML file. -r process directories recursively when -i is a directory -h Prints the usage message -v Displays the FITS version number -x convert FITS output to a standard metadata schema -xc output using a standard metadata schema and include FITS xml • If -o is not specified then the output is sent to the console window. • The general syntax for our purposes is: $ /home/bcadmin/Tools/fits/fits.sh-iinput_file -o output_file
Using FITS • From the directory containing the temp directory and the hr080104 directory, try the following commands: $ /home/bcadmin/Tools/fits/fits.sh -itemp/0000001.XML • You will probably see an error, followed by the output of the command printed to the screen. To save the output, add: $ /home/bcadmin/Tools/fits/fits.sh -itemp/0000001.XML -o xml_fits.txt • Convert the output to a standard metadata scheme: $ /home/bcadmin/Tools/fits/fits.sh -x -itemp/0000001.XML -o xmlstd_fits.txt • Repeat for JPG files. Note the different standard metadata schemas.
Using FITS over directories • You can process an entire directory of files with FITS. You need to add the –r (recursive) option if there are sub-directories and specify a folder to hold the output/ $ mkdirfits_temp $ /home/bcadmin/Tools/fits/fits.sh -x –i temp/ -o fits_temp/ $ mkdirfits_hr080104 $ /home/bcadmin/Tools/fits/fits.sh -x –r –i hr080104/ -o fits_hr080104/ • This will take a long time and you will see a lot of errors. • Inspect the results. The main problem is that all the files are stored in a single directory and it’s difficult to see which fits output goes with which file in the original directory.
bash scripting • Some of the problems we’re seen (such as with the FITS output) can be solved by careful use of scripting. • For a good introduction to BASH, see • The Linux Documentation Project. (n.d.) Bash Tutorial Intro & How-To. Available from http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-1.html • Other options include python and perl scripting. If you want to do this sort of work professionally, it’s highly recommended that you learn at least one of these.