240 likes | 526 Views
Lecture 4. Getting data onto baboon The ASCII character code More Unix/Linux filters. Announcements. First Textools Quiz: October 6 (see review sheet). Getting data onto a machine running Linux. Scan it - scan data in and run optical character recognition (OCR) on it
E N D
Lecture 4 Getting data onto baboon The ASCII character code More Unix/Linux filters
Announcements • First Textools Quiz: October 6 (see review sheet)
Getting data onto a machine running Linux • Scan it -scan data in and run optical character recognition (OCR) on it • Copy it from a CD-Rom or floppy disk • Move it from another machine. • file transfer (ftp) • download data from the World Wide Web • email
TCP/IP • Any given machine might run with Unix, DOS, Windows95, . . . . • So how can machines with different OS’s communicate? • through TCP/IP - a common set of rules • transmission control protocol (TCP) - manages data flow by breaking data into packets • Internet protocol (IP) - moves the data
IP Addresses • Each machine on the Internet has an official numeric address, called an IP address • Some MSU IP addresses sapir 130.68.160.51 Picard.Montclair.edu 130.68.1.31 baboon.montclair.edu 130.68.160.66 chss.montclair.edu 130.68.1.31
Domain Names fitzpatr@ baboon. montclair. edu emf@ homer. att. com user ID machine institution domain name
Telnet • Enables log in to other computers from baboon telnet sapir Trying 130.68.160.51... Connected to sapir. Escape character is '^]'. Welcome to sapir.montclair.edu -- Unauthorized Access Prohibited -- ----------------------------------------------------------------------- *ATTENTION USERS: If you cannot login to your account, please contact the System Administrator at admin@sapir.montclair.edu ----------------------------------------------------------------------- login:
File Transfer Protocol (ftp) • allows you to transfer files from a remote computer. • anonymous ftp allows you to transfer files without having an account on the remote machine. • Basic steps: ftp mrcnext.cso.uiuc.edu (one gutenburg site) login: anonymous password: fitzpatricke@baboon.montclair.edu
File Compression • reduces the size of a file by finding repeating patterns and substituting a variable for the pattern • the compress tool is called gzip • check out man gzip (q to exit)
Tape Archiving (tar) • tar saves multiple files to a single file, preserving the file names • this allows a set of files to be moved from machine to machine as one entity • tar also allows the restoration of the multiple files on the receiving machine • check out man tar (q to exit)
tar options tar cffn.tar fn* create the following single file named fn.tar from all files beginning with fn tar xf fn.tar extract the contents of the single file restoring it as multiple files
The ASCII character code. • The American Standard Code for Information Exchange • The standard for sorting used on all computers • To see the ASCII standard order for characters, type man 7 ascii (q to exit)
egrep • runs significantly faster than grep, but is greedy in terms of computer memory • egrep has several facilities that grep does not: grep egrep c+one or more occurrences of c No Yes c?zero or one occurrence of c No Yes c1|c2c1 or c2 No Yes
egrep (2) egrep b+ words abating Abba abbe egrep b? words Aarhus Aaron Ababa
egrep (3) egrep ‘d|f’ words abaft abandon
tr transforms characters tr expects its input to come from standard input thus, you need a ‘<‘ to fool it into thinking the file input is actually stdin tr from_chars to_chars < fn refs file: Bloomfield, L. 1933. Language. Chomsky, N. 1986. Barriers. Jacobson, R. 1941. Child Language. tr o x < refs Blxxmfield, L. 1933. Language. Chxmsky, N. 1986. Barriers. Jacxbsxn, R. 1941. Child Language. The tr command
Common Uses of tr • Case conversion tr a-z A-Z < refs BLOOMFIELD. L. 1933. LANGUAGE. CHOMSKY, N. 1986. BARRIERS. • Conversion of spaces to newlines tr ‘ ‘ ‘\n’ < gettysburg Four score and seven
tr options tr -s o < refsthe squeeze option Blomfield, L. 1933. Language. Chomsky, N. 1986. Barriers. Jacobson, R. 1941. Child Language. tr -c [A-z0-9] ‘ ‘ < refs the complement option Bloomfield L 1933 Language Chomsky N 1986 Barriers Jacobson R 1941 Child Language
tr options (2) tr -d [0-9] < refs delete option Bloomfield, L. . Language. Chomsky, N. . Barriers. Jacobson, R. . Child Language.
The sort command • The sort command • sorts data line by line in a file • reads previously sorted files and merges them • uses the ASCII code as the default order
sort options sort -rsort in reverse order sort -nsort by arithmetic value sort -nrarithmetic, reverse sort -feliminate case distinctions
The uniq command • Operates on repeated lines • duplicate lines must be consecutive to be identified as duplicates (so sort often precedes uniq) • uniq deletes duplicate lines • uniq -dreports duplicate lines • uniq -ureports unique lines • uniq -creports each line with the number of times it occurred.