1.63k likes | 1.77k Views
Workbook 8 String Processing Tools. Pace Center for Business and Technology. String Processing Tools. Key Concepts When storing text, computers transform characters into a numeric representation. This process is referred to as encoding the text.
E N D
Workbook 8String Processing Tools Pace Center for Business and Technology
String Processing Tools Key Concepts • When storing text, computers transform characters into a numeric representation. This process is referred to as encoding the text. • In order to accommodate the demands of a variety of languages, several different encoding techniques have been developed. These techniques are represented by a variety of character sets. • The oldest and most prevalent encoding technique is known as the ASCII character set, which still serves as a least common denominator among other techniques. • The wc command counts the number of characters, words, and lines in a file. When applied to structured data, the wc command can become a versatile counting tool. • The cat command has options that allow representation of nonprinting characters such as NEWLINE. • The head and tail commands have options that allow you to print only a certain number of lines or a certain number of bytes (one byte usually correlates to one character) from a file.
What are Files? • Linux, like most operating systems, stores information that needs to be preserved outside of the context of any individual process in files. (In this context, and for most of this Workbook, the term file is meant in the sense of regular file). Linux (and Unix) files store information using a simple model: information is stored as a single, ordered array of bytes, starting from at first and ending at the last. The number of bytes in the array is the length of the file. [9] • What type of information is stored in files? Here are but a few examples. • The characters that compose the book report you want to store until you can come back and finish it tomorrow are stored in a file called (say) ~/bookreport.txt. • The individual colors that make up the picture you took with your digital camera are stored in the file (say) /mnt/camera/dcim/100nikon/dscn1203.jpg. • The characters which define the usernames of users on a Linux system (and their home directories, etc.) are stored in the file /etc/passwd. • The specific instructions which tell an x86 compatible CPU how to use the Linux kernel to list the files in a given directory are stored in the file /bin/ls.
Text Encoding Perhaps the most common type of data which computers are asked to store is text. As computers have developed, a variety of techniques for encoding text have been developed, from the simple in concept (which could encode only the Latin alphabet used in Western languages) to complicated but powerful techniques that attempt to encode all forms of human written communication, even attempting to include historical languages such as Egyptian hieroglyphics. The following sections discuss many of the encoding techniques commonly used in Red Hat Enterprise Linux.
ASCII One of the oldest, and still most commonly used techniques for encoding text is called ASCII encoding. ASCII encoding simply takes the 26 lowercase and 26 uppercase letters which compose the Latin alphabet, 10 digits, and common English punctuation characters (those found on a keyboard), and maps them to an integer between 0 and 255, as outlined in the following table.
ASCII What about the integers 0 - 32? These integers are mapped to special keys on early teletypes, many of which have to do with manipulating the spacing on the page being typed on. The following characters are commonly called "whitespace" characters.
ASCII Others of the first 32 integers are mapped to keys which did not directly influence the "printed page", but instead sent "out of band" control signals between two teletypes. Many of these control signals have special interpretations within Linux (and Unix).
Generating Control Characters from the Keyboard Control and whitespace characters can be generated from the terminal keyboard directly using the CTRL key. For example, an audible bell can be generated using CTRL+G, while a backspace can be sent using CTRL+H, and we have already mentioned that CTRL+D is used to generate an "End of File" (or "End of Transmission"). Can you determine how the whitespace and control characters are mapped to the various CTRL key combinations? For example, what CTRL key combination generates a tab? What does CTRL+J generate? As you explore various control sequences, remember that the reset command will restore your terminal to sane behavior, if necessary. A tab can be generated with CTRL+I, while CTRL+J will generate a line feed (akin to hitting the RETURN key). In general, CTRL+A will generate ASCII character 1, CTRL+B will generate ASCII character 2, and so on. What about the values 128-255? ASCII encoding does not use them. The ASCII standard only defines the first 128 values of a byte, leaving the remaining 128 values to be defined by other schemes.
ISO 8859 and Other Character Sets Other standard encoding schemes have been developed, which map various glyphs (such as the symbol for the Yen and Euro), diacritical marks found in many European languages, and non Latin alphabets to the latter 128 values of a byte which the ASCII standard leaves undefined. The following table lists a few of these standard encoding schemes, which are referred to as character sets. The following table lists some character sets which are supported in Linux, including their informal name, formal name, and a brief description.
ISO 8859 and Other Character Sets Notice a couple of implications about ISO 8859 encoding. • Each of the alternate encodings map a single glyph to a single byte, so that the number of letters encoded in a file equals the number of bytes which are required to encode them. • Choosing a particular character set extends the range of characters that can be encoded, but you cannot encode characters from different character sets simultaneously. For example, you could not encode both a Latin capital A with a grave and a Greek letter Delta simultaneously.
Unicode (UCS) In order to overcome the limitations of ASCII and ISO 8859 based encoding techniques, a Universal Character Set has been developed, commonly referred to as UCS, or Unicode. The Unicode standard acknowledges the fact that one byte of information, with its ability to encode 256 different values, is simply not enough to encode the variety of glyphs found in human communication. Instead, the Unicode standard uses 4 bytes to encode each character. Think of 4 bytes as 32 light switches. If we were to again label each permutation of on and off for 32 switches with integers, the mathematician would tell you that you would need 4,294,967,296 (over 4 billion) integers. Thus, Unicode can encode over 4 billion glyphs (nearly enough for every person on the earth to have their own unique glyph; the user prince would approve).
Text Encoding and the Open Source Community When contributors to the open source community are faced with decisions involving potentially incompatible formats, they generally balance local needs with an appreciation for adhering to widely accepted standards where appropriate. The UTF-8 encoding format seems to be evolving as an accepted standard, and in recent releases has become the default for Red Hat Enterprise Linux. The following paragraph, extracted from the utf-8(7) man page, says it well:
The LANG environment variable The LANG environment variable is used to define a user's language, and possibly the default encoding technique as well. The variable is expected to be set to a string using the following syntax: The locale command can be used to examine your current configuration (as can echo $LANG), while locale -a will list all settings currently supported by your system. The extent of the support for any given language will vary.
The LANG environment variable The following tables list some selected language codes, country codes, and code set specifications.
Revisiting cat, head, and tail Revisiting cat We have been using the cat command to simply display the contents of files. Usually, the cat command generates a faithful copy of its input, without performing any edits or conversions. When called with one of the following command line switches, however, the cat command will indicate the presence tabs, line feeds, and other control sequences, using the following conventions. Using the -A command line switch, the whitespace structure of the file becomes evident, as tabs are replaced with ^I, and line feeds are decorated with $. E.g. cat -A /etc/hosts
Revisiting head and tail The head and tail commands have been used to display the first or last few lines of a file, respectively. But what makes a line? Imagine yourself working at a typewriter: click! clack! click! clack! clack! ziiing! Instead of the ziing! of the typewriter carriage at the end of each line, the line feed character (ASCII 10) is chosen to mark the end of lines. Unfortunately, a common convention for how to mark the end of a line is not shared among the dominant operating systems in use today. Linux (and Unix) uses the line feed character (ASCII 10, often represented \n), while Macintosh operating systems uses the carriage return character (ASCII 13, often represented \r or ^M), and Microsoft operating systems use a carriage return/line feed pair (ASCII 13, ASCII 10).
Revisiting head and tail For example, the following file contains a list of four musicians. Linux (and Unix) text files generally adhere to a convention that the last character of the file must be a line feed for the last line of text. Following the cat of the file musicians.mac, which does not contain any conventional Linux line feed characters, the bash prompt is not displayed in its usual location.
The wc (Word Count) Command Counting Made Easy Have you ever tried to answer a “25 words or less” quiz? Did you ever have to write a 1500-word essay? With the wc you can easily verify that your contribution meets the criteria. The wc command counts the number of characters, words, and lines. It will take its input either from files named on its command line or from its standard input. Below is the command line form for the wc program:
The wc (Word Count) Command When used without any command line switches, wc will report on the number of characters, lines, and words. Command line switches can be combined to return any combination of character count, line count or word count.
How To Recognize A Real Character Text files are composed using an alphabet of characters. Some characters are visible, such as numbers and letters. Some characters are used for horizontal distance, such as spaces and TAB characters. Some characters are used for vertical movement, such as carriage returns and line feeds. A line in a text file is a series of any character other than a NEWLINE (line feed) character and then a NEWLINE character. Additional lines in the file immediately follow the first line. While a computer represents characters as numbers, the exact value used for each symbol varies depending on which alphabet has been chosen. The most common alphabet for English speakers is ASCII, also called “Latin-1”. Different human languages are represented by different computer encoding rules, so the exact numeric value for a given character depends on the human language being recorded.
So, What Is A Word? A word is a group of printing characters, such as letters and digits, surrounded by white space, such as space characters or horizontal TAB characters. Notice that our definition of a word does not include any notion of “meaning”. Only the form of the word is important, not its semantics. As far as Linux is concerned, a line such as:
Chapter 2. Finding Text: grep Key Concepts • grep is a command that prints lines that match a specified text string or pattern. • grep is commonly used as a filter to reduce output to only desired items. • grep -r will recursively grep files underneath a given directory. • grep -v prints lines that do NOT match a specified text string or pattern. • Many other command line switches allow users to specify grep's output format.
Searching Text File Contents using grep In an earlier Lesson, we saw how the wc program can be used to count the characters, words and lines in text files. In this Lesson we introduce the grep program, a handy tool for searching text file contents for specific words or character sequences. The name grep stands for general regular expression parser. What, you may well ask, is a regular expression and why on earth should I want to parse one? We will provide a more formal definition of regular expressions in a later Lesson, but for now it is enough to know that a regular expression is simply a way of describing a pattern, or template, to match some sequence of characters. A simple regular expression would be “Hello”, which matches exactly five characters: “H”, “e”, two consecutive “l” characters, and a final “o”. More powerful search patterns are possible and we shall examine them in the next section. The figure below gives the general form of the grep command line:
Searching Text File Contents using grep There are actually three different names for the grep tool [10]: fgrep Does a fast search for simple patterns. Use this command to quickly locate patterns without any wildcard characters, useful when searching for an ordinary word. grep Pattern searches using ordinary regular expressions. egrep Pattern searches using more powerful extended regular expressions. The pattern argument supplies the template characters for which grep is to search. The pattern is expected to be a single argument, so if pattern contains any spaces, or other characters special to the shell, you must enclose the pattern in quotes to prevent the shell from expanding or word splitting it.
Searching Text File Contents using grep The following table summarizes some of grep's more commonly used command line switches. Consult the grep(1) man page (or invoke grep --help) for more.
Show All Occurrences of a String in a File Under Linux, there are often several ways of accomplishing the same task. For example, to see if a file contains the word “even”, you could just visually scan the file: Reading the file, we see that the file does indeed contain the letters “even”. Using this method on a large file suffers because we could easily miss one word in a file of several thousand, or even several hundred thousand, words. We can use the grep tool to search through the file for us in an automatic search: Here we searched for a word using its exact spelling. Instead of just a literal string, the pattern argument can also be a general template for matching more complicated character sequences; we shall explore that in a later Lesson.
Searching in Several Files at Once An easy way to search several files is just to name them on the grep command line: Perhaps we are more interested in just discovering which file mentions the word “nine” than actually seeing the line itself. Adding the -l switch to the grep line does just that:
Searching Directories Recursively Grep can also search all the files in a whole directory tree with a single command. This can be handy when working a large number of files. The easiest way to understand this is to see it in action. In the directory /etc/sysconfig are text files that contain much of the configuration information about a Linux system. The Linux name for the first Ethernet network device on a system is “eth0”, so you can find which file contains the configuration for eth0 by letting the grep -r command do the searching for you [11]:
Searching Directories Recursively Every file in /etc/sysconfig that mentions eth0 is shown in the results. We can further limit the files listed to only those referring to an actual device by filtering the grep -r output through a grep DEVICE: This shows a common use of grep as a filter to simplify the outputs of other commands. If only the names of the files were of interest, the output can be simplified with the -l command line switch.
Inverting grep By default, grep shows only the lines matching the search pattern. Usually, this is what you want, but sometimes you are interested in the lines that do not match the pattern. In these instances, the -v command line switch inverts grep's operation.
Getting Line Numbers Often you may be searching a large file that has many occurrences of the pattern. Grep will list each line containing one or more matches, but how is one to locate those lines in the original file? Using the grep -n command will also list the line number of each matching line. The file /usr/share/dict/words contains a list of common dictionary words. Identify which line contains the word “dictionary”: You might also want to combine the -n switch with the -r switch when searching all the files below a directory:
Limiting Matching to Whole Words Remember the file containing our nursery rhyme earlier? Suppose we wanted to retrieve all lines containing the word “at”. If we try the command: Do you see what happened? We matched the “at” string, whether it was an isolated word or part of a larger word. The grep command provides the -w switch to imply that the specified pattern should only match entire words. The -w switch considers a sequence of letters, numbers, and underscore characters, surrounded by anything else, to be a word.
Ignoring Case The string “Bob” has quite a meaning quite different from the string “bob”. However, sometimes we want to find either one, regardless of whether the word is capitalized or not. The grep -i command solves just this problem.
ExamplesFinding Simple Character Strings Verify that your computer has the system account “lp”, used for the line printer tools. Hint: the file /etc/passwd contains one line for each user account on the system.
QuestionsChapter 2. Finding Text: grep 1, 2 and 3
Chapter 3. Introduction to Regular Expressions Key Concepts • Regular expressions are a standard Unix syntax for specifying text patterns. • Regular expressions are understood by many commands, including grep, sed, vi, and many scripting languages. • Within regular expressions, . and [] are used to match characters. • Within regular expressions, +, *, and ?specify a number of consecutive occurrences. • Within regular expressions, ^ and $ specify the beginning and end of a line. • Within regular expressions, (, ), and | specify alternative groups. • The regex(7) man page provides complete details.
Introducing Regular Expressions In the previous chapter you saw grep used to match either a whole word or part of a word. This by its self is very powerful, especially in conjunction with arguments like -i and -v, but it is not appropriate for all search scenarios. Here are some examples of searches that the grep usage you've learned so far would not be able to do: First, suppose you had a file that looked like this:
Introducing Regular Expressions What if you wanted to pull out just the names of the people in people_and_pets.txt? A command like grep -w Name: would match the 'Name:' line for each person, but also the 'Name:' line for each person's pet. How could we match only the 'Name:' lines for people? Well, notice that the lines for pets' names are all indented, meaning that those lines begin with whitespace characters instead of text. Thus, we could achieve our goal if we had a way to say "Show me all lines that begin with 'Name:'". Another example: Suppose you and a friend both witnessed a hit-and-run car accident. You both got a look at the fleeing car's license plate and yet each of you recalls a slightly different number. You read the license number as "4I35VBB" but your friend read it as "413SV88". It seems that what you read as an 'I' in the second character, your friend read as a '1'. Similar differences appear in your interpretations of other parts of the license like '5' vs 'S' and 'BB' vs '88'. The police, having taken both of your statements, now need to narrow down the suspects by querying their database of license plates for plates that might match what you saw.
Introducing Regular Expressions One solution might be to do separate queries for "4I35VBB" and "413SV88" but doing so assumes that one of you is exactly right. What if the perpetrator's license number was actually "4135VB8"? In other words, what if you were right about some of the characters in question but your friend was right about others? It would be more effective if the police could query for a pattern that effectively said: "Show me all license numbers that begin with a '4', followed by an 'I' or a '1', followed by a '3', followed by a '5' or an 'S', followed by a 'V', followed by two characters that are each either a 'B' or an '8'". Query scenarios like these can be solved using regular expressions. While computer scientists sometimes use the term "regular expression" (or "regex" for short) to describe any method of describing complex patterns, in Linux and many programming languages the term refers to a very specific set of special characters used for solving problems like the above. Regular expressions are supported by a large number of tools including grep, vi, find and sed.
Introducing Regular Expressions To introduce the usage of regular expressions, lets look at some solutions to two problems introduced earlier. Don't worry if these seem a bit complicated, the remainder of the unit will start from scratch and cover regular expressions in great detail. A regex that could solve the first problem, where we wanted to say "Show me all lines that begin with 'Name:'" might look like this: ...that's it! Regular expressions are all about the use of special characters, called metacharacters to represent advanced query parameters. The carat ("^"), as shown here, means "Lines that begin with...". Note, by the way, that the regular expression was put in single-quotes. This is a good habit to get into early on as it prevents bash from interpreting special characters that were meant for grep.
Introducing Regular Expressions Ok, so what about the second problem? That one involved a much more complicated query: "Show me all license numbers that begin with a '4', followed by an 'I' or a '1', followed by a '3', followed by a '5' or an 'S', followed by a 'V', followed by two characters that are each either a 'B' or an '8'". This could be represented by a regular expression that looks like this: Wow, that's pretty short considering how long it took to write out what we were looking for! There are only two types of regex metacharacters used here: square braces ('[]') and curly braces ('{}'). When two or more characters are shown within square braces it means "any one of these". So '[B8]' near the end of the expression means "'B' or '8'". When a number is shown within curly braces it means "this many of the preceding character". Thus, '[B8]{2}' means "two characters that are each either a 'B' or an '8'". Pretty powerful stuff! Now that you've gotten a taste of what regular expressions are and how they can be used, let's start from scratch and cover them in depth.
Regular Expressions, Extended Regular Expressions, and the grep Command As the Unix implementation of regular expression syntax has evolved, new metacharacters have been introduced. In order to preserve backward compatibility, commands usually choose to implement regular expressions, or extended regular expressions. In order to not become bogged down with the differences, this Lesson will introduce the extended syntax, summarizing differences at the end of the discussion. One of the most common uses for regular expressions is specifying search patterns for the grep command. As was mentioned in the previous Lesson, there are three versions of the grep command. Reiterating, the three differ in how they interpret regular expressions.
Regular Expressions, Extended Regular Expressions, and the grep Command fgrep The fgrep command is designed to be a "fast" grep. The fgrep command does not support regular expressions, but instead interprets every character in the specified search pattern literally. grep The grep command interprets each patterns using the original, basic regular expression syntax. egrep The egrep command interprets each patterns using extended regular expression syntax. Because we are not yet making a distinction between the basic and extended regular expression syntax, the egrep command should be used whenever the search pattern contains regular expressions.
Anatomy of a Regular Expression In our discussion of the grep program family, we were introduced to the idea of using a pattern to identify the file content of interest. Our examples were carefully constructed so that the pattern contained exactly the text for which we were searching. We were careful to use only literal characters in our regular expressions; a literal character matches only itself. So when we used “hello” as the regular expression, we were using a five-character regular expression composed only of literal characters. While this let us concentrate on learning how to operate the grep program, it didn't allow us to get a full appreciation of the power of regular expressions. Before we see regular expressions in use, we shall first see how they are constructed.
Anatomy of a Regular Expression A regular expression is a sequence of: Literal Characters Literal characters match only themselves. Examples of literals are letters, digits and most special characters (see below for the exceptions). Wildcards Wildcard characters match any character. Within a regular expression, a period (“.”) matches any character, be it a space, a letter, a digit, punctuation, anything. Modifiers A modifier alters the meaning of the immediately preceding pattern character. For example, the expression “ab*c” matches the strings “ac”, “abc”, “abbc”, “abbbc”, and so on, because the asterisk (“*”) is a modifier that means “any number of (including zero)”. Thus, our pattern means to match any sequence of characters consisting of one “a”, a (possibly empty) series of “b” characters, and a final “c” character. Anchors Anchors establish the context for the pattern, such as "the beginning of a line", or "the end of a word". For example, the expression “cat” would match any occurrence of the three letters, while “^cat” would only match lines that begin “cat”.
Taking Literals Literally Literals are straightforward because each literal character in a regular expressions matches one, and only one, copy of itself in the searched text. Uppercase characters are distinct from lowercase characters, so that “A” does not match “a”. Wildcards The "dot" wildcard The character “.” is used as a placeholder, to match one of any character. In the following example, the pattern matches any occurrence of the literal characters “x” and “s”, separated by exactly two other characters.
Bracket Expressions: Ranges of Literal Characters Normally a literal character in a regex pattern matches exactly one occurrence of itself in the searched text. Suppose we want to search for the string “hello” regardless of how it is capitalized: we want to match “Hello” and “HeLLo” as well. How might we do that? A regex feature called a bracket expression solves this problem neatly. A bracket expression is a range of literals enclosed in square brackets (“[” and “]”). For example, the regex pattern “[Hh]” is a character range that matches exactly one character: either an uppercase “H” or a lowercase “h” letter. Notice that it doesn't matter how large the set of characters within the range is, the set matches exactly one character, if it matches any at all. A bracket expression that matches the set of lowercase vowels could be written “[aeiou]” and would match exactly one vowel. In the following example, bracket expressions are used to find words from the file /usr/share/dict/words. In the first case, the first five words that contain three consecutive (lowercase) vowels are printed. In the second case, the first 5 words that contain lowercase letters in the pattern of vowel-consonant-vowel-consonant-vowel-consonant are printed.
Bracket Expressions: Ranges of Literal Characters If the first character of a bracket expression is a “^”, the interpretation is inverted, and the bracket expression will match any single occurrence of a character not included in the range. For example, the expression “[^aeiou]” would match any character that is not a vowel. The following example first lists words which contain three consecutive vowels, and secondly lists words which contain three consecutive consonant-vowel pairs.
Range Expressions vs. Character Classes: Old School and New School Another way to express a character range is by giving the start- and end-letters of the sequence this way: “[a-d]” would match any character from the set a, b, c or d. A typical usage of this form would be “[0-9]” to represent any single digit, or “[A-Z]” to represent all capital letters.