730 likes | 763 Views
String Manipulation. String Manipulation In Chapter 4 we looked at the String object, which is one of the native objects that JavaScript makes available to us. We saw a number of its properties and methods, including the following: length—the length of the string in characters.
E N D
String Manipulation • In Chapter 4 we looked at the String object, which is one of the native objects that JavaScript makes available to us. We saw a number of its properties and methods, including the following: • length—the length of the string in characters. • charAt () and charCodeAt ()—the methods for returning the character or character code at a certain position in the string • indexOf () and lastlndexof ( ) —the methods that allow the searching of a string for the existence of another string and return the character position of the string if found. • substr () and substring ( ) —the methods that return just a portion of a string • toUpperCase () and toLowerCase ( ) —the methods that return a string converted to upper- or lowercase
in this chapter we’ll look at four new methods of the string object, namely split ( ) , match (), replace ( ), and search () . The last three, in particular, give us some very powerful text manipulation functionality. However, to make full use of this functionality, we need to learn about a slightly more complex subject. • The methods split ( ) ,match ( ) , replace ( ) , and search () can all make use of regular expressions, something JavaScript wraps up in an object called the RegExp object. Regular expressions allow you to define a pattern of characters, which can be used for text searching or replacement. Say, for example, that you had a string in which you wanted to replace all text enclosed in single quotes with double quotes. This may seem easy—just search the string for ‘ and replace it with —but what if the string was Bob O’Hara said “Hello?” We would not want to replace the ‘in O’Hara. Without regular expressions, this could still be done, but it would take more than the two lines of code needed if you use regular expressions. • Although split ( ) , match ( ), replace ( ) ,and search () are at their most powerful with regular expressions, they can also be used with just plain text. We’ll take a look at how they work in this simpler context first, to familiarize ourselves with the methods.
Additional String Methods • Fin this section we will take a look at the split ( ), replace (), search () , and match () methods, and see how they work without regular expressions. • The split() Method • The string object’s split () method splits a single string into an array of substrings. Where the string is split is determined by the separation parameter that we pass to the method. This parameter is simply a character or text string. • For example, to split the string “A, B, C” so that we have an array populated with the letters between the commas, the code would be as follows:
JavaScript creates an array with three elements. in the first element it puts everything from the start of the string mystring up to the first comma. in the second element it puts everything from after the first comma to before the second comma. Finally, in the third element it puts everything from after the second comma to the end of the string. So, our array myTextArray will look like this: • if, however, our string was A, B, C, JavaScript would split this into four elements, the last element containing everything from the last comma to the end of the string, or in other words, an empty string. • This is something that can catch you off guard if you’re not aware of it.
Let’s create a short example using the split () method, in which we reverse the lines written in a <textarea> element.
The replace() Method • The replace () method searches a string for occurrences of a substring. Where it finds a match for this substring, it replaces the substring with a third string that we specify. • Let’s look at an example. Say we have a string with the word “May” in it as shown in the following: • and we want to replace “May” with “June.” We could use the replace () method like so: • The value of mystring will not be changed. instead, the replace () method returns the value of mystring but with “May” replaced with “June.” We assign this returned string to the variable myCleanedUpstring, which will contain the text. • “The event will be in June, the 21st of June”
The search() Method • The search () method allows you to search a string for a particular piece of text. if the text is found, the character position at which it was found is returned; otherwise -1 is returned. The method takes only one parameter, namely the text you want to search for. • When used with plain text, the search () method provides no real benefit over methods like indexof() , which we’ve already seen. However, we’ll see later that it’s when we use regular expressions that the power of this method becomes apparent. • in the following example, we want to find out if the word Java is contained within the string called mystring. • The alert box that occurs will show the value 10, which is the character position of the J in the first occurrence of Java, as part of the word JavaScript.
The match() Method • The match () method is very similar to the search () method, except that instead of returning the position where a match was found, it returns an array. Each element of the array contains the text of each match that is found. • Although you can use plain text with the match() method, it would be completely pointless to do so. For example, take a look at the following: • This code results in myMatchArray holding an element containing the value 2000. Given that we already know our search string is 2000, you can see it’s been a pretty pointless exercise. • However, the match () method makes a lot more sense when we use it with regular expressions. Then we might search for all years in the 21st century, that is, those beginning with 2. In this case, our array would contain the values 2000, 2000, 2001, and 2002, which is much more useful information!
Regular Expressions • Before we look at the split(), match(), search() ,and replace() methods of the string object again, we need to look at regular expressions and the RegExp object. Regular expressions provide a means of defining a pattern of characters, which we can then use to split, search, or replace characters in a string where they fit the defined pattern. • JavaScript’s regular expression syntax borrows heavily from the regular expression syntax of Perl, another scripting language. The latest versions of languages, such as VBScript, have also incorporated regular expressions, as do lots of applications programs, such as Microsoft Word, in which the Find facility allows regular expressions to be used. You’ll find your regular expression knowledge will prove useful even outside JavaScript. • The use of regular expressions in JavaScript is through the RegExp object, which is a native JavaScript object, as are string, Array, and so on. There are two ways of creating a new RegExp object. The easier is with a regular expression literal, such as the following:
The forward slashes (/) mark the start and end of the regular expression. This is a special syntax that tells JavaScript that the code is a regular expression, much as quote marks define a string’s start and end. Don’t worry about the actual expression’s syntax yet (the \b’|’\b)—we’ll be explaining that in detail shortly. • Alternatively, we could use the RegExp object’s constructor function RegExp () and type the following: • Either way of specifying a regular expression is fine, though the former method is a shorter, more efficient one for JavaScript to use, and therefore generally preferred. For much of the remainder of the chapter, we’ll use the first method. The main reason for using the second method is because it allows the regular expression to be determined at runtime (as the code is executing and not when writing the code), for example, if we want to base it on user input.
Once we get familiar with regular expressions, we will come back to the second way of defining them using the RegExp () constructor. • As you can see, the syntax of regular expressions is slightly different when using the second method, and we’ll explain this in detail then. • While we’ll be concentrating on the use of the RegExp object as a parameter for the string object’s split ( ), replace ( ),match ( ), and search () methods, the RegExp object does have its own methods and properties. For example, the test () method allows you to test to see if the string passed to it as a • parameter contains a pattern matching that defined in the RegExp object. We’ll see the test () method in use in an example shortly.
Simple Regular Expressions • Defining patterns of characters using regular expression syntax can get fairly complex. in this section we’ll explore just the basics of regular expression patterns. The best way to do this is through examples. • Let’s start by looking at an example where we want to do a simple text replacement using the replace C method and a regular expression. imagine we have the following string: • and we want to replace any occurrence of the name “Paul” with “Ringo.” Well, the pattern of text we need to look for is simply Paul. Representing this as a regular expression, we just have this:
As we saw earlier, the forward slash characters mark the start and end of the regular expression. Now let’s use this with the replace () method. • You can see the replace () method takes two parameters: the RegExp object that defines the pattern to be searched and replaced, and the replacement text. • If we put this all together in an example, we have the following:
if you load this code into a browser, you will see the screen shown in
We can see that this has replaced the first occurrence of Paul in our string. But what if we wanted all the occurrences of Paul in the string to be replaced? The two at the far end of the string are still there, so what happened? • Well, by default the RegExp object only looks for the first matching pattern, in this case the first Paul, and then stops. This is common and important behavior for RegExp objects. Regular expressions tend to start at one end of a string and look through the characters until the first complete match is found, then stop. • What we want is a global match, which is a search for all possible matches to be made and replaced. To help us out, the RegExp object has three attributes we can define. You can see these listed in the following table.
If we change our RegExp object in the code to • a global case-insensitive match will be made. Running the code now produces the result shown in Figure 8-4.
The RegExp object has done its job correctly. We asked for all patterns of the characters Paul to be replaced and that’s what we got. What we actually meant was for all occurrences of Paul, when it’s a single word and not part of another word, such as Paula, to be replaced. The key to making regular expressions work is to define exactly the pattern of characters so that only that pattern can match and no other So let’s do that. • 1. We want paul or Paul to be replaced. • 2. We don’t want it replaced when it’s actually part of another word, as in Pauline. • How do we specify this second condition? How do we know when the word is joined to other characters, rather than just joined to spaces or punctuation or just the start or end of the string? • To see how we can achieve this with regular expressions, we need to enlist the help of regular expression special characters. We’ll look at these in the next section, by the end of which we should be able to solve the problem.
Regular Expressions: Special Characters • Text, Numbers, and Punctuation • The first group of special characters we’ll look at contains the character class’s special characters. By the character class, I mean digits, letters, and white space characters. The special characters are displayed in the following table.
Note that uppercase and lowercase characters mean very different things, so you need to be extra careful with case when using regular expressions. • Let’s look at an example. To match a telephone number in the format 1-800-888-5474, the regular expression would be as follows: • \d-\d\d\d-\d\d\d-\d\d\d\d • You can see that there’s a lot of repetition of characters here, which makes the expression quite unwieldy. To make this simpler, regular expressions have a way of defining repetition. We’ll see this a little later in the chapter, but first let’s look at another example. • We’ll use what we’ve learned so far about regular expressions in a full example in which we check that a passphrase contains only letters and numbers; that is, alphanumeric characters, and not punctuation or symbols like @, 00, and so on.
How It Works • Let’s start by looking at the regExp Is_Valid() function defined at the top of the script block in the head of the page. That does the validity checking of our passphrase using regular expressions. • The function takes just one parameter: the text we want to check for validity. We then declare a variable myRegExp and set it to a new regular expression, which implicitly creates a new RegExp object. • The regular expression itself is fairly simple, but first let’s think about what pattern we are looking for. What we want to find out is whether our passphrase string contains any characters that are not letters between A—Z and a—z, numbers between 0—9, or a space. Let’s see how this translates into a regular expression.
First we used square brackets with the symbol. • [^] • This means we want to match any character that is not one of the characters specified inside the square brackets. Next we added our a- z, which specifies any character in the range a through to z. • [^a-z] • So far our regular expression matches any character that is not between a and z. Note that, because we added the i to the end of the expression definition, we’ve made the pattern case-insensitive. So our regular expression actually matches any character not between A and Z or a and z. • Next we added \d to indicate any digit character, or any character between 0 and 9. • [^a-z\d]
So our expression matches any character that is not between a and z, A and Z, or 0 and 9. Finally, we decided that a space is valid, so we add that inside the square brackets as shown in next slide: • [^a-z\d ] • Putting this all together, we have a regular expression that will match any character that is not a letter, a digit, or a space. • On the second and final line of the function we use the RegExp object’s test() method to return a value. • return !(myRegExp.test(text)); • The test () method of the RegExp object checks the string passed as its parameter to see if the characters specified by the regular expression syntax match anything inside the string, if they do, true is returned; if not, false is returned. Our regular expression will match the first invalid character found, so if we get a result of true, we have an invalid passphrase. However, it’s a bit illogical for an is-valid function to return true when it’s invalid, so we reverse the result returned by adding the NOT operator (!).
The other function defined in the head of the page is butcheckValid_onclick 0 . As the name suggests, this is called when the butcheckValid button defined in the body of the page is clicked. • This function calls our regExpis_valid() function in an if statement to check whether the passphrase entered by the user in the txtPhrase text box is valid, if it is, an alert box is used to inform the user. • If it isn’t, another alert box is used to let the user know that his text was invalid
Repetition Characters • Regular expressions include something called repetition characters, which are a way of specifying how many of the last item or character we want to match. This proves very useful, for example, if we want to specify a phone number that repeats a character a specific number of times. The following table lists some of the most common repetition characters and what they do.
We saw earlier that to match a telephone number in the format 1-800-888-5474, the regular expression would be \d- \d\d\d- \d\d d- \d\d\d\d. Let’s see how this would be simplified using the repetition characters. • The pattern we’re looking for starts with one digit followed by a dash, so we need the following: • Next are three digits followed by a dash. This time we can use the repetition special characters—\d{3) will match exactly three \ d, which is the any digit character. • Next there are three digits followed by a dash again, so now our regular expression looks like this: • Finally, the last part of the expression is four digits, which is \d{ 4).
We’d declare this regular expression like this: • Remember that the first / and last / tell JavaScript that what is in between those characters is a regular expression. JavaScript creates a RegExp object based on this regular expression. • As another example, what if we have the string “Paul Paula Pauline,” and we want to replace Paul and Paula with George? To do this, we would need a regular expression that matches both Paul and Paula. • Let’s break this down. We know we want the characters Paul, so our regular expression starts as
Now we also want to match Paula, but if we make our expression Paula, this will exclude a match on Paul. This is where the special character ? comes in. it allows us to specify that the previous character is optional—it must appear zero (not at all) or one times. So, the solution is • which we’d declare as
Position Characters • The third group of special characters we’ll look at are those that allow you to specify either where the match should start or end or what will be on either side of the character pattern. For example, we might want our pattern to exist at the start or end of a string or line, or we might want it to be between two words. The following table lists some of the most common position characters and what they do.
For example, if we wanted to make sure our pattern was at the start of a line, we would type the following: • This would match an occurrence of myPattern if it was at the beginning of a line. • To match the same pattern, but at the end of a line, we would type the following:
The word boundary special characters \b and \B can cause confusion, because they do not match characters but the positions between characters. • Imagine we had the string “Hello world!, let’s look at boundaries said 007.” defined in the code as follows: • To make the word boundaries (that is, the boundaries between the words) of this string stand out, let’s convert them to the character. • We’ve replaced all the word boundaries, \b, with a , and our message box looks like the one in
You can see that the position between any word character (letters, numbers, or the underscore character) and any non-word character is a word boundary. You’ll also notice that the boundary between the start or end of the string and a word character is considered to be a word boundary. The end of this string is a full stop. So the boundary between that and the end of the string is a non-word boundary, and therefore no has been inserted. • if we change the regular expression in the example, so that it replaces non-word boundaries as follows: • we get the result shown in
Now the position between a letter, number, or underscore and another letter, number, or underscore is considered a non-word boundary and is replaced by an in our example. However, what is slightly confusing is that the boundary between two non-word characters, such as an exclamation mark and a comma, is also considered a non-word boundary. if you think about it, it actually does make sense, but it’s easy to forget when creating regular expressions. • You’ll remember from when we started looking at regular expressions that i used the following example
to convert all instances of Paul or paul into Ringo. However, we found that this code actually converts all instances of Paul to Ringo, even when inside another word. • One option to solve this problem would be to replace the string Paul only where it is followed by a non- word character. The special character for non-word characters is \W, so we need to alter our regular expression to the following: • This gives the result shown in • At last we’ve got it right, and this example is finished.
Covering All Eventualities • Perhaps the trickiest thing about a regular expression is making sure it covers all eventualities. in the previous example our regular expression works with the string as defined, but does it work with the following? • Here the Paul substring in JeanPaul will be changed to Ringo. We really only want to convert the sub- string Paul where it is on its own, with a word boundary on either side. if we change our regular expression code to • we have our final answer and can be sure only Paul or paul will ever be matched.
Grouping Regular Expressions • Our final topic under regular expressions, before we look at examples using the match ( ), replace 0, and search () methods, is how we can group expressions. in fact it’s quite easy. if we want a number of expressions to be treated as a single group, we just enclose them in parentheses, for example / (\d\d) /. Parentheses in regular expressions are special characters that group together character patterns and are not themselves part of the characters to be matched. • The question is, Why would we want to do this? Well by grouping characters into patterns, we can use the special repetition characters to apply to the whole group of characters, rather than just one. • Let’s take the string defined in mystring below as an example.
How could we match both JavaScript and VBScript using the same regular expression? The only thing they have in common is that they are whole words and they both end in Script. Well, an easy way would be to use parentheses to group the patterns Java and VB Then we can use the ? special character to apply to each of these groups of characters to make our pattern any word having zero or one instances of the characters Java or VB, and ending in Script. • if we break this expression down, we can see the pattern it requires is as follows: • A word boundary: \b • Zero or one instances of VB: (VB)? • Zero or one instances of Java: (Java)? • The characters Script: Script • A word boundary: \b
Let’s think about this problem. We want the pattern to match VBScript or JavaScript. Clearly they have the Script part in common. So what we want is a new word starting with Java or starting with VB, and either way it must end in Script. • First, we know that the word must start with a word boundary. • Next we know that we want either VS or Java to be at the start of the word. We’ve just seen that in regular expressions provides the “or” we need, so in regular expression syntax we want • This would match the pattern VS or Java. Now we can just add the Script part. • So our final code looks like this:
Reusing Groups of Characters • We can reuse the pattern specified by a group of characters later on in our regular expression. To refer to a previous group of characters, we just type \ and the order of the group. For example, the first group can be referred to as \l, the second as \2, and so on. • Let’s look at an example. Say we have a list of numbers in a string, with each number separated by a comma. For whatever reason, we are not allowed to have the same numbers repeated after each other, so while • 009,007,001,002,004,003 • would be OK, the following: • 007,007,001,002,002,003 • would not be valid, because we have 007 and 002 repeated after themselves.
How can we find instances of repeated digits and replace them with the word ERROR? We need to use the ability to refer to groups in regular expressions. • First let’s define our string as follows: • Now we know we need to search for a series of one or more number characters. in regular expressions the \d specifies any digit character, and + means one or more of the previous character So far, that gives our regular expression as • We want to match a series of digits followed by a comma, so we just add the comma.
This will match any series of digits followed by a comma, but how do we search for any series of digits followed by a comma, then followed again by the same series of digits? As the digits could be any digits, we can’t add them directly into our expression like so • because this will not work with the 002 repeat. What we need to do is put the first series of digits in a group, then we can specify that we want to match that group of digits again. This can be done using \l, which says, “match the characters found in the first group defined using parentheses.” Put all this together, and we have the following: • This defines a group whose pattern of characters is one or more digit characters. This group must be followed by a comma and then by the same pattern of characters as were found in the first group.
Put this into some JavaScript, and we have the following: • The alert box will show • That completes our brief look at regular expression syntax. Because regular expressions can get a little complex, it’s often a good idea to start simple and build them up slowly, as we have done. in fact, most regular expressions are just too hard to get right in one step—at least for us mere mortals without a brain the size of a planet. • if it’s still looking a bit strange and confusing, don’t panic. in the next sections, we’ll be looking at the String object’s split ( ), replace 0, search ( ), and match () methods with plenty more examples of regular expression syntax.
The String Object—splitfl, replacefl, searchfl, and match() Methods • The main functions making use of regular expressions are the String object’s split ( ), replace 0, search () and match U methods. We’ve already seen their syntax, so we’ll concentrate on their use with regular expressions and at the same time learn more about regular expression syntax and usage.
The split() Method • We’ve seen that the split () method allows us to split a string into various pieces with the split being made at the character or characters specified as a parameter. The result of this method is an array with each element containing one of the split pieces. For example, the following string • could be split into an array where each element contains a different fruit using • How about if our string was instead
This could, for example, contain both the names and prices of the fruit. How could we split the string, but just retrieve the names of the fruit and not the prices? We could do it without regular expressions, but it would take a number of lines of code. With regular expressions we can use the same code, and just amend the split () method’s parameter. • Let’s create an example that solves the problem just described—it must split our string, but only include the fruit names, not the prices.