310 likes | 531 Views
CMPT 125. Regular Expressions. Objectives. Read and write simple regular expressions. What are Regular Expressions?. Regular expressions are a string processing language A regular expression is a string that matches a set of strings
E N D
CMPT 125 Regular Expressions
Objectives • Read and write simple regular expressions John Edgar
What are Regular Expressions? • Regular expressions are a string processing language • A regular expression is a string that matches a set of strings • They provide a concise, and powerful method for processing strings • Java supports regular expressions • So does Perl • and Python John Edgar
Validating email Addresses • Assume that a valid email address has this format • username@place.extension • The username, place and extension must all be sequences of 1 or more alpha-numeric characters (i.e. upper or lower case letters or digits) • The username and place should be separated by an '@' symbol • The place and extension should be separated by a '.' • Let's look at a Java method to check email addresses John Edgar
Java Method to Check emails publicstaticboolean isEmailAddress1(String s) { int atLoc = s.indexOf("@"); int dotLoc = s.indexOf("."); if (atLoc == -1 || dotLoc == -1) { returnfalse; } else if (atLoc > dotLoc) { returnfalse; //"." must come before "@" } else if (atLoc != s.lastIndexOf("@") || dotLoc != s.lastIndexOf(".")) { returnfalse; //there must be exactly one "." and one "@" } else { String username = s.substring(0, atLoc); String place = s.substring(atLoc + 1, dotLoc); String ext = s.substring(dotLoc + 1); return username.length() > 0 && isAlphaNumeric(username) && place.length() > 0 && isAlphaNumeric(place) && ext.length() > 0 && isAlphaNumeric(ext); } } public static boolean isAlphaNumeric(String s) { for (int i = 0; i < s.length(); i++) { if (!Character.isLetterOrDigit(s.charAt(i))) { returnfalse; } } returntrue; } John Edgar
Regular Expression to Check emails public static boolean isEmailAddress2(String s) { return s.matches("\\w+@\\w+\\.\\w+"); } John Edgar
Checking emails: Explanation • "\\w" means any alphanumeric character (including the "_" character) • "+" means 1 or more • "@" means match the "@" character • "\\." means match the period character • matches(String regex) is a Java String method that takes a regular expression as an argument, and returns true if the calling string matches the regular expression John Edgar
Regular Expression Syntax • The regular expression language is very concise and cryptic • What follows is a description of the basic regular expression syntax • The language has a number of advanced features that we will not cover • Different versions (known as flavours) of regular expressions (such as Perl and Python) differ from the Java version John Edgar
String Matching if(s.matches("dog")){ System.out.println("it's a match!"); }else{ System.out.println("no match"); } • Only matches if s is "dog" John Edgar
Matching Special Characters if(s.matches("Hello\\?")){ System.out.println("it's a match!"); }else{ System.out.println("no match"); } • This regular expression matches the string "Hello?" • Note that ? is a metacharacter so requires the escape character • To match any of [\^$.|?*+() requires preceding them with two slash characters, (\) John Edgar
The Escape Character: Why \\? • To match characters that have special meaning in regular expressions (that is, metacharacters) they have to be preceded by the regular expression escape character ('\') • Unfortunately to have the '\' character appear in a Java string requires that it be escaped by the Java escape character (also '\') • This results in the necessity for writing two slash characters • Other languages (e.g. Perl) have found ways to avoid doing this John Edgar
Matching Multiple Strings if(s.matches("xbox|ps2|gamecube|spectrum")){ System.out.println("it's a match!"); }else{ System.out.println("no match"); } • Matches to any of the given strings separated by the | character • "x|y" means match x or y • The | character can be very useful to search for common misspellings John Edgar
More Matching of Multiple Strings if(s.matches("Call of (Cthullu|Cthulu)")){ System.out.println("it's a match!"); }else{ System.out.println("no match"); } • The brackets help to make the scope of the | operator clear John Edgar
Character Classes if(s.matches("[aeiou]")){ System.out.println("it's a vowel!"); }else{ System.out.println("not a vowel"); } • A character class is a set of characters • The characters are enclosed in square brackets, […] • A character class only matches to one of the characters • The above example would match "o" but not "oo" or "ae" John Edgar
More About Character Classes if(s.matches("sep[ae]r[ae]te]")){ System.out.println("it's a match!"); }else{ System.out.println("no match"); } • This example matches to some common misspellings of the word separate • Matches to separate, separete, seperate, and seperete John Edgar
Negating Character Classes if(s.matches("[^aeiou]")){ System.out.println("not a vowel!"); }else{ System.out.println("it's a vowel"); } • Preceding the characters with a ^ matches to any character that is not in the class, but does not match to an empty string • Note that it will match to anything except a lowercase vowel, so far more than just the consonants John Edgar
Multiple Character Classes if(s.matches("7[0-9][0-9]")){ System.out.println("it's 7xx!"); }else{ System.out.println("it's not 7xx"); } • The – indicates a range of digits or characters • The example matches to all three digit numbers beginning with 7 • The class [a-zA-Z0-9] matches to any alphanumeric character John Edgar
Postal Codes String rx; rx = "[a-zA-Z][0-9][a-zA-Z] [0-9][a-zA-Z][0-9]" if(s.matches(rx)){ System.out.println("it's a postal code!"); }else{ System.out.println("hey, where's that?"); } • Notice that there must be a single space in the post code • This is very restrictive, and it can be improved on John Edgar
Matching a Time • Write a regular expression to match a time like 11:22am or 7:00pm • What makes a valid time? • The number before the ":" must be from 1 to 9 or 10, 11, 12 • [1-9]|10|11|12 • The first digit after the ":" must be in the range 0 to 5 and the second digit can be anything • [0-5][0-9] • The time must end with am or pm • (am|pm) or [ap]m • Are there other possibilities? • Should we allow a.m.? or AM?, or A.M.? • Should we allow a leading 0 for the hour? • And so on … John Edgar
Matching Times public static boolean isTime(String s){ String rx; rx = "([1-9]|10|11|12):[0-5][0-9](am|pm)" return(s.matches(rx)); } • Note that the ()s are very important, the expression will not work properly if they are omitted • What about times for the 24 hour clock? John Edgar
Testing Regular Expressions • It is often convenient to test regular expressions with assertions • Here is an example of testing the time regular expression privatestaticvoid test4() { assert isTime("4:00am"); assert isTime("12:59pm"); assert isTime("3:24am"); assert !isTime("4:60pm"); assert !isTime("4:22"); assert !isTime("13:21pm"); assert !isTime("03:21pm"); assert !isTime("0:20am"); System.out.println("All tests passed."); } John Edgar
Pre-defined Character Classes • . matches any character • \\d matches any digit • equivalent to [0-9] • \\D matches any character that is not a digit • equivalent to [^0-9] • \\w matches a word character • equivalent to [a-zA-Z0-9_] • \\W matches any non word character • equivalent to [^a-zA-Z0-9_] • \\s matches any whitespace character • equivalent to [\t\n\x0B\f\r] • \\S matches any non whitespace character • equivalent to [^\t\n\x0B\f\r] John Edgar
Quantifiers • A quantifier specifies how many of a particular character (or character sequence) are matched • ? makes a character optional, and matches to 0 or 1 occurrences of the preceding character • * matches to 0 or more occurrences of the preceding character • + matches to 1 or more occurrences of the preceding character John Edgar
Java Identifiers • A java identifier is a variable, method or class name • The name must begin with a letter or an underscore (i.e. a word character) and • is then followed by any number of word characters or digits • Here are two regular expressions that match to valid java identifiers • "[a-zA-Z_]\\w*" • "[a-zA-Z_][a-zA-Z0-9_]*" John Edgar
Laughter • Create a regular expression that matches to laughter • ha, or • haha, or • hahaha, • … John Edgar
Matching Laughter if(s.matches("(ha)+")){ System.out.println("laughing!!"); }else{ System.out.println("not funny"); } • The ha has to be in brackets or the quantifier will apply only to the preceding character, i.e. just the a • Use the + quantifier as we require at least one ha for a string to match John Edgar
Evil Genius Laughter • Now let's change our expression to match evil genius laughter: • bwahhahhah, but • one point to watch out for is whether or not the double h's are necessary • bwahahaha John Edgar
Modified Laughter Expression if(s.matches("(bwah?)?(hah?)+")){ System.out.println("laughing!!"); }else{ System.out.println("not funny"); } • Matches a variety of laughter expressions e.g. • bwahhahhah • bwahhaha • ha • hahhah John Edgar
Integers if(s.matches("0|-?[1-9]\\d*")){ System.out.println("an integer!"); }else{ System.out.println("no match"); } • The initial 0 matches to just 0 but not 09, -0 etc. • The – sign is made optional • The first digit cannot be 0 (see the first point) and • \\d* matches to any sequence of digits John Edgar
Using Regular Expressions • There are many other uses for regular expressions other than just matching entire strings • There are other Java methods that accept regular expressions as argument • String[] split(String regex) splits the calling string on matches to the given regular expression (discarding the matching substring) • String replaceAll(String regex, String replace) returns a string with any substrings that match to the regular expression replaced with the replacement string • String replaceFirst(String regex, String replace) similar to replaceAll except it replaces only the first matching substring John Edgar
Advanced Regular Expressions • The regular expression syntax has many advanced features that are not covered here • ^ (outside a character class) represents the start of a line • $ represents the end of a line • and others … • The site http://www.regular-expressions.info/ is a good source of information on regular expressions John Edgar