1 / 31

CMPT 125

CMPT 125. Regular Expressions. Objectives. Read and write simple regular expressions. What are Regular Expressions?. Regular expressions are a string processing language A regular expression is a string that matches a set of strings

nan
Download Presentation

CMPT 125

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMPT 125 Regular Expressions

  2. Objectives • Read and write simple regular expressions John Edgar

  3. What are Regular Expressions? • Regular expressions are a string processing language • A regular expression is a string that matches a set of strings • They provide a concise, and powerful method for processing strings • Java supports regular expressions • So does Perl • and Python John Edgar

  4. Validating email Addresses • Assume that a valid email address has this format • username@place.extension • The username, place and extension must all be sequences of 1 or more alpha-numeric characters (i.e. upper or lower case letters or digits) • The username and place should be separated by an '@' symbol • The place and extension should be separated by a '.' • Let's look at a Java method to check email addresses John Edgar

  5. Java Method to Check emails publicstaticboolean isEmailAddress1(String s) { int atLoc = s.indexOf("@"); int dotLoc = s.indexOf("."); if (atLoc == -1 || dotLoc == -1) { returnfalse; } else if (atLoc > dotLoc) { returnfalse; //"." must come before "@" } else if (atLoc != s.lastIndexOf("@") || dotLoc != s.lastIndexOf(".")) { returnfalse; //there must be exactly one "." and one "@" } else { String username = s.substring(0, atLoc); String place = s.substring(atLoc + 1, dotLoc); String ext = s.substring(dotLoc + 1); return username.length() > 0 && isAlphaNumeric(username) && place.length() > 0 && isAlphaNumeric(place) && ext.length() > 0 && isAlphaNumeric(ext); } } public static boolean isAlphaNumeric(String s) { for (int i = 0; i < s.length(); i++) { if (!Character.isLetterOrDigit(s.charAt(i))) { returnfalse; } } returntrue; } John Edgar

  6. Regular Expression to Check emails public static boolean isEmailAddress2(String s) { return s.matches("\\w+@\\w+\\.\\w+"); } John Edgar

  7. Checking emails: Explanation • "\\w" means any alphanumeric character (including the "_" character) • "+" means 1 or more • "@" means match the "@" character • "\\." means match the period character • matches(String regex) is a Java String method that takes a regular expression as an argument, and returns true if the calling string matches the regular expression John Edgar

  8. Regular Expression Syntax • The regular expression language is very concise and cryptic • What follows is a description of the basic regular expression syntax • The language has a number of advanced features that we will not cover • Different versions (known as flavours) of regular expressions (such as Perl and Python) differ from the Java version John Edgar

  9. String Matching if(s.matches("dog")){ System.out.println("it's a match!"); }else{ System.out.println("no match"); } • Only matches if s is "dog" John Edgar

  10. Matching Special Characters if(s.matches("Hello\\?")){ System.out.println("it's a match!"); }else{ System.out.println("no match"); } • This regular expression matches the string "Hello?" • Note that ? is a metacharacter so requires the escape character • To match any of [\^$.|?*+() requires preceding them with two slash characters, (\) John Edgar

  11. The Escape Character: Why \\? • To match characters that have special meaning in regular expressions (that is, metacharacters) they have to be preceded by the regular expression escape character ('\') • Unfortunately to have the '\' character appear in a Java string requires that it be escaped by the Java escape character (also '\') • This results in the necessity for writing two slash characters • Other languages (e.g. Perl) have found ways to avoid doing this John Edgar

  12. Matching Multiple Strings if(s.matches("xbox|ps2|gamecube|spectrum")){ System.out.println("it's a match!"); }else{ System.out.println("no match"); } • Matches to any of the given strings separated by the | character • "x|y" means match x or y • The | character can be very useful to search for common misspellings John Edgar

  13. More Matching of Multiple Strings if(s.matches("Call of (Cthullu|Cthulu)")){ System.out.println("it's a match!"); }else{ System.out.println("no match"); } • The brackets help to make the scope of the | operator clear John Edgar

  14. Character Classes if(s.matches("[aeiou]")){ System.out.println("it's a vowel!"); }else{ System.out.println("not a vowel"); } • A character class is a set of characters • The characters are enclosed in square brackets, […] • A character class only matches to one of the characters • The above example would match "o" but not "oo" or "ae" John Edgar

  15. More About Character Classes if(s.matches("sep[ae]r[ae]te]")){ System.out.println("it's a match!"); }else{ System.out.println("no match"); } • This example matches to some common misspellings of the word separate • Matches to separate, separete, seperate, and seperete John Edgar

  16. Negating Character Classes if(s.matches("[^aeiou]")){ System.out.println("not a vowel!"); }else{ System.out.println("it's a vowel"); } • Preceding the characters with a ^ matches to any character that is not in the class, but does not match to an empty string • Note that it will match to anything except a lowercase vowel, so far more than just the consonants John Edgar

  17. Multiple Character Classes if(s.matches("7[0-9][0-9]")){ System.out.println("it's 7xx!"); }else{ System.out.println("it's not 7xx"); } • The – indicates a range of digits or characters • The example matches to all three digit numbers beginning with 7 • The class [a-zA-Z0-9] matches to any alphanumeric character John Edgar

  18. Postal Codes String rx; rx = "[a-zA-Z][0-9][a-zA-Z] [0-9][a-zA-Z][0-9]" if(s.matches(rx)){ System.out.println("it's a postal code!"); }else{ System.out.println("hey, where's that?"); } • Notice that there must be a single space in the post code • This is very restrictive, and it can be improved on John Edgar

  19. Matching a Time • Write a regular expression to match a time like 11:22am or 7:00pm • What makes a valid time? • The number before the ":" must be from 1 to 9 or 10, 11, 12 • [1-9]|10|11|12 • The first digit after the ":" must be in the range 0 to 5 and the second digit can be anything • [0-5][0-9] • The time must end with am or pm • (am|pm) or [ap]m • Are there other possibilities? • Should we allow a.m.? or AM?, or A.M.? • Should we allow a leading 0 for the hour? • And so on … John Edgar

  20. Matching Times public static boolean isTime(String s){ String rx; rx = "([1-9]|10|11|12):[0-5][0-9](am|pm)" return(s.matches(rx)); } • Note that the ()s are very important, the expression will not work properly if they are omitted • What about times for the 24 hour clock? John Edgar

  21. Testing Regular Expressions • It is often convenient to test regular expressions with assertions • Here is an example of testing the time regular expression privatestaticvoid test4() { assert isTime("4:00am"); assert isTime("12:59pm"); assert isTime("3:24am"); assert !isTime("4:60pm"); assert !isTime("4:22"); assert !isTime("13:21pm"); assert !isTime("03:21pm"); assert !isTime("0:20am"); System.out.println("All tests passed."); } John Edgar

  22. Pre-defined Character Classes • . matches any character • \\d matches any digit • equivalent to [0-9] • \\D matches any character that is not a digit • equivalent to [^0-9] • \\w matches a word character • equivalent to [a-zA-Z0-9_] • \\W matches any non word character • equivalent to [^a-zA-Z0-9_] • \\s matches any whitespace character • equivalent to [\t\n\x0B\f\r] • \\S matches any non whitespace character • equivalent to [^\t\n\x0B\f\r] John Edgar

  23. Quantifiers • A quantifier specifies how many of a particular character (or character sequence) are matched • ? makes a character optional, and matches to 0 or 1 occurrences of the preceding character • * matches to 0 or more occurrences of the preceding character • + matches to 1 or more occurrences of the preceding character John Edgar

  24. Java Identifiers • A java identifier is a variable, method or class name • The name must begin with a letter or an underscore (i.e. a word character) and • is then followed by any number of word characters or digits • Here are two regular expressions that match to valid java identifiers • "[a-zA-Z_]\\w*" • "[a-zA-Z_][a-zA-Z0-9_]*" John Edgar

  25. Laughter • Create a regular expression that matches to laughter • ha, or • haha, or • hahaha, • … John Edgar

  26. Matching Laughter if(s.matches("(ha)+")){ System.out.println("laughing!!"); }else{ System.out.println("not funny"); } • The ha has to be in brackets or the quantifier will apply only to the preceding character, i.e. just the a • Use the + quantifier as we require at least one ha for a string to match John Edgar

  27. Evil Genius Laughter • Now let's change our expression to match evil genius laughter: • bwahhahhah, but • one point to watch out for is whether or not the double h's are necessary • bwahahaha John Edgar

  28. Modified Laughter Expression if(s.matches("(bwah?)?(hah?)+")){ System.out.println("laughing!!"); }else{ System.out.println("not funny"); } • Matches a variety of laughter expressions e.g. • bwahhahhah • bwahhaha • ha • hahhah John Edgar

  29. Integers if(s.matches("0|-?[1-9]\\d*")){ System.out.println("an integer!"); }else{ System.out.println("no match"); } • The initial 0 matches to just 0 but not 09, -0 etc. • The – sign is made optional • The first digit cannot be 0 (see the first point) and • \\d* matches to any sequence of digits John Edgar

  30. Using Regular Expressions • There are many other uses for regular expressions other than just matching entire strings • There are other Java methods that accept regular expressions as argument • String[] split(String regex) splits the calling string on matches to the given regular expression (discarding the matching substring) • String replaceAll(String regex, String replace) returns a string with any substrings that match to the regular expression replaced with the replacement string • String replaceFirst(String regex, String replace) similar to replaceAll except it replaces only the first matching substring John Edgar

  31. Advanced Regular Expressions • The regular expression syntax has many advanced features that are not covered here • ^ (outside a character class) represents the start of a line • $ represents the end of a line • and others … • The site http://www.regular-expressions.info/ is a good source of information on regular expressions John Edgar

More Related