500 likes | 514 Views
Regular Expression 與 Java. 蕭宇程 swanky.hsiao@gmail.com http://swanky.adsldns.org/. Introduction to Regular Expressions Regular Expression Syntax Object Models Using regexes in Java. Part 1 : Introduction to Regular Expressions.
E N D
Regular Expression與Java 蕭宇程 swanky.hsiao@gmail.com http://swanky.adsldns.org/
Introduction to Regular Expressions • Regular Expression Syntax • Object Models • Using regexes in Java
Regular expressions are the key to powerful, flexible, and efficient text processing.
The Filename Analogy In DOS/Windows dir *.txt *、?:file globs or wildcards • *:match anything • ?:match any one character
Generalized Pattern Language Regular Experssions: Powerful pattern language(generalized pattern language) and the patterns themselves
The Language Analogy Regular Expressions are composed of: • Metacharacters (special characters) • Literal (normal text characters) Literal text acting as the words and metacharacters as the grammar.
Regular Expression測試小程式 http://www.javaworld.com.tw/blog/archives/ciyawasay/000381.html Regular Experssion的投影片與範例檔 http://www.javaworld.com.tw/blog/archives/ciyawasay/000394.html
Start and End of the Line • Start:^ (caret) • End: $ (dollar) cat、^cat、cat$ Match a position in the line rather than any actual text characters themselves. Question:^cat$ 、^$ 、^各代表什麼意思?
Character Classes • Matching any one of several characters […] • Negated character classes [^…]
Matching any one of several characters • […] grey、gray ⇒ gr[ea]y • [0-9]character-class metacharacter‘-’ (dash) <h[123456]> ⇒ <h[1-6]> [0-9a-fA-F] = [A-Fa-f0-9] • A dash is a metacharacter only within a character class – otherwise it matches the normal dash character.
Negated character classes • [^…] Matches any character that isn’t listed. • [^1-6] matches a character that’s not 1 through 6. Question:Why doesn’t q[^u] match ‘Qantas’ or ‘Iraq’
Character Class Notes • A character class, even negated, still requires a character to match. • Consider character classes as their own mini language. The rules regarding which metacharacters are supported (and what they do) are completely different inside and outside of character classes.
Matching Any Character with Dot • . (dot 、point) Matches any character 03.19.76⇒ 03a19●76 03/19/76 、03-19-76 、03.19.76 ⇒ 03[-./]19[-./]76 • The dots are not metacharacters within a character class • [.-/] would be a mistake
AlternationMatching any one of several subexperssions • | (or 、bar) Combine multiple experssions into a single expression that matches any of the individual ones. gr[ea]y = grey|gray = gr(a|e)y gr[a|e]y:Wrong! Within a class, the ‘|’ character is just a normal character.
Question • Jeffrey|Jeffery • (Geoff|Jeff)(rey|ery) • ^(From|Subject|Date):● • Start-of-line, followed by F、r、o、m, followed by ‘:●’ • Start-of-line, followed by S、u、b、j、e、c、t, followed by ‘:●’ • Start-of-line, followed by D、a、t、e, followed by ‘:●’
Character Class & Alternation • A character class can match just a single character in the target text. • With alternation, since each alternative can be a full-fledged regular expression in and of itself, each alternative can match an arbitrary amount of text.
Character Class & Alternation • Claracter classes are almost like their own special mini-language (with their own ideas about metacharacters, for example) • While alternation is part of the “main” regular expression language.
Word Boundaries • \b (\<、\>) Match the position at the start and end of a word (word-based versions of ^ and $) \bcat\b、\bcat、cat\b
Optional Items • ? (question mark) color|colour⇒colou?r Optional:placed after the character that is allowed to appear at that point in the experssion, but whose existence isn’t actually required to still be considered a successful match. • ? can attach to a parenthesized expression. 4th|4⇒4(th)?
Other Quantifiers: Repetition • + (plus) One or more of the immediately-preceding item • * (asterisk 、star) Any number, including none, of the item <hr●size=14>⇒<hr●+size●*=●*14●*> Exercrise:the size part is optional.
Defined range of matches: intervals • {min,max} (interval quantifier) [a-zA-Z]{1,5}
Parentheses and Backreferences • Parentheses can “remember” text matched by the subexpression they enclose. • Backreferencing:match new text that is the same as some text matched earlier in the expression. Doubled-word problem:the●the \b([A-Za-z]+)●+\1\b ([a-z])([0-9])\1\2
The Great Escape • \ (escape) When a metacharacter is escaped, it loses its special meaning and becomes a literal character. ega.att.com ⇒ega\.att\.com (very) ⇒\([a-zA-Z]+\) • Not escape:\<、\>、\1
Tasks need to be done in using a regular expression: Setup . . . • Accept a string as a regex; compile to an internal form. • Associate the regex with the target text. Actually apply the regex . . . • Initiate a match attempt. See the results . . . • Learn whether the match is successful. • Gain access to further details of a successful attempt. • Query those details (what matched, where it matched, etc.). • You might repeat them from 3. to find the next match in the target string.
Pattern:Represents a compiled regular expression. • Matcher:Has all of the state associated with applying a Pattern object to a particular string.
Java API public final class Pattern {// Flags values ('or' together) public static final int UNIX_LINES, CASE_INSENSITIVE, COMMENTS, MULTILINE, DOTALL, UNICODE_CASE, CANON_EQ;// Factory methods (no public constructors) public static Pattern compile(String patt); public static Pattern compile(String patt, int flags);// Method to get a Matcher for this Pattern public Matcher matcher(CharSequence input);// Information methods public String pattern(); public int flags();// Convenience methods public static boolean matches(String pattern, CharSequence input); public String[] split(CharSequence input); public String[] split(CharSequence input, int max);}
public final class Matcher {// Action: find or match methods public boolean matches(); public boolean find(); public boolean find(int start); public boolean lookingAt();// "Information about the previous match" methods public int start(); public int start(int whichGroup); public int end(); public int end(int whichGroup); public int groupCount(); public String group(); public String group(int whichGroup);}
public final class Matcher {// Reset methods public Matcher reset(); public Matcher reset(CharSequence newInput);// Replacement methods public Matcher appendReplacement(StringBuffer where, String newText); public StringBuffer appendTail(StringBuffer where); public String replaceAll(String newText); public String replaceFirst(String newText);// information methods public Pattern pattern();}
/* String, showing only the RE-related methods */public final class String { public boolean matches(String regex); public String replaceFirst(String regex, String newStr); public String replaceAll(String regex, String newStr); public String[] split(String regex); public String[] split(String regex, int max);}
SimpleRegexText.java import java.regex.Pattern; import java.regex.Matcher; public class SimpleRegexText { public static void main(String args[]){ String sampleText = "this is the 1st test string"; String sampleRegex = "\\d+\\w+"; Patternp = Pattern.compile(sampleRegex); Matcherm = p.matcher(sampleText); if(m.find()){ String matchedText = m.group(); int matchedFrom = m.start(); int matchedTo = m.end(); System.out.println("matched [" + matchedText + "] from " + matchedFrom + " to " + matchedTo + "."); } else { System.out.println("didn’t match"); } } } matched [1st] from 12 to 15.
範例:取出英文單字(取自 Thinking in Java) //: c12:FindDemo.java import java.util.regex.*; import com.bruceeckel.simpletest.*; import java.util.*; public class FindDemo { private static Test monitor = new Test(); public static void main(String[] args) { Matcher m = Pattern.compile("\\w+").matcher( "Evening is full of the linnet's wings"); while(m.find()) System.out.println(m.group()); monitor.expect(new String[] { "Evening", "is", "full", "of", "the", "linnet", "s", "wings" }); } } ///:~
import java.util.regex.*; /** Split a String into a Java Array of Strings divided by an RE */ public class Split { public static void main(String[] args) { String[] x = Pattern.compile("ian").split( "the darwinian devonian explodianchicken"); for (int i=0; i<x.length; i++) { System.out.println(i + " \"" + x[i] + "\""); } } } 0 "the darwin" 1 " devon" 2 " explod" 3 "chicken"
import java.util.regex.*; /** * Quick demo of RE substitution: correct "demon" and other * spelling variants to the correct, non-satanic "daemon". */ public class ReplaceDemo { public static void main(String[] argv) { // Make an RE pattern to match almost any form (deamon, demon, etc.). String patt = "d[ae]{1,2}mon"; // i.e., 1 or 2 'a' or 'e' any combo // A test input. String input = "Unix hath demons and deamons in it!"; System.out.println("Input: " + input); // Run it from a RE instance and see that it works Pattern r = Pattern.compile(patt); Matcher m = r.matcher(input); System.out.println("ReplaceAll: " + m.replaceAll("daemon")); // Show the appendReplacement method m.reset(); StringBuffer sb = new StringBuffer(); System.out.print("Append methods: "); while (m.find()) { // copy to before first match, plus the word "daemon" m.appendReplacement(sb, "daemon"); } m.appendTail(sb);// copy remainder System.out.println(sb.toString()); } } Input: Unix hath demons and deamons in it! ReplaceAll: Unix hath daemons and daemons in it! Append methods: Unix hath daemons and daemons in it!
The End 謝謝大家! 有問題歡迎到506研究室找我一起研究