490 likes | 524 Views
Session id: 40105. Introducing Oracle Regular Expressions. Jonathan Gennick, O'Reilly & Associates Peter Linsley, Oracle Corporation. What are Regular Expressions?. A language, or syntax, you can use to describe patterns in text Example: [0-9]{3}-[0-9]{4}
E N D
Session id: 40105 Introducing Oracle Regular Expressions Jonathan Gennick, O'Reilly & Associates Peter Linsley, Oracle Corporation
What are Regular Expressions? • A language, or syntax, you can use to describe patterns in text • Example: [0-9]{3}-[0-9]{4} • That which you can describe, you can find and manipulate • Unix ed, grep, perl, and now everywhere!
Why Describe Patterns? • Humans have long worked with patterns: • Postal and email addresses • URLs • Phone numbers • Often it’s not the data that’s important, but the pattern: • Bioinformatics • Validate format of URLs and email addresses • Correct formatting of phone numbers
Pre-Oracle Database 10g Find parks with acreage in their descriptions: SELECT * FROM park WHERE description LIKE '%acre%'; Finds '217-acre' and '27 acres', but also ‘few acres’, ‘more acres than all other parks’, 'the location of a massacre', etc.
Pre-Oracle Database 10g cont. • Pattern matching with LIKE • Limited to only two operators: % and _ • OWA_PATTERN • No support for alternation, ASCII only, relatively poor performance • Non-native solutions • External Procedures • Difficult to deploy, maintain, and support • Client based solutions • Pull all that data down across the network
Oracle Database 10g • Four regular expression functions • REGEXP_LIKE does pattern match? • REGEXP_INSTR where does it match? • REGEXP_SUBSTR what does it match? • REGEXP_REPLACE replace what matched. • POSIX Extended Regular Expressions • UNIX Regular Expressions • Backreference support added • Longest match not supported
REGEXP_LIKE • Determine whether a pattern exists in a string • Revisiting the acreage problem: SELECT * FROM park WHERE REGEXP_LIKE(description, '[0-9]+(-| )acre'); • Finds '217-acre' and '27 acres' • REJECTS ‘few acres’, ‘more acres than all other parks’, 'the location of a massacre', etc.
Useful for Constraints • Filter allowable data with check constraint • Only allow alphabetical characters: CREATE TABLE t1 (c1 VARCHAR2(20), CHECK (REGEXP_LIKE(c1, '^[[:alpha:]]+$'))); INSERT INTO t1 VALUES ('newuser'); 1 row created. INSERT INTO t1 VALUES ('newuser1'); ORA-02290: check constraint violated
Metacharacters Operator Description . match any character a? match 'a' zero or one time a* match 'a' zero or more times a+ match 'a' one or more times a|b match either 'a' or 'b' a{m,n} match 'a' between m and n times [abc] match either 'a' or 'b' or 'c' (abc) match group 'abc' \n match nth group [:cc:] match character class [.ce.] match collation element [=ec=] match equivalence class
REGEXP_INSTR • Find out where a match occurs: SELECT REGEXP_INSTR(description, '[0-9]+(-| )acre') FROM park; REGEXP_INSTR(DESCRIPTION,'[0-9]+… --------------------------------- 6 20 0 …
REGEXP_SUBSTR • Determine what text matched: SELECT REGEXP_SUBSTR(description, '[0-9]+(-| )acre') FROM park; REGEXP_SUBSTR(DESCRIPT ---------------------- 217-acre 27 acre …
REGEXP_SUBSTR Cont • To extract just the acreage value: SELECT REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+') FROM park; REGEXP_SUBSTR(REGEXP -------------------- 217 27
REGEXP_REPLACE Convert acres to hectares: UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. Convert acres to hectares: UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. This 217-acre park is wonderful. UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre 217 UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre 217 217 * 0.4047 = 87.8199 UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre 217 217 * 0.4047 = 87.8199 87.8199\2hectare UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre 217 217 * 0.4047 = 87.8199 87.8199\2hectare 87.8199-hectare 1 2 UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre 217 217 * 0.4047 = 87.8199 87.8199\2hectare 87.8199-hectare This 87.8199-hectare park is wonderful. UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');
D E M O N S T R A T I O N Oracle Regular Expressions
Performance • Pattern matching can be complex • Need to compile to state machine • Lex and parse • Examine all possible branches until match found • Compiled once per statement • Can be faster than LIKE for complex scenarios • Usually faster than PL/SQL equivalent • ZIP code checking 5 times faster
Performance Cont. • Some poorly-performing expressions: • 'a{2}' will be slower than 'aa' • '.*b' on input that doesn't contain a 'b' can also be quite time-consuming Mastering Regular Expressions By Jeffrey Friedl Chapter 6, Crafting an Efficient Expression
Using with Indexes • Use function-based indexes: CREATE INDEX acre_ind ON park (REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+')); • To support regular expression queries: SELECT * FROM park WHERE REGEXP_SUBSTR(REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+') = 217;
Using with Views • Hide the complexity from users: CREATE VIEW park_acreage as SELECT park_name, REGEXP_SUBSTR( REGEXP_SUBSTR( description, '[0-9]+(-| )acre'), '[0-9]+') acreage FROM park;
Using with PL/SQL • REGEXP_LIKE acts as a Boolean function in PL/SQL: IF REGEXP_LIKE(description, '[0-9]+(-| )acre') THEN acres := REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'); ... • All other functions act identically in PL/SQL and SQL.
Longest Match vs Greediness • Greediness = each element matches as much as possible. For example: SELECT REGEXP_SUBSTR( 'In the beginning','.+[[:space:]]') FROM dual; In the
Longest Match vs Greediness • Longest match = find the variations resulting in the greatest number of matching characters: • SELECT REGEXP_SUBSTR('bbb','b|bb') FROM dual; b • SELECT REGEXP_SUBSTR('bbb','bb|b') FROM dual; bb
Optional Parameters • All but REGEXP_LIKE take optional parameters for starting position and occurrence: REGEXP_INSTR (source, pattern, start, occurrence, match) REGEXP_SUBSTR (source, pattern, start, occurrence, match) REGEXP_REPLACE(source, pattern, replace, start, occurrence, match) • For example: REGEXP_SUBSTR('description','[^[:space:]]+',1,10)
Match Parameter • All functions take an optional match parameter: • Is matching case sensitive? • Does period (.) match newlines? • Is the source string one line or many? • The match parameter comes last
Case-sensitivity • Case-insensitive search: SELECT * FROM park WHERE REGEXP_LIKE( description, '[0-9]+(-| )acre', 'i');
Newline matching INSERT INTO park VALUES ('Park 6', '640' || CHR(10) || 'ACRE'); SELECT * FROM park WHERE REGEXP_LIKE( description, '[0-9]+.acre', 'in');
String anchors INSERT INTO employee (surname) VALUES ('Ellison' || CHR(10) || 'Gennick'); SELECT * FROM EMPLOYEE WHERE REGEXP_LIKE( surname,'^Ellison'); Yes!
String anchors INSERT INTO employee (surname) VALUES ('Ellison' || CHR(10) || 'Gennick') SELECT * FROM EMPLOYEE WHERE REGEXP_LIKE( surname,'^Gennick'); No!
String anchors INSERT INTO employee (surname) VALUES ('Ellison' || CHR(10) || 'Gennick') SELECT * FROM EMPLOYEE WHERE REGEXP_LIKE( surname,'^Gennick','m'); Yes!
Locale Support • Full Locale Support • All character sets • All languages • Case and accent insensitive searching • Linguistic range • Character classes • Collation elements • Equivalence classes
Character Sets and Languages • For example, you can search for Ukrainian names beginning with Ґ and ending with к: SELECT * FROM employee WHERE REGEXP_LIKE( surname, '^Ґ[[:alpha:]]*к$','n');
Case- and Accent-Insensitive Searching • Respect for NLS settings: ALTER SESSION SET NLS_SORT = GENERIC_BASELETTER; • With this sort, case won't matter and an expression such as: REGEXP_INSTR(x,'resume') will find "resume", "résumé", "Résume", etc.
Linguistic Range • Ranges respect NLS_SORT settings: a,b,c…z NLS_SORT=GERMAN [a-z] a,A,b,B,c,C…z,Z NLS_SORT=GERMAN_CI
Character Classes • Character classes such as [:alpha:] and [:digit:] encompass more than just Latin characters. • For example, [:digit:] matches: • Latin 0 through 9 • Arabic-Indic٠through ٩ • And more
Collation Elements ALTER SESSION SET NLS_SORT=XSPANISH; SELECT REGEXP_SUBSTR( 'El caballo, Chico come la tortilla.', '[[:alpha:]]*[ch][[:alpha:]]*', 1,1,'i') FROM dual; caballo
Collation Elements ALTER SESSION SET NLS_SORT=XSPANISH; SELECT REGEXP_SUBSTR( 'El caballo, Chico come la tortilla.', '[[:alpha:]]*[[.ch.]][[:alpha:]]*', 1,1,'i') FROM dual; Chico
Equivalence Classes • Ignore case and accents without changing NLS_SORT: REGEXP_INSTR(x,'r[[=e=]]sum[[=e=]]') • Finds 'resume', 'résumé', and 'rEsumE'
Conclusion • String searching and manipulation is at the heart of a great many applications • Oracle Regular Expressions provide versatile string manipulation in the database instead of externalized in middle tier logic • They are Locale sensitive and support character large objects • Available in both SQL and PL/SQL
Next Steps…. • Recommended sessions • Session #40088 New SQL Capabilities • Session #40202 Oracle HTML DB • Recommended demos and/or hands-on labs • Database Globalization Pod R • See Your Business in Our Software • Visit the DEMOgrounds for a customized architectural review, see a customized demo with Solutions Factory, or receive a personalized proposal. Visit the DEMOgrounds for more information. • Relevant web sites to visit for more information • http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html
Shameless Plug Oracle Regular Expressions Pocket Reference Jonathan Gennick & Peter Linsley Free! At the O'Reilly & Associaties Booth