1 / 48

Introducing Oracle Regular Expressions

Session id: 40105. Introducing Oracle Regular Expressions. Jonathan Gennick, O'Reilly & Associates Peter Linsley, Oracle Corporation. What are Regular Expressions?. A language, or syntax, you can use to describe patterns in text Example: [0-9]{3}-[0-9]{4}

vanwinkle
Download Presentation

Introducing Oracle Regular Expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Session id: 40105 Introducing Oracle Regular Expressions Jonathan Gennick, O'Reilly & Associates Peter Linsley, Oracle Corporation

  2. What are Regular Expressions? • A language, or syntax, you can use to describe patterns in text • Example: [0-9]{3}-[0-9]{4} • That which you can describe, you can find and manipulate • Unix ed, grep, perl, and now everywhere!

  3. Why Describe Patterns? • Humans have long worked with patterns: • Postal and email addresses • URLs • Phone numbers • Often it’s not the data that’s important, but the pattern: • Bioinformatics • Validate format of URLs and email addresses • Correct formatting of phone numbers

  4. Pre-Oracle Database 10g Find parks with acreage in their descriptions: SELECT * FROM park WHERE description LIKE '%acre%'; Finds '217-acre' and '27 acres', but also ‘few acres’, ‘more acres than all other parks’, 'the location of a massacre', etc.

  5. Pre-Oracle Database 10g cont. • Pattern matching with LIKE • Limited to only two operators: % and _ • OWA_PATTERN • No support for alternation, ASCII only, relatively poor performance • Non-native solutions • External Procedures • Difficult to deploy, maintain, and support • Client based solutions • Pull all that data down across the network

  6. Oracle Database 10g • Four regular expression functions • REGEXP_LIKE does pattern match? • REGEXP_INSTR where does it match? • REGEXP_SUBSTR what does it match? • REGEXP_REPLACE replace what matched. • POSIX Extended Regular Expressions • UNIX Regular Expressions • Backreference support added • Longest match not supported

  7. REGEXP_LIKE • Determine whether a pattern exists in a string • Revisiting the acreage problem: SELECT * FROM park WHERE REGEXP_LIKE(description, '[0-9]+(-| )acre'); • Finds '217-acre' and '27 acres' • REJECTS ‘few acres’, ‘more acres than all other parks’, 'the location of a massacre', etc.

  8. Useful for Constraints • Filter allowable data with check constraint • Only allow alphabetical characters: CREATE TABLE t1 (c1 VARCHAR2(20), CHECK (REGEXP_LIKE(c1, '^[[:alpha:]]+$'))); INSERT INTO t1 VALUES ('newuser');  1 row created. INSERT INTO t1 VALUES ('newuser1');  ORA-02290: check constraint violated

  9. Metacharacters Operator Description . match any character a? match 'a' zero or one time a* match 'a' zero or more times a+ match 'a' one or more times a|b match either 'a' or 'b' a{m,n} match 'a' between m and n times [abc] match either 'a' or 'b' or 'c' (abc) match group 'abc' \n match nth group [:cc:] match character class [.ce.] match collation element [=ec=] match equivalence class

  10. REGEXP_INSTR • Find out where a match occurs: SELECT REGEXP_INSTR(description, '[0-9]+(-| )acre') FROM park; REGEXP_INSTR(DESCRIPTION,'[0-9]+… --------------------------------- 6 20 0 …

  11. REGEXP_SUBSTR • Determine what text matched: SELECT REGEXP_SUBSTR(description, '[0-9]+(-| )acre') FROM park; REGEXP_SUBSTR(DESCRIPT ---------------------- 217-acre 27 acre …

  12. REGEXP_SUBSTR Cont • To extract just the acreage value: SELECT REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+') FROM park; REGEXP_SUBSTR(REGEXP -------------------- 217 27

  13. REGEXP_REPLACE Convert acres to hectares: UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');

  14. REGEXP_REPLACE Cont. Convert acres to hectares: UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');

  15. REGEXP_REPLACE Cont. This 217-acre park is wonderful. UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');

  16. REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');

  17. REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre 217 UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');

  18. REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre 217 217 * 0.4047 = 87.8199 UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');

  19. REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre 217 217 * 0.4047 = 87.8199 87.8199\2hectare UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');

  20. REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre 217 217 * 0.4047 = 87.8199 87.8199\2hectare 87.8199-hectare 1 2 UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');

  21. REGEXP_REPLACE Cont. This 217-acre park is wonderful. 217-acre 217 217 * 0.4047 = 87.8199 87.8199\2hectare 87.8199-hectare This 87.8199-hectare park is wonderful. UPDATE park SET description = REGEXP_REPLACE( description,'([0-9]+)(-| )acre', TO_CHAR(0.4047 * TO_NUMBER( REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'))) || '\2' || 'hectare');

  22. D E M O N S T R A T I O N Oracle Regular Expressions

  23. Performance • Pattern matching can be complex • Need to compile to state machine • Lex and parse • Examine all possible branches until match found • Compiled once per statement • Can be faster than LIKE for complex scenarios • Usually faster than PL/SQL equivalent • ZIP code checking 5 times faster

  24. Performance Cont. • Some poorly-performing expressions: • 'a{2}' will be slower than 'aa' • '.*b' on input that doesn't contain a 'b' can also be quite time-consuming Mastering Regular Expressions By Jeffrey Friedl Chapter 6, Crafting an Efficient Expression

  25. Using with Indexes • Use function-based indexes: CREATE INDEX acre_ind ON park (REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+')); • To support regular expression queries: SELECT * FROM park WHERE REGEXP_SUBSTR(REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+') = 217;

  26. Using with Views • Hide the complexity from users: CREATE VIEW park_acreage as SELECT park_name, REGEXP_SUBSTR( REGEXP_SUBSTR( description, '[0-9]+(-| )acre'), '[0-9]+') acreage FROM park;

  27. Using with PL/SQL • REGEXP_LIKE acts as a Boolean function in PL/SQL: IF REGEXP_LIKE(description, '[0-9]+(-| )acre') THEN acres := REGEXP_SUBSTR( REGEXP_SUBSTR(description, '[0-9]+(-| )acre'),'[0-9]+'); ... • All other functions act identically in PL/SQL and SQL.

  28. Longest Match vs Greediness • Greediness = each element matches as much as possible. For example: SELECT REGEXP_SUBSTR( 'In the beginning','.+[[:space:]]') FROM dual;  In the

  29. Longest Match vs Greediness • Longest match = find the variations resulting in the greatest number of matching characters: • SELECT REGEXP_SUBSTR('bbb','b|bb') FROM dual;  b • SELECT REGEXP_SUBSTR('bbb','bb|b') FROM dual;  bb

  30. Optional Parameters • All but REGEXP_LIKE take optional parameters for starting position and occurrence: REGEXP_INSTR (source, pattern, start, occurrence, match) REGEXP_SUBSTR (source, pattern, start, occurrence, match) REGEXP_REPLACE(source, pattern, replace, start, occurrence, match) • For example: REGEXP_SUBSTR('description','[^[:space:]]+',1,10)

  31. Match Parameter • All functions take an optional match parameter: • Is matching case sensitive? • Does period (.) match newlines? • Is the source string one line or many? • The match parameter comes last

  32. Case-sensitivity • Case-insensitive search: SELECT * FROM park WHERE REGEXP_LIKE( description, '[0-9]+(-| )acre', 'i');

  33. Newline matching INSERT INTO park VALUES ('Park 6', '640' || CHR(10) || 'ACRE'); SELECT * FROM park WHERE REGEXP_LIKE( description, '[0-9]+.acre', 'in');

  34. String anchors INSERT INTO employee (surname) VALUES ('Ellison' || CHR(10) || 'Gennick'); SELECT * FROM EMPLOYEE WHERE REGEXP_LIKE( surname,'^Ellison'); Yes!

  35. String anchors INSERT INTO employee (surname) VALUES ('Ellison' || CHR(10) || 'Gennick') SELECT * FROM EMPLOYEE WHERE REGEXP_LIKE( surname,'^Gennick'); No!

  36. String anchors INSERT INTO employee (surname) VALUES ('Ellison' || CHR(10) || 'Gennick') SELECT * FROM EMPLOYEE WHERE REGEXP_LIKE( surname,'^Gennick','m'); Yes!

  37. Locale Support • Full Locale Support • All character sets • All languages • Case and accent insensitive searching • Linguistic range • Character classes • Collation elements • Equivalence classes

  38. Character Sets and Languages • For example, you can search for Ukrainian names beginning with Ґ and ending with к: SELECT * FROM employee WHERE REGEXP_LIKE( surname, '^Ґ[[:alpha:]]*к$','n');

  39. Case- and Accent-Insensitive Searching • Respect for NLS settings: ALTER SESSION SET NLS_SORT = GENERIC_BASELETTER; • With this sort, case won't matter and an expression such as: REGEXP_INSTR(x,'resume') will find "resume", "résumé", "Résume", etc.

  40. Linguistic Range • Ranges respect NLS_SORT settings: a,b,c…z NLS_SORT=GERMAN [a-z] a,A,b,B,c,C…z,Z NLS_SORT=GERMAN_CI

  41. Character Classes • Character classes such as [:alpha:] and [:digit:] encompass more than just Latin characters. • For example, [:digit:] matches: • Latin 0 through 9 • Arabic-Indic٠through ٩ • And more

  42. Collation Elements ALTER SESSION SET NLS_SORT=XSPANISH; SELECT REGEXP_SUBSTR( 'El caballo, Chico come la tortilla.', '[[:alpha:]]*[ch][[:alpha:]]*', 1,1,'i') FROM dual; caballo

  43. Collation Elements ALTER SESSION SET NLS_SORT=XSPANISH; SELECT REGEXP_SUBSTR( 'El caballo, Chico come la tortilla.', '[[:alpha:]]*[[.ch.]][[:alpha:]]*', 1,1,'i') FROM dual; Chico

  44. Equivalence Classes • Ignore case and accents without changing NLS_SORT: REGEXP_INSTR(x,'r[[=e=]]sum[[=e=]]') • Finds 'resume', 'résumé', and 'rEsumE'

  45. Conclusion • String searching and manipulation is at the heart of a great many applications • Oracle Regular Expressions provide versatile string manipulation in the database instead of externalized in middle tier logic • They are Locale sensitive and support character large objects • Available in both SQL and PL/SQL

  46. Next Steps…. • Recommended sessions • Session #40088 New SQL Capabilities • Session #40202 Oracle HTML DB • Recommended demos and/or hands-on labs • Database Globalization Pod R • See Your Business in Our Software • Visit the DEMOgrounds for a customized architectural review, see a customized demo with Solutions Factory, or receive a personalized proposal. Visit the DEMOgrounds for more information. • Relevant web sites to visit for more information • http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html

  47. Shameless Plug Oracle Regular Expressions Pocket Reference Jonathan Gennick & Peter Linsley Free! At the O'Reilly & Associaties Booth

More Related