1 / 24

Mastering Regular Expressions for Efficient Text Processing

Learn how regular expressions (RegEx) can be used in search, text processing, and programming languages like Perl and Java. Understand the structure, literal matching, character classes, negation, start/end of line, repeated matches, word selection, and more. Explore examples and references.

weylin
Download Presentation

Mastering Regular Expressions for Efficient Text Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regular Expressions ashishFA@cse

  2. Where can I use • Powerful in - Search , search and replace, text processing • Text Editors – vi, editplus • Programming languages - perl, Java • Grep,awk

  3. Before RegEx • Wildcard • *.txt • My_report*.doc Here the * indicates any number of any characters.

  4. ! Regular expressions (RegEx) tend to be easier to write than they are to read

  5. What is RegEx • a regular expression-- a pattern that describes or matches a set of strings • Matched text– chunk of text which matches the regular expression. • ca[trn] • Matches car, can, cat Editplus is used throughout this presentation as tool to demonstrate the regular expressions

  6. Structure of RegEx • Made up of normal characters and metacharacters. • Metacharacters – special function $^ . \ [] \( \) + ? • $ means end of line • ^ start of line

  7. Literal match • RegEx: cat will match the word cat • It will also match words like concatenation , delicate, located, modification • It is not desired sometimes ? • solution

  8. Matching • Match the space before and after “cat” • “ cat ” • ? Still problem

  9. Character class • Want to search ‘in’ or ‘on’ .. • So searching RegEx : [io]n will match in and on both • [ ] : used to specify a set of character to select from. • [a-h] : indicates set of all characters from a to h • [4-9A-T]

  10. Character class • It can also contain individual characters as : [acV5y0] • [0-9] : ? • [0-9][0-9] :? • 18[0-9][0-9]:?

  11. Example • set of vowels • [aeiou] • set of consonents • [bcdfghjklmnpqrstvwxyz] • Consider matching words which start with 2 vowels and end with consonant • [aeiou][aeiou][bcdfghjklmnpqrstvwxyz] ? • “ [aeiou][aeiou][bcdfghjklmnpqrstvwxyz] ”

  12. Negation • The absence of any character or set of character can be shown using ^ symbol • [^ab^8] : means not a , but b , but not 8 • [^c-p] : means any character other than c..p • “[^t]ion ” : select all words ending with ion but with not before it

  13. Start/End of line • ^ : indicates start of line • $ : indicates end of line Example: search lines starting with I Use RegEx : “^I ” search lines ending ending with is Use RegEx : “ is$”

  14. match • . : Any character match • e.e : match all strings where first letter is e and last is e. • Try “ e.e ” • If you want only words to be searched then change the query to • e[a-z]e

  15. Repeated match • * : match the previous character or character-class zero or more times • be* : will match sequence of zero or more ‘e’ preceded by b • + : similar to * • Only difference is that it matches sequence of one or more.

  16. Selecting a number • Single digit : [0-9] • When single digit is repeated zero or more times it is a number. • (digit)repeat • [0-9]* • $[0-9]* : ? • \$[0-9]*

  17. Selecting a word • Word is composed of alphabets • A word is : [a-z]* • A word in all capital letters : ?? • A word starting with capital letter :[ ][ ]*

  18. Alternate match • | : symbol is used to specify alternate match • Search: (above)|(below)

  19. Search • Day Words • [a-z]*day • “[a-z]+day ” - “[A-Z][a-z]+day ”

  20. Escaping Special meaning • How to match (, ) or * • To match the characters which are used as Metacharacter, ‘\’ is added before them as an escape character. • i.e. to match ( write \( and to match period . write \.

  21. Search patterns • has, have, had • not, n’t • ((have)|(had)|(has)) • (( )|(n't)|( not ))* • ((have)|(had)|(has))(( )|(n't)|( not ))*

  22. References • Editplus help pages • http://gnosis.cx/publish/programming/regular_expressions.html • OReilly - Mastering Regular Expressions • Google “regular expression tutorial”

  23. Thank you !

More Related