130 likes | 209 Views
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Regular expressions { week 04}. from Search Engines: Information Retrieval in Practice , 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0.
E N D
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Regular expressions{week 04} from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
Storage and retrieval • Computers store and retrieve information • Retrieval first requires finding information once we find the data, we often mustextract what we need...
Identifying traffic patterns • Weblogs record each and everyaccess to the Web server • Use the data to answer questions • Which pages are the most popular? • How much spam is the site experiencing? • Are certain days/times busier than others? • Are there any missing pages (bad links)? • Where is the traffic coming from?
Weblogs (not blogs!) • Apache records an access_log file: • 75.194.143.61 - - [26/Sep/2011:22:38:12 -0400] "GET /cis460/wordfreq.php HTTP/1.1" 200 566 requesting IP (or host) username/password access timestamp HTTP request server response code size in bytes of data returned (for server response codes, see http://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
What do we do with the data? • We have many options for using the data summarized file (e.g. csv or tsv) spreadsheet access_log database web site
How do we process the data? • Regardless of what we do with the data,we must first parse or extract the data • We could write specific code to processthe data and programmatically extract the desired information • Use regular expressions to simplify processing
Regular expressions (i) • A regular expression is an expression ina “mini language” designed specificallyfor textual pattern matching • Support for regular expressions are availablein many languages, including Java, JavaScript,C, C++, PHP, etc.
Regular expressions (ii) • A pattern contains numerous character groupings and is specified as a string • Patterns to match a phone number include: • [0-9][0-9][0-9]−[0-9][0-9][0-9]−[0-9][0-9][0-9][0-9] • [0-9]{3}−[0-9]{3}−[0-9]{4} • \d\d\d−\d\d\d−\d\d\d\d • \d{3}−\d{3}−\d{4} • (\d\d\d) \d\d\d−\d\d\d\d
Regular expressions in Java (i) • The String class in Java provides a pattern matching method called matches(): • Unlike other languages, Java requires the pattern to match the entire string String s = "Pattern matching in Java!"; String p = "\\w+\\s\\w+\\s\\w{2}\\s\\w+!"; if ( s.matches( p ) ) { System.out.println( "MATCH!" ); }
Regular expressions in Java (ii) • Additional pattern-matching methods: • Use the replaceFirst() and replaceAll() methods to replace a pattern with a string: String s = "<title>Cool Web Site</title>"; String p = "</?\w+>"; String result = s.replaceAll( p, "" );
Regular expressions in Java (iii) • Additional pattern-matching methods: • Use the split() method to split a stringinto an array of substrings String s = "The Legend of Sleepy Hollow"; String[] words = s.split( "\\s+" );