260 likes | 333 Views
Web Programming Week 13. Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein<mklein@cs.odu.edu> 11/23/10. Relational Data Model is a Special Case…. SELECT pi.fname, m.aces, m.unforced_errors, m.winners FROM player_info pi, matches m
E N D
Web ProgrammingWeek 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein<mklein@cs.odu.edu> 11/23/10
Relational Data Model is a Special Case… SELECT pi.fname, m.aces, m.unforced_errors, m.winners FROM player_info pi, matches m WHERE pi.fname = “Andre” AND m.opponent_name = “Sampras” AND m.year = “2002”;
Precision and Recall • how much extra stuff did you get? • how much did you miss? source: http://www.hsl.creighton.edu/hsl/Searching/Recall-Precision.html
Precision and Recall 10 documents in the index are relevant search returns 20 documents 5 of which are relevant half of the relevant documents were retrieved 1 out of 4 retrieved documents are relevant source: http://www.hsl.creighton.edu/hsl/Searching/Recall-Precision.html
Precision and Recall 1 Precision figure 1.2 in FBY 0 1 Recall
Why Isn’t Recall Always 100%? Virginia Agricultural and Mechanical College? Virginia Agricultural and Mechanical College and Polytechnic Institute? Virginia Polytechnic Institute? Virginia Polytechnic Institute and State University? Virginia Tech?
CREATE Table mysql> CREATE TABLE ODUtennis( -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, -> title VARCHAR(200), -> body TEXT, -> ); Query OK, 0 rows affected (0.00 sec)
INSERT mysql> INSERT INTO ODUtennis (title, body) VALUES -> ('Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals', 'The fourth-seeded Monarchs ...'), -> ('Monarchs Close Out Season With 4-3 Win Over South Alabama', 'ODU closes out the 2010 tennis schedule...'), -> ('ODU Edged By DePaul in Mens Tennis Action, 4-3', 'Junior Tobias Fanselow was the other Monarch…'), -> ('ODU Mens Tennis Drops Delaware, 5-2, in CAA Action', 'The Old Dominion mens tennis team…doubles…Monarchs...'), -> ('Mens Tennis Nipped by #64 UNC Wilmington, 4-3', 'The two teams split the first two doubles matches…Monarchs…doubles...'); Query OK, 5 rows affected (0.00 sec) Records: 5 Duplicates: 0 Warnings: 0
LIKE & REGEXP • We can search rows with the “LIKE” (or “REGEXP”) operator • http://dev.mysql.com/doc/refman/5.0/en/pattern-matching.html • for tables of any size, this will be s-l-o-w
Example 1 mysql> SELECT id, title FROM ODUtennis WHERE title LIKE 'Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals'; mysql> SELECT id, title FROM ODUtennis WHERE title REGEXP 'Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals'; +----+--------------------------------------------------------------------------------------+ | id | title | +----+--------------------------------------------------------------------------------------+ | 1 | Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals | +----+--------------------------------------------------------------------------------------+ 1 row in set (0.00 sec)
Example 2 mysql> SELECT id, title FROM ODUtennis WHERE title REGEXP 'Monarchs'; +----+---------------------------------------------------------------------------------------+ | id | title | +----+---------------------------------------------------------------------------------------+ | 1 | Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals | | 2 | Monarchs Close Out Season With 4-3 Win Over South Alabama | +----+----------------------------------------------------------------------------------------+ 2 rows in set (0.01 sec) BUT mysql> SELECT id, title FROM ODUtennis WHERE title LIKE 'Monarchs'; Empty set (0.00 sec)
Full-Text Search – The Better Way • MATCH()…AGAINST() • Performs a natural language search over index • Index = set of one ore more columns of the same table • Index as argument to MATCH() • Search string as argument to AGAINST() • If used in WHERE clause result returned in order of relevance score • Relevance: similarity between search string and index row
CREATE Table mysql> CREATE TABLE ODUtennis ( -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, -> title VARCHAR(200), -> body TEXT, -> FULLTEXT (title,body) -> ); Query OK, 0 rows affected (0.00 sec) • can only create FULLTEXT on CHAR, VARCHAR or TEXT columns • “title” and “body” still available as regular columns • if you want to search only on “title”, you need to create a separate index
INSERT mysql> INSERT INTO ODUtennis (title, body) VALUES -> ('Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals', 'The fourth-seeded Monarchs ...'), -> ('Monarchs Close Out Season With 4-3 Win Over South Alabama', 'ODU closes out the 2010 tennis schedule...'), -> ('ODU Edged By DePaul in Mens Tennis Action, 4-3', 'Junior Tobias Fanselow was the other Monarch…'), -> ('ODU Mens Tennis Drops Delaware, 5-2, in CAA Action', 'The Old Dominion mens tennisteam…doubles…Monarchs...'), -> ('Mens Tennis Nipped by #64 UNC Wilmington, 4-3', 'The two teams split the first two doubles matches…Monarchs…doubles...'); Query OK, 5 rows affected (0.00 sec) Records: 5 Duplicates: 0 Warnings: 0
MATCH .. AGAINST mysql> SELECT * FROM ODUtennis WHERE MATCH(title,body) AGAINST('fanselow'); +----+------------------------------------------------+------------------------------------------------------------------------------------+ | id | title | body | +----+------------------------------------------------+------------------------------------------------------------------------------------+ | 3 | ODU Edged By DePaul in Mens Tennis Action, 4-3 | Junior Tobias Fanselow was the other … | +----+------------------------------------------------+------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> SELECT * FROM ODUtennis WHERE MATCH(title,body) AGAINST('Monarchs'); Empty set (0.00 sec) why?!
Ranking • If the word appears in > 50% of the rows then the word is considered a “stop word” and is not matched (unless you are in Boolean mode) • this makes sense for large collections (the word is not a good discriminator of records), but can lead to unexpected results for small collections
Stopwords • Stopwords exist in stoplists or negative dictionaries • Idea: remove low semantic content • index should only have “important stuff” • What not to index is domain dependent, but often includes: • “small” words: a, and, the, but, of, an, very, etc. • NASA ADS example: • http://adsabs.harvard.edu/abs_doc/stopwords.html • MySQL full-text index: • http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html
Stopwords • Punctuation, numbers often stripped or treated as stopwords • precision suffers on searches for: • NASA TM-3389 • F-15 • X.500 • .NET • Tree::Suffix • MySQL also treats words < 4 characters as stopwords • too bad for: “Liu”, “ORF”, “DEA”, etc.
Getting the Rank mysql> SELECT id, MATCH(title,body) AGAINST('doubles') from ODUtennis; +----+---------------------------------------------------+ | id | MATCH(title,body) AGAINST('doubles') | +----+---------------------------------------------------+ | 1 | 0 | | 2 | 0 | | 3 | 0 | | 4 | 0.28169098496437 | | 5 | 0.61670464277267 | +----+---------------------------------------------------+ 5 rows in set (0.00 sec)
Getting the Rank in Order mysql> SELECT id, title, MATCH(title,body) AGAINST('doubles') AS score FROM ODUtennis WHERE MATCH(title,body) AGAINST('doubles'); +----+-------------------------------------------------------------------------+----------------------------+ | id | title | score | +----+-------------------------------------------------------------------------+----------------------------+ | 5 | Mens Tennis Nipped by #64 UNC Wilmington, 4-3 | 0.61670464277267 | | 4 | ODU Mens Tennis Drops Delaware, 5-2, in CAA Action | 0.28169098496437 | +----+-------------------------------------------------------------------------+----------------------------+ 2 rows in set (0.00 sec)
Boolean Mode mysql> SELECT id, title FROM ODUtennis WHERE MATCH(title,body) AGAINST('+Monarchs' IN BOOLEAN MODE); +----+-----------------------------------------------------------------------------------------+ | id | title | +----+-----------------------------------------------------------------------------------------+ | 1 | Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals | | 2 | Monarchs Close Out Season With 4-3 Win Over South Alabama | | 4 | ODU Mens Tennis Drops Delaware, 5-2, in CAA Action | | 5 | Mens Tennis Nipped by #64 UNC Wilmington, 4-3 | +----+-----------------------------------------------------------------------------------------+ 4 rows in set (0.00 sec) • Does not use the 50% threshold • Does use stopwords, length limitation • Operator list: • http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html
Blind Query Expansion (AKA Automatic Relevance Feedback) • General assumption: user query is insufficient • Too short • Too generic • Too many results • How does one keep up with Virginia Tech’s multiple names / nicknames? • Hokies, Fighting Gobblers, VPI, VPI&SU, Va Tech, VT • Idea: • run the query with the requested terms • then take the results and • re-run the query with the most relevant terms from the initial results
Blind Query Expansion (AKA Automatic Relevance Feedback) mysql> SELECT * FROM ODUtennis WHERE MATCH(title,body) AGAINST('fanselow' WITH QUERY EXPANSION); +----+-------------------------------------------------------------------------+----------------------------------------------------------------+ | id | title | body | +----+-------------------------------------------------------------------------+-----------------------------------------------------------------+ | 3 | ODU Edged By DePaul in Mens Tennis Action, 4-3 | Junior Tobias Fanselow was the other Monarch | | | to win, earning the win at No. 1 singles over | | | Alasdair Graetz, 6-2, 7-6 | ----------------------------------------------------------------------------------------------------------------------------------------------------+ | 4 | ODU Mens Tennis Drops Delaware, 5-2, in CAA Action | The Old Dominion mens tennis team improved | | | to 14-8 overall and 2-3 in the CAA with a 5-2 | | | victory over the Delaware Fighting Blue Hens | | | on Wednesday. After dropping all three doubles | | | matches, the Monarchs went on to win five of | | | the six singles matches for the victory. | +----+----------------------------------------------------+--------------------------------------------------------------------------------------+ 2 rows in set (0.00 sec)
For More Information… • MySQL documentation: • http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html • Chapter 12/13 “Building a Content Management System” • CS 751/851 “Introduction to Digital Libraries” • http://www.cs.odu.edu/~mln/teaching/ • esp. “Information Retrieval Concepts” lecture • Is MySQL the right tool for your job? • http://lucene.apache.org/ MySQL examples in this lecture based on those found at dev.mysql.com content snippets taken from www.odusports.com