130 likes | 418 Views
Postgresql. …and postgis & full text search & f uzzy comparisons. Postgis. For manipulating 2D/3D spatial data Points, lines, and polygons formed from points and lines Can perform union, intersection, operations Can project shapes into 2D areas Has a 3D geometry type (relatively new)
E N D
Postgresql …and postgis & full text search & fuzzy comparisons
Postgis • For manipulating 2D/3D spatial data • Points, lines, and polygons formed from points and lines • Can perform union, intersection, operations • Can project shapes into 2D areas • Has a 3D geometry type (relatively new) • Can calculate accurate distances in meters • Works with an open source server that allows folks to share geospatial data • Command line interface • Also supports some forms of raster data • Provides spatial indices • Has a notion of a geometric column
Queries SELECT superhero.name FROM city, superhero WHEREST_Contains(city.geom, superhero.geom) and city.name = 'Gotham'; SELECTAsBinary(the_geom) as wkb_geometryFROMriver AS r, state AS s WHERE intersects(r.the_geom, s.the_geom)
Mapnik • Used for OSM (open street map) data and uses postgis • Mapnik is an open source system for rendering maps • Used to design maps • Written in C++ • It renders maps from postgis databases
Next: full text and approximate text search • But first, not to be confused with the Like operator • Used % as the wild card • Or with regular expressions for character string comparison
Full text search • First, you index the words in a document and create an array of lexemes • Second, specify a boolean phrase using and, or, not, and parens • We typically don’t index “stop” words like and, or, the, etc. • Dictionaries are used to find roots of related words, like dead and dying • Thesauruses dictionaries are used to for recognition of domain-specific and similar words
documents • A document is a text attribute in a row of a table • Often we use part of a document or concatenate various parts of documents
Details: dictionaries • Define stop words that should not be indexed • Map synonyms to a single word. • Map phrases to a single word using a thesaurus. • Map different variations of a word to a canonical form
Searching • Uses a match operator - @@ • Basic search consists of asking about the relationship to a vector of words to a given document, which is also a vector • The vector can have and, or, etc. in it • tsvector – document – normalized lexemes • tsquery – query
Examples SELECT title FROM pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10; select the ten most recent documents that contain create and table in the title or body Results can be ranked
Recent addition: fuzziness • soundex(text) returns text • Converts a string to its Soundexcode • Based on pronunciation • difference(text, text) returns int • converts two strings to their Soundex codes and then reports the number of matching code positions • 0 is a no match • 4 is a full match • Def: A phonetic coding system intended to suppress spelling variation and determining the relationship between two (similar) words
Levenshtein • Levenshtein distance is a metric for evaluating the difference between two sequences, in particular, words • E.g.: test=# SELECT levenshtein('GUMBO', 'GAMBOL'); • E.g.: SELECT * FROM some_table WHERE levenshtein(code, 'AB123-lHdfj') <= 3 ORDER BY levenshtein(code, 'AB123-lHdfj') LIMIT 10 • Used in particular, to detect nicknames
Metaphone • E.g., metaphone(text source, intmax_output_length) returns text • Similar to soundex • Used to classify words according to their english pronunciation • Apparently better for non-english languages, compared to soundex • E.g.: SELECT * FROM users WHERE METAPHONE(users.first_name, 2) = METAPHONE('Willem', 2) should detect similarity to word William