Tuesday, February 11, 2014

Seven expert tools for advanced string search in PostgreSQL

http://www.openlogic.com/wazi/bid/334546/seven-expert-tools-for-advanced-string-search-in-postgresql


You can improve PostgreSQL text search performance by executing searches directly in the database interface layer. Advanced text search mechanisms include SELECT LIKE queries, queries with regular expressions, the Levenshtein algorithm, Trigram, TSvector and TSquery, and Metaphone.

To get started, first make sure you have installed the latest stable version of PostgreSQL on your CentOS server. Pick the correct RPM package for your architecture from the download page and run the necessary commands to install PostgreSQL:

wget http://yum.postgresql.org/9.3/redhat/rhel-6-i386/pgdg-centos93-9.3-1.noarch.rpm
rpm -ivH pgdg-centos93-9.3-1.noarch.rpm
yum install postgresql93-devel postgresql93-server postgresql93-contrib

Next, initialize the PostgreSQL cluster, start the server, enter the command-line interface, and create a new database:

service postgresql-9.3 initdb
Initializing database:                                     [  OK  ]
/etc/init.d/postgresql-9.3 start
Starting postgresql-9.3 service:                           [  OK  ]
su postgres
bash-4.1$ psql
postgres=# create database vehicles;
CREATE DATABASE

Connect to the new database, set up tables, and add some sample data for a test case:
postgres=# \c vehicles;
You are now connected to database "vehicles" as user "postgres".
vehicles=# CREATE TABLE cars(id SERIAL PRIMARY KEY NOT NULL, name TEXT NOT NULL, description TEXT NOT NULL);
CREATE TABLE
vehicles=# INSERT INTO cars(name, description) VALUES ('BMW X5', 'Luxury German SUV with front-engine and four-wheel-drive'), ('Mercedes SLR', 'Luxury German grand tourer with front-engine and rear-wheel-drive'), ('Cadillac Escalade', 'Luxury USA SUV with front-engine and rear-wheel-drive or four-wheel-drive'),('Volkswagen Phaeton', 'Luxury German sedan with front-engine and four-wheel drive'), ('Ford Probe', 'USA sports coupe with front-engine and rear-wheel drive'), ('Honda City', 'Japanese compact car with front-engine and front-wheel drive'), ('KIA Sportage', 'South-Korean crossover with front-engine and rear-wheel-drive or four-wheel-drive'),('Skoda Yeti', 'Czech compact SUV with front-engine and front-wheel-drive or four-wheel-drive');
INSERT 0 8
vehicles=# CREATE TABLE engines_manufacturers(id SERIAL PRIMARY KEY NOT NULL, manufacturer_name TEXT NOT NULL);
CREATE TABLE
vehicles=# INSERT INTO engines_manufacturers(manufacturer_name) VALUES ('BMW'),('McLaren'),('GM'),('Mercedes Benz'),('Ford Motor Company'),('VW'),('Hyundai-Kia'),('Peugeot Citroen Moteurs'),('Honda Motor Company');
INSERT 0 9
vehicles=# CREATE TABLE cars_engines(cars_id INT REFERENCES cars NOT NULL, engines_id INT REFERENCES engines_manufacturers NOT NULL, UNIQUE(cars_id,engines_id));
CREATE TABLE
vehicles=# INSERT INTO cars_engines(cars_id,engines_id) VALUES(1,1),(2,2),(3,3),(4,6),(5,5),(6,9),(7,7),(8,6);
INSERT 0 8

Standard string search methods

PostgreSQL comes with two built-in mechanisms for string searches. Let's start with the one based on the standard SQL LIKE query structure. LIKE and ILIKE (case-insensitive) syntax can include the wild-card characters % and _ before and after the searched string. The percent sign replaces any given sequence of characters, while the underscore replaces a single character. For instance, this query lists records that include "luxury" and "suv" in any case in their description fields:

vehicles=# SELECT name, description FROM cars WHERE description ILIKE '%luxury%suv%';
       name        |                                description
-------------------+---------------------------------------------------------------------------
 BMW X5            | Luxury German SUV with front-engine and four-wheel-drive
 Cadillac Escalade | Luxury USA SUV with front-engine and rear-wheel-drive or four-wheel-drive
(2 rows)

You can also search for strings in text based on POSIX-style regular expressions. The ~ match operator must precede a regular expression. You can use the optional ! operator to inverse the logic (not match) and the * operator for case-insensitive search. The following query list all non-German cars from our database built by VW:

vehicles=# SELECT cars.name as Cars, engines_manufacturers.manufacturer_name as Manufacturer FROM cars INNER JOIN cars_engines ON cars.id = cars_engines.cars_id INNER JOIN engines_manufacturers ON cars_engines.engines_id = engines_manufacturers.id WHERE engines_manufacturers.manufacturer_name ILIKE 'vw' AND cars.description !~* '.*german.*';
    cars    | manufacturer
------------+--------------
 Skoda Yeti | VW
(1 row)

Fuzzy string matching

In addition to the standard solutions, PostgreSQL has several contributed packages that allow you to perform more complicated text searches. You'll find them in the PostgreSQL contrib package (in our case postgresql93-contrib), which we installed earlier.

The Levenshtein algorithm calculates the distance between two strings based on the total number of steps that should be completed to change one of the strings into the other. To determine the number of steps, it adds one point for each new character added, existing character removed, and replaced character.

Install and verify the fuzzystrmatch extension, which implements the Levenshtein algorithm, by running the command CREATE EXTENSION fuzzystrmatch;. You can then use the function to see the closest car name to a particular string – for example, "Sportage." Notice that even the difference in case adds one point to the calculation:

vehicles=# SELECT id, name FROM cars WHERE levenshtein(name, 'sportage') <=5;
 id |     name
----+--------------
  7 | KIA Sportage
(1 row)
vehicles=# SELECT id, name FROM cars WHERE levenshtein(name, 'sportage') <=4;
 id | name
----+------
(0 rows)
vehicles=# SELECT id, name FROM cars WHERE levenshtein(lower(name), lower('sportage')) <=4;
 id |     name
----+--------------
  7 | KIA Sportage
(1 row)

Trigram

Another tool for string searches, Trigram, splits the chosen string into substrings of three consecutive characters. The query result lists the string with most matching trigrams. That makes it especially useful if you enter a string in your query that may have slight misspellings. According to the PostgreSQL documentation each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string.

Install Trigram by executing the command CREATE EXTENSION pg_trgm; then query on the string "Ford":

vehicles=# select show_trgm('Ford');
          show_trgm
-----------------------------
 {"  f"," fo",for,ord,"rd "}
(1 row)

You can directly use the specific Trigram SELECT FROM WHERE % construction, but when you have a really large table with many records, you should optimize the searching process by creating a special index on the text column you plan to search before running the actual SELECT query. The Generalized Index Search Tree (GIST) provides fast text searches by converting the text into a vector of trigrams:

vehicles=# CREATE INDEX cars_name_trigram ON cars USING gist(name gist_trgm_ops);
CREATE INDEX

You can then look for a car model in your table even without spelling it correctly. Trigram's default similarity threshold is 0.3. You can change it to anything between 0 and 1; the smaller it gets, the more spelling error-tolerant it becomes.

vehicles=# select show_limit();
 show_limit
------------
        0.3
(1 row)
vehicles=# SELECT * FROM cars WHERE name % 'Folkswagen';
 id |        name        |                        description
----+--------------------+------------------------------------------------------------
  4 | Volkswagen Phaeton | Luxury German sedan with front-engine and four-wheel drive
(1 row)
vehicles=# SELECT * FROM cars WHERE name % 'Folkswagon';
 id | name | description
----+------+-------------
(0 rows)
vehicles=# select set_limit(0.2);
 set_limit
-----------
       0.2
(1 row)
vehicles=# SELECT * FROM cars WHERE name % 'Folkswagon';
 id |        name        |                        description
----+--------------------+------------------------------------------------------------
  4 | Volkswagen Phaeton | Luxury German sedan with front-engine and four-wheel drive
(1 row)

TSvector and TSquery

TSvector and TSquery allow full-text searches based on several words from an entire sentence. TSvector splits the full text against which you want to run the query into an array of tokens (words) called lexemes, which are stored in TSvector along with their position in the text. TSquery performs the actual query. You can specify the language whose dictionary should be used and the words that should be matched. The words are united with the combining operator &. The vector and the query interact via the full-text query operator @@. For example, you can see the luxury SUVs from your database with a query such as:

vehicles=# SELECT name, description FROM cars WHERE to_tsvector(description) @@ to_tsquery('english', 'luxury & SUVs');
       name        |                                description
-------------------+---------------------------------------------------------------------------
 BMW X5            | Luxury German SUV with front-engine and four-wheel-drive
 Cadillac Escalade | Luxury USA SUV with front-engine and rear-wheel-drive or four-wheel-drive
(2 rows)

As you can see, PostgreSQL returns the correct result even if one of the searched words is written in a plural form recognized as such by the dictionary used.

You can run the queries for text searches in different languages. PostgreSQL bundles more than a dozen text search dictionaries:

vehicles=# \dFd
                             List of text search dictionaries
   Schema   |      Name       |                        Description
------------+-----------------+-----------------------------------------------------------
 pg_catalog | danish_stem     | snowball stemmer for danish language
 pg_catalog | dutch_stem      | snowball stemmer for dutch language
 pg_catalog | english_stem    | snowball stemmer for english language
 pg_catalog | finnish_stem    | snowball stemmer for finnish language
 pg_catalog | french_stem     | snowball stemmer for french language
 pg_catalog | german_stem     | snowball stemmer for german language
 pg_catalog | hungarian_stem  | snowball stemmer for hungarian language
 pg_catalog | italian_stem    | snowball stemmer for italian language
 pg_catalog | norwegian_stem  | snowball stemmer for norwegian language
 pg_catalog | portuguese_stem | snowball stemmer for portuguese language
 pg_catalog | romanian_stem   | snowball stemmer for romanian language
 pg_catalog | russian_stem    | snowball stemmer for russian language
 pg_catalog | simple          | simple dictionary: just lower case and check for stopword
 pg_catalog | spanish_stem    | snowball stemmer for spanish language
 pg_catalog | swedish_stem    | snowball stemmer for swedish language
 pg_catalog | turkish_stem    | snowball stemmer for turkish language
(16 rows)

Here again, full-text search with really large tables can be quite slow. You can use the Generalized Inverted Index (GIN) to significantly increase the searching speed and improve the performance. It needs more time to be generated compared to GIST, but it runs faster. GIN indexes are preferred when you work with mostly static data, since queries will be completed much faster. GIST indexes are updated much faster when new data is inserted in the corresponding table, so they are more suitable for dynamic data.

vehicles=# CREATE INDEX cars_vector_search ON cars USING gist(to_tsvector('english', description));                          CREATE INDEX

Metaphone

Finally, the Metaphone algorithm, which is installed with the fuzzystrmatch extension, matches the searched string by the way it sounds. Metaphone allows more liberty than the other text searching tools described above when it comes to the way a searched word is spelled. It is the most spelling error-tolerant tool listed in the article.

vehicles=# SELECT metaphone('Honda City', 5);
 metaphone
-----------
 HNTST
(1 row)

A similar function, dmetaphone (short for double metaphone) generates two output strings based on the way a source string sounds, from the results of the dmetaphone() and dmetaphone_alt() functions. Usually they return same results, but when it comes to non-English names the results might be different, depending on the pronunciation. Using dmetaphone instead of metaphone gives you better chance to match the misspelled word with the original one stored in your database.
You can run several sample queries to become acquainted with the functions:

vehicles=# select manufacturer_name, metaphone(manufacturer_name,11), dmetaphone(manufacturer_name), dmetaphone_alt(manufacturer_name) from engines_manufacturers where engines_manufacturers.id=8;
    manufacturer_name     |  metaphone  | dmetaphone | dmetaphone_alt
-------------------------+-------------+------------+----------------
 Peugeot Citroen Moteurs | PJTSTRNMTRS | PJTS       | PKTS
(1 row)
vehicles=# SELECT manufacturer_name FROM engines_manufacturers WHERE metaphone(engines_manufacturers.manufacturer_name, 2) = metaphone('Pejo', 2);
    manufacturer_name
-------------------------
 Peugeot Citroen Moteurs
(1 row)

With a word like Peugout, in this example, the pronunciation and the spelling differ. In this case dmetaphone gives you two options instead of one with metaphone. In the second query, if you lower the max_output_length value, you have a better chance to match the manufacturer name you are looking for.

Next, you can search for the car models of a manufacturer which name is misspelled:

SELECT cars.name as Cars, engines_manufacturers.manufacturer_name as Manufacturer FROM cars INNER JOIN cars_engines ON cars.id = cars_engines.cars_id INNER JOIN engines_manufacturers ON cars_engines.engines_id = engines_manufacturers.id WHERE metaphone(engines_manufacturers.manufacturer_name, 6) = metaphone('FW', 6);
        cars        | manufacturer
--------------------+--------------
 Volkswagen Phaeton | VW
 Skoda Yeti         | VW
(2 rows)

To decide which algorithm to use, carefully analyze the data in your database records and the requirements for your project defined during the design stage. Sometimes one solutions works better and faster that the other. Or you can combine several solutions in a single query in order to get the best result.

The techniques described in this article allow application developers to save programming time and improve performance by completing text searches directly on the database level. Using the PostgreSQL syntax makes code easier for other IT specialists to understand.

No comments:

Post a Comment