Serious i18n text search — Engineering at Polyconseil

Versions used

Django 1.10
Python 3.6
PostgreSQL 9.6

Scope

Our base use case is definitely a common case in web interfaces: we want to offer our users a single text search field enabling them to search over a collection of objects and fields. A simple example would be looking for a customer using a single field that could match name, surname, email, phone number, internal reference or UUID, etc. More complex needs arise quickly, like using a single input to search over different objects in your database: invoices, attachments, customer-care emails, etc. All hell breaks loose when you want to perform such lookups on multi-lingual fields. Oh, and we want completion, right?

We all use this exact feature every day, every time we Google something. Any non-technical client thinks it's easy to implement and wants it. Well guess what, it's not easy to pull off: that's why Google is a thing. However all of us want to give the best to our users, so we rely on the only open-source solution existing in the field: Lucene (or its fancy packaged app, Elasticsearch).

On a structural and production point of view though, it's a pain to require a Redis for caching, an Elasticsearch for text indexing, etc. when you already have a nice PostgreSQL server running under your Django app.

Explanations on the serious stuff

PostgreSQL text completion and search

PostgreSQL offers very serious mechanisms for text indexation. One of them is full-text search using the tsvector and tsquery types and dictionaries. You can read all about it in the text search documentation. It enables you to search a text for words or sentences while ignoring conjugation, case and accents (for accented languages like French, German, Spanish…), plural forms, punctuation and most common cohesive devices like determiners, pronouns and conjunctions.

The second mechanism PostgreSQL provides is trigram matching, which you can leverage using the the pg_trgm extension. This enables word completion and correction of misspelled words. Combining these two mechanisms, you can offer a powerful and very smooth text search.

To dig some more about these feature, I'll recommend the excellent reading of shisaa's and Rachid Belaid's blogs.

Django contrib.postgres

The contrib.postgres module of Django is getting more complete with every version. It enables you to use JSONB or UUID fields, along with ts_vector and ts_query queries.

JSONB translations with nece

If you're handling some kind of i18n in your project, you need to choose a technical stack for translation and localization. Historically, most of us Django users were relying on the Parler package. This was nice and all, but now that JSON columns are around, it seems ludicrous to use join table as Parler does instead of putting all the translations in a well-organized JSON field. Well, that's exactly what nece does, so let's use it for now. Note that the techniques described here can very well be adapted to Parler, though.

How to seriously setup the serious stuff

Basic context

Let's say you have a basic model named foo, with a reference and some translations.

CREATE EXTENSION pgcrypto;

CREATE TABLE foo(
    uid uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    reference text,
    translations jsonb
);

Note you had the good taste of using UUIDs as primary keys. Just a taunt, that's not required whatsoever.

Dictionaries

Why use dictionaries? That's because we want to use the correct stemming for each language. This way we can match plurals and conjugations nicely. Hence, we need some kind of mapping between nece ISO language codes and PostgreSQL text configuration names.

CREATE TABLE text_config(
    config REGCONFIG,
    lang TEXT
    );

INSERT INTO text_config VALUES
    ('english', 'en-gb'),
    ('english', 'en-us')
    ;

PostgreSQL uses snowball dictionaries. That's awesome, except for these funny alterations named accents. For some languages, we need to create another dictionary handling these little peculiarities.

CREATE EXTENSION unaccent;
CREATE text SEARCH CONFIGURATION fr_FR (copy=french);
ALTER text SEARCH CONFIGURATION fr_FR ALTER MAPPING FOR
    asciihword,
    asciiword,
    hword,
    hword_asciipart,
    hword_part,
    word
    WITH unaccent, french_stem;
INSERT INTO text_config VALUES ('fr_FR', 'fr-fr');

We will also need a simple unaccented configuration:

CREATE TEXT SEARCH CONFIGURATION simple_unaccent (copy=simple);
ALTER TEXT SEARCH CONFIGURATION fr_FR ALTER MAPPING FOR
    asciihword,
    asciiword,
    hword,
    hword_asciipart,
    hword_part,
    word
    WITH unaccent, simple;

Search view

To build a single input search over multiple fields, we want to query a ts_vector aggregating all our translated fields (in translations) and the non translated ones (like reference). The easiest way to do it is to use a materialized view. Nice twist with this technique: you can easily build a view for a search on multiple models.

CREATE MATERIALIZED VIEW foo_search AS
    SELECT
        foo.uid,
        to_tsvector(foo.reference) ||
        tsvector_agg(to_tsvector(text_config.config, field.value))
        AS vector
    FROM
        foo,
        jsonb_each(foo.translations) AS lang,
        jsonb_each_text(lang.value) AS field,
        text_config
    WHERE text_config.lang=lang.key
    GROUP BY foo.uid;

CREATE INDEX foo_search_idx ON foo_search USING GIN (vector);

Let's walk through it. First, we use psql JSON functions to unpack our translations field. Nece stores them like this:

{
    'en-gb':
        'title': 'The Beauty and the Beast'
        'description': 'Stockholm syndrome works. (thx @DanSlott)'
    'fr-fr':
        'titre': 'La Belle et la Bête'
        'description': 'Le syndrome de Stockholm, ça marche.'
}

so we need to jsonb_each twice to get the actual text (field.value). Also we use our text_config table to get the psql text config matching the language (lang.key). Then we want to aggregate all these correctly-stemmed vectors using our home-made aggregator tsvector_agg:

CREATE FUNCTION concat_tsvectors(x tsvector, y tsvector)
RETURNS tsvector AS $$
BEGIN
  RETURN x || y;
END;
$$ LANGUAGE plpgsql;

CREATE AGGREGATE tsvector_agg (
  basetype = tsvector,
  sfunc = concat_tsvectors,
  stype = tsvector,
  initcond = ''
);

Finally, you will need to keep your materialized view up to date by running a cron command every other minute.

5 * * * * psql -c "REFRESH MATERIALIZED VIEW CONCURRENTLY foo_search"

This is the drawback of this simple method: new data is not immediately searchable, plus if your table is too big, the refresh will just take ages.

Alternative: search table

The materialized view trick is satisfactory in a lot of cases, but if it's not good enough, you will need a table and triggers.

Tip

There are simpler alternatives if you are in a simple case. If your table is small enough and lookups scarce enough, you may simply use a "normal" view instead of the materialized one. If you only need to search the nece translations field, you can directly use a GIN index:

CREATE OR REPLACE FUNCTION nece_to_vector(translations jsonb) RETURNS tsvector AS
$$
DECLARE tsv tsvector;
BEGIN
    SELECT tsvector_agg(to_tsvector(text_config.config, field.value))
    INTO tsv
    FROM
        jsonb_each(translations) AS lang,
        jsonb_each_text(lang.value) AS field,
        text_config;
    RETURN tsv;
END
$$ LANGUAGE plpgsql
IMMUTABLE;

CREATE INDEX foo_search_idx ON foo USING GIN (nece_to_vector(translations));

You can probably use an index on more complicated cases (you'll have to write more functions as they have to be immutable to be used in the index), but it quickly becomes more intricate than the solution below.

In this case you will have to build the table by hand and use a trigger and a function.

CREATE TABLE foo_search(
    uid uuid REFERENCES foo,
    vector tsvector
);

CREATE INDEX foo_search_idx ON foo_search USING GIN (vector);

CREATE OR REPLACE FUNCTION reindex_foo_translations() RETURNS TRIGGER AS
$$
DECLARE

    translations jsonb;  -- nece translations field
    tsv tsvector;        -- the search vector to store

BEGIN
    translations := NEW.translations;
    SELECT
        tsvector_agg(to_tsvector(text_config.config, field.value))
        INTO tsv
        FROM
            jsonb_each(translations) AS lang,
            jsonb_each_text(lang.value) AS field,
            text_config;
    UPDATE foo_search SET vector=tsv WHERE uid=NEW.uid;
    IF NOT FOUND THEN
        INSERT INTO foo_search VALUES (NEW.uid, tsv);
    END IF;
    RETURN NEW;
END
$$ LANGUAGE plpgsql;

CREATE TRIGGER reindex_search AFTER INSERT OR UPDATE ON foo
    FOR EACH ROW EXECUTE PROCEDURE reindex_foo_translations();

Completion view

Now your search view (or table) is okay for full-text searches, but you can not use it for word completion or partial word lookups. You will need to use a trigram search for this.

CREATE MATERIALIZED VIEW foo_complete AS
    SELECT word FROM ts_stat(
        'SELECT to_tsvector(''simple_unaccent'', field.value)
        FROM
            foo,
            jsonb_each(foo.translations) AS lang,
            jsonb_each_text(lang.value) AS field
        '
    );

CREATE INDEX foo_complete_idx ON foo_complete USING GIN (word gin_trgm_ops);

This is far simpler. As we took care to unaccent our text configurations for the search, we also use an unaccented version of the simple configuration for the completion. You'll complete without accents, but the search will find the result anyway, as it is unaccented too.

Putting it all together

Now we can query. PostgreSQL and Django contrib.postgres documentations give you many ideas of how to query our views for your need. You may want to look for partial matches though. For example let's say we type "bet" and we expect to match our movie description containing "bête" (yeah, we're French, we use accents, but not when searching for stuff).

SELECT foo.* FROM foo
NATURAL JOIN foo_search AS find,
foo_complete AS complete
WHERE complete.word SIMILAR TO 'bet%'
AND find.vector @@ to_tsquery('fr_FR', complete.word);

You can try to use contrib.postgres to build this query… or just cursor.execute() it.