PostgreSQL fine-tuning — Engineering at Polyconseil

Versions used

PostgreSQL 9.0 to 9.4

Failure Recovery

Our failure recovery strategy depends on the kind of failure. We can make a quick classification of the failures we can experience, different failure scenarios implying different recovery processes.

Natural disaster: Bad things happen and some may not be prevented: arsons, earthquakes, floods, you name it. Therefore, we should make sure our critical data is held at different geographical places simultaneously.
Human disaster: Our team may introduce a bug. Even PostgreSQL developers can write bugs. Our OS distribution can ship one, or some hackers can exploit a weakness: you never know. Such a bug can corrupt or delete our master data and all their hot copies. We need at least a week (a month is better) of inert archives (disconnected from our system) so we can restore our system to a previous state. If need be, this dreadful story should convince anyone of the necessity of such a procedure.
Server crash: Sometimes a server crashes. There are byzantine softwares interactions, memory leaks or resource outage. Its motherboard burns, or a disk gets corrupted. Hot data replication can allow us to get up again very fast in case of a server failure.
Update: This is not really a failure, but we must make sure that we will not lose any data when we need to update our hardware, our OS or our PostgreSQL version. There is no excuse, and we want to be up-to-date to mitigate the probabilities of very undesirable situations (like a crash, a hacker finding its way in or an old bug awakening). Our update procedure must be as seamless as possible.

Reliability

System reliability is achieved through good testing and monitoring. This has nothing to do with failure recovery, which is our backup when something goes terribly wrong.

Regular recovery testing

Recovery processes must be regularly tested to ensure they can fill their duty. Every recovery procedure we had and did not test on a regular basis was in fact broken. We concluded that a non-tested recovery procedure is empirically broken and useless.

Data Typology

We do not need the same recovery guarantees on all our datasets, so we made a simple classification.

Transient data

Many datasets have a very limited lifetime, and we can compute or fetch them again if need be. The state and telemetry of cars, charge points and batteries, user sessions on the web or in front of a kiosk, etc. This kind of data can be held in a cache. If they are in a database, we can accept to lose them on any failure, and even during a simple update.

Logs

Activity logs are not essential, but they are used for statistics, reports and verifications. Their business value is minor. We can accept to lose all of them in case of a disaster, and lose some of them in case of a failure. We should not lose them during an update, though.

Core data

The core of our business data do not change so much over time: client accounts, cars and stations list and parameters. We can accept to lose some of it: as it does not change so much, losing for example 24 hours of modifications just means the clients who registered this day will have to register again, the new cars and new stations of the day will have to be inserted and configured again: a small price to pay in face of a disaster. Of course we want to make this time range as limited as possible.

Critical data

Finally, a limited set of data is critical and we can not afford to just lose them in any case. Invoices, accounting and payments for example. We have to handle this data carefully as to make sure we do not lose them in any failure scenario, so we check the data life cycle step-by-step. For example, the invoice process goes as follows:

We generate invoice data and wait a reasonable time for replication. If some invoices are lost because of a disaster at this point, it has no impact (they were not official invoices yet).
We send invoice data for replication in a independent third-party system. The invoices are now replicated on two sites. Data loss is very unlikely at this point. Still, if some invoices are lost, it has no impact.
We get a list of all invoices that were properly saved by our partner. These invoices are now official: we will not lose them even if a disaster occurs.

PostgreSQL backup and replication

PostgreSQL comes with an extended documentation about replication and recovery. We will only cover the essential parts here and get straight to the point.

Thou shall not copy the whole file system

Some replication ideas revolve around the full synchronous copy of your disks, like using a RAID storage (1, 5 or 10). This is a poor idea:

Costly: First, the performance cost is very high. At best, it doubles your response time, since we write everything twice instead of once. The cost is probably higher in fact, because the need to make sure the data is written and duplicated on the storage side prevents the use of many great performance options.
Not much protection: We get no protection against disasters: it is a (probably on-site) hot copy, so we still need inert archives. Moreover, we will have to resort to some other recovery process if our storage or replication technology fails: our server and PostgreSQL application are not SPOF anymore, but our storage and its replication still are.
No more hot update: This will definitely prevent any hot update procedure: the service will need to be shut down for a PostgreSQL update. And if another separated instance of PostgreSQL can take over during updates, this means we already have a reliable hot standby and the storage replication is redundant.

However, if your PostgreSQL instance or the underlying operating system have a tendency to crash regularly, this could be an idea worth exploring.

About scaling

This technique (or any other one presented here) is not usable for scaling purpose: the storage will stay a bottleneck. The main scaling option is to shard the database. Here is a nice article from Instagram about this.

Dump

The easiest way to save our data is to dump our database every now and then:

pg_dump -j4 -Fc -T excluded_table -f dump_file.sql db_name

# -j4: Run on 4 threads. Add the --no-synchronized-snapshots option for PostgreSQL < 9.2.
# -Fc: Generate compressed binary dumps, more lightweight than default text dumps.
# -T: Exclude non-critical data you will accept to lose in case of failure.

More details on the parameters can be found in the Postgres documentation.

To restore:

pg_restore -j4 -d db_name dump_file.sql

We can store the dump on a remote server and protect ourselves from bugs and disasters. We will lose a lot of data in case of recovery though, but dumps can still be the stone on which to build our disaster recovery plan.

The almighty WAL

The Write-Ahead Log is a core component of PostgreSQL: every data-modifying transaction PostgreSQL undertakes is written to the WAL before returning the query answer. In case of a brutal server shutdown, a PostgreSQL database can use its own WAL to rebuild the last operations and ensure no data corruption ensued if the storage is OK.

Far more interestingly, the WAL can be forwarded to other servers, which can use them to rebuild a fully operational copy of the database, or even maintain a running copy very close to the master database.

PITR: Point-In-Time Recovery is a technique that uses the WAL over a database backup to let the administrator set up a copy of the database at any given time in the past since the original backup. This is like a dump on steroids, which enables optimal data recovery in disastrous scenarios.
Transaction Log Shipping (aka. Warm Standby): A standby PostgreSQL server can also be configured to read the WAL archive and reconstruct a running copy of the database with just a little bit of delay.
Streaming replication (aka. Hot Standby): The master-standby pair can also be configured so that every operation written to the WAL is also sent to the standby, which will replicate it immediately. If the stream goes down for any reason, the standby switches back to the WAL files as in Transaction Log Shipping mode.

Warning

All WAL-related techniques have a few requirements:

the database cannot use any hash indexes, as they are not WAL-logged
there must not be any CREATE DATABASE during the base backup
CREATE TABLESPACE requires replicas to be reinstalled with new base backups

Multi-tier replication

Considering the different techniques described above, we came up with a combination to set up a solid multi-tier replication scheme.

Tier 1-2: PITR and dumps

The documentation has the full explanation of this technique, here is a short cheatsheet.

Configure this on the master:

wal_level = hot_standby     # this is compatible with the "archive" level
archive_mode = on
# how to save wal records (%p is the source, %f the filename)
archive_command = 'gzip < %p > /mnt/nfs1/wal/%f'
# in case we need to plug another hot standby or do a wal-streamed pg_basebackup
max_wal_senders = 2

Then run a base backup:

pg_basebackup -Ft -z -D backup_dir_path

# -Ft: Output as a tar so we can compress
# -z: Compress using gzip

The base backup replicates the current database cluster (the whole pgdata). Once it is done, a psql instance can be started up any time with this backup as a base and, using the WAL archives stored in /mnt/nfs1/wal, replay all modifications until a given point in time. This is done using a recovery.conf when setting up the recovery psql instance:

# how to restore wal records (%p is the source, %f the filename)
restore_command = 'gunzip < /mnt/nfs1/wal/%f > %p'
# once restored, register where we stopped (%r is the first record to keep, the next start point)
archive_cleanup_command = 'echo %r > /tmp/wal_start_point'
# point in time at which we want to recover
recovery_target_time = '2015-04-01'

We can also check our PITR process every month and make sure our backup copy is not corrupted and our process is OK. We then dump the database, and cleanup old WAL records using pg_archivecleanup (the record to keep was saved by the archive_cleanup_command at the end of the recovery process when we started up this instance):

pg_archivecleanup -x .gz /mnt/nfs1/wal `cat /tmp/wal_start_point`

# -x .gz: wal files have the .gz extension

Usage

This first tiers are used as the cornerstone of our recovery strategy: in case of disaster (data corrupted or deleted by a bug, data warehouse destroyed, etc.) it enables us to set up a new database from any given point in the last month. We still keep doing daily dumps to make sure we have a backup in case we lose the WAL records.

Tier 3: Streaming replication

Yet again, the PostgreSQL documentation has a full section about streaming replication

Set up authentication, on the master (pg_hba.conf):

# Allow the user "foo" from standby (192.168.1.100) to connect to the master for replication (with password)
#
# TYPE  DATABASE        USER            ADDRESS                 METHOD
host    replication     foo             192.168.1.100/32        md5

and on the standby (recovery.conf):

# The standby connects to the master (192.168.1.50:5432) as "foo" with password "foopass"
primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
# it is a hot standby
hot_standby = true
# same restore as a PITR, but no cleanup
restore_command = 'gunzip < /mnt/nfs1/wal/%f > %p'

When the master instance is lost (or shut it down for an upgrade), execute the following on the standby (9.1+ only):

pg_ctl promote

Usage

This third tier is used as a read-only server for requests that do not need to be perfectly in sync with the master. Also, it can take over the role of the master in a matter of minutes, should a server crash occur. Finally, it can replace the master temporarily for system or version upgrades, without any data loss.

TL;DR: Summary

	Downtime	Data loss	Disaster	Failure	Update
Recover from dump	< 48h	< 24h	✅	✅	🚫
Recover from PITR	< 12h	< 1h	✅	✅	🚫
Crash & promote hot standby	< 10mn	< 5mn	🚫	✅	🚫
Shutdown & promote	< 2mn	None	🚫	🚫	✅

In the rare case of a server failure, we do not hope to get the storage and the file system back: we just switch immediately to the hot standby and lose at worst 5 minutes of non-critical data (most likely we will only lose a few transactions, maybe none if the storage and networks are ok and only psql crashed). In case of a disaster or a bug making the hot standby unavailable, we recover from the PITR archive or from a dump and lose more data. In case of an update, we shutdown the master gracefully and promote the standby without losing any data (we still get a couple of minutes of downtime — we're still trying to figure how to avoid that).

Configuration fine-tuning

Once we knew our recovery strategy, we configured our PostgreSQL accordingly. In the rare case of a server failure, we do not hope to get the storage and the file system back: we just switch immediately to the hot standby. This means we have no interest in keeping our file system consistent in failure scenarios. This allows us to change our settings from default quite a bit. There is a PostgreSQL wiki with extended details about configuration.

Default settings:

# core config
max_connections = 100         # max connections
full_page_writes = on         # copy full pages to WAL
work_mem = 1MB                # warning: multiple buffers of such size by query
synchronous_commit = on       # wait for WAL to be flushed
maintenance_work_mem = 64MB   # maintenance memory (for vacuum, index creation)
shared_buffers = 128MB        # cross-queries memory (cache)

# hints for planner estimates
random_page_cost = 4.0        # Disk random/contiguous access cost
effective_cache_size = 4GB    # OS disk-caching capability for planner estimations

# WAL checkpoints and shipping
checkpoint_segments = 3       # max segments before checkpoint (3*16MB)
checkpoint_timeout = 5min     # timeout before checkpoint
checkpoint_completion_target = 0.5  # checkpoint write spread

Our performance settings:

# core config
max_connections = 500         # workers + apis + www
full_page_writes = off        # never rebuild ourselves (use standby)
work_mem = 30MB               # mean 2 buffer allocations by query so it is OK
synchronous_commit = off      # current checkpoint lost in some crash cases (psql, kernel)
maintenance_work_mem = 4GB    # 1/16 memory for vacuum
shared_buffers = 20GB         # 1/4 memory shared between psql processes

# hints for planner estimates
random_page_cost = 1.0        # using SSD drives
effective_cache_size = 40GB   # 1/2 memory

# WAL checkpoints and shipping
checkpoint_segments = 128     # max segments before checkpoint for WAL (2GB)
checkpoint_timeout = 5min     # timeout before checkpoint (should be hit before size limit)
checkpoint_completion_target = 0.9  # spread writes fully

This stresses the importance of having a clear recovery strategy: in short, our PostgreSQL instance operates in memory, quite like a cache software. It does not need to wait for a disk flush to answer a query: this is an enormous performance gain. At the same time, not worrying about our file system state lets us use shorter WAL records (full_page_writes = off) and the streamed replica gets more efficient