PostgreSQL 9.0 to 9.4
Our failure recovery strategy depends on the kind of failure. We can make a quick classification of the failures we can experience, different failure scenarios implying different recovery processes.
- Natural disaster
- Bad things happen and some may not be prevented: arsons, earthquakes, floods, you name it. Therefore, we should make sure our critical data is held at different geographical places simultaneously.
- Human disaster
- Our team may introduce a bug. Even PostgreSQL developers can write bugs. Our OS distribution can ship one, or some hackers can exploit a weakness: you never know. Such a bug can corrupt or delete our master data and all their hot copies. We need at least a week (a month is better) of inert archives (disconnected from our system) so we can restore our system to a previous state. If need be, this dreadful story should convince anyone of the necessity of such a procedure.
- Server crash
- Sometimes a server crashes. There are byzantine softwares interactions, memory leaks or resource outage. Its motherboard burns, or a disk gets corrupted. Hot data replication can allow us to get up again very fast in case of a server failure.
- This is not really a failure, but we must make sure that we will not lose any data when we need to update our hardware, our OS or our PostgreSQL version. There is no excuse, and we want to be up-to-date to mitigate the probabilities of very undesirable situations (like a crash, a hacker finding its way in or an old bug awakening). Our update procedure must be as seamless as possible.
System reliability is achieved through good testing and monitoring. This has nothing to do with failure recovery, which is our backup when something goes terribly wrong.
Regular recovery testing
Recovery processes must be regularly tested to ensure they can fill their duty. Every recovery procedure we had and did not test on a regular basis was in fact broken. We concluded that a non-tested recovery procedure is empirically broken and useless.
We do not need the same recovery guarantees on all our datasets, so we made a simple classification.
- Transient data
- Many datasets have a very limited lifetime, and we can compute or fetch them again if need be. The state and telemetry of cars, charge points and batteries, user sessions on the web or in front of a kiosk, etc. This kind of data can be held in a cache. If they are in a database, we can accept to lose them on any failure, and even during a simple update.
- Activity logs are not essential, but they are used for statistics, reports and verifications. Their business value is minor. We can accept to lose all of them in case of a disaster, and lose some of them in case of a failure. We should not lose them during an update, though.
- Core data
- The core of our business data do not change so much over time: client accounts, cars and stations list and parameters. We can accept to lose some of it: as it does not change so much, losing for example 24 hours of modifications just means the clients who registered this day will have to register again, the new cars and new stations of the day will have to be inserted and configured again: a small price to pay in face of a disaster. Of course we want to make this time range as limited as possible.
- Critical data
Finally, a limited set of data is critical and we can not afford to just lose them in any case. Invoices, accounting and payments for example. We have to handle this data carefully as to make sure we do not lose them in any failure scenario, so we check the data life cycle step-by-step. For example, the invoice process goes as follows:
- We generate invoice data and wait a reasonable time for replication. If some invoices are lost because of a disaster at this point, it has no impact (they were not official invoices yet).
- We send invoice data for replication in a independent third-party system. The invoices are now replicated on two sites. Data loss is very unlikely at this point. Still, if some invoices are lost, it has no impact.
- We get a list of all invoices that were properly saved by our partner. These invoices are now official: we will not lose them even if a disaster occurs.
PostgreSQL backup and replication
PostgreSQL comes with an extended documentation about replication and recovery. We will only cover the essential parts here and get straight to the point.
Thou shall not copy the whole file system
Some replication ideas revolve around the full synchronous copy of your disks, like using a RAID storage (1, 5 or 10) or File System (Block-Device) Replication (using DRDB to replicate your disk). This is a very poor idea:
- First, the performance cost is very high. At best, it doubles your response time, since we write everything twice instead of once. The cost is probably higher in fact, because the need to make sure the data is written and duplicated on the storage side prevents the use of many great performance options.
- Not much protection
- We get no protection against disasters: it is a (probably on-site) hot copy, so we still need inert archives. Moreover, we will have to resort to some other recovery process if our storage or replication technology fails: our server and PostgreSQL application are not SPOF anymore, but our storage and its replication still are.
- No more hot update
- This will definitely prevent any hot update procedure: the service will need to be shut down for a PostgreSQL update. And if another separated instance of PostgreSQL can take over during updates, this means we already have a reliable hot standby and the storage replication is redundant.
However, if your PostgreSQL instance or the underlying operating system have a tendency to crash regularly, this could be an idea worth exploring.
This technique (or any other one presented here) is not usable for scaling purpose: the storage will stay a bottleneck. The main scaling option is to shard the database. Here is a nice article from Instagram about this.
The easiest way to save our data is to dump our database every now and then:
pg_dump -j4 -Fc -T excluded_table -f dump_file.sql db_name # -j4: Run on 4 threads. Add the --no-synchronized-snapshots option for PostgreSQL < 9.2. # -Fc: Generate compressed binary dumps, more lightweight than default text dumps. # -T: Exclude non-critical data you will accept to lose in case of failure.
More details on the parameters can be found in the Postgres documentation.
pg_restore -j4 -d db_name dump_file.sql
We can store the dump on a remote server and protect ourselves from bugs and disasters. We will lose a lot of data in case of recovery though, but dumps can still be the stone on which to build our disaster recovery plan.
The almighty WAL
The Write-Ahead Log is a core component of PostgreSQL: every data-modifying transaction PostgreSQL undertakes is written to the WAL before returning the query answer. In case of a brutal server shutdown, a PostgreSQL database can use its own WAL to rebuild the last operations and ensure no data corruption ensued if the storage is OK.
Far more interestingly, the WAL can be forwarded to other servers, which can use them to rebuild a fully operational copy of the database, or even maintain a running copy very close to the master database.
- Point-In-Time Recovery is a technique that uses the WAL over a database backup to let the administrator set up a copy of the database at any given time in the past since the original backup. This is like a dump on steroids, which enables optimal data recovery in disastrous scenarios.
- Transaction Log Shipping (aka. Warm Standby)
- A standby PostgreSQL server can also be configured to read the WAL archive and reconstruct a running copy of the database with just a little bit of delay.
- Streaming replication (aka. Hot Standby)
- The master-standby pair can also be configured so that every operation written to the WAL is also sent to the standby, which will replicate it immediately. If the stream goes down for any reason, the standby switches back to the WAL files as in Transaction Log Shipping mode.
All WAL-related techniques have a few requirements:
- the database cannot use any hash indexes, as they are not WAL-logged
- there must not be any CREATE DATABASE during the base backup
- CREATE TABLESPACE requires replicas to be reinstalled with new base backups
Considering the different techniques described above, we came up with a combination to set up a solid multi-tier replication scheme.
Tier 1-2: PITR and dumps
The documentation has the full explanation of this technique, here is a short cheatsheet.
Configure this on the master:
wal_level = hot_standby # this is compatible with the "archive" level archive_mode = on # how to save wal records (%p is the source, %f the filename) archive_command = 'gzip < %p > /mnt/nfs1/wal/%f' # in case we need to plug another hot standby or do a wal-streamed pg_basebackup max_wal_senders = 2
Then run a base backup:
pg_basebackup -Ft -z -D backup_dir_path # -Ft: Output as a tar so we can compress # -z: Compress using gzip
The base backup replicates the current database cluster (the whole pgdata). Once it is done, a psql instance can be started up any time with this backup as a base and, using the WAL archives stored in /mnt/nfs1/wal, replay all modifications until a given point in time. This is done using a recovery.conf when setting up the recovery psql instance:
# how to restore wal records (%p is the source, %f the filename) restore_command = 'gunzip < /mnt/nfs1/wal/%f > %p' # once restored, register where we stopped (%r is the first record to keep, the next start point) archive_cleanup_command = 'echo %r > /tmp/wal_start_point' # point in time at which we want to recover recovery_target_time = '2015-04-01'
We can also check our PITR process every month and make sure our backup copy is not corrupted and our process is OK. We then dump the database, and cleanup old WAL records using pg_archivecleanup (the record to keep was saved by the archive_cleanup_command at the end of the recovery process when we started up this instance):
pg_archivecleanup -x .gz /mnt/nfs1/wal `cat /tmp/wal_start_point` # -x .gz: wal files have the .gz extension
This first tiers are used as the cornerstone of our recovery strategy: in case of disaster (data corrupted or deleted by a bug, data warehouse destroyed, etc.) it enables us to set up a new database from any given point in the last month. We still keep doing daily dumps to make sure we have a backup in case we lose the WAL records.
Tier 3: Streaming replication
Yet again, the PostgreSQL documentation has a full section about streaming replication
Set up authentication, on the master (pg_hba.conf):
# Allow the user "foo" from standby (192.168.1.100) to connect to the master for replication (with password) # # TYPE DATABASE USER ADDRESS METHOD host replication foo 192.168.1.100/32 md5
and on the standby (recovery.conf):
# The standby connects to the master (192.168.1.50:5432) as "foo" with password "foopass" primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' # it is a hot standby hot_standby = true # same restore as a PITR, but no cleanup restore_command = 'gunzip < /mnt/nfs1/wal/%f > %p'
When the master instance is lost (or shut it down for an upgrade), execute the following on the standby (9.1+ only):
This third tier is used as a read-only server for requests that do not need to be perfectly in sync with the master. Also, it can take over the role of the master in a matter of minutes, should a server crash occur. Finally, it can replace the master temporarily for system or version upgrades, without any data loss.
|Recover from dump||< 48h||< 24h|
|Recover from PITR||< 12h||< 1h|
|Crash & promote hot standby||< 10mn||< 5mn|
|Shutdown & promote||< 2mn||None|
In the rare case of a server failure, we do not hope to get the storage and the file system back: we just switch immediately to the hot standby and lose at worst 5 minutes of non-critical data (most likely we will only lose a few transactions, maybe none if the storage and networks are ok and only psql crashed). In case of a disaster or a bug making the hot standby unavailable, we recover from the PITR archive or from a dump and lose more data. In case of an update, we shutdown the master gracefully and promote the standby without losing any data (we still get a couple of minutes of downtime — we're still trying to figure how to avoid that).
Once we knew our recovery strategy, we configured our PostgreSQL accordingly. In the rare case of a server failure, we do not hope to get the storage and the file system back: we just switch immediately to the hot standby. This means we have no interest in keeping our file system consistent in failure scenarios. This allows us to change our settings from default quite a bit. There is a PostgreSQL wiki with extended details about configuration.
# core config max_connections = 100 # max connections full_page_writes = on # copy full pages to WAL work_mem = 1MB # warning: multiple buffers of such size by query synchronous_commit = on # wait for WAL to be flushed maintenance_work_mem = 64MB # maintenance memory (for vacuum, index creation) shared_buffers = 128MB # cross-queries memory (cache) # hints for planner estimates random_page_cost = 4.0 # Disk random/contiguous access cost effective_cache_size = 4GB # OS disk-caching capability for planner estimations # WAL checkpoints and shipping checkpoint_segments = 3 # max segments before checkpoint (3*16MB) checkpoint_timeout = 5min # timeout before checkpoint checkpoint_completion_target = 0.5 # checkpoint write spread
Our performance settings:
# core config max_connections = 500 # workers + apis + www full_page_writes = off # never rebuild ourselves (use standby) work_mem = 30MB # mean 2 buffer allocations by query so it is OK synchronous_commit = off # current checkpoint lost in some crash cases (psql, kernel) maintenance_work_mem = 4GB # 1/16 memory for vacuum shared_buffers = 20GB # 1/4 memory shared between psql processes # hints for planner estimates random_page_cost = 1.0 # using SSD drives effective_cache_size = 40GB # 1/2 memory # WAL checkpoints and shipping checkpoint_segments = 128 # max segments before checkpoint for WAL (2GB) checkpoint_timeout = 5min # timeout before checkpoint (should be hit before size limit) checkpoint_completion_target = 0.9 # spread writes fully
This stresses the importance of having a clear recovery strategy: in short, our PostgreSQL instance operates in memory, quite like a cache software. It does not need to wait for a disk flush to answer a query: this is an enormous performance gain. At the same time, not worrying about our file system state lets us use shorter WAL records (full_page_writes = off) and the streamed replica gets more efficient