In this article I will only talk about PostgreSQL, the database management system (DBMS) that we are using here at Polyconseil, but the same concepts should apply to its competitors. To understand why the database may suffer when doing schema migrations, we must first understand the concept of locking. This is a feature provided by the DBMS to control concurrent access to data in the database.
Each action on a table (SELECT, INSERT, UPDATE, DELETE, ALTER, …) requires a lock before being applied. The lock cannot be acquired until all conflicting locks on the same table are released. These locks are typically released at the end of the transaction (commit). Locks together with transactions are the core mechanisms ensuring the ACID properties of the database operations. You can read more about the different kinds of locks provided by PostgreSQL in the documentation.
Altering a table will most of the time require an access exclusive lock on this table. This kind of lock conflicts with all other locks. This means that during a schema update there is a good chance that no read and no write will be possible on the updated table. For example, this will be the case when adding a nullable  column to a table. Fortunately PostgreSQL is pretty quick at performing schema updates and the operation should not take more than a few milliseconds. However there are a couple of rules that we should follow in order to minimize the impact on our production when running database schema changes, particularly on big and highly used tables.
Most of the time our database migration will run inside a transaction (this is the case, for example, when using Django migrations with PostgreSQL). When we lock more than one table we increase the risk of deadlock, which is one of the worst  things that can happen during our schema migration. Indeed, if by any chance our application code also takes a lock on the two same tables, this could result in a deadlock and potentially in an error in our application code. Since access exclusive locks conflict with any other locks, it is sufficient that our application reads the content of these two tables in the same transaction to have a risk of deadlock.
For example suppose that:
- our application needs, in the same transaction, an access share lock on Table1 to read data (T1) and, later on, an access share lock on Table2 to read more data (T4)
- our migration alters Table2 (T2) and Table1 (T3), taking an access exclusive lock on both table (which conflicts with the access share lock required to read from the tables)
If the operations run in the following order (T1) - (T2) - (T3) - (T4) (see the figure below) then we have a deadlock and our application code will fail either with a deadlock detected error or with a timeout error after PostgreSQL kills the transaction when the database timeout is exceeded.
Remember, until our transaction has commited, no read or write is possible on the modified table. We do not want this to take more than a few milliseconds. Consequently we need to limit the work done inside the same transaction. Note that since it can be hard to acquire an access exclusive lock on a highly used table, there is a trade-off between a few long transactions and many quick ones. There are good chances that grouping five schema updates in the same transaction is safe in any case.
There is one special case where rule #1 is not applicable: adding a foreign key constraint between two tables requires a share row exclusive lock on both tables. This lock conflicts with row exclusive locks. Acquiring it will thus prevent insertions or updates in these tables (and probably even reads on the table receiving the new column ). Let's explore this case with a small example. Suppose that our application code regularly triggers the following transaction:
BEGIN; -- transaction (1) INSERT INTO "Table1" ...; -- (1.1) ... some more work; SELECT * FROM "Table2" where id = ...; -- (1.2) COMMIT;
If we want to add a foreign key from Table2 to Table1 we will need to run a query similar to this one:
BEGIN; -- transaction (2) ALTER TABLE "Table2" ADD table1_id INTEGER NULL; -- (2.1) ALTER TABLE "Table2" ADD CONSTRAINT "table1_id_foreign_key" FOREIGN KEY ("table1_id") REFERENCES "Table1" ("id") DEFERRABLE INITIALLY DEFERRED; -- (2.2) COMMIT;
Now suppose that the statements are executed in the following order:
- (1.1): transaction 1 takes a row exclusive lock on Table1 to insert some values;
- (2.1): we start the schema migration, transaction 2 takes an access exclusive lock on Table2;
- (2.2): transaction 2 requests a share row exclusive lock on Table1 and Table2 to add the foreign key constraint. The lock is granted for Table2 but since the lock for Table1 conflicts with the lock acquired by (1.1) transaction 2 is blocked at this point;
- (1.2): transaction 1 requests an access share lock to read data from Table2. This lock conflicts with the lock acquired by (2.1), both transactions are now trapped in a deadlock :(.
The pattern is actually exactly the same as the one presented in the figure above. Similarly, removing a foreign key constraint requires an access exclusive lock on both tables preventing both reads and writes on these tables. This exposes us to the same risk of deadlock. The trick here is that when we perform a schema migration on a table we tend to focus on the table receiving the new column. Remember that the table pointed by the foreign key can also be impacted. Be careful when adding or removing a foreign key to a highly used table, the risk of deadlock is not negligible.
Index creation is a costly operation that can easily take more than a minute on really big tables. The CREATE INDEX command acquires a share lock on the table, preventing any writes on the table. Fortunately, PostgreSQL provides a CREATE INDEX CONCURRENTLY command to avoid blocking writes on the table for dozens of seconds. This command only takes a share update exclusive lock which allows UPDATE, DELETE and INSERT on the table. I see no reason why you should prefer the CREATE INDEX command over the CREATE INDEX CONCURRENTLY in production: use CONCURRENTLY!
Deploying schema migrations on a running system always requires a preliminary analysis to limit the risks of impacting our final users. Web frameworks like Django can automatically generate the schema migrations for us, but we should always proofread the auto-generated files and make sure that:
- they are compatible with the code currently running in production (e.g. don't forget to make new columns nullable);
- the rules provided in this article were considered when touching big tables (several millions of lines) receiving more than one read or write per second.
We also try to plan our more risky schema migrations at times when the database activity is the lowest (e.g. during the night).
Following these advice should put you in the safe zone. That's what we do at Polyconseil and we deploy schema migrations several times a week with no impact. Or rather: usually no impact. Because there is one last rule: never underestimate schema changes, always review and triple check them to anticipate deadlocks and other possible issues.
|||Remember to always make sure that new columns are nullable when you add them to existing and non empty tables. Otherwise the schema update will immediately fail because the NOT NULL constraint will not be valid on existing rows. You probably don't want to fill the column with a default value (as provided by Django) because PostgreSQL will then rewrite the entire table. This can take a lot of time and meanwhile the active access exclusive lock (acquired by the ALTER TABLE) will block all concurrent queries on this table.|
|||Well… OK you could also drop your entire database which is waaay much worse.|
|||If you are doing your schema updates with Django, there are good chances that the column insertion and the addition of the foreign key constraint will be performed in the same transaction (this is how Django creates the migration by default). This means that when you add the constraint, you already have an access exclusive lock on the table on which you add the column. Consequently, reads will not be possible on this table during the operation. This is even more problematic if you need to wait a few seconds to acquire the share row exclusive lock on the other table (e.g. if this table receives a lot of writes).|