SaaS Database Migration Zero-Downtime: 3 Proven & Safe

Table of Contents

The migration was one line. It locked a production table for 45 minutes in the middle of the workday while we watched the error rate climb and practiced saying "it's progressing nicely" in a calm voice. That outage is the reason we never run a bare ALTER TABLE on a big table anymore.

Standard migration tutorials teach you to write a migration file and run it. In SaaS, that advice gets you paged at 3pm on a Tuesday. Your database and application code deploy independently. During a deployment, old code handles in-flight requests while the schema changes underneath it. If the migration breaks backward compatibility — even for a minute — you have an outage window between "migration applied" and "old code retired."

Here is the playbook for SaaS database migration zero-downtime in production: the three-phase expand-contract pattern, with TypeORM code, batch backfill strategies, and rollback plans that do not involve restoring from a 6-hour-old backup.

Why Standard Migrations Risk Downtime

The core problem is a deployment timing mismatch. On Vercel, Netlify, or Kubernetes rolling updates, the old application version processes active requests for 30 seconds to several minutes after the new version starts. If your migration:

Drops a column — the old code crashes because it queries a column that no longer exists
Renames a column — same problem
Adds NOT NULL — PostgreSQL takes an AccessExclusiveLock that blocks all reads and writes until the scan completes
Changes a column type — PostgreSQL rewrites the entire table, locking it for minutes on large tables

The database needs to be in a state that both old and new code can work with during the deployment window. That is what the expand-contract pattern guarantees.

SaaS database migration zero-downtime — SQL migration code on a monitor in a dark development environment

SaaS Database Migration Zero-Downtime: The Expand-Contract Pattern

The expand-contract pattern (also called parallel change) splits every destructive or risky schema change into three phases, each independently deployable and reversible.

1Phase 1: Expand ─── Add new structure alongside the old
2    │
3    ▼
4Phase 2: Migrate ── Backfill data, update application code
5    │
6    ▼
7Phase 3: Contract ── Remove old structure after verification

The key property: after each phase, the database remains backward-compatible with the application version running before the phase. This means you can deploy any phase, verify it, and roll back by deploying the previous application version if something goes wrong.

Phase 1: Expand — Add New Structure

The expand phase only adds things. It never removes or alters existing objects in a way that breaks running code. Here are the safe operations, written as TypeORM migrations:

1// 001-expand-add-column.ts
2import { MigrationInterface, QueryRunner } from 'typeorm';
3
4export class ExpandAddColumn1234567890 implements MigrationInterface {
5  name = 'ExpandAddColumn1234567890';
6
7  public async up(queryRunner: QueryRunner): Promise<void> {
8    // Add nullable column — instant in PostgreSQL, no table lock
9    await queryRunner.query(`
10      ALTER TABLE "users" ADD COLUMN "display_name" character varying
11    `);
12
13    // Add index concurrently — non-blocking
14    await queryRunner.query(`
15      CREATE INDEX CONCURRENTLY IF NOT EXISTS "idx_users_display_name"
16      ON "users" ("display_name")
17    `);
18
19    // Add a new table — always safe
20    await queryRunner.query(`
21      CREATE TABLE "user_profiles" (
22        "id" uuid NOT NULL DEFAULT gen_random_uuid(),
23        "user_id" uuid NOT NULL REFERENCES "users"("id"),
24        "bio" text,
25        "created_at" TIMESTAMPTZ NOT NULL DEFAULT now(),
26        CONSTRAINT "PK_user_profiles" PRIMARY KEY ("id")
27      )
28    `);
29  }
30
31  public async down(queryRunner: QueryRunner): Promise<void> {
32    await queryRunner.query(`DROP TABLE "user_profiles"`);
33    await queryRunner.query(`DROP INDEX IF EXISTS "idx_users_display_name"`);
34    await queryRunner.query(`ALTER TABLE "users" DROP COLUMN "display_name"`);
35  }
36}

Rules for the expand phase:

Add columns as nullable — a nullable column with no default is metadata-only in PostgreSQL. It takes milliseconds even on large tables
Create indexes with CONCURRENTLY — standard CREATE INDEX blocks writes. CREATE INDEX CONCURRENTLY builds the index without locking the table
Add new tables freely — creating a table does not affect existing queries
Use NOT VALID for foreign keys — ADD CONSTRAINT ... NOT VALID is instant; validate separately later

Phase 2: Migrate — Backfill and Update Code

This phase has two parts that happen in sequence: backfill historical data and deploy updated application code.

Backfill in Batches

Never backfill millions of rows in a single UPDATE. Short, bounded batches prevent lock contention and let monitoring catch problems early.

1// scripts/backfill-display-name.ts
2import { DataSource } from 'typeorm';
3
4export async function backfillDisplayName(
5  dataSource: DataSource,
6  batchSize = 1000,
7  delayMs = 50,
8) {
9  const queryRunner = dataSource.createQueryRunner();
10  let processed = 0;
11
12  while (true) {
13    const result = await queryRunner.query(`
14      UPDATE "users"
15      SET "display_name" = "username"
16      WHERE "id" IN (
17        SELECT "id" FROM "users"
18        WHERE "display_name" IS NULL
19        LIMIT $1
20      )
21      RETURNING "id"
22    `, [batchSize]);
23
24    if (result.length === 0) break;
25
26    processed += result.length;
27    console.log(`Backfilled ${processed} rows`);
28
29    // Give the database breathing room between batches
30    await new Promise((resolve) => setTimeout(resolve, delayMs));
31  }
32
33  console.log(`Backfill complete: ${processed} rows updated`);
34  await queryRunner.release();
35}

Three rules for safe backfills:

Batch by primary key range or LIMIT/OFFSET — never load the entire table into memory
Add a delay between batches — 50–200ms prevents overwhelming the database connection pool
Make the backfill idempotent — WHERE display_name IS NULL ensures re-running does not double-write

For large backfills, push the job to a background job queue instead of running it inline:

1// Run backfill as a BullMQ job — same pattern as our background job queue
2import { Processor, WorkerHost } from '@nestjs/bullmq';
3import { Job } from 'bullmq';
4
5@Processor('backfill')
6export class MigrationBackfillProcessor extends WorkerHost {
7  constructor(private dataSource: DataSource) {
8    super();
9  }
10
11  async process(job: Job<{ table: string; batchSize: number }>) {
12    await backfillDisplayName(this.dataSource, job.data.batchSize);
13  }
14}

Dual-Write Application Code

Once the new column exists and historical data is backfilled, deploy application code that writes to both the old and new fields:

1// apps/api/src/users/users.service.ts
2export class UsersService {
3  async updateUser(id: string, dto: UpdateUserDto) {
4    // Dual-write to both columns during migration period
5    await this.userRepository.update(id, {
6      username: dto.username,
7      displayName: dto.username, // new column, same data
8    });
9  }
10
11  async getUser(id: string) {
12    const user = await this.userRepository.findOneBy({ id });
13    // Read from new column, fall back to old column
14    return {
15      name: user.displayName ?? user.username,
16      // ...other fields
17    };
18  }
19}

The fallback pattern (?? user.username) ensures the application works even if some rows were not backfilled. After verifying all reads use the new column, remove the fallback in a subsequent deploy.

Phase 3: Contract — Remove Old Structure

The contract phase is the simplest but requires the most verification. Before removing the old column, confirm:

No running application version references the old field
Reconciliation queries show zero drift between old and new data
Error rates and latency are normal after the dual-write deploy

1// 003-contract-remove-username-column.ts
2import { MigrationInterface, QueryRunner } from 'typeorm';
3
4export class ContractRemoveUsername1234567892 implements MigrationInterface {
5  name = 'ContractRemoveUsername1234567892';
6
7  public async up(queryRunner: QueryRunner): Promise<void> {
8    // Remove the trigger or sync mechanism
9    // Then drop the old column
10    await queryRunner.query(`
11      ALTER TABLE "users" DROP COLUMN "username"
12    `);
13  }
14
15  public async down(queryRunner: QueryRunner): Promise<void> {
16    // Rollback: add the column back
17    await queryRunner.query(`
18      ALTER TABLE "users" ADD COLUMN "username" character varying
19    `);
20    await queryRunner.query(`
21      UPDATE "users" SET "username" = "display_name"
22    `);
23  }
24}

The down method exists for local development. In production, the rollback strategy is always forward rollback — write a new migration that adds back the column rather than reversing the applied migration. Once data is written to the new schema, reversing a migration can lose data.

Example: Adding a NOT NULL Column Safely

Adding plan to a users table with 500,000 existing rows:

Phase 1 — Expand:

1ALTER TABLE "users" ADD COLUMN "plan" character varying DEFAULT 'free';

The DEFAULT clause makes this safe — PostgreSQL fills existing rows instantly without scanning the table. But the column is technically nullable at the database level. Use application-level validation to enforce the constraint until phase 3.

Phase 2 — Migrate: Deploy code that always sets plan. No backfill needed if the default covered existing rows.

Phase 3 — Contract:

1ALTER TABLE "users" ALTER COLUMN "plan" SET NOT NULL;

This is safe only after confirming zero NULL rows exist:

1SELECT COUNT(*) FROM "users" WHERE "plan" IS NULL;
2-- Must return 0

If any NULLs exist despite the default, backfill them first:

1UPDATE "users" SET "plan" = 'free' WHERE "plan" IS NULL;

Web developer testing schema changes on a laptop and desktop setup in a modern workspace

Handling Large Table Migrations (Millions of Rows)

Engineer running a large-table backfill on a production database from a laptop in a data center

For tables with millions of rows, even adding a column with a default value can be expensive in some databases (though PostgreSQL handles it well with metadata-only defaults). The biggest risk is the backfill, not the schema change.

Use cursor-based batching instead of LIMIT/OFFSET for very large tables:

1export async function backfillInCursorBatches(
2  dataSource: DataSource,
3  batchSize = 5000,
4) {
5  let lastId = '00000000-0000-0000-0000-000000000000';
6
7  while (true) {
8    const result = await dataSource.query(`
9      UPDATE "orders"
10      SET "status_normalized" = "status"
11      WHERE "id" IN (
12        SELECT "id" FROM "orders"
13        WHERE "status_normalized" IS NULL
14          AND "id" > $1
15        ORDER BY "id" ASC
16        LIMIT $2
17      )
18      RETURNING "id"
19    `, [lastId, batchSize]);
20
21    if (result.length === 0) break;
22
23    lastId = result[result.length - 1].id;
24    console.log(`Backfilled up to ID ${lastId}`);
25
26    await new Promise((resolve) => setTimeout(resolve, 100));
27  }
28}

Cursor-based batching is restartable and predictable. If the backfill is interrupted, it resumes from the last processed ID instead of re-scanning from the beginning.

For extremely large tables (tens of millions of rows), consider using pg_repack or a dedicated online migration tool. But for most SaaS tables under 5 million rows, batched backfills with 100ms delays complete within a reasonable window without blocking production traffic.

Multi-Tenant Migration Strategies

In a multi-tenant SaaS, one size does not fit all. A tenant with 10 rows backfills instantly. A tenant with 10 million rows needs the cursor-based approach and takes hours.

Run backfills per tenant instead of globally:

1export async function backfillPerTenant(
2  dataSource: DataSource,
3  tenantIds: string[],
4) {
5  for (const tenantId of tenantIds) {
6    console.log(`Backfilling tenant ${tenantId}`);
7
8    // Scope every query by tenant_id
9    await dataSource.query(`
10      UPDATE "users"
11      SET "display_name" = "username"
12      WHERE "tenant_id" = $1
13        AND "display_name" IS NULL
14    `, [tenantId]);
15
16    // Rate-limit per tenant to avoid resource contention
17    await new Promise((resolve) => setTimeout(resolve, 200));
18  }
19}

For the rollout, use tenant rings: start with internal tenants, then small customers, then mid-market, then enterprise. If a backfill fails for one tenant, you catch it before it reaches the largest accounts. This pairs well with feature flags for the read cutover — enable the new column reads per tenant rather than globally.

Rollback Plans for Every Migration

Every migration needs a rollback plan before it runs. The plan is not "restore from backup." Restoring from a backup means losing data written since the last snapshot.

Forward rollback is the safe pattern: write a new migration that undoes the change.

For a column addition, the forward rollback is:

1ALTER TABLE "users" DROP COLUMN "display_name";

For a NOT NULL constraint addition:

1ALTER TABLE "users" ALTER COLUMN "plan" DROP NOT NULL;

For a column rename (after dual-write is deployed):

1-- Keep the old column, revert reads to use it
2-- The old column was never dropped, so there is nothing to undo
3-- Just deploy application code that reads from the old column again

The expand-contract pattern makes rollback trivial at every phase because nothing is destroyed until the contract phase, and the contract phase only runs after you are confident the new structure works.

Testing Migrations on a Staging Copy

Test every migration against a dataset that mirrors production size. A migration that takes 200ms on your local laptop with 50 rows can take 45 minutes on a production table with 5 million rows.

The safest approach for TypeORM + NestJS projects:

1// scripts/test-migration.ts
2import { DataSource } from 'typeorm';
3
4async function testMigration() {
5  const dataSource = new DataSource({
6    type: 'postgres',
7    url: process.env.STAGING_DATABASE_URL, // staging DB with production-sized data
8    migrations: ['dist/migrations/*.js'],
9  });
10
11  await dataSource.initialize();
12
13  // Run in a transaction we can roll back
14  const queryRunner = dataSource.createQueryRunner();
15  await queryRunner.startTransaction();
16
17  try {
18    await dataSource.runMigrations({ transaction: 'all' });
19    // If we get here without timeout or lock contention, the migration is safe
20    console.log('Migration test passed');
21    await queryRunner.rollbackTransaction(); // revert the test
22  } catch (err) {
23    await queryRunner.rollbackTransaction();
24    console.error('Migration test failed:', err);
25    process.exit(1);
26  } finally {
27    await dataSource.destroy();
28  }
29}

Even better: restore a recent production backup to staging, run the migration against it, and measure wall-clock time, lock duration, and row counts. If the migration takes longer than your deployment timeout or blocks reads, redesign the expand-contract phases.

CI/CD Sequencing

The order in a deployment pipeline matters:

Run backward-compatible migration (expand phase — add new columns)
Deploy new application code (dual-write, read from old or new)
Run backfill (as a background job after deploy)
Deploy read-cutover code (read from new column only)
Run contract migration (remove old column, run separately after verification)

Steps 3 through 5 can be separated by hours or days. There is no rule that says a migration must finish in one deploy. The expand-contract pattern is designed to span multiple deploys, giving you time to verify each phase before committing to the next.

1# .github/workflows/deploy.yml
2jobs:
3  migrate:
4    runs-on: ubuntu-latest
5    steps:
6      - run: npx typeorm migration:run -d dist/datasource.js
7        # Only runs expand-phase migrations (additive only)
8        # Contract-phase migrations run manually after verification
9
10  deploy:
11    needs: migrate
12    runs-on: ubuntu-latest
13    steps:
14      - run: npm run build && npm run deploy

The migrate step runs before deploy so the new schema exists before the new code needs it. The contract migration runs as a separate, manual step hours or days later.

Conclusion

The migration that locked our production table for 45 minutes was a SET NOT NULL on a table with 3 million rows. We did not check for NULL rows first. We did not batch the change. We ran the migration during business hours because the ticket said "minor schema change."

The expand-contract pattern turns every risky migration into a sequence of safe steps. Add the new column first. Backfill in batches with delays between them. Deploy dual-write code. Verify reconciliation. Only then drop the old column. Each phase is independent, reversible, and deployable at a separate time.

SaaS database migration zero-downtime is not about a clever tool — it is about ordering changes so the database is always compatible with the code talking to it. If you are planning a production migration right now and the first step is "DROP COLUMN" or "RENAME COLUMN," stop. Write the expand phase first. Deploy it. Wait a day. Then write the migrate phase. Your future self — holding a pager at 3pm on a Tuesday — will thank you for the boring, incremental approach that kept the site online.

If you have a migration coming up that scares you a little, that's the healthy reaction — and it's exactly the kind of thing we are happy to look over with you before you run it against production.

Frequently Asked Questions

Zero-downtime database migration is a strategy for changing your database schema without taking the application offline. Instead of running a single migration that locks tables or breaks queries, you split the change into backward-compatible phases — expand (add new structure), migrate (backfill data and update code), and contract (remove old structure). Each phase deploys independently and is reversible.

The expand-contract pattern splits every schema change into three phases. First, expand by adding the new column, table, or index alongside the existing structure. Second, migrate by backfilling data in batches and updating application code to dual-write to both old and new fields. Third, contract by removing the old structure once no code references it. This ensures old and new application versions can run simultaneously during deployment.

Add the new column as nullable, backfill existing data from the old column, update application code to write to both columns and read from the new one, then drop the old column in a separate deployment after verifying no code references it. Never use ALTER TABLE RENAME COLUMN in production — it breaks every query using the old name immediately.

Add the column as nullable first (instant operation), then backfill NULL rows with a default value in batches, and finally add the NOT NULL constraint in a separate step. PostgreSQL only needs to scan the table to validate NOT NULL, so ensure zero NULLs exist before adding the constraint to avoid a long table lock.

Use forward rollback — write a new migration that undoes the previous change instead of reverting the migration file. Never try to reverse-apply a migration after data has been written. For schema changes, this means writing a migration that adds back the old column or constraint. For data changes, maintain a deleted_at soft-delete pattern so recovery is possible without restoring from backup.

About Umar Farooq

Umar Farooq is the founder and lead engineer of Codify SaaS. He builds B2B SaaS products and web applications on modern TypeScript stacks and enterprise Java, and writes code-first guides drawn from real production work — the schema decisions, the migrations that almost went wrong, and the performance fixes that actually moved the numbers. When he recommends an approach, he shows the code and explains the trade-offs.

Read full bio