BRO SRE

Reliability practices, infrastructure, automation

← Back to articles

PostgreSQL High Availability: Patroni, PgBouncer, and the Lessons We Learned

2025-11-20 · PostgreSQL, High Availability, Databases

PostgreSQL does not ship with built-in high availability. Streaming replication gives you a hot standby, but automated failover, leader election, and connection routing are your responsibility. After evaluating repmgr, Stolon, and Patroni, we chose Patroni. Here is what we learned running it in production for 18 months.

Why Patroni

Patroni uses a distributed consensus store (etcd, ZooKeeper, or Consul) for leader election and configuration. This means failover decisions are based on distributed consensus, not on a single monitoring node that might itself be partitioned.

The key advantages over repmgr:

Architecture

Our production setup runs three PostgreSQL nodes across three availability zones, with a three-node etcd cluster for consensus. PgBouncer runs on each application node (not on the database nodes) in transaction pooling mode.

# patroni.yml (simplified)
scope: prod-main
name: pg-node-1

etcd3:
  hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 200
        shared_buffers: 8GB
        wal_level: replica
        max_wal_senders: 10
        max_replication_slots: 10
        hot_standby: "on"

postgresql:
  listen: 0.0.0.0:5432
  data_dir: /var/lib/postgresql/16/main
  pgpass: /tmp/pgpass

  authentication:
    superuser:
      username: postgres
    replication:
      username: replicator

PgBouncer Connection Pooling

PostgreSQL forks a process per connection. At 200 connections, that is manageable. At 2,000 connections across 50 microservices, it is not. PgBouncer solves this by multiplexing application connections onto a smaller pool of PostgreSQL connections.

We run PgBouncer in transaction pooling mode: a server connection is assigned to a client only for the duration of a transaction, then returned to the pool. This gives us 2,000 application connections mapped to 100 PostgreSQL connections.

The critical gotcha: transaction pooling breaks session-level features. Prepared statements, SET commands, LISTEN/NOTIFY, and advisory locks do not work as expected. We discovered this when a service using SET search_path started reading from the wrong schema intermittently — because the SET applied to a shared server connection that was subsequently used by a different client.

WAL Archiving

Streaming replication protects against node failure. WAL archiving protects against data corruption and enables point-in-time recovery. We use WAL-G to archive WAL segments to S3:

# WAL-G configuration
WALG_S3_PREFIX=s3://our-bucket/wal-archive/prod-main
AWS_REGION=us-east-1
WALG_COMPRESSION_METHOD=lz4
WALG_UPLOAD_CONCURRENCY=4

# Full backup schedule (cron)
0 3 * * * /usr/local/bin/wal-g backup-push /var/lib/postgresql/16/main

# Retention: keep 7 days of PITR capability
0 4 * * * /usr/local/bin/wal-g delete retain 7 --confirm

We test recovery monthly by restoring from backup to a staging environment and running our integration test suite against it. This has caught two issues: a misconfigured S3 lifecycle policy that was deleting WAL segments after 3 days, and a WAL-G version incompatibility after an upgrade.

The Outage That Taught Us About synchronous_commit

Six months into production, we had a 12-minute outage that taught us a painful lesson about PostgreSQL replication guarantees.

The sequence: the primary node's EBS volume experienced elevated latency (an AWS event affecting our AZ). Patroni detected the primary as unhealthy and promoted the synchronous replica to primary. The failover completed in 8 seconds. But when the old primary came back, we discovered that 47 transactions committed in the last 200ms before failover were missing from the new primary.

The cause: we were running with synchronous_commit = on but synchronous_standby_names was empty. This means PostgreSQL waited for the WAL to be flushed to the local disk but did not wait for replication to the standby. The setting that provides zero data loss is:

synchronous_commit = remote_apply
synchronous_standby_names = 'ANY 1 (pg-node-2, pg-node-3)'

The tradeoff: remote_apply adds 1-3ms of commit latency because each transaction waits for at least one replica to apply the WAL. For our workload, this was acceptable. For latency-sensitive workloads, remote_write is a middle ground — it waits for the WAL to be received by the replica's OS but not applied.

Monitoring

We track four critical metrics: