Replication Lag Detection | PG Pilot PostgreSQL Monitoring

What is this issue?

Replication lag measures how far behind a replica is from the primary database.
PG Pilot measures this in bytes of WAL (Write-Ahead Log) not yet consumed.

Lag thresholds:
- Warning: > 1 GB lag
- Critical: > 5 GB lag

High lag means:
- Read replicas are serving stale data
- Failover would result in data loss
- WAL files are accumulating on the primary

Why it matters

Stale Reads

Applications reading from replicas get outdated data

Data Loss Risk

Failover to a lagging replica loses uncommitted transactions

Disk Space

Primary retains WAL files until replicas consume them

Recovery Time

High lag means longer time to catch up during maintenance

How PG Pilot detects it

```sql
SELECT
  slot_name,
  active,
  pg_size_pretty(
    pg_current_wal_lsn() - confirmed_flush_lsn
  ) AS lag_bytes,
  pg_current_wal_lsn() - confirmed_flush_lsn AS lag_raw
FROM pg_replication_slots
WHERE slot_type = 'logical'

UNION ALL

SELECT
  slot_name,
  active,
  pg_size_pretty(
    pg_current_wal_lsn() - restart_lsn
  ) AS lag_bytes,
  pg_current_wal_lsn() - restart_lsn AS lag_raw
FROM pg_replication_slots
WHERE slot_type = 'physical';
```

How to fix it

1

Check replica status

Verify the replica is running and connected:

SELECT * FROM pg_stat_replication;

Look for the replica's state (should be streaming)
and sent_lsn vs replay_lsn.

2

Check replica performance

On the replica, check if it's keeping up:

SELECT
  pg_last_wal_receive_lsn(),
  pg_last_wal_replay_lsn(),
  pg_last_xact_replay_timestamp();

If replay is behind receive, the replica is I/O bound.

3

Investigate bottlenecks

Common causes of lag:
- Slow I/O on replica (check disk metrics)
- Network issues between primary and replica
- Heavy read load on replica affecting replay
- Replica running on undersized hardware

4

Tune replica settings

Increase replay parallelism:

-- On replica (requires restart)
max_parallel_apply_workers_per_subscription = 4

For physical replication, ensure sufficient:
- max_wal_senders on primary
- Network bandwidth
- Disk I/O capacity

Prevention

Size replicas appropriately for the write load
Monitor lag continuously with PG Pilot
Set up alerts before lag becomes critical
Test failover procedures regularly

Related Features

Replication Monitoring

PostgreSQL Documentation

PostgreSQL Streaming Replication

High Replication Lag