Restoring database - How to keep replication in place

database restoration

replication management

data recovery

database maintenance

replication configuration

Restoring database - How to keep replication in place

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Restoring a replicated database is risky because one wrong step can leave replicas diverged or permanently behind. The safe approach is to treat restore and replication reattachment as one coordinated procedure. If you preserve replication coordinates, seed replicas from a consistent point, and validate state before traffic returns, replication can stay intact during recovery.

Plan Recovery Target Before Touching Data

Start with clear objectives:

point-in-time recovery to a timestamp.
restore to the latest valid backup.
partial object-level recovery versus full cluster restore.

Without this decision, teams often mix incompatible steps, such as replaying logs past the intended recovery point. Document the chosen target and assign one operator to control command execution.

Also capture current replication metadata before changes. For MySQL-based systems, record SHOW REPLICA STATUS on replicas and SHOW MASTER STATUS on the primary. These snapshots are critical if rollback is needed.

Controlled Restore Sequence

The order of operations matters more than the individual commands.

1. Pause write activity

If possible, stop application writes to avoid new divergence during restore.

2. Stop replication apply

sql

STOP REPLICA;

3. Restore primary from known-good backup

bash

mysql -u admin -p < full_backup.sql

For physical backups, use engine-specific tooling and verify file permissions and log state before startup.

4. Capture new replication start point

sql

SHOW MASTER STATUS;

In GTID deployments, confirm GTID mode and executed sets rather than relying on file and position only.

5. Re-seed replicas from matching dataset

Every replica must start from data consistent with the restored primary. Do not attach replicas that still contain later transactions from the old timeline.

6. Reconfigure source coordinates and resume

sql

1CHANGE REPLICATION SOURCE TO
2  SOURCE_HOST='db-primary',
3  SOURCE_USER='repl',
4  SOURCE_PASSWORD='***',
5  SOURCE_LOG_FILE='mysql-bin.004120',
6  SOURCE_LOG_POS=982341;
7
8START REPLICA;

Use your server version syntax and security standards.

GTID and Non-GTID Considerations

GTID simplifies failover and reattachment, but only when configured consistently across nodes. In mixed or partially upgraded environments, operators can mistakenly apply non-GTID steps on GTID replicas.

Practical guidance:

verify gtid_mode, enforce_gtid_consistency, and related settings before incident day.
store restore procedures separately for GTID and non-GTID fleets.
include version-specific SQL examples in runbooks.

If you use channels or filtered replication, verify filter rules after restore. Silent filter mismatch can pass health checks while data drifts over time.

Validation Before Returning Traffic

Do not declare success immediately after START REPLICA. Validate:

replica IO and SQL threads running.
error fields empty.
lag trending toward zero.
checksum or row-count spot checks on critical tables.

sql

SHOW REPLICA STATUS\G;

Then reintroduce write traffic gradually while watching lag and replication errors. A phased ramp catches hidden issues without full customer impact.

Operational Hardening

The best restore is the one you rehearsed. Run non-production drills using realistic data volume and timing constraints. Confirm backup readability, credential access, and monitoring signals.

After each incident or drill:

update runbook steps that caused hesitation.
remove manual steps that can be scripted safely.
improve alerting for lag spikes and replication errors.

These improvements matter more than one-time heroics during an outage.

Common Pitfalls

Restoring primary without recording replication coordinates for reattachment.
Reattaching replicas with datasets from a different timeline.
Ignoring GTID-specific requirements in GTID-enabled environments.
Resuming full write traffic before lag and data checks are stable.
Treating green thread status as proof of full data consistency.

Summary

Keep restore objective explicit and shared before executing commands.
Pause writes and replication, then restore from a known consistent backup.
Re-seed replicas from matching data and attach using correct coordinates or GTID state.
Validate thread health, lag, and data consistency before full traffic return.
Rehearse and refine runbooks so replication-safe restores are repeatable under pressure.