Block Storage Disaster Recovery
RBD Mirror is a feature of Ceph Block Storage (RBD) that enables asynchronous data replication between different Ceph clusters, providing cross-cluster Disaster Recovery (DR). Its core function is to synchronize data in a primary-backup mode, ensuring rapid service takeover by the backup cluster when the primary cluster fails.
- RBD Mirror performs incremental synchronization based on snapshots, with a default snapshot interval of once per hour (configurable). The differential data between primary and backup clusters typically corresponds to writes within one snapshot cycle.
- RBD Mirror only provides underlying storage data backup and does not handle Kubernetes resource backups. Please use the platform's Backup and Restore feature to back up PVC and PV resources.
TOC
Terminology
Backup Configuration
Prerequisites
- Prepare two clusters capable of deploying Alauda Build of Rook-Ceph: a Primary cluster and a Secondary cluster, with network connectivity between them.
- Both clusters must run the same platform version (v3.12 or later).
- Create distributed storage services in both Primary and Secondary clusters.
- Create block storage pools with identical names in both Primary and Secondary clusters.
- Please ensure that the following two images have been uploaded to the platform's private image repository:
quay.io/csiaddons/k8s-controller:v0.12.0-><registry>/csiaddons/k8s-controller:v0.12.0quay.io/csiaddons/k8s-sidecar:v0.12.0-><registry>/csiaddons/k8s-sidecar:v0.12.0
Procedures
Bootstrap Peers(Primary <-> Secondary)
-
Enable Mirroring for Primary Cluster's Block Storage Pool
Execute the following command on both Primary and Secondary clusters' Control nodes:
Parameters:
<block-pool-name>: Block storage pool name.
-
This token serves as the critical credential for establishing mirror connections between clusters.
Execute the following command on both Primary and Secondary clusters' Control nodes:
-
Create Peer Token Secret in Peer Cluster
Execute the following command on both Primary and Secondary cluster's Control node:
Parameters:
-
<token>: Token obtained from Step 2.On the Primary cluster, configure this field using the token obtained from the Secondary cluster.
On the Secondary cluster, configure this field using the token obtained from the Primary cluster.
-
<block-pool-name>: Block storage pool name.
-
-
Patch Peer Secret for Block Storage Pool
Execute the following command on both Primary and Secondary cluster's Control node:
Parameters:
<block-pool-name>: Block storage pool name.
-
Deploy Mirror Daemon
This daemon is responsible for monitoring and managing RBD mirror synchronization processes, including data synchronization and error handling.
Execute the following command on both Primary and Secondary cluster's Control node:
-
Verify Mirror Status
Execute the following command on both Primary and Secondary cluster's Control node:
Parameters:
<block-pool-name>: Block storage pool name.
Setup Environment For Volume Replication
This feature enables efficient data replication and synchronization without interrupting primary application operations, enhancing system reliability and availability.
-
Setup CsiAddons Controller
Execute the following commands on both Primary and Secondary clusters' Control nodes:
Parameters:
<registry>: Registry address of platform.
-
Enable CsiAddons sidecar
Execute the following commands on both Primary and Secondary clusters' Control nodes:
Wait for all csi pods to restart successfully
-
Create VolumeReplicationClass
Execute the following commands on both Primary and Secondary clusters' Control nodes:
<scheduling-interval>: Scheduling interval, (e.g., schedulingInterval: "1h" indicates execution every 1 hour.)
Enable Mirror for PVC
Execute the following command on the Primary cluster's Control node:
<vr-name>: The name of the VolumeReplication object, recommended to be the same as the PVC name.<namespace>: The namespace to which the VolumeReplication belongs, which must be the same as the PVC namespace.<pvc-name>: The name of the PVC for which Mirror needs to be enabled.
Note After enabling, the RBD image in the Secondary cluster becomes read-only.
Planned Migration
Use cases: Datacenter maintenance, technology refresh, disaster avoidance, etc.
Relocation
The Relocation operation is the process of switching production to a backup facility(normally your recovery site) or vice versa.
For relocation, access to the image on the primary site should be stopped. The image should now be made primary on the secondary cluster so that the access can be resumed there.
Prerequisites
- The Kubernetes resources of the Primary cluster have been backed up and restored to the Secondary cluster, including PVCs, PVs, application workloads, etc.
Procedures
Follow the below steps for planned migration of workload from the Primary cluster to the Secondary cluster:
-
Scale down all the application pods which are using the mirrored PVC on the Primary cluster.
-
Update VolumeReplications for all the PVCs which mirroring is enabled on the Primary cluster.
Set
spec.replicationStatetosecondary. -
Create VolumeReplications for all the PVCs for which mirroring is enabled on the Secondary.
<vr-name>: The name of the VolumeReplication object, recommended to be the same as the PVC name.<namespace>: The namespace to which the VolumeReplication belongs, which must be the same as the PVC namespace.<pvc-name>: The name of the PVC for which Mirror needs to be enabled.
-
Check VolumeReplication CR status to verify if the image is marked
primaryon the secondary site. -
Once the Image is marked as
primary, the PVC is now ready to be used. Now, we can scale up the applications to use the PVC.
Disaster Recovery
Use cases: Natural disasters, Power failures, System failures, and crashes, etc.
Failover (abrupt shutdown)
In case of Disaster recovery, create VolumeReplication CR at the Secondary Site.
Since the connection to the Primary Site is lost, the operator automatically sends a GRPC request down to the driver to forcefully mark the dataSource as primary on the Secondary Site.
Prerequisites
- The Kubernetes resources of the Primary cluster have been backed up and restored to the Secondary cluster, including PVCs, PVs, application workloads, etc.
Procedures
-
Create VolumeReplications for all the PVCs for which mirroring is enabled on the Secondary.
<vr-name>: The name of the VolumeReplication object, recommended to be the same as the PVC name.<namespace>: The namespace to which the VolumeReplication belongs, which must be the same as the PVC namespace.<pvc-name>: The name of the PVC for which Mirror needs to be enabled.
-
Check VolumeReplication CR status to verify if the image is marked
primaryon the secondary site. -
Once the Image is marked as
primary, the PVC is now ready to be used. Now, we can scale up the applications to use the PVC.
Failback (post-disaster recovery)
Once the failed cluster is recovered on the primary site and you want to failback from secondary site, follow the below steps:
Prerequisites
- The Kubernetes resources of the Primary cluster have been backed up and restored to the Secondary cluster, including PVCs, PVs, application workloads, etc.
Procedures
-
Scale down the running applications (if any) on the primary site. Ensure that all persistent volumes in use by the workload are no longer in use on the primary cluster.
-
Update VolumeReplication CR replicationState from primary to secondary on the primary site.
-
Scale down the applications on the secondary site.
-
Update VolumeReplication CR replicationState state from
primarytosecondaryin secondary site. -
On the primary site, verify the VolumeReplication status is marked as volume ready to use.
-
Once the volume is marked to ready to use, change the replicationState state from secondary to primary in primary site.
-
Scale up the applications again on the primary site.