One of the design principles to build reliable workloads is to test your recovery procedures. While this is a challenging task in traditional environments (i.e., on-premises), it’s much easier on the cloud because you can predict how your application fails and simulate a failure. You can then validate how your people, technology, and processes work together to recover from that failure. For more information about cloud design principles, review the Reliability Pillar of the AWS Well-Architected Framework.
In part 1 of this 2-part blog series, I built a disaster recovery (DR) solution for a workload hosted primarily on Google Cloud Platform (GCP) and we used AWS Cloud as a DR site. I used AWS Elastic Disaster Recovery (AWS DRS) to replicate source servers from GCP to AWS, failed over from GCP to AWS, and cutover by changing the Domain Name System (DNS).
In this post, I complete this demonstration by simulating a popular requirement by organizations to use the DR site to serve traffic, run transactions, and after completion, return to the normal operations before the outage.
This process is called failback and is the act of redirecting traffic from the recovery system, AWS in this case, to the primary system, GCP. AWS DRS uses Failback Client on the original source server (GCP) that will help to reverse the replication direction and replicate the date written on AWS, during the outage, back to GCP.
This solution assumes that you completed the failover part, as discussed in part 1 of this 2-part blog series.
Solution overview and walkthrough
The following diagram illustrates the solution workflow.
At a high-level, I perform the following steps:
- Prepare Failback Client.
- Meet the networking requirements.
- Create an AWS Identify and Access Management (IAM) user and generate IAM credentials.
- Boot Failback Client on GCP and complete the replication.
- Cutover and switch the replication back to normal.
1. Prepare Failback Client
The Failback Client will be used to boot the server that my system fails back to. The download link for the client depends on the AWS Region where the Recovery instances are located. I’m using us-east-2. I use this link to download the Failback Client software.
As the Failback Client is in the livecd.iso format, it’s unusable and I can’t use it directly to boot a Virtual Machine (VM) on GCP. To resolve this, I convert it to a virtual disk that is used on GCP. The conversion process installs Linux Kernel, generates the Grand Unified Bootloader (GRUB) configuration file (grub.cfg), and then installs GRUB on the disk. The output from this conversion process is a virtual disk (VMDK), which I use to create a GCP compatible image (check how) that can be used to create my failback VM.
The steps for the conversions are outside the scope of this post; however, you can view and download the code from here.
2. Meet the networking requirements
The failback process requires the following connections to be permitted:
- TCP 1500 from the Failback Client on GCP to the Recovery instance on AWS.
- This is to allow the replication in the opposite direction we used from GCP to AWS in Part 1
- TCP 443 from the Failback Client on GCP to S3 endpoint
- This is to download the replication software
- TCP 443 from the Failback Client on GCP to AWS DRS endpoint
- This is to initiate communication and send ongoing replication states updates, also for pairing.
3. Create an IAM user and generate IAM credential
To perform a failback with the Elastic Disaster Recovery Failback Client, I must generate the required AWS credentials. The credentials can either be permanent credentials (IAM User) or temporary credentials (by assuming a role). These credentials are only used during Failback Client installation. For the purpose of this demonstration, I followed the instruction in Using the Failback Client to create an IAM user (drs-failback). I identify the users’ credentials as I use them in the following step (step 4).
4. Boot Failback Client on GCP and complete the replication
To prepare for the failback replication:
- I confirm in the failback replication settings in the AWS DRS Console that “Use private IP for data replication” is set to Yes since I’m using the VPN connection that I built in part 1 of this post.
- I identify the Recovery instance ID that I’d like to failback from on AWS. I will use this ID in the next step.
3. For the failback to work, the Failback Client should have volumes that are greater than those of the original server on AWS to accommodate any data to be written to the DR site during the outage. In this demonstration, I use the same VM on GCP that I used in part 1 of this post. I add an additional 30 GB volume to this VM and I use this new added volume (sdb) for the failback data replication.
This is how the file system on the Failback Client VM looks like after adding the addition volume:
I start the Failback Client by running the start.sh command. It then asks me to provide the name of AWS Region, so I enter us-east-2.
The failback process starts and the wizard asks me to provide the following details:
AWS access key and secret access keys: I provide the credentials I created in step 3.
Recovery instance ID: the Failback Client will try to map the configurations of the server where the Failback Client is installed with the recovery instances on AWS. If it doesn’t find a matching instance, it asks you to enter the recovery instance ID that you’d like to failback from. I enter the ID I took in step 4.2.
Local block device: similar to the previous step, the failback client tries to map the volumes in the recovery instance and the failback server. If it doesn’t find mapping, it asks you to manually enter the volume ID where you want the data to be replicated.
After successfully mapping the volumes and establishing a connection, the Failback Client will download the replication software and start the reversed replication from AWS back to GCP. At this point, the screen shows “Replication in progress.” I confirm that by checking the AWS side.
To confirm the status, I go back to AWS DRS Console and I see that the connection has been established and that the replication is about to start.
After some time, I see the replication has progressed to 44%.
And then it completes.
Once all the Recovery instances I plan to fail back show the statuses above, I select the checkbox to the left of each Instance ID and choose Failback. This will stop data replication and will start the conversion process. This will finalize the failback process and will create a replica of each Recovery instance on the corresponding source server.
This action will create a Job, which you can follow on the Recovery job history page. After the failback is complete, the Failback Client will show that the failback has been completed successfully.
I have all the data from the Recovery instance (on AWS) replicated to the new volume (sdb) that I created on Failback Client VM on GCP. The last step to complete the failback process is to use this volume to create a new image and then use the image to spin up a new Compute Engine in GCP. This new VM will have the original data (i.e., the data before the disaster) and the data written on AWS during the DR test. I will then use this VM to return to normal operations. This means I start the replication from GCP to AWS and I do the exercise again when I have an outage or I want to conduct a DR exercise.
I follow the steps described earlier in this post to create a new image:
And then I follow the step to create a new Compute Engine from the image I created:
5. Cutover and switch the replication back to normal
The last step is to cutover the DNS to point back to your original workload on GCP. For more information on options for changing DNS, see part 1 of this 2-part blog series.
Solution cost and pricing
The pricing details discussed in part 1 of this post also apply to part 2. However, part 2 has an additional factor to consider and that is the outbound data transfer charges for the failback replication from AWS to GCP. Data transfer from AWS to the internet is charged per service, with rates specific to the originating Region. Refer to the pricing pages for each service for more details. For example, see the pricing page for Amazon Elastic Compute Cloud (Amazon EC2).
To avoid incurring unwanted AWS costs after performing these steps, delete the AWS and GCP resources created for this demonstration, which include the Compute Engine and networking components on GCP, the replication instances, EBS snapshots, and the Recovery Instance created by AWS DRS on AWS.
In this post, I showed the steps to failback from a disaster recovery site (AWS Cloud) to the primary site (GCP). This is usually the final task in completing DR testing when serving users’ requests from a DR site and have them perform transactions there. The failback process will reverse the replication direction back from AWS to GCP to allow the data written on the DR site during the outage to be replicated back to the primary site. Because the ISO image of the Failback Client can’t be directly used on the GCP, I performed a conversion process to create a bootable disk that is compatible with GCP. I completed the replication to a volume created for that purpose, and then used that volume to create a new Compute Engine on GCP that represents our original server before the outage. I switched DNS to server customer’s traffic from GCP.
Thanks for reading this 2-part blog series. If you have any comments or questions, feel free to leave them in the comments section.