Application failover with Health Checks

By correctly implementing DNS Failover with Amazon Route 53, we can detect an application outage and redirect application users to alternate endpoint locations. This will improve the availability of your applications with Route 53 automatically removing damaged endpoints from service where your application is unavailable.

In this example we will work through setting up failover between EC2 and S3 for an example domain, example.com.

To get started, first we will need to log into the Amazon Route 53 console. If you already have a zone to work with use it, otherwise create a new hosted zone for your account.

Create a new record set for the hosted zone apex domain name, in this example that would be a record for example.com. Select it’s type as an A record, and set alias to yes. Select the alias as your ELB endpoint in EC2 (if not using an ELB, you can create a regular record to the instance IP address, and create a health check for that resource). Set the routing policy to failover, with a failover type of primary. Set evaluate target health to yes (and if applicable associate with health check to yes and set the health check).

Save the record.

Next up is setting up the failover to go to your Amazon S3 bucket, which in this example could host a maintenance notification or something similar.

You will need to name your bucket correctly, otherwise you’ll encounter issues unless you’re implementing CloudFront (a CDN). Create another example.com record, type a, alias yes, and set it to your S3 bucket. Next change the routing policy to failover, with a failover type of secondary. Set evaluate target health to yes. For S3 you don’t need a health check, as Amazon has internal health checks validating the status of S3.

Now if you’ve set everything up correctly you should hit the ELB until the ELB fails the associated health check… at which point you should be directed to the S3 bucket.

If set up correctly your DNS should normally look like:

example.com ====> example.elb.amazon.com
            X               
            X
            X
            XXXX> example-com.s3.aws.amazon.com (not returned)

And if the ELB were to fail:

example.com XXXX> example.elb.amazon.com (not returned)
            |               
            |
            |
            ====> example-com.s3.aws.amazon.com

You can nest failure records by using other records in the zone as the ALIAS record instead of the ELB, and have failure setup on the extra record.

example.com ====> web.example.com ====> pri-web-example.elb.amazon.com
            X                     X
            X                     X
            X                     X
            X                     ====> sec-web-example.elb.amazon.com (not returned)
            X
            X
            ====> example-com.s3.aws.amazon.com (not returned)

You can also do something similar with round robin DNS, by setting the record to evaluate target health.

Health Checks for External / Unique Resources

Since we’re working with failover records, you’ll need to set the record to evaluate endpoint health or create a health check. For ELBs and CloudFront you can just set the record to evaluate endpoint health - they have a mechanism built in. For other resources you will need to set a health check: from the Route53 Dashboard go to the health check section, and use the Create Health Check button to bring up the health check form.

As an example you would set the health check as follows:

  • Name: example.com-eip-health
  • Protocol: HTTP (or even better, HTTPS…)
  • Specify Endpoint By: Domain Name
  • Domain Name: some-eip.amazon.com
  • Port: 80
  • Path: /
  • Request Interval: Standard
  • Failure Threshold: 3
  • Enable String Matching: No
  • Create Alarm: No

When ready, click create.

If you’ve configured the health check as shown above, then it will take between 90 and 150 seconds before the failover actually occurs. You can shorten that time by setting the health check request interval to fast and the failure threshold to 1. That will decrease the time for a transition to between 10 and 70 seconds (10 seconds for the check to fail plus up to an additional 60 seconds for the DNS record to be updated on the client side).

Creating Failure Records with the AWS CLI

To do this with the AWS CLI, you first need to write the changes out in JSON. Official cli examples.

failover.json

{
  "Comment": "optional comment about the changes in this change batch request",
  "Changes": \[
    {
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "example.com",
        "Type": "A",
        "SetIdentifier": "RANDOMTEXT1",
        "Failover": "PRIMARY",
        "AliasTarget": {
          "HostedZoneId": "hosted zone ID for your Elastic Load Balancing load balancer",
          "DNSName": "DNS domain name for your Elastic Load Balancing load balancer",
          "EvaluateTargetHealth": true
        }
      }
    },
    {
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "example.com",
        "Type": "A",
        "SetIdentifier": "RANDOMTEXT2",
        "Failover": "SECONDARY",
        "AliasTarget": {
          "HostedZoneId": "hosted zone ID for your Amazon S3 bucket",
          "DNSName": "DNS domain name for your Amazon S3 bucket",
          "EvaluateTargetHealth": true
        }
      }
    }
  \]
}

When you’ve prepared the json file, you can run the command: aws route53 change-resource-record-sets --hosted-zone-id ######### --change-batch ./failover.json