AWS Does not support Broadcast or Multicast, so implementing a FHRP Solution as we are used to do on-premise won’t work. Fortunately, keepalived support Unicast Peers, so implementing it on AWS is no problem!
The use-case for this post continues where we left on the previous one with the difference that we will be using 2 VPN instances per vpc-region running Ubuntu instead of CentOS. The full configuration is found on my github. Here’s a quick illustration:
We have LibreSwan running on the vpn instance, and tunnels between
vpn01a.euw1<>vpn01a.euc1 and another between vpn02a.euw1<>vpn02a.euc1.
Each subnet on AWS has a Route Table attached to it, you could think of it as being similar to a VRF when we think about external routes, but internal to the VPC, all subnets have Layer2 access between them, so unless you apply SG and NACL, everything can reach everything inside the same VPC.
Here’s how my vpn01a Route Table instances looks like right now:
Both have a default Route via the First IP of the Subnet, which on AWS will always be the local subnet AWS Route Table. They both know how to reach the remote VPC via the IPSEC Tunnel (vti0), and they can indeed reach the other side. However, the AWS RT’s still does not have the proper routes, they don’t know how to reach the remote side, meaning that all the other Instances that relies on the AWS RT (all of them do!) won’t be able to reach the remote vpc. Heres how my private-subnet RT looks on EUW1:
As we can see, EUW1 does not know how to reach EUC1 (10.240.0.0/24)! A easy fix would be adding a Manual Route to 10.240.0.0/24, set next-hop to the instance vpn01a, do the same on the EUC1 side with the inverse route, done! It works, but what if vpn01a fails? Also, adding manual routes is a nightmare when it starts to grow. AWS Does not (yet) supports any type of Dynamic Routing Protocols, so to workaround that we will be using KeepAlived + a crafted Python Script for Dynamic Route Injection, and we’re going to use Ansible to Automate the KeepAlived deployment.
KeepAlived
VRRP Is a old friend of us, Network Engineers, we’ve been doing it on our Routers forever, so why not using this powerful FHRP solution also on the Public Cloud? To deploy the keepalived configuration into our VPN Instance, we will be using Ansible. I assume Keepalived is already installed, if not, please install it. Here’s how the playbook and the Jinja2 Template looks like:
--- - hosts: vpn0*.euc1.* #gather_facts: no vars: left_side: 'vpn01.euc1.netoops.net' right_side: 'vpn02.euc1.netoops.net' host1: "{{ hostvars['vpn01.euc1.netoops.net']['ansible_ens3']['ipv4']['address'] }}" host2: "{{ hostvars['vpn02.euc1.netoops.net']['ansible_ens3']['ipv4']['address'] }}" tasks: - name: write the keepalived config file template: src=keepalived.j2 dest=/etc/keepalived/keepalived.conf become: true notify: - restart keepalived - name: ensure keepalived is running service: name=ipsec state=started become: true handlers: - name: restart keepalived service: name=keepalived state=restarted become: true
{% if ( left_side in inventory_hostname) %} vrrp_instance VPN { interface ens3 state MASTER priority 200 virtual_router_id 33 unicast_src_ip {{ host1 }} unicast_peer { {{ host2 }} } notify_master "/usr/local/bin/master.sh" } {% endif %} {% if ( right_side in inventory_hostname) %} vrrp_instance VPN { interface ens3 state BACKUP priority 100 virtual_router_id 33 unicast_src_ip {{ host2 }} unicast_peer { {{ host1 }} } notify_master "/usr/local/bin/master.sh" } {% endif %}
The trick to make VRRP works on AWS is the following lines of the KeepAlived configuration:
unicast_src_ip {{ host1 }} unicast_peer { {{ host2 }} }
We change its behaviour from Multicast to Unicast, and with Ansible we are able to get facts of any Inventory Instance, including its IP address, making the automation more dynamic by not having to worry what IP the Instance is using, these are the lines of our Playbook that takes care of the IP address of the Instance:
host1: "{{ hostvars['vpn01.euc1.netoops.net']['ansible_ens3']['ipv4']['address'] }}" host2: "{{ hostvars['vpn02.euc1.netoops.net']['ansible_ens3']['ipv4']['address'] }}"
Note: Don’t forget to quote 🙂
KepAlived also has a neat feature called notify, where we are able to run a shell script as the Node change its state. Here in our example, whenever the Node gets into MASTER state, we are going to run a shell script called master.sh. This script in turn will be responsible to call the aws_inject_routes.py Script that takes care of Injecting the routes Into the AWS Route Tables.
#!/bin/bash /usr/local/bin/aws_route_inject.py echo "Route Injection Done" | cat >/var/tmp/test.log
Note: I installed the scripts into the Instances during boot time using user-data on Terraform.
That’s it! With that, whenever a Instance become Master, it will Inject the VPN Routes into the AWS Route Table with the Next-Hop as itself, traffic from other Instances on the VPC should then be able to access the remote VPN locations via the vpn instances. Lets try it.
First we define on our keepalived-playbook.yml in which instances we want to deploy the config. In our example, we want the VRRP Cluster between vpn01.euw1 and vpn02.euw1, we then save and run the Playbook:
Done! We have a VRRP Cluster UP and Running on AWS. Now, if our setups works, when vpn01a became Master, notify_master should have called the Python Script, and the routes should be available on the AWS RT with a next-hop as vpn01a-instance-id. Lets check:
Hooray! Now, lets force KeepAlived to fail on vpn01a by stopping its service and see what happens while we tail vpn02a logs:
Sweet! Routes converged to vpn02a as expected! But not enough to failover the traffic on both side, so far we were able to failover the Tunnel in only one side, to make keepalived trigger the failover/routes change on both VPCs we need to make use of the feature called vrrp_script. Its a small change to our Keepalived template, here’s how it looks:
vrrp_script vpn_check { script "/usr/local/bin/vpn_check.sh" interval 2 # check every 2 seconds fall 2 # require 2 failures for KO rise 2 # require 2 successes for OK } vrrp_instance VPN { ...... track_script { vpn_check }
#!/bin/bash VTI=$(ifconfig | grep vti) if [ -z "$VTI" ] then exit 1 else exit 0 fi
That’s it, keepalived will run the vpn_check.sh script every 2 seconds, and the script has the simple task of checking if the VPN is up, by checking if a VTI interface exists. If when return is 1 VRRP will enter in FAULT state and failover, when resturn is 0 health check will pass.
To test it, I’ve spin up 1 test instance in each VPC so we can ping from one to the other and here’s the Result:
From EUW1(10.250/24) I have a MTR to 10.240.0.84, our test instance in EUC1. Traffic is flowing symmetrical via vpn01a.euw1 (10.250.0.18) and vpn01a.euc1 (10.240.0.9). Now I will go ahead and Terminate vpn01a.euw1, to simulate a Instance Failure (and also because I need to shutdown the lab to save money).
6 Packet Loss, which is pretty much due the health check timers that we have, or it could be much quicker! Here’s the prove that traffic shifted:
Hooray!
The goal here is to help people with Ideas of networking tricks that can be done on Public Clouds, there’s much more we could (and should) do here to improve this use-case, like dynamic routing on the VTIs and better Health Check Scripts so we improve failover/recovery and reduce errors, also this setup works only as Active/Standby, but it should be possible to improve and make works as Active/Active, etc..
As the network grows it will start to get challenging to manage static p2p Tunnels, so might be time to start Thinking about AWS Transit if the hub-spoke Latency is not a issue, or a automated DMVPN Solution as shown on the end of this presentation from re:Invent 2017 🙂
Hope this helps someone. Adios.
Hi! This is my first visit to your blog! We are a collection of volunteers and starting a new initiative
in a community in the same niche. Your blog provided us beneficial information to work on. You have
done a extraordinary job!
LikeLike
Thanks Bessie! 🙂
LikeLike