FHRP On AWS with Ansible, Keepalived and Python

AWS Does not support Broadcast or Multicast, so implementing a FHRP Solution as we are used to do on-premise won’t work. Fortunately, keepalived support Unicast Peers, so implementing it on AWS is no problem!

The use-case for this post continues where we left on the previous one with the difference that we will be using 2 VPN instances per vpc-region running Ubuntu instead of CentOS. The full configuration is found on my github. Here’s a quick illustration:

Screen Shot 2018-02-06 at 10.20.10

We have LibreSwan running on the vpn instance, and tunnels between
vpn01a.euw1<>vpn01a.euc1 and another between vpn02a.euw1<>vpn02a.euc1.

Each subnet on AWS has a Route Table attached to it, you could think of it as being similar to a VRF when we think about external routes, but internal to the VPC, all subnets have Layer2 access between them, so unless you apply SG and NACL, everything can reach everything inside the same VPC.

Here’s how my vpn01a Route Table instances looks like right now:

Both have a default Route via the First IP of the Subnet, which on AWS will always be the local subnet AWS Route Table. They both know how to reach the remote VPC via the IPSEC Tunnel (vti0), and they can indeed reach the other side. However, the AWS RT’s still does not have the proper routes, they don’t know how to reach the remote side, meaning that all the other Instances that relies on the AWS RT (all of them do!) won’t be able to reach the remote vpc. Heres how my private-subnet RT looks on EUW1:
Screen Shot 2018-02-06 at 10.37.03

As we can see, EUW1 does not know how to reach EUC1 (10.240.0.0/24)! A easy fix would be adding a Manual Route to 10.240.0.0/24, set next-hop to the instance vpn01a, do the same on the EUC1 side with the inverse route, done! It works, but what if vpn01a fails? Also, adding manual routes is a nightmare when it starts to grow. AWS Does not (yet) supports any type of Dynamic Routing Protocols, so to workaround that we will be using KeepAlived + a crafted Python Script for Dynamic Route Injection, and we’re going to use Ansible to Automate the KeepAlived deployment.

KeepAlived

VRRP Is a old friend of us, Network Engineers, we’ve been doing it on our Routers forever, so why not using this powerful FHRP solution also on the Public Cloud? To deploy the keepalived configuration into our VPN Instance, we will be using Ansible. I assume Keepalived is already installed, if not, please install it. Here’s how the playbook and the Jinja2 Template looks like:

---
- hosts: vpn0*.euc1.*
  #gather_facts: no
  vars:
    left_side: 'vpn01.euc1.netoops.net'
    right_side: 'vpn02.euc1.netoops.net'
    host1: "{{ hostvars['vpn01.euc1.netoops.net']['ansible_ens3']['ipv4']['address'] }}"
    host2: "{{ hostvars['vpn02.euc1.netoops.net']['ansible_ens3']['ipv4']['address'] }}"

  tasks:
  - name: write the keepalived config file
    template: src=keepalived.j2 dest=/etc/keepalived/keepalived.conf
    become: true
    notify:
      - restart keepalived

  - name: ensure keepalived is running
    service: name=ipsec state=started
    become: true

  handlers:
    - name: restart keepalived
      service: name=keepalived state=restarted
      become: true
{% if ( left_side in inventory_hostname) %}
vrrp_instance VPN {
    interface ens3
    state MASTER
    priority 200

    virtual_router_id 33
    unicast_src_ip {{ host1 }}
    unicast_peer {
        {{ host2 }}
    }

  notify_master "/usr/local/bin/master.sh"

}
{% endif %}

{% if ( right_side in inventory_hostname) %}
vrrp_instance VPN {
    interface ens3
    state BACKUP
    priority 100

    virtual_router_id 33
    unicast_src_ip {{ host2 }}
    unicast_peer {
        {{ host1 }}
    }

    notify_master "/usr/local/bin/master.sh"
}
{% endif %}

The trick to make VRRP works on AWS is the following lines of the KeepAlived configuration:

    unicast_src_ip {{ host1 }}
    unicast_peer {
        {{ host2 }}
    }

We change its behaviour from Multicast to Unicast, and with Ansible we are able to get facts of any Inventory Instance, including its IP address, making the automation more dynamic by not having to worry what IP the Instance is using, these are the lines of our Playbook that takes care of the IP address of the Instance:

    host1: "{{ hostvars['vpn01.euc1.netoops.net']['ansible_ens3']['ipv4']['address'] }}"
    host2: "{{ hostvars['vpn02.euc1.netoops.net']['ansible_ens3']['ipv4']['address'] }}"

Note: Don’t forget to quote 🙂

KepAlived also has a neat feature called notify, where we are able to run a shell script as the Node change its state. Here in our example, whenever the Node gets into MASTER state, we are going to run a shell script called master.sh. This script in turn will be responsible to call the aws_inject_routes.py Script that takes care of Injecting the routes Into the AWS Route Tables.

#!/bin/bash
/usr/local/bin/aws_route_inject.py
echo "Route Injection Done" | cat >/var/tmp/test.log

Note: I installed the scripts into the Instances during boot time using user-data on Terraform.

That’s it! With that, whenever a Instance become Master, it will Inject the VPN Routes into the AWS Route Table with the Next-Hop as itself, traffic from other Instances on the VPC should then be able to access the remote VPN locations via the vpn instances. Lets try it.

First we define on our keepalived-playbook.yml in which instances we want to deploy the config. In our example, we want the VRRP Cluster between vpn01.euw1 and vpn02.euw1, we then save and run the Playbook:

Screen Shot 2018-02-06 at 11.17.31Screen Shot 2018-02-06 at 11.17.48

Done! We have a VRRP Cluster UP and Running on AWS. Now, if our setups works, when vpn01a became Master, notify_master should have called the Python Script, and the routes should be available on the AWS RT with a next-hop as vpn01a-instance-id. Lets check:
Screen Shot 2018-02-06 at 11.23.12

Hooray! Now, lets force KeepAlived to fail on vpn01a by stopping its service and see what happens while we tail vpn02a logs:
Screen Shot 2018-02-06 at 11.24.44Screen Shot 2018-02-06 at 11.25.38

Sweet! Routes converged to vpn02a as expected! But not enough to failover the traffic on both side, so far we were able to failover the Tunnel in only one side, to make keepalived trigger the failover/routes change on both VPCs we need to make use of the feature called vrrp_script. Its a small change to our Keepalived template, here’s how it looks:

vrrp_script vpn_check {
  script       "/usr/local/bin/vpn_check.sh"
  interval 2   # check every 2 seconds
  fall 2       # require 2 failures for KO
  rise 2       # require 2 successes for OK
}
vrrp_instance VPN {

......

    track_script {
    vpn_check
  }    
#!/bin/bash
VTI=$(ifconfig | grep vti)
if [ -z "$VTI" ]
then
        exit 1
else
        exit 0
fi

That’s it, keepalived will run the vpn_check.sh script every 2 seconds, and the script has the simple task of checking if the VPN is up, by checking if a VTI interface exists. If when return is 1 VRRP will enter in FAULT state and failover, when resturn is 0 health check will pass.

To test it, I’ve spin up 1 test instance in each VPC so we can ping from one to the other and here’s the Result:

Screen Shot 2018-02-06 at 14.23.06

From EUW1(10.250/24) I have a MTR to 10.240.0.84, our test instance in EUC1. Traffic is flowing symmetrical via vpn01a.euw1 (10.250.0.18) and vpn01a.euc1 (10.240.0.9). Now I will go ahead and Terminate vpn01a.euw1, to simulate a Instance Failure (and also because I need to shutdown the lab to save money).

Screen Shot 2018-02-06 at 14.23.59

6 Packet Loss, which is pretty much due the health check timers that we have, or it could be much quicker! Here’s the prove that traffic shifted:

Screen Shot 2018-02-06 at 14.24.20

Hooray!

The goal here is to help people with Ideas of networking tricks that can be done on Public Clouds, there’s much more we could (and should) do here to improve this use-case, like dynamic routing on the VTIs and better Health Check Scripts so we improve failover/recovery and reduce errors, also this setup works only as Active/Standby, but it should be possible to improve and make works as Active/Active, etc..
As the network grows it will start to get challenging to manage static p2p Tunnels, so might be time to start Thinking about AWS Transit if the hub-spoke Latency is not a issue, or a automated DMVPN Solution as shown on the end of this presentation from re:Invent 2017 🙂

Hope this helps someone. Adios.

Advertisements

LibreSwan IPSEC Tunnel with Ansible

On the previous post I wrote about deploying a AWS Network Stack with Terraform and how to use Terraform to deploy a Linux Instance with LibreSwan installed.

I’ve been wanting to learn Ansible for a while now, so on this post we are going to use it to make it easy to deploy VPN Configuration into new VPN Instances. I won’t get into details of how Ansible works because they have a great Documentation, its easy to understand the basics and start playing with it!

On this setup we will be using the two VPCs in eu-central-1 and eu-west-1 that we built on the previous posts, with the difference that a second vpn-instance was added per vpc, so we can have 2 vpc-instance for redundancy, and also the Security Groups were changed because I notice I had left it allowing everything (Oops!).

The full thing can be found here on my Git Repo.

Playbook Tree

This is the File structure used by our vpn-instances Playbook:

└── vpn-instances
    ├── hosts
    ├── libreswan.j2
    ├── libreswan_secrets.j2
    ├── vpn-playbook.yml
    └── vpn_vars
        ├── vault.yml
        ├── vpn01-euc1-euw1.yml
        └── vpn02-euc1-euw1.yml

Hosts is our Ansible Inventory File, it should list all of our VPN Instances:

[euc1]
vpn01.euc1.netoops.net ansible_host=18.196.67.47
vpn02.euc1.netoops.net ansible_host=52.58.236.243

[euw1]
vpn01.euw1.netoops.net ansible_host=34.243.28.111
vpn02.euw1.netoops.net ansible_host=52.211.207.206

Libreswan.j2 and libreswan_secrets.j2 are our Jinja2 Template Files, these holds the LibreSwan Template, and the Playbook will pass our defined variables to it and deploy it to our vpn instances.

{% if ( left_side in inventory_hostname) %}
conn {{ conn_name }}
  left=%defaultroute
  leftid={{ left_id }}
  right={{ right_peer }}
  authby=secret
  leftsubnet=0.0.0.0/0
  rightsubnet=0.0.0.0/0
  auto=start
  # route-based VPN requires marking and an interface
  mark=5/0xffffffff
  vti-interface={{ vti_left }}
  # do not setup routing because we don't want to send 0.0.0.0/0 over the tunnel
  vti-routing=no

{% endif %}

{% if ( right_side in inventory_hostname) %}
conn {{ conn_name }}
  right=%defaultroute
  rightid={{ right_id }}
  left={{left_peer}}
  authby=secret
  leftsubnet=0.0.0.0/0
  rightsubnet=0.0.0.0/0
  auto=start
  # route-based VPN requires marking and an interface
  mark=5/0xffffffff
  vti-interface={{ vti_right }}
  # do not setup routing because we don't want to send 0.0.0.0/0 over the tunnel
  vti-routing=no
{% endif %}%
{{ left_peer }} {{right_peer}} : PSK "{{ vpn_psk }}"%

vpn-playbook.yml Is the Ansible Playbook. Here’s where we tell Ansible what to do and how to do it. The Playbook has a YAML format, where you define Tasks and defines hosts where the Tasks needs to be applied.

---
- hosts: vpn0x.xxx1.*,vpn0y.other-yyy1.*
  gather_facts: no
  vars_files:
    - ./vpn_vars/vault.yml
    - ./vpn_vars/vpn0x-xxx1-yyy1.yml

  tasks:
  - name: write the vpn config file
    template: src=libreswan.j2 dest=/etc/ipsec.d/{{ conn_name }}.conf
    become: true
    register: tunnel

  - name: write the vpn secrets file
    template: src=libreswan_secrets.j2 dest=/etc/ipsec.d/{{ conn_name }}.secrets
    become: true


  - name: Activate the tunnel
    shell: "{{ item }}"
    with_items:
      - ipsec auto --rereadsecrets
      - ipsec auto --add {{ conn_name }}
      - ipsec auto --up {{ conn_name }}
    become: true
    when: tunnel.changed


  - name: Install Routes left
    shell: ip route add {{ right_subnet }} dev {{ vti_left }}
    when: (left_side in inventory_hostname) and tunnel.changed
    become: true

  - name: Install Routes Right
    shell: ip route add {{ left_subnet }} dev {{ vti_right }}
    when: (right_side in inventory_hostname) and tunnel.changed
    become: true

  - name: ensure ipsec is running
    service: name=ipsec state=started
    become: true

The vault.yml is an Ansible Vault, here is where we are going to Store our PSKs as we don’t want them to be available in Clear-Text in our configuration Files, and specially not in clear-text on a Git Repo, so do remember to also add the vault.yml to your .gitignore 🙂

Finally we have the vpn01-euc1-euw1.yml and vpn02-euc1-euw1.yml file. These files holds our Variables, with all the Parameters that we need to setup each one of our IPSEC VPN Pairs. Here is the example of one of them:

---
#Name for the VPN Connetion
conn_name: euc1-euw1
#EUW1, Left Side
left_id: 34.243.28.111
left_peer: 34.243.28.111
left_side: vpn01.euw1.netoops.net
vti_left: vti0
left_subnet: 10.250.0.0/24

#EUC1, right Side
right_peer: 18.196.67.47
right_id: 18.196.67.47
right_side: vpn01.euc1.netoops.net
vti_right: vti0
right_subnet: 10.240.0.0/24

#PSK to be used. Note: PSK is stored on vault-psk
vpn_psk: "{{ vault_vpn01_psk }}"%

For each new P2P IPSEC That we want to setup, we will create a new file with a suggestive name, we are going to set the variables with the common VPN params as Local Peer IP and ID, Remote Peer IP and ID, Subnet of Each Side, the VTI Tunnel Number (one VTI per Tunnel), and last but not least, our Pre-Shared Key. If you pay attention on the vpn_psk variable, you will see that it points to another variable with a prefix of vault_, this variable is stored on our Ansible Vault, and for each new Tunnel we should define a new PSK on the Vault. Try not to use the same PSK everywhere!

With your variables done, all we need to do now is go back to our Playbook and change the Variables File, pointing to the var_files we defined above and filter the hosts to apply the configuration to:

- hosts: vpn01.euc1.*,vpn01.euw1.*
  gather_facts: no
  vars_files:
    - ./vpn_vars/vault.yml
    - ./vpn_vars/vpn01-euc1-euw1.yml

That’s it. The playbook will run against the inventory file, will match the host-entries vpn01.euc1* and vpn01.euw1* (wildcards allowed, take a look on Ansible Patterns), we will source our encrypted-variables from vault.yml and our vpn variables from vpn01-euc1-euw1.yml. We only need now to execute the Playbook and the 2 VPN Instances in EUC1 and EUW1 should have their IPSEC Tunnels established and proper routes set to talk to each-other.

AWS Cross-Region Talk in no time 🙂 So far we got the 2 VPN Instances to establish the Tunnel and setup the Static Routes, the next step is to make the AWS Route Tables itself to route traffic from the AWS Subnets into the proper VPN Instances. Lets do that in a different post.

I hope this post helps to see that a Network Engineer doesn’t need to be a Programmer or a Systems Wizard to be able to start writing some Network Automation tools that will help us on our day-by-day tasks, we just need to go a little bit out of our comfort zone and start learning new and interesting skills, its fun, try it!