FHRP On AWS with Ansible, Keepalived and Python

AWS Does not support Broadcast or Multicast, so implementing a FHRP Solution as we are used to do on-premise won’t work. Fortunately, keepalived support Unicast Peers, so implementing it on AWS is no problem!

The use-case for this post continues where we left on the previous one with the difference that we will be using 2 VPN instances per vpc-region running Ubuntu instead of CentOS. The full configuration is found on my github. Here’s a quick illustration:

Screen Shot 2018-02-06 at 10.20.10

We have LibreSwan running on the vpn instance, and tunnels between
vpn01a.euw1<>vpn01a.euc1 and another between vpn02a.euw1<>vpn02a.euc1.

Each subnet on AWS has a Route Table attached to it, you could think of it as being similar to a VRF when we think about external routes, but internal to the VPC, all subnets have Layer2 access between them, so unless you apply SG and NACL, everything can reach everything inside the same VPC.

Here’s how my vpn01a Route Table instances looks like right now:

Both have a default Route via the First IP of the Subnet, which on AWS will always be the local subnet AWS Route Table. They both know how to reach the remote VPC via the IPSEC Tunnel (vti0), and they can indeed reach the other side. However, the AWS RT’s still does not have the proper routes, they don’t know how to reach the remote side, meaning that all the other Instances that relies on the AWS RT (all of them do!) won’t be able to reach the remote vpc. Heres how my private-subnet RT looks on EUW1:
Screen Shot 2018-02-06 at 10.37.03

As we can see, EUW1 does not know how to reach EUC1 (10.240.0.0/24)! A easy fix would be adding a Manual Route to 10.240.0.0/24, set next-hop to the instance vpn01a, do the same on the EUC1 side with the inverse route, done! It works, but what if vpn01a fails? Also, adding manual routes is a nightmare when it starts to grow. AWS Does not (yet) supports any type of Dynamic Routing Protocols, so to workaround that we will be using KeepAlived + a crafted Python Script for Dynamic Route Injection, and we’re going to use Ansible to Automate the KeepAlived deployment.

KeepAlived

VRRP Is a old friend of us, Network Engineers, we’ve been doing it on our Routers forever, so why not using this powerful FHRP solution also on the Public Cloud? To deploy the keepalived configuration into our VPN Instance, we will be using Ansible. I assume Keepalived is already installed, if not, please install it. Here’s how the playbook and the Jinja2 Template looks like:

---
- hosts: vpn0*.euc1.*
  #gather_facts: no
  vars:
    left_side: 'vpn01.euc1.netoops.net'
    right_side: 'vpn02.euc1.netoops.net'
    host1: "{{ hostvars['vpn01.euc1.netoops.net']['ansible_ens3']['ipv4']['address'] }}"
    host2: "{{ hostvars['vpn02.euc1.netoops.net']['ansible_ens3']['ipv4']['address'] }}"

  tasks:
  - name: write the keepalived config file
    template: src=keepalived.j2 dest=/etc/keepalived/keepalived.conf
    become: true
    notify:
      - restart keepalived

  - name: ensure keepalived is running
    service: name=ipsec state=started
    become: true

  handlers:
    - name: restart keepalived
      service: name=keepalived state=restarted
      become: true
{% if ( left_side in inventory_hostname) %}
vrrp_instance VPN {
    interface ens3
    state MASTER
    priority 200

    virtual_router_id 33
    unicast_src_ip {{ host1 }}
    unicast_peer {
        {{ host2 }}
    }

  notify_master "/usr/local/bin/master.sh"

}
{% endif %}

{% if ( right_side in inventory_hostname) %}
vrrp_instance VPN {
    interface ens3
    state BACKUP
    priority 100

    virtual_router_id 33
    unicast_src_ip {{ host2 }}
    unicast_peer {
        {{ host1 }}
    }

    notify_master "/usr/local/bin/master.sh"
}
{% endif %}

The trick to make VRRP works on AWS is the following lines of the KeepAlived configuration:

    unicast_src_ip {{ host1 }}
    unicast_peer {
        {{ host2 }}
    }

We change its behaviour from Multicast to Unicast, and with Ansible we are able to get facts of any Inventory Instance, including its IP address, making the automation more dynamic by not having to worry what IP the Instance is using, these are the lines of our Playbook that takes care of the IP address of the Instance:

    host1: "{{ hostvars['vpn01.euc1.netoops.net']['ansible_ens3']['ipv4']['address'] }}"
    host2: "{{ hostvars['vpn02.euc1.netoops.net']['ansible_ens3']['ipv4']['address'] }}"

Note: Don’t forget to quote 🙂

KepAlived also has a neat feature called notify, where we are able to run a shell script as the Node change its state. Here in our example, whenever the Node gets into MASTER state, we are going to run a shell script called master.sh. This script in turn will be responsible to call the aws_inject_routes.py Script that takes care of Injecting the routes Into the AWS Route Tables.

#!/bin/bash
/usr/local/bin/aws_route_inject.py
echo "Route Injection Done" | cat >/var/tmp/test.log

Note: I installed the scripts into the Instances during boot time using user-data on Terraform.

That’s it! With that, whenever a Instance become Master, it will Inject the VPN Routes into the AWS Route Table with the Next-Hop as itself, traffic from other Instances on the VPC should then be able to access the remote VPN locations via the vpn instances. Lets try it.

First we define on our keepalived-playbook.yml in which instances we want to deploy the config. In our example, we want the VRRP Cluster between vpn01.euw1 and vpn02.euw1, we then save and run the Playbook:

Screen Shot 2018-02-06 at 11.17.31Screen Shot 2018-02-06 at 11.17.48

Done! We have a VRRP Cluster UP and Running on AWS. Now, if our setups works, when vpn01a became Master, notify_master should have called the Python Script, and the routes should be available on the AWS RT with a next-hop as vpn01a-instance-id. Lets check:
Screen Shot 2018-02-06 at 11.23.12

Hooray! Now, lets force KeepAlived to fail on vpn01a by stopping its service and see what happens while we tail vpn02a logs:
Screen Shot 2018-02-06 at 11.24.44Screen Shot 2018-02-06 at 11.25.38

Sweet! Routes converged to vpn02a as expected! But not enough to failover the traffic on both side, so far we were able to failover the Tunnel in only one side, to make keepalived trigger the failover/routes change on both VPCs we need to make use of the feature called vrrp_script. Its a small change to our Keepalived template, here’s how it looks:

vrrp_script vpn_check {
  script       "/usr/local/bin/vpn_check.sh"
  interval 2   # check every 2 seconds
  fall 2       # require 2 failures for KO
  rise 2       # require 2 successes for OK
}
vrrp_instance VPN {

......

    track_script {
    vpn_check
  }    
#!/bin/bash
VTI=$(ifconfig | grep vti)
if [ -z "$VTI" ]
then
        exit 1
else
        exit 0
fi

That’s it, keepalived will run the vpn_check.sh script every 2 seconds, and the script has the simple task of checking if the VPN is up, by checking if a VTI interface exists. If when return is 1 VRRP will enter in FAULT state and failover, when resturn is 0 health check will pass.

To test it, I’ve spin up 1 test instance in each VPC so we can ping from one to the other and here’s the Result:

Screen Shot 2018-02-06 at 14.23.06

From EUW1(10.250/24) I have a MTR to 10.240.0.84, our test instance in EUC1. Traffic is flowing symmetrical via vpn01a.euw1 (10.250.0.18) and vpn01a.euc1 (10.240.0.9). Now I will go ahead and Terminate vpn01a.euw1, to simulate a Instance Failure (and also because I need to shutdown the lab to save money).

Screen Shot 2018-02-06 at 14.23.59

6 Packet Loss, which is pretty much due the health check timers that we have, or it could be much quicker! Here’s the prove that traffic shifted:

Screen Shot 2018-02-06 at 14.24.20

Hooray!

The goal here is to help people with Ideas of networking tricks that can be done on Public Clouds, there’s much more we could (and should) do here to improve this use-case, like dynamic routing on the VTIs and better Health Check Scripts so we improve failover/recovery and reduce errors, also this setup works only as Active/Standby, but it should be possible to improve and make works as Active/Active, etc..
As the network grows it will start to get challenging to manage static p2p Tunnels, so might be time to start Thinking about AWS Transit if the hub-spoke Latency is not a issue, or a automated DMVPN Solution as shown on the end of this presentation from re:Invent 2017 🙂

Hope this helps someone. Adios.

Advertisements

LibreSwan IPSEC Tunnel with Ansible

On the previous post I wrote about deploying a AWS Network Stack with Terraform and how to use Terraform to deploy a Linux Instance with LibreSwan installed.

I’ve been wanting to learn Ansible for a while now, so on this post we are going to use it to make it easy to deploy VPN Configuration into new VPN Instances. I won’t get into details of how Ansible works because they have a great Documentation, its easy to understand the basics and start playing with it!

On this setup we will be using the two VPCs in eu-central-1 and eu-west-1 that we built on the previous posts, with the difference that a second vpn-instance was added per vpc, so we can have 2 vpc-instance for redundancy, and also the Security Groups were changed because I notice I had left it allowing everything (Oops!).

The full thing can be found here on my Git Repo.

Playbook Tree

This is the File structure used by our vpn-instances Playbook:

└── vpn-instances
    ├── hosts
    ├── libreswan.j2
    ├── libreswan_secrets.j2
    ├── vpn-playbook.yml
    └── vpn_vars
        ├── vault.yml
        ├── vpn01-euc1-euw1.yml
        └── vpn02-euc1-euw1.yml

Hosts is our Ansible Inventory File, it should list all of our VPN Instances:

[euc1]
vpn01.euc1.netoops.net ansible_host=18.196.67.47
vpn02.euc1.netoops.net ansible_host=52.58.236.243

[euw1]
vpn01.euw1.netoops.net ansible_host=34.243.28.111
vpn02.euw1.netoops.net ansible_host=52.211.207.206

Libreswan.j2 and libreswan_secrets.j2 are our Jinja2 Template Files, these holds the LibreSwan Template, and the Playbook will pass our defined variables to it and deploy it to our vpn instances.

{% if ( left_side in inventory_hostname) %}
conn {{ conn_name }}
  left=%defaultroute
  leftid={{ left_id }}
  right={{ right_peer }}
  authby=secret
  leftsubnet=0.0.0.0/0
  rightsubnet=0.0.0.0/0
  auto=start
  # route-based VPN requires marking and an interface
  mark=5/0xffffffff
  vti-interface={{ vti_left }}
  # do not setup routing because we don't want to send 0.0.0.0/0 over the tunnel
  vti-routing=no

{% endif %}

{% if ( right_side in inventory_hostname) %}
conn {{ conn_name }}
  right=%defaultroute
  rightid={{ right_id }}
  left={{left_peer}}
  authby=secret
  leftsubnet=0.0.0.0/0
  rightsubnet=0.0.0.0/0
  auto=start
  # route-based VPN requires marking and an interface
  mark=5/0xffffffff
  vti-interface={{ vti_right }}
  # do not setup routing because we don't want to send 0.0.0.0/0 over the tunnel
  vti-routing=no
{% endif %}%
{{ left_peer }} {{right_peer}} : PSK "{{ vpn_psk }}"%

vpn-playbook.yml Is the Ansible Playbook. Here’s where we tell Ansible what to do and how to do it. The Playbook has a YAML format, where you define Tasks and defines hosts where the Tasks needs to be applied.

---
- hosts: vpn0x.xxx1.*,vpn0y.other-yyy1.*
  gather_facts: no
  vars_files:
    - ./vpn_vars/vault.yml
    - ./vpn_vars/vpn0x-xxx1-yyy1.yml

  tasks:
  - name: write the vpn config file
    template: src=libreswan.j2 dest=/etc/ipsec.d/{{ conn_name }}.conf
    become: true
    register: tunnel

  - name: write the vpn secrets file
    template: src=libreswan_secrets.j2 dest=/etc/ipsec.d/{{ conn_name }}.secrets
    become: true


  - name: Activate the tunnel
    shell: "{{ item }}"
    with_items:
      - ipsec auto --rereadsecrets
      - ipsec auto --add {{ conn_name }}
      - ipsec auto --up {{ conn_name }}
    become: true
    when: tunnel.changed


  - name: Install Routes left
    shell: ip route add {{ right_subnet }} dev {{ vti_left }}
    when: (left_side in inventory_hostname) and tunnel.changed
    become: true

  - name: Install Routes Right
    shell: ip route add {{ left_subnet }} dev {{ vti_right }}
    when: (right_side in inventory_hostname) and tunnel.changed
    become: true

  - name: ensure ipsec is running
    service: name=ipsec state=started
    become: true

The vault.yml is an Ansible Vault, here is where we are going to Store our PSKs as we don’t want them to be available in Clear-Text in our configuration Files, and specially not in clear-text on a Git Repo, so do remember to also add the vault.yml to your .gitignore 🙂

Finally we have the vpn01-euc1-euw1.yml and vpn02-euc1-euw1.yml file. These files holds our Variables, with all the Parameters that we need to setup each one of our IPSEC VPN Pairs. Here is the example of one of them:

---
#Name for the VPN Connetion
conn_name: euc1-euw1
#EUW1, Left Side
left_id: 34.243.28.111
left_peer: 34.243.28.111
left_side: vpn01.euw1.netoops.net
vti_left: vti0
left_subnet: 10.250.0.0/24

#EUC1, right Side
right_peer: 18.196.67.47
right_id: 18.196.67.47
right_side: vpn01.euc1.netoops.net
vti_right: vti0
right_subnet: 10.240.0.0/24

#PSK to be used. Note: PSK is stored on vault-psk
vpn_psk: "{{ vault_vpn01_psk }}"%

For each new P2P IPSEC That we want to setup, we will create a new file with a suggestive name, we are going to set the variables with the common VPN params as Local Peer IP and ID, Remote Peer IP and ID, Subnet of Each Side, the VTI Tunnel Number (one VTI per Tunnel), and last but not least, our Pre-Shared Key. If you pay attention on the vpn_psk variable, you will see that it points to another variable with a prefix of vault_, this variable is stored on our Ansible Vault, and for each new Tunnel we should define a new PSK on the Vault. Try not to use the same PSK everywhere!

With your variables done, all we need to do now is go back to our Playbook and change the Variables File, pointing to the var_files we defined above and filter the hosts to apply the configuration to:

- hosts: vpn01.euc1.*,vpn01.euw1.*
  gather_facts: no
  vars_files:
    - ./vpn_vars/vault.yml
    - ./vpn_vars/vpn01-euc1-euw1.yml

That’s it. The playbook will run against the inventory file, will match the host-entries vpn01.euc1* and vpn01.euw1* (wildcards allowed, take a look on Ansible Patterns), we will source our encrypted-variables from vault.yml and our vpn variables from vpn01-euc1-euw1.yml. We only need now to execute the Playbook and the 2 VPN Instances in EUC1 and EUW1 should have their IPSEC Tunnels established and proper routes set to talk to each-other.

AWS Cross-Region Talk in no time 🙂 So far we got the 2 VPN Instances to establish the Tunnel and setup the Static Routes, the next step is to make the AWS Route Tables itself to route traffic from the AWS Subnets into the proper VPN Instances. Lets do that in a different post.

I hope this post helps to see that a Network Engineer doesn’t need to be a Programmer or a Systems Wizard to be able to start writing some Network Automation tools that will help us on our day-by-day tasks, we just need to go a little bit out of our comfort zone and start learning new and interesting skills, its fun, try it!

AWS Network and Terraform – Part two

On Part One we saw how to create the base aws network stack using Terraform, on this post we are gonna deploy a Linux instance that will be used to establish Inter-Region IPSEC Tunnels using LibreSwan.

AWS Inter-Region Traffic

In November 2017 AWS Announced the Support for Inter-Region Peering, this allow VPCs in different regions to communicate between each other using AWS’s own Backbone, however, not all Regions currently support the Inter-Region Peering Feature, and to work around that the solution is the good old IPSEC VPN Tunnels.
The reason we use VPN Instances instead of the native AWS VPN Gateway is because aws-vgw works on passive mode only, so its not possible to initiate a Tunnel between 2 aws-vgws.

VPN Instance Module

We start by defining a new Module called vpn-instance, this module will hold the template to create our Linux Instance on the Public Subnet, and we will use user-data to run a bootstrap script on the Instance, so it defines a few settings and install LibreSwan on the first boot. LibreSwan is a nice and simple Linux IPSEC implementation that should do a good job to demonstrate our use-case. Lets get started.

“../modules/vpn-instance”

We can then create our new project and make use of our previous created network-stack and the vpn-instance Modules.

“../projects/eu-central-1/main.tf”

Now we need to define our user-data script, I call it init_config.sh, and we will define what we want to run on our Instance at Lunch time.

“../projects/eu-central-1/init_config.sh”

Done. All wee need to do now is run terraform init / terraform get / terraform apply, wait a few minutes and our eu-central-1 vpc will be up and running with a Linux Instance Ready to be used as a VPN Endpoint.

Screen Shot 2018-02-02 at 20.57.26Screen Shot 2018-02-02 at 20.45.08

Now we go back to our eu-west-1 project, and all we need to do is add the module vpn-instance snippet on the main.tf, terraform get / terraform apply and it should deploy the instance also on eu-west-1. We only need to change the ami-id for one that is available in euw1.

“../projects/eu-west-1/main.tf”

Screen Shot 2018-02-02 at 23.17.33Note that only 3 Resources were added? In the previous post we had already deployed all the network-stack, now we are only adding the Resources defined on the module vpn-instance.

Terraform and Configuration Management

Terraform is a great tool for creating Infrastructure, but it is not a configuration management tool. If we go back to our user-data file and try to add or remove something, Terraform will detect the change and when you apply it the instance will be destroyed and re-created. We don’t want that every time we add a new VPN Peer, so we should use a ConfigManager such as Ansible to define our LibreSwan Configurations.

I haven’t played much with Ansible yet, but as the main goal of this blog is to help-me learn and document what I am learning, on the Next Blog post I should have an Ansible environment ready to deploy those VPN Configs, and use Terraform to define and control the AWS Route Tables.

The complete Lab can be found on my GitHub.

That’s all folks.

AWS Network and Terraform – Part one

The word out there is that Public Cloud is the solution for all the problems..a bit too strong, right? As Network Engineers we tend to have a feeling for our on-premise Datacenter where we have control over all network matters, but I strong believe that we should embrace the Cloud, it is indeed a very interesting, good, reliable and fun solution to learn and use in MANY different use-cases. AWS is the leader IaaS in the market by far, and that’s why I decide to learn my way into the Public Cloud with them. This post won’t cover AWS Networking Fundamentals, for that I recommend the 2 part blog posts by Nick Matthews “Amazon VPC for On-Premises Network Engineers – Part 1 and Part 2

Terraform is a Simple and Powerful Infrastructure as a Code tool that can be used to Template, Plan and Create Infrastructure in a multitude of different Providers, including but not limited to AWS. I suggest a look on Terraform Getting Started guide for a better idea of how it works, they have an amazing Documentation, and there’s no need to be a Programmer to understand it.

Lets play around with it so we can learn more.

Defining your Modules

Modules can be thought as Functions, you define a Module that will be reused as many times as you want, this avoid repetitive code and also give us a cleaner and easier to manage configuration. On our example, I am gonna have 2 main folders, one called Modules and another one called Projects, which is where each new Project will be placed and the Resources will all be sourced from the Modules Folder. Let start by defining the Modules. Under “/modules/networking-stack” we will create the terraform files (.tf) that will contain our Templated Code to spin-up the aws networking stack. 

“../modules/network-stack/vpc.tf”

“../modules/network-stack/network.tf”

“../modules/network-stack/subnet.tf”

“../modules/network-stack/security.tf”

With that we have a complete aws-network-stack Module, with all the pieces necessary to bring up a AWS VPC, with 2 Public & Private subnets in 2 different AZs, with the proper IGW for Public Internet on the Public Subnets, NAT Gateway providing Internet Access for Private Instances and Security Groups adding Layer3/4 Security to the Instances, so we are ready to use this Module as many time as we want in our Projects.

To demonstrate that, lets create a folder called eu-west-1 on our Projects folder, and there we are going to call the Module Network-Stack that we created above, we are going to pass variable values and run Terraform over this Project, which should create our AWS Network Stack in a matter of minutes.

“../projects/eu-west-1/main.tf”

That’s all. All we need to do now is run  terraform initterraform plan / terraform apply and the Base AWS Network Stack will be ready for use in a matter of minutes!

Screen Shot 2018-02-02 at 12.33.34

On a Next posts I want to show how to use Terraform to Manage our Route Tables, VPNs, Cross-Region (X)Swan IPSEC Tunnels, VPC-Peering, etc. Stay Tuned.

Git Repo with all the configs can be find here.