How To Use HashiCorp Tools To Create a Secured Edge Infrastructure
This article is the missing step-by-step tutorial needed to glue all the pieces of Consul, Nomad, and Vault together.
Join the DZone community and get the full member experience.
Join For FreeThis article explains how to build a secure platform using HashiCorp’s stack; many of the steps are well documented, but we missed hints to iron out the wrinkles found in the process; here, we want to show how to glue together the whole procedure. We will highlight in detail the most critical parts explaining where we found issues and how we solved them, and leave references on the official tutorials, if sufficient, to prevent this article from getting too long and difficult to follow.
The Physical Architecture
As shown in the picture below, the physical architecture we want to achieve is composed of a Consul and Nomad cluster of five nodes: three of them are Control Plane nodes configured for High Availability (HA), while the remaining two are Data Plane nodes.
Given that we are in an edge environment, the final location of cluster servers is key:
Control plane nodes (servers) are distributed on the fair’s pavilions or installed on a dedicated cabinet in the main building
Data plane nodes (clients) are physically located on every pavilion
By default, the server's role is to maintain the state of the clusters while clients serve users which are physically located in proximity of them.
Basic Requirements and Nodes Deployment
In the era of Automation and Virtualization, the way to proceed is to create a "test and destroy" virtualized environment in order to try our configurations before going into production and deploy everything on the physical infrastructure.
The DevOps toolbox, composed of tools such as Ansible, Terraform, and Git, has been a strict requirement in the process of evolving and refining the configuration and procedure for setting up such a complicated environment. Ansible Vault is our choice to ensure that sensitive content related to specific environments, such as passwords, keys, configuration variables, and bootstrapping tokens or TLS certificates, are protected and encrypted rather than visible in plain text from the Git repository where we store code for building the infrastructure.
We start by creating five virtual machines. For our tests, we used the following specs:
OS: Ubuntu 22.04 LTS
Disk: 20GB
Memory: 2GB
vCPU: 2
You can choose your preferred Public Cloud provider and deploy a bunch of virtual machines with Terraform there, or alternatively, to speed up the deployment process, you could create them on your laptop using VirtualBox and Vagrant, but finally, you will end up with an Ansible inventory looking similar to this one:
all:
children:
cluster_role_server:
hosts:
node-1:
node-2:
node-3:
cluster_role_client:
hosts:
node-4:
vars:
placement: "pavilion-1"
node-5:
vars:
placement: "pavilion-2"
Provision the Consul and Nomad Cluster
In this bootstrapping phase, the Open Source Ansible Community provides a library of extensively tested and secure public playbooks. Thanks to the Ansible Galaxy utility, we can pull the code of two roles for deploying, respectively, Consul and Nomad.
We crafted the above roles into a proprietary playbook that, first of all, configures and setup the server networking with a dummy network interface. That interface is mapped to the reserved IP address 169.254.1.1
and used to resolve *.consul
domain names in the cluster, let's call it the dummy0
interface. We rely on Ansible to configure it on target nodes, and at the end, we will have the following file: /etc/systemd/network/dummy0.network
on every server:
[Match]
Name=dummy0
[Network]
Address=169.254.1.1/32
# We want these domains
Domains=~consul.
# To go to the consul DNS server we'll bind to this address
DNS=169.254.1.1
In addition, the playbook configures a few basic customizations on servers like updating the OS, setting up node's users importing the related SSH authorized_keys, installing the container runtime, and some debugging CLI tools.
Finally, you should take care in configuring Consul and Nomad roles by setting proper Ansible variables. Most of their names are self-explanatory, but for sure you have to specify the consul_version
and nomad_version
(in our case, respectively 1.12.2
and 1.2.8
). The three client nodes should use consul_node_role
and nomad_node_role
variables, while the others should use the server. Another useful variable to set is nomad_cni_enable
: true
because, otherwise, you will have to manually install CNI on any Nomad node. More information about how to configure the roles for your specific case is available on the related GitHub pages.
In this way, we are also able to support, with minimal effort, the rapid provisioning and deployment of different environment inventories varying from production bare-metal servers or cloudlets at the edge to development and testing virtual machines in the cloud.
Provision the Vault Cluster
At this point, you have a Consul + Nomad cluster ready, and you should be able to access the respective UI and launch some testing workload. If everything is working fine, it comes time to provision the Vault cluster, which for simplicity and exclusively for this testing phase, will be installed in HA on the very same server nodes where Nomad and Consul control planes run.
If you are a Vault beginner, start here and here playing a little bit with it, then you can move to the Ansible Vault role to install this component, but before jumping into the playbook magic, a few manual steps are needed in order to bootstrap the PKI for our Vault cluster.
You can rely on an existing PKI configuration if your company provides it. However, we will start from scratch, creating our Root Certification (CA) Authority and Certificates.
First, create the root key (this is the CA private key and can be used to sign certificate requests), then use the root key to create the root certificate that needs to be distributed to all the clients that have to trust us. You can tune the value for the -days
flag depending on your needs.
openssl genrsa -aes256 -out rootCA.key 4096
openssl req -x509 -new -key rootCA.key -sha256 -days 1024 -out rootCA.crt
You will have to customize the fields on prompts, for instance, with something like: C=IT, O=CompanyName, CN=company.com/emailAddress=example@company.com
and optionally check the root CA certificate with:
openssl x509 -in rootCA.crt -text
At this point, we need to create a certificate for every server, appliance, or application that needs to expose a TLS-encrypted service via a certificate that is trusted by our CA. In our case, we will assign this certificate to the Vault cluster, and therefore we will use the same on every Vault node.
openssl genrsa -out vault.key 4096
# create the Certificate Signing Request (csr)
openssl req -new -sha256 \
-key vault.key \
-subj "/C=IT/ST=TN/O=Tofane/CN=vault.service.consul" \
-out vault.csr
# finally verify the CSR's content
openssl req -in vault.csr -noout -text
To complete the procedure, create an extfile with all the alternative names (can be URL or IP); in the example, the IP of every Vault node is used, and then generate the certificate using the CSR and key along with the CA root key.
echo "subjectAltName=DNS:*.vault.service.consul,IP:10.231.0.72,IP:10.231.0.71,IP:10.231.0.73" >> extfile.cnf
# optionally, if you want to use the csr also for authentication
echo extendedKeyUsage = serverAuth >> extfile.cnf
openssl x509 -req -in vault.csr -CA rootCA.crt -CAkey rootCA.key -CAcreateserial -out vault.crt -extfile extfile.cnf -days 365 -sha256
# verify the certificate content
openssl x509 -in vault.crt -text -noout
In addition to other minor files, you should now ensure to have the following generated files in the current working folder:
rootCA.crt
vault.crt
vault.key
PKI files must be passed, through dedicated variables, to the Ansible role together with some additional configurations suggested below, then start the Vault provisioning by assigning the vault
role to your server nodes.
vault_backend: consul
vault_consul: 169.254.1.1:8500
vault_iface: "{{ ansible_default_ipv4.interface }}"
vault_listener_localhost_enable: true
vault_tls_disable: 0
vault_tls_src_files: "ANSIBLE_RELATIVE_PATH_TO_YOUR_CERT_FILES"
vault_tls_cert_file: "vault.crt"
vault_tls_key_file: "vault.key"
vault_tls_ca_file: "rootCA.crt"
One more step is needed if you used a self-signed CA: it must be copied and installed into every node, and you can easily accomplish this task with the following:
sudo cp rootCA.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
Finally, initialize your Vault cluster, unseal it, login with the token returned from the initialization phase, and check its status; if everything is working well, then move to the next pitfall: enabling and configuring TLS on your fresh new cluster.
vault operator init
vault operator unseal
vault login
vault status
Configure TLS
In this section, we will walk through the procedure to enable secure cluster communications for both infrastructure components (eg: Consul and Nomad agents) and application containers running in our cluster. So we will secure Consul Agent communication with TLS encryption and services authentication on top of mTLS verification. This can be done using certificates generated from a Vault cluster; moreover, in our case, certificates will be distributed via consul-template on servers and via auto encryption method on clients.
Requirements
All the magic related to certificates renewal and rotation between Consul and Nomad agents and Vault is in the hands of a required component which is Consul Template. Let us share the reference to the GitHub page and shortly give you some pointers on how to install and configure it because, as of today, we cannot find a complete guide on the web:
Download and unpack the binary in the home of every cluster node
ansible all -m shell -a 'sudo apt-get install unzip; CT_V=0.29.0; \
curl -O https://releases.hashicorp.com/consul-template/${CT_V}/consul-template_${CT_V}_linux_amd64.zip; \
unzip consul-template_${CT_V}_linux_amd64.zip; \
sudo mv consul-template /usr/local/bin/'
Create the unit file on every node for systemd
ansible all -m shell -a "sudo bash -c 'cat <<EOF > /etc/systemd/system/consul-template.service
[Unit]
Description=consul-template
Documentation=https://github.com/hashicorp/consul-template
Wants=network-online.target
After=network-online.target
[Service]
EnvironmentFile=/etc/environment
ExecStart=/usr/local/bin/consul-template -config /etc/consul-template.d
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
KillSignal=SIGINT
Restart=on-failure
RestartSec=42s
[Install]
WantedBy=multi-user.target
EOF'"
Finally, promote it as a system service under systemd
and create a folder for consul-template configuration file
ansible all -m shell -a "sudo systemctl daemon-reload"
ansible all -m shell -a \
'sudo mkdir -p /etc/consul-template.d; \
sudo chgrp consul /etc/consul-template.d'
The main scaffold for Consul Template is now ready; all you need to do in the next steps is to create its configuration file and template files for every service requesting certificate rotation.
Configure Consul TLS and Consul Connect
You can easily follow the below step-by-step tutorials, which are very detailed and need just minor and quite obvious adjustments for your specific case:
Configure Vault as Consul's CA, Install the Consul intermediate CA into every node
Now, before continuing with the Consul configuration, we need to enable and configure the Consul Connect feature, and for this, we need first to enable Vault as a Consul Connect Certification Authority provider: certificates used in the service mesh networks created by Consul Connect will be generated and managed by Vault. Jump to the following tutorial and make sure to choose the instructions under the "Self Managed" tab: Vault as Consul Service Mesh Certification Authority provider.
At this point, you need to configure Consul by first distributing bootstrap certificates on both clients and servers under the /opt/consul/agent-certs/
and then by installing the Consul intermediate CA into every node using update-ca-certificates
utility in a similar way to how we did for Vault CA in the previous section. For these steps, you can easily follow the Configure Consul tutorial.
Now that you have distributed the bootstrap certificates on your nodes, you can enable TLS first on servers and then on clients by placing the Consul configuration respectively on servers and clients in the /etc/consul.d/
folder and restart the Consul agent.
These are the configurations used in our cluster:
enable-tls-server.hcl
verify_incoming = true
verify_outgoing = true
verify_server_hostname = true
ca_file = "/opt/consul/agent-certs/ca.crt"
cert_file = "/opt/consul/agent-certs/agent.crt"
key_file = "/opt/consul/agent-certs/agent.key"
auto_encrypt {
allow_tls = true
}
enable-tls-client.hcl
verify_incoming = false
verify_outgoing = true
verify_server_hostname = true
ca_file = "/opt/consul/agent-certs/ca.crt"
auto_encrypt {
tls = true
ip_san = ["169.254.1.1"]
}
Restart the Consul agent on every node, and the API over TLS should be available at https://169.254.1.1:8501
, but still, we need to configure certificate renewal by means of the Consul Template and also set up the mTLS authentication.
The Configure consul-template page will guide you in setting up the template files, and here you should take care of a small detail: given that in our cluster, Consul agents are listening on the special dummy0
interface, we need to create proper certificates, via the template, mechanism, by adding a custom ip_sans
in every template files. For your convenience, here is a template example for the server agent certificate:
{{ with secret "pki_consul_int/issue/consul-tofane" "common_name=server.tofane.consul" "ttl=24h" "alt_names=localhost" "ip_sans=169.254.1.1"}}
{{ .Data.certificate }}
{{ end }}
Finally, you just need to create a consul-template configuration file, and here the only advice is to use address = "http://localhost:8200"
for server nodes and address = "https://active.vault.service.consul:8200"
for client nodes. In this way servers will have a more stable configuration because the communication with Vault will work also in the case Consul is not working, and the DNS resolution is not available.
To perform final checks that TLS certificate rotation is working as expected, you should wait for 24 hours for the TTL that you chose when creating the certificates in Vault. Testing the certificates have been overwritten on every node can be done with this quite simple Ansible command
ansible all -m shell -a 'sudo ls -la /opt/consul/agent-certs/'
In addition, you can use OpenSSL and check the certificates used on the RPC or API ports.
openssl s_client -connect $CONSUL_NODE_LAN_IP:8300 2>/dev/null | less
openssl s_client -connect 169.254.1.1:8501 2>/dev/null | less
And finally, try to connect on API using the proper certificates from a server node.
consul members -http-addr="https://169.254.1.1:8501" \
-ca-file="/opt/consul/agent-certs/ca.crt" \
-client-cert="/opt/consul/agent-certs/agent.crt" \
-client-key="/opt/consul/agent-certs/agent.key"
Configure Nomad TLS
The procedure to configure Nomad to use TLS is very straightforward because we need to proceed with the very same steps done for the Consul configuration. The official tutorial for your reference is Generate mTLS Certificates for Nomad using Vault, and we will describe below only a few tweaks that will save you some headaches.
First of all, as we did for Consul, in the consul templates, we need to add ip_sans
, but in this case, we need also to add 127.0.0.1
because Nomad will be contacted also on that interface from services (eg: the Nomad Autoscaler). Below you have our example:
{{ with secret "pki_nomad_int/issue/nomad-cluster" "common_name=server.global.nomad" "ttl=24h" "alt_names=localhost" "ip_sans=127.0.0.1,169.254.1.1"}}
{{ .Data.certificate }}
{{ end }}
It is worth noticing that the consul template configuration for Nomad can just be appended to the existing configuration file because we are already using it for Consul.
Given that Consul Template will be used to retrieve both Consul and Nomad certificates from different Vault paths, it is better to create a new policy and also create a new token that will have both permissions.
vault token create -policy=nomad -policy=connect-ca
Of course, this token must be replaced in the Consul Template configuration file on every node.
As we did for Consul, for your convenience, we share the files that should be placed in /etc/nomad.d/
in order to enable Nomad TLS on server nodes. For client nodes, just remove the last line.
enable-tls-server.hcl
tls {
http = true
rpc = true
ca_file = "/opt/nomad/agent-certs/ca.crt"
cert_file = "/opt/nomad/agent-certs/agent.crt"
key_file = "/opt/nomad/agent-certs/agent.key"
verify_server_hostname = true
verify_https_client = true
rpc_upgrade_mode = true
}
As soon as you first reload the Nomad agents, remember to change the rpc_upgrade_mode
option to false on server nodes so that your Nomad servers will only accept TLS connections, then restart the service again.
ansible core_cluster_role_server -m shell -a \ 'sudo sed -i "s/rpc_upgrade_mode = true/rpc_upgrade_mode = false/g" /etc/nomad.d/enable-tls-server.hcl'
ansible core_cluster_role_server -m shell -a \ 'sudo systemctl restart nomad'
Only a few more steps are needed to fully conclude Nomad configuration: i) Allow access to web UI, ii) Enable sidecars to use the TLS, and iii) Test if everything is working as expected.
In order to access Web-UI when TLS is enabled, set verify_https_client
to false on servers and restart the nomad service. For more information, we found this post very useful.
ansible core_cluster_role_server -m shell -a \
'sudo sed -i "s/verify_https_client = true/verify_https_client = false/g" /etc/nomad.d/enable-tls-server.hcl; \
sudo systemctl restart nomad'
Now you should be able to connect to https://localhost:4646/ui, but remember that for production environments, you should configure your browser using the Vault root CA in order to provide a supported peer authentication.
The second step is about enabling sidecars to use the TLS-protected mesh provided by Consul connect: in order to initialize the channels in the TLS-protected mesh, Nomad needs to communicate with Consul and inject in the sidecar the right certificates in order to establish a protected connection.
To do that, update the Nomad configuration in /etc/nomad.d/base.hcl
and allow Nomad to communicate via TLS with the Consul API, adding the lines below:
address = "169.254.1.1:8501"
ssl = true ca_file = "/opt/consul/agent-certs/ca.crt"
cert_file = ""
key_file = ""
Hopefully, you are at the final testing phase; indeed if you input the nomad status
command, you will receive an error message because you need to properly configure the CLI.
Export the endpoint and certificates variables:
export NOMAD_ADDR=https://localhost:4646
export NOMAD_CACERT="/opt/nomad/agent-certs/ca.crt"
export NOMAD_CLIENT_CERT="/opt/nomad/cli-certs/cli.crt"
export NOMAD_CLIENT_KEY="/opt/nomad/cli-certs/cli.key"
And finally, check the status again via TLS, the nomad status
command now should work seamlessly.
Configure Consul ACL
The following last step will enable the Consul Access Control List System on our cluster. The main reference for this task is the HashiCorp tutorial about Bootstrapping the ACL system but again we will give you some suggestions.
Enable the ACL
In the process of enabling the ACL system, we encountered just a small issue preventing us to bootstrap it if multiple nodes were active simultaneously. Small issues can be solved with easy solutions: why not temporarily shrink the cluster to one node before enabling the ACL? We are "lazy developers," so a rough but working bash loop and SSH helped us in this:
# server nodes
S_L="node-1 node-2 node-3"
# client nodes
C_L="node-4 node-5"
for i in ${S_L#* } $C_L;
do echo "R $i"
ssh -t $i consul leave
done
Now copy the following Consul HCL configuration file and then bootstrap the ACL into the single node Consul cluster
enable-acl.hcl
acl = {
enabled = true
#default_policy = "deny"
default_policy = "allow"
enable_token_persistence = true
}
FIRST=node-1
echo "R $FIRST"
scp enable-acl.hcl $FIRST:/var/tmp/enable-acl.hcl
ssh $FIRST sudo install --owner consul --group consul /var/tmp/enable-acl.hcl /etc/consul.d/enable-acl.hcl
ssh -t $FIRST sudo systemctl restart consul.service
Consul will reply with a successful message and will allow us to proceed with saving the Consul bootstrap token and re-joining other nodes to the original cluster by simply restarting its Consul agents.
# save consul bootstrap token
sudo consul acl bootstrap -format json 2>&1 | tee consul-acl-bootstrap.token CONSUL_HTTP_TOKEN=$(cat consul-acl-bootstrap.token | jq -r '.SecretID') export CONSUL_HTTP_TOKEN
# restore cluster nodes
for i in ${S_L#* } $C_L; do
echo "R $i"
scp enable-acl.hcl $i:/var/tmp/enable-acl.hcl
ssh $i sudo install --owner consul --group consul /var/tmp/enable-acl.hcl /etc/consul.d/enable-acl.hcl
ssh -t $i sudo systemctl restart consul.service
done
At this point, if only server nodes are available in the cluster (check using: consul members
) and you see that Vault, Consul, and Nomad agents do not work anymore; that's quite normal, given that they do not have the proper ACL policies and relative tokens. Following the next steps, you will fix it.
Integrate ACL With Vault
The integration with Vault is quite easy; you just need to generate a token from a proper Consul policy and run again the Ansible Vault role to update all your nodes in one shot after settings an additional Ansible extra variable:
vault_consul_token: "YOUR_TOKEN_HERE"
Integrate ACL With Consul and Nomad Agents
Integration with other components is absolutely straightforward, and you can easily follow the official tutorials in this order:
Create rules, policy, and tokens for Consul nodes and agents
Create rules, policy, and tokens for DNS; here for simplicity, we just merged this policy with the one at point 1) and generated for every node starting from the template below:
node "NODE-NAME" {
policy = "write"
}
service_prefix "" {
policy = "read"
}
node_prefix "" {
policy = "read"
}
service_prefix "" {
policy = "read"
}
query_prefix "" {
policy = "read"
}
agent "NODE-NAME" {
policy = "write"
}
Fully Enable ACL Policy
Now force ACL to deny by default communications and finally restart consul on every nodes.
ansible all -m shell -a "\ sudo sed -i \
's/default_policy = \"allow\"/#default_policy = \"allow\"/g' /etc/consul.d/enable-acl.hcl"
ansible all -m shell -a "\ sudo sed -i \
's/#default_policy = \"deny\"/default_policy = \"deny\"/g' /etc/consul.d/enable-acl.hcl"
ansible all -m shell -a "sudo systemctl restart consul"
Conclusions
Well done! At this point, your "Secure Edge Infrastructure" based on the HashiCorp products should be up and running, and you can start installing your preferred workload. In our case, we installed a layer of middleware on top of this deployment, enabling features like Load Balancing, Monitoring, and Autoscaling. To achieve this last goal, we respectively used Traefik, Prometheus, and Nomad autoscaler and, as you may imagine, additional configurations are required in order to have these tools working properly on top of a full TLS infrastructure but this will be the topic of another article.
In conclusion, we want to share with you a possible evolution of this work, which we think is quite natural: given that the infrastructure is considered reasonably secure in terms of communication and authentication across components, why not use it as a pillar to build a hybrid infrastructure in which the control plane nodes live in the Cloud of your choice while the data plane servers stay at the edge?
With such architecture, you would have the unquestioned advantage of being able to control multiple edge sites with a single control plane sweeping away the overhead and targeting economy of scale in your company. Of course the system should be properly partitioned to support multi-tenancy but this will be for sure our next exciting project.
Opinions expressed by DZone contributors are their own.
Comments