Tips for running Ansible playbooks on large networks

Since our recent migration from Puppet/Mcollective, Ansible has proven to be an amazing tool. It has simplified a lot our orchestration/configuration layer and has removed the headache and complexity of managing the management system

Most of our interactions with the network are peer-to-peer (meaning from a central API server to a single machine in our vastly distributed network). This works great and we have been able to get even better results (speed wise) than what we were getting from Mcollective. 

But every now an then we run a network wide process (deploy a given package, restart a given service, critical patch deployment to all systems ...). With a network with over 4000 servers, it presents some challenges.

Issues

The main issue we have faced when running playbooks over our large network has been always related to memory. Playbook would run fine initially but after a while you would start getting errors like:

Process Process-132:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python2.7/dist-packages/ansible/runner/__init__.py", line 81, in _executor_hook
while not job_queue.empty():
File "<string>", line 2, in empty
File "/usr/lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
conn.send((self._id, methodname, args, kwds))
IOError: [Errno 32] Broken pipe

And out of memory errors once resources on the Ansible machine are depleted. 

Workaround

Ideally Ansible should be able to fix these issues somehow, in the meantime we have come up with a way of being able to successfully run playbooks against large networks.

When you run a playbook, facts are gathered for ALL the machines the playbook will run against as the first step in the playbook execution process. For very large networks, this poses the first problem, as memory consumption can be huge. 

So the first thing we do is to avoid fact gathering at the start of the playbook execution:

gather_facts: false

Then we want to limit the number of servers that will be processed in parallel. We use the serial parameter for that

serial: "10%"

Finally, we use the setup module to gather facts from the "serial size" batch of servers chosen for processing (thus limiting the number of servers we get facts from at once). 

A full playbook example could be:

---
- hosts: all
remote_user: test
sudo: yes
gather_facts: false
serial: 20
tasks:
- name: Get facts
setup:
- name: Find patching date
debug: msg="Server {{ inventory_hostname }} was last patched {{ ansible_local.patching.status.last_patched }}"

We then execute the playbook limiting forks to the max number of servers in each batch (20 in this example):

ansible-playbook -f 20 -T 10 patching.yml

This ensures smooth execution of the playbook across the whole network (although you need to be a bit patient for it to complete!).