Best Practices and Production Patterns
Learn production-ready patterns for security, testing, error handling, and team collaboration in Ansible automation.
Running Ansible in production environments requires more than functional playbooks. You need security controls, testing procedures, error handling strategies, and organizational patterns that support team collaboration. This section covers the practices that separate hobby automation from enterprise-grade infrastructure management.
Security Best Practices
Secrets Management
Never store sensitive data in plain text. Use Ansible Vault for encrypting secrets:
# Create an encrypted file
ansible-vault create group_vars/production/vault.yml
# Edit an encrypted file
ansible-vault edit group_vars/production/vault.yml
# Encrypt an existing file
ansible-vault encrypt secrets.yml
# Decrypt for viewing
ansible-vault view group_vars/production/vault.yml
Structure your variables to clearly separate sensitive data:
# group_vars/production/vars.yml (unencrypted)
database_host: db.production.example.com
database_port: 5432
database_name: app_production
database_user: app_user
# group_vars/production/vault.yml (encrypted)
vault_database_password: supersecretpassword
vault_api_key: abcd1234567890
vault_ssl_private_key: |
-----BEGIN PRIVATE KEY-----
MIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwggSjAgEAAoIBAQC...
-----END PRIVATE KEY-----
Reference vault variables in your regular variables:
# group_vars/production/vars.yml
database_password: '{{ vault_database_password }}'
api_key: '{{ vault_api_key }}'
ssl_private_key: '{{ vault_ssl_private_key }}'
Vault Management Strategies
For team environments, use multiple vault files with different access levels:
# Structure for team access control
group_vars/
├── production/
│ ├── vars.yml # Public variables
│ ├── vault-common.yml # Shared secrets (team access)
│ ├── vault-database.yml # Database secrets (DBA access)
│ └── vault-certificates.yml # SSL secrets (security team access)
Use external secret management systems for enterprise environments:
# Using HashiCorp Vault lookup
tasks:
- name: Get database password from HashiCorp Vault
set_fact:
database_password: "{{ lookup('hashi_vault', 'secret=secret/database/production:password') }}"
no_log: true
- name: Configure application with secret
template:
src: app.conf.j2
dest: /etc/myapp/app.conf
mode: '0600'
no_log: true
Privilege Management
Follow the principle of least privilege:
# Bad: Running everything as root
- name: Configure application
hosts: appservers
become: yes # Everything runs as root
# Better: Selective privilege escalation
- name: Configure application
hosts: appservers
tasks:
- name: Install packages (requires root)
package:
name: myapp
state: present
become: yes
- name: Configure application (app user is sufficient)
template:
src: app.conf.j2
dest: /opt/myapp/app.conf
become: yes
become_user: myapp
- name: Start service (requires root)
service:
name: myapp
state: started
become: yes
Use sudo rules that limit specific commands:
# Configure sudo for specific Ansible operations
- name: Configure limited sudo for deployment user
lineinfile:
path: /etc/sudoers.d/ansible-deploy
line: 'deploy ALL=(ALL) NOPASSWD: /bin/systemctl restart myapp, /bin/systemctl reload nginx'
create: yes
validate: 'visudo -cf %s'
Input Validation and Sanitization
Validate user inputs to prevent injection attacks:
tasks:
- name: Validate database name format
fail:
msg: 'Database name contains invalid characters'
when: database_name is not match("^[a-zA-Z][a-zA-Z0-9_]*$")
- name: Validate port range
fail:
msg: 'Port must be between 1024 and 65535'
when: app_port | int < 1024 or app_port | int > 65535
- name: Sanitize user input
set_fact:
clean_app_name: "{{ app_name | regex_replace('[^a-zA-Z0-9_-]', '') }}"
Testing Strategies
Syntax and Lint Testing
Implement automated testing in your development workflow:
# Create a testing script
cat > scripts/test.sh << 'EOF'
#!/bin/bash
set -e
echo "=== Syntax Check ==="
ansible-playbook --syntax-check site.yml
echo "=== Ansible Lint ==="
ansible-lint site.yml
echo "=== YAML Lint ==="
yamllint .
echo "=== Check Mode (Dry Run) ==="
ansible-playbook site.yml --check --diff
echo "All tests passed!"
EOF
chmod +x scripts/test.sh
Ansible Lint Configuration
Create .ansible-lint
configuration:
# .ansible-lint
---
exclude_paths:
- .cache/ # implicit unless exclude_paths is defined in config
- .github/
- test/fixtures/formatting-before/
- test/fixtures/formatting-prettier/
use_default_rules: true
# Disable specific rules
skip_list:
- yaml[line-length] # Allow long lines in specific cases
- name[casing] # Allow flexible task naming
# Enable additional rules
enable_list:
- no-log-password
- name[prefix]
# Set rule-specific configuration
rules:
line-length:
max: 120
allow-non-breakable-words: true
allow-non-breakable-inline-mappings: true
Infrastructure Testing
Test your infrastructure changes in isolated environments:
# test-environment.yml
---
- name: Test infrastructure changes
hosts: test_servers
become: yes
pre_tasks:
- name: Create snapshot before changes (if supported)
uri:
url: '{{ cloud_api_endpoint }}/snapshots'
method: POST
body_format: json
body:
server_id: '{{ ansible_default_ipv4.address }}'
name: 'pre-ansible-{{ ansible_date_time.epoch }}'
delegate_to: localhost
when: cloud_snapshots_enabled | default(false)
roles:
- webserver
- database
post_tasks:
- name: Run application health checks
uri:
url: 'http://{{ inventory_hostname }}:{{ app_port }}/health'
method: GET
status_code: 200
register: health_check
retries: 5
delay: 10
- name: Validate database connectivity
postgresql_ping:
db: '{{ database_name }}'
login_host: '{{ inventory_hostname }}'
login_user: '{{ database_user }}'
login_password: '{{ database_password }}'
when: "'databases' in group_names"
Molecule Testing
Use Molecule for comprehensive role testing:
# molecule/default/molecule.yml
---
dependency:
name: galaxy
driver:
name: docker
platforms:
- name: ubuntu-instance
image: ubuntu:20.04
pre_build_image: true
command: /lib/systemd/systemd
v2: true
privileged: true
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:ro
provisioner:
name: ansible
config_options:
defaults:
interpreter_python: auto_silent
callback_whitelist: profile_tasks, timer, yaml
ssh_connection:
pipelining: false
verifier:
name: ansible
scenario:
test_sequence:
- dependency
- lint
- cleanup
- destroy
- syntax
- create
- prepare
- converge
- idempotence
- side_effect
- verify
- cleanup
- destroy
Error Handling and Recovery
Graceful Failure Handling
Implement proper error handling to prevent partial configurations:
---
- name: Deploy application with rollback capability
hosts: appservers
become: yes
vars:
app_backup_dir: /opt/backups/{{ app_name }}
max_failures: '{{ (ansible_play_hosts | length * 0.3) | int }}'
pre_tasks:
- name: Create backup directory
file:
path: '{{ app_backup_dir }}'
state: directory
mode: '0755'
tasks:
- name: Backup current application
block:
- name: Stop application service
service:
name: '{{ app_name }}'
state: stopped
register: service_stopped
- name: Create application backup
archive:
path: '{{ app_directory }}'
dest: '{{ app_backup_dir }}/backup-{{ ansible_date_time.epoch }}.tar.gz'
format: gz
register: backup_created
- name: Deploy new application version
unarchive:
src: '{{ app_package_url }}'
dest: '{{ app_directory }}'
remote_src: yes
owner: '{{ app_user }}'
group: '{{ app_user }}'
register: app_deployed
- name: Start application service
service:
name: '{{ app_name }}'
state: started
register: service_started
- name: Verify application health
uri:
url: 'http://{{ inventory_hostname }}:{{ app_port }}/health'
status_code: 200
register: health_check
retries: 5
delay: 10
rescue:
- name: Rollback on failure
block:
- name: Stop failed application
service:
name: '{{ app_name }}'
state: stopped
ignore_errors: yes
- name: Restore from backup
unarchive:
src: '{{ app_backup_dir }}/backup-{{ ansible_date_time.epoch }}.tar.gz'
dest: '{{ app_directory | dirname }}'
remote_src: yes
owner: '{{ app_user }}'
group: '{{ app_user }}'
when: backup_created is succeeded
- name: Start restored application
service:
name: '{{ app_name }}'
state: started
- name: Report rollback completion
debug:
msg: 'Application rolled back due to deployment failure on {{ inventory_hostname }}'
always:
- name: Notify deployment failure
mail:
to: '{{ ops_email }}'
subject: 'Deployment failed on {{ inventory_hostname }}'
body: 'Deployment failed and was rolled back. Check logs for details.'
delegate_to: localhost
when: notify_on_failure | default(true)
always:
- name: Clean old backups (keep last 5)
shell: |
cd {{ app_backup_dir }}
ls -t backup-*.tar.gz | tail -n +6 | xargs -r rm
ignore_errors: yes
Circuit Breaker Pattern
Stop execution if too many hosts fail:
- name: Deploy with failure limits
hosts: webservers
max_fail_percentage: 25 # Stop if more than 25% of hosts fail
serial: 5 # Process 5 hosts at a time
tasks:
- name: Deploy application
include_role:
name: app_deployment
Retry and Recovery Patterns
Implement intelligent retry logic:
tasks:
- name: Download application package with retries
get_url:
url: '{{ app_download_url }}'
dest: '/tmp/{{ app_package_name }}'
timeout: 30
register: download_result
retries: 3
delay: 10
until: download_result is succeeded
- name: Deploy with database connectivity check
block:
- name: Wait for database to be ready
wait_for:
host: '{{ database_host }}'
port: '{{ database_port }}'
timeout: 300
delay: 5
- name: Test database connection
postgresql_ping:
login_host: '{{ database_host }}'
login_user: '{{ database_user }}'
login_password: '{{ database_password }}'
register: db_connection
retries: 5
delay: 15
until: db_connection is succeeded
- name: Run database migrations
command: '{{ app_directory }}/migrate.sh'
become_user: '{{ app_user }}'
rescue:
- name: Handle database connectivity issues
debug:
msg: 'Database not available, deployment postponed'
- name: Schedule retry job
cron:
name: 'retry-deployment-{{ inventory_hostname }}'
minute: '*/30'
job: 'ansible-playbook /opt/ansible/retry-deployment.yml'
user: ansible
Performance Optimization
Execution Strategies
Optimize playbook execution for large infrastructures:
# Parallel execution strategies
- name: Fast deployment across many hosts
hosts: webservers
strategy: free # Don't wait for all hosts to complete each task
gather_facts: no # Skip fact gathering if not needed
tasks:
- name: Quick service restart
service:
name: myapp
state: restarted
# Serial execution for rolling updates
- name: Rolling update with load balancer management
hosts: webservers
serial: 2 # Process 2 hosts at a time
max_fail_percentage: 10
pre_tasks:
- name: Remove from load balancer
uri:
url: '{{ lb_api_endpoint }}/disable/{{ inventory_hostname }}'
method: POST
roles:
- app_update
post_tasks:
- name: Add back to load balancer
uri:
url: '{{ lb_api_endpoint }}/enable/{{ inventory_hostname }}'
method: POST
Fact Caching
Enable fact caching to improve performance:
# ansible.cfg
[defaults]
fact_caching = redis
fact_caching_connection = localhost:6379:0
fact_caching_timeout = 3600
fact_caching_prefix = ansible_facts_
# Or use JSON file caching
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_fact_cache
Pipelining and Connection Optimization
Optimize SSH connections:
# ansible.cfg
[defaults]
host_key_checking = False
pipelining = True
forks = 20
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o PreferredAuthentications=publickey
pipelining = True
control_path = /tmp/ansible-ssh-%%h-%%p-%%r
CI/CD Integration
GitLab CI Integration
Create .gitlab-ci.yml
for automated testing and deployment:
# .gitlab-ci.yml
---
stages:
- lint
- test
- deploy-staging
- deploy-production
variables:
ANSIBLE_HOST_KEY_CHECKING: 'False'
ANSIBLE_STDOUT_CALLBACK: 'yaml'
before_script:
- pip install ansible ansible-lint yamllint
- mkdir -p ~/.ssh
- echo "$SSH_PRIVATE_KEY" > ~/.ssh/id_rsa
- chmod 600 ~/.ssh/id_rsa
lint:
stage: lint
script:
- yamllint .
- ansible-lint site.yml
- ansible-playbook --syntax-check site.yml
test:
stage: test
script:
- ansible-playbook site.yml --check --diff -i inventories/test/
deploy-staging:
stage: deploy-staging
script:
- ansible-playbook site.yml -i inventories/staging/ --vault-password-file $VAULT_PASSWORD
environment:
name: staging
url: https://staging.example.com
only:
- develop
deploy-production:
stage: deploy-production
script:
- ansible-playbook site.yml -i inventories/production/ --vault-password-file $VAULT_PASSWORD
environment:
name: production
url: https://example.com
when: manual
only:
- main
GitHub Actions Integration
Create .github/workflows/ansible.yml
:
name: Ansible CI/CD
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Install dependencies
run: |
pip install ansible ansible-lint yamllint
- name: Run YAML Lint
run: yamllint .
- name: Run Ansible Lint
run: ansible-lint site.yml
- name: Syntax check
run: ansible-playbook --syntax-check site.yml
test:
runs-on: ubuntu-latest
needs: lint
steps:
- uses: actions/checkout@v3
- name: Run molecule tests
run: |
pip install molecule[docker]
molecule test
deploy-staging:
runs-on: ubuntu-latest
needs: test
if: github.ref == 'refs/heads/develop'
environment: staging
steps:
- uses: actions/checkout@v3
- name: Deploy to staging
run: |
echo "${{ secrets.SSH_PRIVATE_KEY }}" > private_key
chmod 600 private_key
ansible-playbook site.yml -i inventories/staging/ \
--private-key private_key \
--vault-password-file <(echo "${{ secrets.VAULT_PASSWORD }}")
deploy-production:
runs-on: ubuntu-latest
needs: test
if: github.ref == 'refs/heads/main'
environment: production
steps:
- uses: actions/checkout@v3
- name: Deploy to production
run: |
echo "${{ secrets.SSH_PRIVATE_KEY }}" > private_key
chmod 600 private_key
ansible-playbook site.yml -i inventories/production/ \
--private-key private_key \
--vault-password-file <(echo "${{ secrets.VAULT_PASSWORD }}")
Team Collaboration Patterns
Code Review Process
Establish clear review criteria for Ansible changes:
# .github/pull_request_template.md
## Ansible Playbook Changes
### Checklist
- [ ] All tasks have descriptive names
- [ ] Variables are properly documented
- [ ] Secrets are encrypted with ansible-vault
- [ ] Changes are tested in staging environment
- [ ] Handlers are used appropriately
- [ ] Role dependencies are documented
### Testing
- [ ] `ansible-lint` passes without errors
- [ ] `yamllint` passes without errors
- [ ] Syntax check passes
- [ ] Dry run completes successfully
### Security Review
- [ ] No hardcoded secrets
- [ ] Appropriate privilege escalation
- [ ] Input validation where needed
- [ ] Backup and rollback procedures considered
### Documentation
- [ ] README updated if needed
- [ ] Variable documentation updated
- [ ] Change log entry added
Environment Management
Structure environments for safe testing and deployment:
inventories/
├── development/
│ ├── hosts.yml
│ ├── group_vars/
│ └── host_vars/
├── staging/
│ ├── hosts.yml
│ ├── group_vars/
│ └── host_vars/
├── production/
│ ├── hosts.yml
│ ├── group_vars/
│ └── host_vars/
└── shared/
└── group_vars/
└── all.yml
Use environment-specific variable validation:
# group_vars/production/vars.yml
---
environment: production
ssl_required: true
backup_enabled: true
monitoring_enabled: true
debug_mode: false
# Add validation
- name: Validate production environment
assert:
that:
- ssl_required | bool
- backup_enabled | bool
- not debug_mode | bool
fail_msg: "Production environment validation failed"
when: environment == "production"
Documentation Standards
Maintain comprehensive documentation:
# Role: webserver
## Description
Configures and manages nginx web servers with SSL support and performance optimization.
## Requirements
- Ubuntu 18.04+ or CentOS 7+
- SSL certificates (if SSL enabled)
- Firewall configuration allowing ports 80/443
## Role Variables
### Required Variables
- `server_name`: Primary server name for SSL certificate
- `document_root`: Web document root directory
### Optional Variables
- `ssl_enabled`: Enable SSL configuration (default: false)
- `worker_processes`: Number of nginx worker processes (default: CPU cores)
- `max_client_body_size`: Maximum upload size (default: 1m)
### Example Usage
```yaml
- hosts: webservers
roles:
- role: webserver
server_name: example.com
ssl_enabled: true
worker_processes: 4
```
Testing
Run the test playbook:
ansible-playbook tests/test.yml
Changelog
v2.1.0
- Added HTTP/2 support
- Improved SSL configuration
- Added rate limiting options
v2.0.0
- Breaking: Changed variable naming convention
- Added multi-site support
- Improved performance tuning
Next Steps
Production-ready Ansible automation requires attention to security, testing, error handling, and team collaboration. You've learned patterns that ensure your automation is reliable, secure, and maintainable at scale.
In the final section, we'll explore advanced Ansible features and discuss how to continue your automation journey - from dynamic inventories to custom modules and integration with other tools.
The production patterns you've learned here form the foundation for managing infrastructure automation in enterprise environments, where reliability and security are paramount.
Found an issue?