Files
homelab-optimized/docs/advanced/ansible.md
Gitea Mirror Bot 49e9fca0ca
Some checks failed
Documentation / Deploy to GitHub Pages (push) Has been cancelled
Documentation / Build Docusaurus (push) Has been cancelled
Sanitized mirror from private repository - 2026-04-06 02:16:57 UTC
2026-04-06 02:16:57 +00:00

17 KiB

🤖 Ansible Automation Guide

🔴 Advanced Guide

This guide covers the Ansible automation system used to manage all 176 services across 13 hosts in this homelab. Ansible enables Infrastructure as Code, automated deployments, and consistent configuration management.

🎯 Ansible in This Homelab

📊 Current Automation Scope

  • 13 hosts managed through Ansible inventory
  • 176 services deployed via playbooks
  • Automated health checks across all systems
  • Configuration management for consistent settings
  • Deployment automation for new services

🏗️ Architecture Overview

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Git Repository│───►│  Ansible Control│───►│  Target Hosts   │
│  (This repo)    │    │     Node        │    │  (All systems)  │
│                 │    │                 │    │                 │
│ • Playbooks     │    │ • Inventory     │    │ • Docker        │
│ • Inventory     │    │ • Execution     │    │ • Services      │
│ • Variables     │    │ • Logging       │    │ • Configuration │
└─────────────────┘    └─────────────────┘    └─────────────────┘

📁 Repository Structure

🗂️ Ansible Directory Layout

ansible/
├── automation/
│   ├── ansible.cfg              # Ansible configuration
│   ├── hosts                    # Main inventory file
│   ├── hosts.ini               # Alternative inventory format
│   ├── group_vars/             # Group-specific variables
│   │   ├── all.yml
│   │   ├── synology.yml
│   │   └── debian_clients.yml
│   ├── host_vars/              # Host-specific variables
│   │   ├── atlantis.yml
│   │   ├── calypso.yml
│   │   └── homelab.yml
│   ├── playbooks/              # Ansible playbooks
│   │   ├── deploy-service.yml
│   │   ├── health-check.yml
│   │   ├── system-update.yml
│   │   └── backup.yml
│   └── scripts/                # Helper scripts
│       ├── deploy.sh
│       └── health-check.sh
├── deploy_arr_suite_full.yml   # Specific deployment playbooks
├── deploy_arr_suite_updated.yml
└── inventory.ini               # Legacy inventory

🏠 Inventory Management

📋 Host Groups

The inventory organizes hosts into logical groups:

# Core Management Node
[homelab]
homelab ansible_host=100.67.40.126 ansible_user=homelab

# Synology NAS Cluster
[synology]
atlantis ansible_host=100.83.230.112 ansible_port=60000 ansible_user=vish
calypso ansible_host=100.103.48.78 ansible_port=62000 ansible_user=Vish
setillo ansible_host=100.125.0.20 ansible_user=vish

# Raspberry Pi Nodes
[rpi]
pi-5 ansible_host=100.77.151.40 ansible_user=vish
pi-5-kevin ansible_host=100.123.246.75 ansible_user=vish

# Hypervisors / Storage
[hypervisors]
pve ansible_host=100.87.12.28 ansible_user=root
truenas-scale ansible_host=100.75.252.64 ansible_user=vish

# Remote Systems
[remote]
vish-concord-nuc ansible_host=100.72.55.21 ansible_user=vish
vmi2076105 ansible_host=100.99.156.20 ansible_user=root

# Active Group (used by most playbooks)
[active:children]
homelab
synology
rpi
hypervisors
remote

🔧 Host Variables

Each host has specific configuration:

# host_vars/atlantis.yml
---
# Synology-specific settings
synology_user_id: 1026
synology_group_id: 100
docker_compose_path: /volume1/docker
media_path: /volume1/media

# Service-specific settings
plex_enabled: true
grafana_enabled: true
prometheus_enabled: true

# Network settings
tailscale_ip: 100.83.230.112
local_ip: 10.0.0.250

📖 Playbook Examples

🚀 Service Deployment Playbook

---
- name: Deploy Docker Service
  hosts: "{{ target_host | default('all') }}"
  become: yes
  vars:
    service_name: "{{ service_name }}"
    service_path: "{{ service_path | default('/opt/docker/' + service_name) }}"
  
  tasks:
    - name: Create service directory
      file:
        path: "{{ service_path }}"
        state: directory
        owner: "{{ ansible_user }}"
        group: "{{ ansible_user }}"
        mode: '0755'

    - name: Copy docker-compose file
      template:
        src: "{{ service_name }}/docker-compose.yml.j2"
        dest: "{{ service_path }}/docker-compose.yml"
        owner: "{{ ansible_user }}"
        group: "{{ ansible_user }}"
        mode: '0644'
      notify: restart service

    - name: Copy environment file
      template:
        src: "{{ service_name }}/.env.j2"
        dest: "{{ service_path }}/.env"
        owner: "{{ ansible_user }}"
        group: "{{ ansible_user }}"
        mode: '0600'
      notify: restart service

    - name: Start service
      docker_compose:
        project_src: "{{ service_path }}"
        state: present
        pull: yes

    - name: Wait for service to be healthy
      uri:
        url: "http://{{ ansible_host }}:{{ service_port }}/health"
        method: GET
        status_code: 200
      retries: 30
      delay: 10
      when: service_health_check is defined

  handlers:
    - name: restart service
      docker_compose:
        project_src: "{{ service_path }}"
        state: present
        pull: yes
        recreate: always

🔍 Health Check Playbook

---
- name: Health Check All Services
  hosts: active
  gather_facts: no
  
  tasks:
    - name: Check Docker daemon
      systemd:
        name: docker
        state: started
      register: docker_status

    - name: Get running containers
      docker_host_info:
        containers: yes
      register: docker_info

    - name: Check container health
      docker_container_info:
        name: "{{ item }}"
      register: container_health
      loop: "{{ expected_containers | default([]) }}"
      when: expected_containers is defined

    - name: Test service endpoints
      uri:
        url: "http://{{ ansible_host }}:{{ item.port }}{{ item.path | default('/') }}"
        method: GET
        timeout: 10
      register: endpoint_check
      loop: "{{ service_endpoints | default([]) }}"
      ignore_errors: yes

    - name: Generate health report
      template:
        src: health-report.j2
        dest: "/tmp/health-{{ inventory_hostname }}-{{ ansible_date_time.epoch }}.json"
      delegate_to: localhost

🔄 System Update Playbook

---
- name: Update Systems and Services
  hosts: debian_clients
  become: yes
  serial: 1  # Update one host at a time
  
  pre_tasks:
    - name: Check if reboot required
      stat:
        path: /var/run/reboot-required
      register: reboot_required

  tasks:
    - name: Update package cache
      apt:
        update_cache: yes
        cache_valid_time: 3600

    - name: Upgrade packages
      apt:
        upgrade: dist
        autoremove: yes
        autoclean: yes

    - name: Update Docker containers
      shell: |
        cd {{ item }}
        docker-compose pull
        docker-compose up -d
      loop: "{{ docker_compose_paths | default([]) }}"
      when: docker_compose_paths is defined

    - name: Clean up Docker
      docker_prune:
        containers: yes
        images: yes
        networks: yes
        volumes: no  # Don't remove volumes
        builder_cache: yes

  post_tasks:
    - name: Reboot if required
      reboot:
        reboot_timeout: 300
      when: reboot_required.stat.exists

    - name: Wait for services to start
      wait_for:
        port: "{{ item }}"
        timeout: 300
      loop: "{{ critical_ports | default([22, 80, 443]) }}"

🔧 Configuration Management

⚙️ Ansible Configuration

# ansible.cfg
[defaults]
inventory = hosts
host_key_checking = False
timeout = 30
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 86400

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o UserKnownHostsFile=/dev/null
pipelining = True

📊 Group Variables

# group_vars/all.yml
---
# Global settings
timezone: America/Los_Angeles
docker_compose_version: "2.0"
default_restart_policy: "on-failure:5"

# Security settings
security_hardening: true
no_new_privileges: true
default_user_mapping: "1000:1000"

# Monitoring settings
prometheus_enabled: true
grafana_enabled: true
uptime_kuma_enabled: true

# Backup settings
backup_enabled: true
backup_retention_days: 30
# group_vars/synology.yml
---
# Synology-specific overrides
default_user_mapping: "1026:100"
docker_compose_path: "/volume1/docker"
media_path: "/volume1/media"
backup_path: "/volume1/backups"

# Synology Docker settings
docker_socket: "/var/run/docker.sock"
docker_data_root: "/volume1/@docker"

🚀 Deployment Workflows

📦 Single Service Deployment

# Deploy a specific service to a specific host
ansible-playbook -i hosts playbooks/deploy-service.yml \
  --extra-vars "target_host=atlantis service_name=uptime-kuma"

# Deploy to multiple hosts
ansible-playbook -i hosts playbooks/deploy-service.yml \
  --extra-vars "target_host=synology service_name=watchtower"

# Deploy with custom variables
ansible-playbook -i hosts playbooks/deploy-service.yml \
  --extra-vars "target_host=homelab service_name=grafana grafana_port=3001"

🏗️ Full Stack Deployment

# Deploy entire Arr suite to Atlantis
ansible-playbook -i hosts deploy_arr_suite_full.yml \
  --limit atlantis

# Deploy monitoring stack to all hosts
ansible-playbook -i hosts playbooks/deploy-monitoring.yml

# Deploy with dry-run first
ansible-playbook -i hosts playbooks/deploy-service.yml \
  --check --diff --extra-vars "service_name=new-service"

🔍 Health Checks and Monitoring

# Run health checks on all active hosts
ansible-playbook -i hosts playbooks/health-check.yml

# Check specific service group
ansible-playbook -i hosts playbooks/health-check.yml \
  --limit synology

# Generate detailed health report
ansible-playbook -i hosts playbooks/health-check.yml \
  --extra-vars "detailed_report=true"

📊 Advanced Automation

🔄 Automated Updates

# Cron job for automated updates
---
- name: Setup Automated Updates
  hosts: all
  become: yes
  
  tasks:
    - name: Create update script
      template:
        src: update-script.sh.j2
        dest: /usr/local/bin/homelab-update
        mode: '0755'

    - name: Schedule weekly updates
      cron:
        name: "Homelab automated update"
        minute: "0"
        hour: "2"
        weekday: "0"  # Sunday
        job: "/usr/local/bin/homelab-update >> /var/log/homelab-update.log 2>&1"

📈 Monitoring Integration

# Deploy monitoring agents
---
- name: Deploy Monitoring Stack
  hosts: all
  
  tasks:
    - name: Deploy Node Exporter
      docker_container:
        name: node-exporter
        image: prom/node-exporter:latest
        ports:
          - "9100:9100"
        volumes:
          - /proc:/host/proc:ro
          - /sys:/host/sys:ro
          - /:/rootfs:ro
        command:
          - '--path.procfs=/host/proc'
          - '--path.rootfs=/rootfs'
          - '--path.sysfs=/host/sys'
        restart_policy: on-failure

    - name: Register with Prometheus
      uri:
        url: "http://{{ prometheus_server }}:9090/api/v1/targets"
        method: POST
        body_format: json
        body:
          targets:
            - "{{ ansible_host }}:9100"

🔐 Security Automation

# Security hardening playbook
---
- name: Security Hardening
  hosts: all
  become: yes
  
  tasks:
    - name: Update all packages
      package:
        name: "*"
        state: latest

    - name: Configure firewall
      ufw:
        rule: allow
        port: "{{ item }}"
      loop: "{{ allowed_ports | default([22, 80, 443]) }}"

    - name: Disable root SSH
      lineinfile:
        path: /etc/ssh/sshd_config
        regexp: '^PermitRootLogin'
        line: 'PermitRootLogin no'
      notify: restart ssh

    - name: Configure fail2ban
      package:
        name: fail2ban
        state: present

    - name: Harden Docker daemon
      template:
        src: docker-daemon.json.j2
        dest: /etc/docker/daemon.json
      notify: restart docker

🔍 Troubleshooting Ansible

Common Issues

SSH Connection Failures

# Test SSH connectivity
ansible all -i hosts -m ping

# Debug SSH issues
ansible all -i hosts -m ping -vvv

# Test with specific user
ansible all -i hosts -m ping -u username

# Check SSH key permissions
chmod 600 ~/.ssh/id_rsa

Permission Issues

# Test sudo access
ansible all -i hosts -m shell -a "sudo whoami" -b

# Fix sudo configuration
ansible all -i hosts -m lineinfile -a "path=/etc/sudoers.d/ansible line='ansible ALL=(ALL) NOPASSWD:ALL'" -b

# Check user groups
ansible all -i hosts -m shell -a "groups"

Docker Issues

# Check Docker status
ansible all -i hosts -m systemd -a "name=docker state=started" -b

# Test Docker access
ansible all -i hosts -m shell -a "docker ps"

# Add user to docker group
ansible all -i hosts -m user -a "name={{ ansible_user }} groups=docker append=yes" -b

🔧 Debugging Techniques

Verbose Output

# Increase verbosity
ansible-playbook -vvv playbook.yml

# Debug specific tasks
ansible-playbook playbook.yml --start-at-task="Task Name"

# Check mode (dry run)
ansible-playbook playbook.yml --check --diff

Fact Gathering

# Gather all facts
ansible hostname -i hosts -m setup

# Gather specific facts
ansible hostname -i hosts -m setup -a "filter=ansible_distribution*"

# Custom fact gathering
ansible hostname -i hosts -m shell -a "docker --version"

📊 Monitoring Ansible

📈 Execution Tracking

# Callback plugins for monitoring
# ansible.cfg
[defaults]
callback_plugins = /usr/share/ansible/plugins/callback
stdout_callback = json
callback_whitelist = timer, profile_tasks, log_plays

# Log all playbook runs
log_path = /var/log/ansible.log

📊 Performance Metrics

# Time playbook execution
time ansible-playbook playbook.yml

# Profile task execution
ansible-playbook playbook.yml --extra-vars "profile_tasks=true"

# Monitor resource usage
htop  # During playbook execution

🚨 Error Handling

# Robust error handling
---
- name: Deploy with error handling
  hosts: all
  ignore_errors: no
  any_errors_fatal: no
  
  tasks:
    - name: Risky task
      shell: potentially_failing_command
      register: result
      failed_when: result.rc != 0 and result.rc != 2  # Allow specific error codes
      
    - name: Cleanup on failure
      file:
        path: /tmp/cleanup
        state: absent
      when: result is failed

🚀 Best Practices

Playbook Design

  • Idempotency: Playbooks should be safe to run multiple times
  • Error handling: Always handle potential failures gracefully
  • Documentation: Comment complex tasks and variables
  • Testing: Test playbooks in development before production

🔐 Security

  • Vault encryption: Encrypt sensitive variables with ansible-vault
  • SSH keys: Use SSH keys instead of passwords
  • Least privilege: Run tasks with minimum required permissions
  • Audit logs: Keep logs of all Ansible executions

📊 Performance

  • Parallelism: Use appropriate fork settings
  • Fact caching: Cache facts to speed up subsequent runs
  • Task optimization: Combine tasks where possible
  • Selective execution: Use tags and limits to run specific parts

🔄 Maintenance

  • Regular updates: Keep Ansible and modules updated
  • Inventory cleanup: Remove obsolete hosts and variables
  • Playbook refactoring: Regularly review and improve playbooks
  • Documentation: Keep documentation current with changes

📋 Next Steps

🎯 Learning Path

  1. Start simple: Begin with basic playbooks
  2. Understand inventory: Master host and group management
  3. Learn templating: Use Jinja2 for dynamic configurations
  4. Explore modules: Discover Ansible's extensive module library
  5. Advanced features: Roles, collections, and custom modules

📚 Resources

  • Official docs: docs.ansible.com
  • Ansible Galaxy: galaxy.ansible.com for roles and collections
  • Community: ansible.com/community
  • Training: Red Hat Ansible training courses

Ansible automation is what makes managing 176 services across 13 hosts feasible. Start with simple playbooks and gradually build more sophisticated automation as your confidence grows.