Files
homelab-optimized/ansible/automation/docs/plans/2026-02-21-new-playbooks-design.md
Gitea Mirror Bot febaf56ba4
Some checks failed
Documentation / Build Docusaurus (push) Has started running
Documentation / Deploy to GitHub Pages (push) Has been cancelled
Sanitized mirror from private repository - 2026-04-16 07:12:52 UTC
2026-04-16 07:12:52 +00:00

3.6 KiB

New Playbooks Design — 2026-02-21

Context

Adding 5 playbooks to fill coverage gaps in the existing 42-playbook homelab automation suite. Infrastructure: 10+ hosts, 200+ containers, Tailscale mesh, mixed platforms (Ubuntu, Debian, Synology DSM, TrueNAS SCALE, Proxmox, Alpine/Home Assistant, Raspberry Pi).

Approved Playbooks

1. network_connectivity.yml

Priority: High (user-requested)

Full mesh connectivity verification across the tailnet.

  • Targets: all (unreachable hosts handled gracefully with ignore_unreachable)
  • Checks per host:
    • Tailscale is running and has a valid IP (tailscale status --json)
    • Ping all other inventory hosts by Tailscale IP
    • SSH reachability to each peer
    • HTTP/HTTPS endpoint health for key services (Portainer, Gitea, Immich, Home Assistant, etc.) — defined in group_vars or inline vars
  • Output: connectivity matrix table + /tmp/connectivity_reports/connectivity_<timestamp>.json
  • Alert: ntfy notification on any failed node or endpoint

2. proxmox_management.yml

Priority: High

Proxmox-specific management targeting pve host.

  • Checks:
    • VM/LXC inventory: count, names, state (running/stopped)
    • Resource allocation vs actual usage (RAM, CPU per VM)
    • Storage pool status and utilisation
    • Recent Proxmox task log (last 10 tasks)
  • Optional action: -e action=snapshot -e vm_id=100 to snapshot a specific VM
  • Output: JSON report at /tmp/health_reports/proxmox_<timestamp>.json
  • Pattern: mirrors synology_health.yml structure

3. truenas_health.yml

Priority: High

TrueNAS SCALE-specific health targeting truenas-scale host.

  • Checks:
    • ZFS pool status (zpool status) — flags DEGRADED/FAULTED
    • Pool scrub: last scrub date, status, any errors
    • Dataset disk usage with warnings at 80%/90%
    • SMART status for physical disks
    • TrueNAS apps (k3s-based): running app count, failed apps
  • Output: JSON report at /tmp/health_reports/truenas_<timestamp>.json
  • Complements existing synology_health.yml

4. ntp_check.yml

Priority: Medium

Time sync health check across all hosts. Check only — no configuration changes.

  • Targets: all
  • Platform-adaptive daemon detection: chronyd, systemd-timesyncd, ntpd, Synology NTP
  • Reports: sync source, current offset (ms), stratum, last sync time
  • Thresholds: warn >500ms, critical >1000ms
  • Alert: ntfy notification for hosts exceeding warn threshold
  • Output: summary table + /tmp/ntp_reports/ntp_<timestamp>.json

5. cron_audit.yml

Priority: Medium

Scheduled task inventory and basic security audit across all hosts.

  • Inventories:
    • /etc/crontab, /etc/cron.d/*, /etc/cron.{hourly,daily,weekly,monthly}/
    • User crontabs (crontab -l for each user with a crontab)
    • systemd timer units (systemctl list-timers --all)
  • Security flags:
    • Cron jobs running as root that reference world-writable paths
    • Cron jobs referencing paths that no longer exist
  • Output: per-host JSON at /tmp/cron_audit/<host>_<timestamp>.json + summary

Patterns to Follow

  • Use changed_when: false on all read-only shell tasks
  • Use ignore_errors: true / ignore_unreachable: true for non-fatal checks
  • Platform detection via ansible_distribution and custom system_type host_vars
  • ntfy URL from ntfy_url variable (group_vars with default fallback)
  • JSON reports saved to /tmp/<category>_reports/ with timestamp in filename
  • delegate_to: localhost + run_once: true for report aggregation tasks

Out of Scope

  • NTP configuration/enforcement (check only, per user decision)
  • Home Assistant backup (deferred)
  • Docker compose drift detection (deferred)
  • Gitea health (deferred)