3.6 KiB
3.6 KiB
New Playbooks Design — 2026-02-21
Context
Adding 5 playbooks to fill coverage gaps in the existing 42-playbook homelab automation suite. Infrastructure: 10+ hosts, 200+ containers, Tailscale mesh, mixed platforms (Ubuntu, Debian, Synology DSM, TrueNAS SCALE, Proxmox, Alpine/Home Assistant, Raspberry Pi).
Approved Playbooks
1. network_connectivity.yml
Priority: High (user-requested)
Full mesh connectivity verification across the tailnet.
- Targets:
all(unreachable hosts handled gracefully withignore_unreachable) - Checks per host:
- Tailscale is running and has a valid IP (
tailscale status --json) - Ping all other inventory hosts by Tailscale IP
- SSH reachability to each peer
- HTTP/HTTPS endpoint health for key services (Portainer, Gitea, Immich, Home Assistant, etc.) — defined in group_vars or inline vars
- Tailscale is running and has a valid IP (
- Output: connectivity matrix table +
/tmp/connectivity_reports/connectivity_<timestamp>.json - Alert: ntfy notification on any failed node or endpoint
2. proxmox_management.yml
Priority: High
Proxmox-specific management targeting pve host.
- Checks:
- VM/LXC inventory: count, names, state (running/stopped)
- Resource allocation vs actual usage (RAM, CPU per VM)
- Storage pool status and utilisation
- Recent Proxmox task log (last 10 tasks)
- Optional action:
-e action=snapshot -e vm_id=100to snapshot a specific VM - Output: JSON report at
/tmp/health_reports/proxmox_<timestamp>.json - Pattern: mirrors
synology_health.ymlstructure
3. truenas_health.yml
Priority: High
TrueNAS SCALE-specific health targeting truenas-scale host.
- Checks:
- ZFS pool status (
zpool status) — flags DEGRADED/FAULTED - Pool scrub: last scrub date, status, any errors
- Dataset disk usage with warnings at 80%/90%
- SMART status for physical disks
- TrueNAS apps (k3s-based): running app count, failed apps
- ZFS pool status (
- Output: JSON report at
/tmp/health_reports/truenas_<timestamp>.json - Complements existing
synology_health.yml
4. ntp_check.yml
Priority: Medium
Time sync health check across all hosts. Check only — no configuration changes.
- Targets:
all - Platform-adaptive daemon detection:
chronyd,systemd-timesyncd,ntpd, Synology NTP - Reports: sync source, current offset (ms), stratum, last sync time
- Thresholds: warn >500ms, critical >1000ms
- Alert: ntfy notification for hosts exceeding warn threshold
- Output: summary table +
/tmp/ntp_reports/ntp_<timestamp>.json
5. cron_audit.yml
Priority: Medium
Scheduled task inventory and basic security audit across all hosts.
- Inventories:
/etc/crontab,/etc/cron.d/*,/etc/cron.{hourly,daily,weekly,monthly}/- User crontabs (
crontab -lfor each user with a crontab) systemdtimer units (systemctl list-timers --all)
- Security flags:
- Cron jobs running as root that reference world-writable paths
- Cron jobs referencing paths that no longer exist
- Output: per-host JSON at
/tmp/cron_audit/<host>_<timestamp>.json+ summary
Patterns to Follow
- Use
changed_when: falseon all read-only shell tasks - Use
ignore_errors: true/ignore_unreachable: truefor non-fatal checks - Platform detection via
ansible_distributionand customsystem_typehost_vars - ntfy URL from
ntfy_urlvariable (group_vars with default fallback) - JSON reports saved to
/tmp/<category>_reports/with timestamp in filename delegate_to: localhost+run_once: truefor report aggregation tasks
Out of Scope
- NTP configuration/enforcement (check only, per user decision)
- Home Assistant backup (deferred)
- Docker compose drift detection (deferred)
- Gitea health (deferred)