homelab-optimized/ansible/automation/docs/plans/2026-02-21-new-playbooks-design.md

# New Playbooks Design — 2026-02-21

## Context

Adding 5 playbooks to fill coverage gaps in the existing 42-playbook homelab automation suite.
Infrastructure: 10+ hosts, 200+ containers, Tailscale mesh, mixed platforms (Ubuntu, Debian,
Synology DSM, TrueNAS SCALE, Proxmox, Alpine/Home Assistant, Raspberry Pi).

## Approved Playbooks

### 1. `network_connectivity.yml`
**Priority: High (user-requested)**

Full mesh connectivity verification across the tailnet.

- Targets: `all` (unreachable hosts handled gracefully with `ignore_unreachable`)
- Checks per host:
  - Tailscale is running and has a valid IP (`tailscale status --json`)
  - Ping all other inventory hosts by Tailscale IP
  - SSH reachability to each peer
  - HTTP/HTTPS endpoint health for key services (Portainer, Gitea, Immich, Home Assistant, etc.) — defined in group_vars or inline vars
- Output: connectivity matrix table + `/tmp/connectivity_reports/connectivity_<timestamp>.json`
- Alert: ntfy notification on any failed node or endpoint

### 2. `proxmox_management.yml`
**Priority: High**

Proxmox-specific management targeting `pve` host.

- Checks:
  - VM/LXC inventory: count, names, state (running/stopped)
  - Resource allocation vs actual usage (RAM, CPU per VM)
  - Storage pool status and utilisation
  - Recent Proxmox task log (last 10 tasks)
- Optional action: `-e action=snapshot -e vm_id=100` to snapshot a specific VM
- Output: JSON report at `/tmp/health_reports/proxmox_<timestamp>.json`
- Pattern: mirrors `synology_health.yml` structure

### 3. `truenas_health.yml`
**Priority: High**

TrueNAS SCALE-specific health targeting `truenas-scale` host.

- Checks:
  - ZFS pool status (`zpool status`) — flags DEGRADED/FAULTED
  - Pool scrub: last scrub date, status, any errors
  - Dataset disk usage with warnings at 80%/90%
  - SMART status for physical disks
  - TrueNAS apps (k3s-based): running app count, failed apps
- Output: JSON report at `/tmp/health_reports/truenas_<timestamp>.json`
- Complements existing `synology_health.yml`

### 4. `ntp_check.yml`
**Priority: Medium**

Time sync health check across all hosts. Check only — no configuration changes.

- Targets: `all`
- Platform-adaptive daemon detection: `chronyd`, `systemd-timesyncd`, `ntpd`, Synology NTP
- Reports: sync source, current offset (ms), stratum, last sync time
- Thresholds: warn >500ms, critical >1000ms
- Alert: ntfy notification for hosts exceeding warn threshold
- Output: summary table + `/tmp/ntp_reports/ntp_<timestamp>.json`

### 5. `cron_audit.yml`
**Priority: Medium**

Scheduled task inventory and basic security audit across all hosts.

- Inventories:
  - `/etc/crontab`, `/etc/cron.d/*`, `/etc/cron.{hourly,daily,weekly,monthly}/`
  - User crontabs (`crontab -l` for each user with a crontab)
  - `systemd` timer units (`systemctl list-timers --all`)
- Security flags:
  - Cron jobs running as root that reference world-writable paths
  - Cron jobs referencing paths that no longer exist
- Output: per-host JSON at `/tmp/cron_audit/<host>_<timestamp>.json` + summary

## Patterns to Follow

- Use `changed_when: false` on all read-only shell tasks
- Use `ignore_errors: true` / `ignore_unreachable: true` for non-fatal checks
- Platform detection via `ansible_distribution` and custom `system_type` host_vars
- ntfy URL from `ntfy_url` variable (group_vars with default fallback)
- JSON reports saved to `/tmp/<category>_reports/` with timestamp in filename
- `delegate_to: localhost` + `run_once: true` for report aggregation tasks

## Out of Scope

- NTP configuration/enforcement (check only, per user decision)
- Home Assistant backup (deferred)
- Docker compose drift detection (deferred)
- Gitea health (deferred)