94 lines
3.6 KiB
Markdown
94 lines
3.6 KiB
Markdown
# New Playbooks Design — 2026-02-21
|
|
|
|
## Context
|
|
|
|
Adding 5 playbooks to fill coverage gaps in the existing 42-playbook homelab automation suite.
|
|
Infrastructure: 10+ hosts, 200+ containers, Tailscale mesh, mixed platforms (Ubuntu, Debian,
|
|
Synology DSM, TrueNAS SCALE, Proxmox, Alpine/Home Assistant, Raspberry Pi).
|
|
|
|
## Approved Playbooks
|
|
|
|
### 1. `network_connectivity.yml`
|
|
**Priority: High (user-requested)**
|
|
|
|
Full mesh connectivity verification across the tailnet.
|
|
|
|
- Targets: `all` (unreachable hosts handled gracefully with `ignore_unreachable`)
|
|
- Checks per host:
|
|
- Tailscale is running and has a valid IP (`tailscale status --json`)
|
|
- Ping all other inventory hosts by Tailscale IP
|
|
- SSH reachability to each peer
|
|
- HTTP/HTTPS endpoint health for key services (Portainer, Gitea, Immich, Home Assistant, etc.) — defined in group_vars or inline vars
|
|
- Output: connectivity matrix table + `/tmp/connectivity_reports/connectivity_<timestamp>.json`
|
|
- Alert: ntfy notification on any failed node or endpoint
|
|
|
|
### 2. `proxmox_management.yml`
|
|
**Priority: High**
|
|
|
|
Proxmox-specific management targeting `pve` host.
|
|
|
|
- Checks:
|
|
- VM/LXC inventory: count, names, state (running/stopped)
|
|
- Resource allocation vs actual usage (RAM, CPU per VM)
|
|
- Storage pool status and utilisation
|
|
- Recent Proxmox task log (last 10 tasks)
|
|
- Optional action: `-e action=snapshot -e vm_id=100` to snapshot a specific VM
|
|
- Output: JSON report at `/tmp/health_reports/proxmox_<timestamp>.json`
|
|
- Pattern: mirrors `synology_health.yml` structure
|
|
|
|
### 3. `truenas_health.yml`
|
|
**Priority: High**
|
|
|
|
TrueNAS SCALE-specific health targeting `truenas-scale` host.
|
|
|
|
- Checks:
|
|
- ZFS pool status (`zpool status`) — flags DEGRADED/FAULTED
|
|
- Pool scrub: last scrub date, status, any errors
|
|
- Dataset disk usage with warnings at 80%/90%
|
|
- SMART status for physical disks
|
|
- TrueNAS apps (k3s-based): running app count, failed apps
|
|
- Output: JSON report at `/tmp/health_reports/truenas_<timestamp>.json`
|
|
- Complements existing `synology_health.yml`
|
|
|
|
### 4. `ntp_check.yml`
|
|
**Priority: Medium**
|
|
|
|
Time sync health check across all hosts. Check only — no configuration changes.
|
|
|
|
- Targets: `all`
|
|
- Platform-adaptive daemon detection: `chronyd`, `systemd-timesyncd`, `ntpd`, Synology NTP
|
|
- Reports: sync source, current offset (ms), stratum, last sync time
|
|
- Thresholds: warn >500ms, critical >1000ms
|
|
- Alert: ntfy notification for hosts exceeding warn threshold
|
|
- Output: summary table + `/tmp/ntp_reports/ntp_<timestamp>.json`
|
|
|
|
### 5. `cron_audit.yml`
|
|
**Priority: Medium**
|
|
|
|
Scheduled task inventory and basic security audit across all hosts.
|
|
|
|
- Inventories:
|
|
- `/etc/crontab`, `/etc/cron.d/*`, `/etc/cron.{hourly,daily,weekly,monthly}/`
|
|
- User crontabs (`crontab -l` for each user with a crontab)
|
|
- `systemd` timer units (`systemctl list-timers --all`)
|
|
- Security flags:
|
|
- Cron jobs running as root that reference world-writable paths
|
|
- Cron jobs referencing paths that no longer exist
|
|
- Output: per-host JSON at `/tmp/cron_audit/<host>_<timestamp>.json` + summary
|
|
|
|
## Patterns to Follow
|
|
|
|
- Use `changed_when: false` on all read-only shell tasks
|
|
- Use `ignore_errors: true` / `ignore_unreachable: true` for non-fatal checks
|
|
- Platform detection via `ansible_distribution` and custom `system_type` host_vars
|
|
- ntfy URL from `ntfy_url` variable (group_vars with default fallback)
|
|
- JSON reports saved to `/tmp/<category>_reports/` with timestamp in filename
|
|
- `delegate_to: localhost` + `run_once: true` for report aggregation tasks
|
|
|
|
## Out of Scope
|
|
|
|
- NTP configuration/enforcement (check only, per user decision)
|
|
- Home Assistant backup (deferred)
|
|
- Docker compose drift detection (deferred)
|
|
- Gitea health (deferred)
|