Files
homelab-optimized/ansible/automation/docs/plans/2026-02-21-new-playbooks-design.md
Gitea Mirror Bot 32abef4132
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m4s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-04-19 15:28:05 UTC
2026-04-19 15:28:05 +00:00

94 lines
3.6 KiB
Markdown

# New Playbooks Design — 2026-02-21
## Context
Adding 5 playbooks to fill coverage gaps in the existing 42-playbook homelab automation suite.
Infrastructure: 10+ hosts, 200+ containers, Tailscale mesh, mixed platforms (Ubuntu, Debian,
Synology DSM, TrueNAS SCALE, Proxmox, Alpine/Home Assistant, Raspberry Pi).
## Approved Playbooks
### 1. `network_connectivity.yml`
**Priority: High (user-requested)**
Full mesh connectivity verification across the tailnet.
- Targets: `all` (unreachable hosts handled gracefully with `ignore_unreachable`)
- Checks per host:
- Tailscale is running and has a valid IP (`tailscale status --json`)
- Ping all other inventory hosts by Tailscale IP
- SSH reachability to each peer
- HTTP/HTTPS endpoint health for key services (Portainer, Gitea, Immich, Home Assistant, etc.) — defined in group_vars or inline vars
- Output: connectivity matrix table + `/tmp/connectivity_reports/connectivity_<timestamp>.json`
- Alert: ntfy notification on any failed node or endpoint
### 2. `proxmox_management.yml`
**Priority: High**
Proxmox-specific management targeting `pve` host.
- Checks:
- VM/LXC inventory: count, names, state (running/stopped)
- Resource allocation vs actual usage (RAM, CPU per VM)
- Storage pool status and utilisation
- Recent Proxmox task log (last 10 tasks)
- Optional action: `-e action=snapshot -e vm_id=100` to snapshot a specific VM
- Output: JSON report at `/tmp/health_reports/proxmox_<timestamp>.json`
- Pattern: mirrors `synology_health.yml` structure
### 3. `truenas_health.yml`
**Priority: High**
TrueNAS SCALE-specific health targeting `truenas-scale` host.
- Checks:
- ZFS pool status (`zpool status`) — flags DEGRADED/FAULTED
- Pool scrub: last scrub date, status, any errors
- Dataset disk usage with warnings at 80%/90%
- SMART status for physical disks
- TrueNAS apps (k3s-based): running app count, failed apps
- Output: JSON report at `/tmp/health_reports/truenas_<timestamp>.json`
- Complements existing `synology_health.yml`
### 4. `ntp_check.yml`
**Priority: Medium**
Time sync health check across all hosts. Check only — no configuration changes.
- Targets: `all`
- Platform-adaptive daemon detection: `chronyd`, `systemd-timesyncd`, `ntpd`, Synology NTP
- Reports: sync source, current offset (ms), stratum, last sync time
- Thresholds: warn >500ms, critical >1000ms
- Alert: ntfy notification for hosts exceeding warn threshold
- Output: summary table + `/tmp/ntp_reports/ntp_<timestamp>.json`
### 5. `cron_audit.yml`
**Priority: Medium**
Scheduled task inventory and basic security audit across all hosts.
- Inventories:
- `/etc/crontab`, `/etc/cron.d/*`, `/etc/cron.{hourly,daily,weekly,monthly}/`
- User crontabs (`crontab -l` for each user with a crontab)
- `systemd` timer units (`systemctl list-timers --all`)
- Security flags:
- Cron jobs running as root that reference world-writable paths
- Cron jobs referencing paths that no longer exist
- Output: per-host JSON at `/tmp/cron_audit/<host>_<timestamp>.json` + summary
## Patterns to Follow
- Use `changed_when: false` on all read-only shell tasks
- Use `ignore_errors: true` / `ignore_unreachable: true` for non-fatal checks
- Platform detection via `ansible_distribution` and custom `system_type` host_vars
- ntfy URL from `ntfy_url` variable (group_vars with default fallback)
- JSON reports saved to `/tmp/<category>_reports/` with timestamp in filename
- `delegate_to: localhost` + `run_once: true` for report aggregation tasks
## Out of Scope
- NTP configuration/enforcement (check only, per user decision)
- Home Assistant backup (deferred)
- Docker compose drift detection (deferred)
- Gitea health (deferred)