Sanitized mirror from private repository - 2026-03-18 10:31:50 UTC
This commit is contained in:
@@ -0,0 +1,93 @@
|
||||
# New Playbooks Design — 2026-02-21
|
||||
|
||||
## Context
|
||||
|
||||
Adding 5 playbooks to fill coverage gaps in the existing 42-playbook homelab automation suite.
|
||||
Infrastructure: 10+ hosts, 200+ containers, Tailscale mesh, mixed platforms (Ubuntu, Debian,
|
||||
Synology DSM, TrueNAS SCALE, Proxmox, Alpine/Home Assistant, Raspberry Pi).
|
||||
|
||||
## Approved Playbooks
|
||||
|
||||
### 1. `network_connectivity.yml`
|
||||
**Priority: High (user-requested)**
|
||||
|
||||
Full mesh connectivity verification across the tailnet.
|
||||
|
||||
- Targets: `all` (unreachable hosts handled gracefully with `ignore_unreachable`)
|
||||
- Checks per host:
|
||||
- Tailscale is running and has a valid IP (`tailscale status --json`)
|
||||
- Ping all other inventory hosts by Tailscale IP
|
||||
- SSH reachability to each peer
|
||||
- HTTP/HTTPS endpoint health for key services (Portainer, Gitea, Immich, Home Assistant, etc.) — defined in group_vars or inline vars
|
||||
- Output: connectivity matrix table + `/tmp/connectivity_reports/connectivity_<timestamp>.json`
|
||||
- Alert: ntfy notification on any failed node or endpoint
|
||||
|
||||
### 2. `proxmox_management.yml`
|
||||
**Priority: High**
|
||||
|
||||
Proxmox-specific management targeting `pve` host.
|
||||
|
||||
- Checks:
|
||||
- VM/LXC inventory: count, names, state (running/stopped)
|
||||
- Resource allocation vs actual usage (RAM, CPU per VM)
|
||||
- Storage pool status and utilisation
|
||||
- Recent Proxmox task log (last 10 tasks)
|
||||
- Optional action: `-e action=snapshot -e vm_id=100` to snapshot a specific VM
|
||||
- Output: JSON report at `/tmp/health_reports/proxmox_<timestamp>.json`
|
||||
- Pattern: mirrors `synology_health.yml` structure
|
||||
|
||||
### 3. `truenas_health.yml`
|
||||
**Priority: High**
|
||||
|
||||
TrueNAS SCALE-specific health targeting `truenas-scale` host.
|
||||
|
||||
- Checks:
|
||||
- ZFS pool status (`zpool status`) — flags DEGRADED/FAULTED
|
||||
- Pool scrub: last scrub date, status, any errors
|
||||
- Dataset disk usage with warnings at 80%/90%
|
||||
- SMART status for physical disks
|
||||
- TrueNAS apps (k3s-based): running app count, failed apps
|
||||
- Output: JSON report at `/tmp/health_reports/truenas_<timestamp>.json`
|
||||
- Complements existing `synology_health.yml`
|
||||
|
||||
### 4. `ntp_check.yml`
|
||||
**Priority: Medium**
|
||||
|
||||
Time sync health check across all hosts. Check only — no configuration changes.
|
||||
|
||||
- Targets: `all`
|
||||
- Platform-adaptive daemon detection: `chronyd`, `systemd-timesyncd`, `ntpd`, Synology NTP
|
||||
- Reports: sync source, current offset (ms), stratum, last sync time
|
||||
- Thresholds: warn >500ms, critical >1000ms
|
||||
- Alert: ntfy notification for hosts exceeding warn threshold
|
||||
- Output: summary table + `/tmp/ntp_reports/ntp_<timestamp>.json`
|
||||
|
||||
### 5. `cron_audit.yml`
|
||||
**Priority: Medium**
|
||||
|
||||
Scheduled task inventory and basic security audit across all hosts.
|
||||
|
||||
- Inventories:
|
||||
- `/etc/crontab`, `/etc/cron.d/*`, `/etc/cron.{hourly,daily,weekly,monthly}/`
|
||||
- User crontabs (`crontab -l` for each user with a crontab)
|
||||
- `systemd` timer units (`systemctl list-timers --all`)
|
||||
- Security flags:
|
||||
- Cron jobs running as root that reference world-writable paths
|
||||
- Cron jobs referencing paths that no longer exist
|
||||
- Output: per-host JSON at `/tmp/cron_audit/<host>_<timestamp>.json` + summary
|
||||
|
||||
## Patterns to Follow
|
||||
|
||||
- Use `changed_when: false` on all read-only shell tasks
|
||||
- Use `ignore_errors: true` / `ignore_unreachable: true` for non-fatal checks
|
||||
- Platform detection via `ansible_distribution` and custom `system_type` host_vars
|
||||
- ntfy URL from `ntfy_url` variable (group_vars with default fallback)
|
||||
- JSON reports saved to `/tmp/<category>_reports/` with timestamp in filename
|
||||
- `delegate_to: localhost` + `run_once: true` for report aggregation tasks
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- NTP configuration/enforcement (check only, per user decision)
|
||||
- Home Assistant backup (deferred)
|
||||
- Docker compose drift detection (deferred)
|
||||
- Gitea health (deferred)
|
||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user