Files
homelab-optimized/ansible/automation/docs/plans/2026-02-21-new-playbooks-implementation.md
Gitea Mirror Bot fb00a325d1
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m14s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-04-18 11:19:59 UTC
2026-04-18 11:19:59 +00:00

1154 lines
44 KiB
Markdown

# New Playbooks Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Add 5 new Ansible playbooks covering network connectivity health, Proxmox management, TrueNAS health, NTP sync auditing, and cron job inventory.
**Architecture:** Each playbook is standalone, follows existing patterns (read-only shell tasks with `changed_when: false`, `failed_when: false` for non-fatal checks, ntfy alerting via `ntfy_url` var, JSON reports in `/tmp/<category>_reports/`). Platform detection is done inline via command availability checks rather than Ansible facts to keep cross-platform compatibility with Synology/TrueNAS.
**Tech Stack:** Ansible, bash shell commands, Tailscale CLI, Proxmox `qm`/`pct`/`pvesh` CLI, ZFS `zpool`/`zfs` tools, `chronyc`/`timedatectl`, `smartctl`, standard POSIX cron paths.
---
## Conventions to Follow (read this first)
These patterns appear in every existing playbook — match them exactly:
```yaml
# Read-only tasks always have:
changed_when: false
failed_when: false # (or ignore_errors: yes)
# Report directories:
delegate_to: localhost
run_once: true
# Variable defaults:
my_var: "{{ my_var | default('fallback') }}"
# Module names use fully-qualified form:
ansible.builtin.shell
ansible.builtin.debug
ansible.builtin.assert
# ntfy alerting (used in alert_check.yml — copy that pattern):
ntfy_url: "{{ ntfy_url | default('https://ntfy.sh/REDACTED_TOPIC') }}"
```
Reference files to read before each task:
- `playbooks/synology_health.yml` — pattern for platform-specific health checks
- `playbooks/tailscale_health.yml` — pattern for binary detection + JSON parsing
- `playbooks/disk_usage_report.yml` — pattern for threshold variables + report dirs
- `playbooks/alert_check.yml` — pattern for ntfy notifications
---
## Task 1: `network_connectivity.yml` — Full mesh connectivity check
**Files:**
- Create: `playbooks/network_connectivity.yml`
**What it does:** For every host in inventory, check Tailscale is Running, ping all other hosts by their `ansible_host` IP, test SSH port reachability, and verify HTTP endpoints for key services. Outputs a connectivity matrix and sends ntfy alert on failures.
**Step 1: Create the playbook file**
```yaml
---
# Network Connectivity Health Check
# Verifies Tailscale mesh connectivity between all inventory hosts
# and checks HTTP/HTTPS endpoints for key services.
#
# Usage: ansible-playbook -i hosts.ini playbooks/network_connectivity.yml
# Usage: ansible-playbook -i hosts.ini playbooks/network_connectivity.yml --limit homelab
- name: Network Connectivity Health Check
hosts: "{{ host_target | default('active') }}"
gather_facts: yes
ignore_unreachable: true
vars:
report_dir: "/tmp/connectivity_reports"
ts_candidates:
- /usr/bin/tailscale
- /var/packages/Tailscale/target/bin/tailscale
warn_on_failure: true
ntfy_url: "{{ ntfy_url | default('https://ntfy.sh/REDACTED_TOPIC') }}"
# HTTP endpoints to verify — add/remove per your services
http_endpoints:
- name: Portainer (homelab)
url: "http://100.67.40.126:9000"
- name: Gitea (homelab)
url: "http://100.67.40.126:3000"
- name: Immich (homelab)
url: "http://100.67.40.126:2283"
- name: Home Assistant
url: "http://100.112.186.90:8123"
tasks:
- name: Create connectivity report directory
ansible.builtin.file:
path: "{{ report_dir }}"
state: directory
mode: '0755'
delegate_to: localhost
run_once: true
# ── Tailscale status ──────────────────────────────────────────────
- name: Detect Tailscale binary
ansible.builtin.shell: |
for p in {{ ts_candidates | join(' ') }}; do
[ -x "$p" ] && echo "$p" && exit 0
done
echo ""
register: ts_bin
changed_when: false
failed_when: false
- name: Get Tailscale status JSON
ansible.builtin.command: "{{ ts_bin.stdout }} status --json"
register: ts_status_raw
changed_when: false
failed_when: false
when: ts_bin.stdout | length > 0
- name: Parse Tailscale state
ansible.builtin.set_fact:
ts_parsed: "{{ ts_status_raw.stdout | from_json }}"
ts_backend: "{{ (ts_status_raw.stdout | from_json).BackendState | default('unknown') }}"
ts_ip: "{{ ((ts_status_raw.stdout | from_json).Self.TailscaleIPs | default([]) | first) | default('n/a') }}"
when:
- ts_bin.stdout | length > 0
- ts_status_raw.rc | default(1) == 0
- ts_status_raw.stdout | default('') | length > 0
- ts_status_raw.stdout is search('{')
failed_when: false
# ── Peer reachability (ping each inventory host by Tailscale IP) ──
- name: Ping all inventory hosts
ansible.builtin.shell: |
ping -c 2 -W 2 {{ hostvars[item]['ansible_host'] }} > /dev/null 2>&1 && echo "OK" || echo "FAIL"
register: ping_results
changed_when: false
failed_when: false
loop: "{{ groups['active'] | select('ne', inventory_hostname) | list }}"
loop_control:
label: "{{ item }}"
- name: Summarise ping results
ansible.builtin.set_fact:
ping_summary: "{{ ping_summary | default({}) | combine({item.item: item.stdout | trim}) }}"
loop: "{{ ping_results.results }}"
loop_control:
label: "{{ item.item }}"
# ── SSH port check ────────────────────────────────────────────────
- name: Check SSH port on all inventory hosts
ansible.builtin.shell: |
port="{{ hostvars[item]['ansible_port'] | default(22) }}"
nc -zw3 {{ hostvars[item]['ansible_host'] }} "$port" > /dev/null 2>&1 && echo "OK" || echo "FAIL"
register: ssh_port_results
changed_when: false
failed_when: false
loop: "{{ groups['active'] | select('ne', inventory_hostname) | list }}"
loop_control:
label: "{{ item }}"
- name: Summarise SSH port results
ansible.builtin.set_fact:
ssh_summary: "{{ ssh_summary | default({}) | combine({item.item: item.stdout | trim}) }}"
loop: "{{ ssh_port_results.results }}"
loop_control:
label: "{{ item.item }}"
# ── HTTP endpoint checks (run once from localhost) ────────────────
- name: Check HTTP endpoints
ansible.builtin.uri:
url: "{{ item.url }}"
method: GET
status_code: [200, 301, 302, 401, 403]
timeout: 5
validate_certs: false
register: http_results
failed_when: false
loop: "{{ http_endpoints }}"
loop_control:
label: "{{ item.name }}"
delegate_to: localhost
run_once: true
# ── Connectivity summary ──────────────────────────────────────────
- name: Display connectivity summary per host
ansible.builtin.debug:
msg: |
═══ {{ inventory_hostname }} ═══
Tailscale: {{ ts_backend | default('not installed') }} | IP: {{ ts_ip | default('n/a') }}
Peer ping results:
{% for host, result in (ping_summary | default({})).items() %}
{{ host }}: {{ result }}
{% endfor %}
SSH port results:
{% for host, result in (ssh_summary | default({})).items() %}
{{ host }}: {{ result }}
{% endfor %}
- name: Display HTTP endpoint results
ansible.builtin.debug:
msg: |
═══ HTTP Endpoint Health ═══
{% for item in http_results.results | default([]) %}
{{ item.item.name }}: {{ 'OK (' + (item.status | string) + ')' if item.status is defined and item.status > 0 else 'FAIL' }}
{% endfor %}
run_once: true
delegate_to: localhost
# ── Alert on failures ─────────────────────────────────────────────
- name: Collect failed peers
ansible.builtin.set_fact:
failed_peers: >-
{{ (ping_summary | default({})).items() | selectattr('1', 'eq', 'FAIL') | map(attribute='0') | list }}
- name: Send ntfy alert for connectivity failures
ansible.builtin.uri:
url: "{{ ntfy_url }}"
method: POST
body: "Connectivity failures on {{ inventory_hostname }}: {{ failed_peers | join(', ') }}"
headers:
Title: "Homelab Network Alert"
Priority: "high"
Tags: "warning,network"
body_format: raw
status_code: [200, 204]
delegate_to: localhost
failed_when: false
when:
- warn_on_failure | bool
- failed_peers | length > 0
# ── Write JSON report ─────────────────────────────────────────────
- name: Write connectivity report
ansible.builtin.copy:
content: "{{ {'host': inventory_hostname, 'timestamp': ansible_date_time.iso8601, 'tailscale_state': ts_backend | default('unknown'), 'tailscale_ip': ts_ip | default('n/a'), 'ping': ping_summary | default({}), 'ssh_port': ssh_summary | default({})} | to_nice_json }}"
dest: "{{ report_dir }}/{{ inventory_hostname }}_{{ ansible_date_time.date }}.json"
delegate_to: localhost
changed_when: false
```
**Step 2: Validate YAML syntax**
```bash
cd /home/homelab/organized/repos/homelab/ansible/automation
ansible-playbook --syntax-check -i hosts.ini playbooks/network_connectivity.yml
```
Expected: `playbook: playbooks/network_connectivity.yml` with no errors.
**Step 3: Dry-run against one host**
```bash
ansible-playbook -i hosts.ini playbooks/network_connectivity.yml --limit homelab --check
```
Expected: Tasks run, no failures. Some tasks will report `skipped` (when conditions, etc.) — that's fine.
**Step 4: Run for real against one host**
```bash
ansible-playbook -i hosts.ini playbooks/network_connectivity.yml --limit homelab
```
Expected: Connectivity summary printed, report written to `/tmp/connectivity_reports/homelab_<date>.json`.
**Step 5: Run against all active hosts**
```bash
ansible-playbook -i hosts.ini playbooks/network_connectivity.yml
```
Expected: Summary for every host in `[active]` group. Unreachable hosts are handled gracefully (skipped, not errored).
**Step 6: Commit**
```bash
git add playbooks/network_connectivity.yml
git commit -m "feat: add network_connectivity playbook for full mesh health check"
```
---
## Task 2: `proxmox_management.yml` — Proxmox VM/LXC inventory and health
**Files:**
- Create: `playbooks/proxmox_management.yml`
**What it does:** Targets the `pve` host. Reports VM inventory (`qm list`), LXC inventory (`pct list`), node resource summary, storage pool status, and last 10 task log entries. Optional snapshot action via `-e action=snapshot -e vm_id=100`.
**Note:** `pve` uses `ansible_user=root` (see `hosts.ini`), so `become: false` is correct here — root already has all access.
**Step 1: Create the playbook**
```yaml
---
# Proxmox VE Management Playbook
# Reports VM/LXC inventory, resource usage, storage pool status, and recent tasks.
# Optionally creates a snapshot with -e action=snapshot -e vm_id=100
#
# Usage: ansible-playbook -i hosts.ini playbooks/proxmox_management.yml
# Usage: ansible-playbook -i hosts.ini playbooks/proxmox_management.yml -e action=snapshot -e vm_id=100
- name: Proxmox VE Management
hosts: pve
gather_facts: yes
become: false
vars:
action: "{{ action | default('status') }}" # status | snapshot
vm_id: "{{ vm_id | default('') }}"
report_dir: "/tmp/health_reports"
tasks:
- name: Create report directory
ansible.builtin.file:
path: "{{ report_dir }}"
state: directory
mode: '0755'
delegate_to: localhost
run_once: true
# ── Node overview ─────────────────────────────────────────────────
- name: Get PVE version
ansible.builtin.command: pveversion
register: pve_version
changed_when: false
failed_when: false
- name: Get node resource summary
ansible.builtin.shell: |
pvesh get /nodes/$(hostname)/status --output-format json 2>/dev/null || \
echo '{"error": "pvesh not available"}'
register: node_status_raw
changed_when: false
failed_when: false
- name: Parse node status
ansible.builtin.set_fact:
node_status: "{{ node_status_raw.stdout | from_json }}"
failed_when: false
when: node_status_raw.stdout | default('') | length > 0
# ── VM inventory ──────────────────────────────────────────────────
- name: List all VMs
ansible.builtin.command: qm list
register: vm_list
changed_when: false
failed_when: false
- name: List all LXC containers
ansible.builtin.command: pct list
register: lxc_list
changed_when: false
failed_when: false
- name: Count running VMs
ansible.builtin.shell: |
qm list 2>/dev/null | grep -c "running" || echo "0"
register: vm_running_count
changed_when: false
failed_when: false
- name: Count running LXCs
ansible.builtin.shell: |
pct list 2>/dev/null | grep -c "running" || echo "0"
register: lxc_running_count
changed_when: false
failed_when: false
# ── Storage pools ─────────────────────────────────────────────────
- name: Get storage pool status
ansible.builtin.shell: |
pvesh get /nodes/$(hostname)/storage --output-format json 2>/dev/null | \
python3 -c "
import json,sys
data=json.load(sys.stdin)
for s in data:
used_pct = round(s.get('used',0) / s.get('total',1) * 100, 1) if s.get('total',0) > 0 else 0
print(f\"{s.get('storage','?'):20} {s.get('type','?'):10} used={used_pct}% avail={round(s.get('avail',0)/1073741824,1)}GiB\")
" 2>/dev/null || pvesm status 2>/dev/null || echo "Storage info unavailable"
register: storage_status
changed_when: false
failed_when: false
# ── Recent task log ───────────────────────────────────────────────
- name: Get recent PVE tasks
ansible.builtin.shell: |
pvesh get /nodes/$(hostname)/tasks \
--limit 10 \
--output-format json 2>/dev/null | \
python3 -c "
import json,sys,datetime
tasks=json.load(sys.stdin)
for t in tasks:
ts=datetime.datetime.fromtimestamp(t.get('starttime',0)).strftime('%Y-%m-%d %H:%M')
status=t.get('status','?')
upid=t.get('upid','?')
print(f'{ts} {status:12} {upid}')
" 2>/dev/null || echo "Task log unavailable"
register: recent_tasks
changed_when: false
failed_when: false
# ── Summary output ────────────────────────────────────────────────
- name: Display Proxmox summary
ansible.builtin.debug:
msg: |
═══ Proxmox VE — {{ inventory_hostname }} ═══
Version: {{ pve_version.stdout | default('unknown') }}
VMs: {{ vm_running_count.stdout | trim }} running
{{ vm_list.stdout | default('(no VMs)') | indent(2) }}
LXCs: {{ lxc_running_count.stdout | trim }} running
{{ lxc_list.stdout | default('(no LXCs)') | indent(2) }}
Storage Pools:
{{ storage_status.stdout | default('n/a') | indent(2) }}
Recent Tasks (last 10):
{{ recent_tasks.stdout | default('n/a') | indent(2) }}
# ── Optional: snapshot a VM ───────────────────────────────────────
- name: Create VM snapshot
ansible.builtin.shell: |
snap_name="ansible-snap-$(date +%Y%m%d-%H%M%S)"
qm snapshot {{ vm_id }} "$snap_name" --description "Ansible automated snapshot"
echo "Snapshot created: $snap_name for VM {{ vm_id }}"
register: snapshot_result
when:
- action == "snapshot"
- vm_id | string | length > 0
changed_when: true
- name: Show snapshot result
ansible.builtin.debug:
msg: "{{ snapshot_result.stdout | default('No snapshot taken') }}"
when: action == "snapshot"
# ── Write JSON report ─────────────────────────────────────────────
- name: Write Proxmox report
ansible.builtin.copy:
content: "{{ {'host': inventory_hostname, 'timestamp': ansible_date_time.iso8601, 'version': pve_version.stdout | default('unknown'), 'vms_running': vm_running_count.stdout | trim, 'lxcs_running': lxc_running_count.stdout | trim, 'storage': storage_status.stdout | default(''), 'tasks': recent_tasks.stdout | default('')} | to_nice_json }}"
dest: "{{ report_dir }}/proxmox_{{ ansible_date_time.date }}.json"
delegate_to: localhost
changed_when: false
```
**Step 2: Validate syntax**
```bash
ansible-playbook --syntax-check -i hosts.ini playbooks/proxmox_management.yml
```
Expected: no errors.
**Step 3: Run against pve**
```bash
ansible-playbook -i hosts.ini playbooks/proxmox_management.yml
```
Expected: Proxmox summary table printed. JSON report written to `/tmp/health_reports/proxmox_<date>.json`.
**Step 4: Test snapshot action (optional — only if you have a test VM)**
```bash
# Replace 100 with a real VM ID from the qm list output above
ansible-playbook -i hosts.ini playbooks/proxmox_management.yml -e action=snapshot -e vm_id=100
```
Expected: `Snapshot created: ansible-snap-<timestamp> for VM 100`
**Step 5: Commit**
```bash
git add playbooks/proxmox_management.yml
git commit -m "feat: add proxmox_management playbook for PVE VM/LXC inventory and health"
```
---
## Task 3: `truenas_health.yml` — TrueNAS SCALE ZFS and app health
**Files:**
- Create: `playbooks/truenas_health.yml`
**What it does:** Targets `truenas-scale`. Checks ZFS pool health, scrub status, dataset usage, SMART disk status, and running TrueNAS apps (k3s-based). Flags degraded/faulted pools. Mirrors `synology_health.yml` structure.
**Note:** TrueNAS SCALE runs on Debian. The `vish` user needs sudo for `smartctl` and `zpool`. Check `host_vars/truenas-scale.yml``ansible_become: true` is set in `group_vars/homelab_linux.yml` which covers all hosts.
**Step 1: Create the playbook**
```yaml
---
# TrueNAS SCALE Health Check
# Checks ZFS pool status, scrub health, dataset usage, SMART disk status, and app state.
# Mirrors synology_health.yml but for TrueNAS SCALE (Debian-based with ZFS).
#
# Usage: ansible-playbook -i hosts.ini playbooks/truenas_health.yml
- name: TrueNAS SCALE Health Check
hosts: truenas-scale
gather_facts: yes
become: true
vars:
disk_warn_pct: 80
disk_critical_pct: 90
report_dir: "/tmp/health_reports"
tasks:
- name: Create report directory
ansible.builtin.file:
path: "{{ report_dir }}"
state: directory
mode: '0755'
delegate_to: localhost
run_once: true
# ── System overview ───────────────────────────────────────────────
- name: Get system uptime
ansible.builtin.command: uptime -p
register: uptime_out
changed_when: false
failed_when: false
- name: Get TrueNAS version
ansible.builtin.shell: |
cat /etc/version 2>/dev/null || \
midclt call system.version 2>/dev/null || \
echo "version unavailable"
register: truenas_version
changed_when: false
failed_when: false
# ── ZFS pool health ───────────────────────────────────────────────
- name: Get ZFS pool status
ansible.builtin.command: zpool status -v
register: zpool_status
changed_when: false
failed_when: false
- name: Get ZFS pool list (usage)
ansible.builtin.command: zpool list -H
register: zpool_list
changed_when: false
failed_when: false
- name: Check for degraded or faulted pools
ansible.builtin.shell: |
zpool status 2>/dev/null | grep -E "state:\s*(DEGRADED|FAULTED|OFFLINE|REMOVED)" | wc -l
register: pool_errors
changed_when: false
failed_when: false
- name: Assert no degraded pools
ansible.builtin.assert:
that:
- (pool_errors.stdout | trim | int) == 0
success_msg: "All ZFS pools ONLINE"
fail_msg: "DEGRADED or FAULTED pool detected — run: zpool status"
changed_when: false
ignore_errors: yes
# ── ZFS scrub status ──────────────────────────────────────────────
- name: Get last scrub info per pool
ansible.builtin.shell: |
for pool in $(zpool list -H -o name 2>/dev/null); do
echo "Pool: $pool"
zpool status "$pool" 2>/dev/null | grep -E "scrub|scan" | head -3
echo "---"
done
register: scrub_status
changed_when: false
failed_when: false
# ── Dataset usage ─────────────────────────────────────────────────
- name: Get dataset usage (top-level datasets)
ansible.builtin.shell: |
zfs list -H -o name,used,avail,refer,mountpoint -d 1 2>/dev/null | head -20
register: dataset_usage
changed_when: false
failed_when: false
# ── SMART disk status ─────────────────────────────────────────────
- name: List physical disks
ansible.builtin.shell: |
lsblk -d -o NAME,SIZE,MODEL,SERIAL 2>/dev/null | grep -v "loop\|sr" || \
ls /dev/sd? /dev/nvme?n? 2>/dev/null
register: disk_list
changed_when: false
failed_when: false
- name: Check SMART health for each disk
ansible.builtin.shell: |
failed=0
for disk in $(lsblk -d -n -o NAME 2>/dev/null | grep -v "loop\|sr"); do
result=$(smartctl -H /dev/$disk 2>/dev/null | grep -E "SMART overall-health|PASSED|FAILED" || echo "n/a")
echo "$disk: $result"
echo "$result" | grep -q "FAILED" && failed=$((failed+1))
done
exit $failed
register: smart_results
changed_when: false
failed_when: false
# ── TrueNAS apps (k3s) ────────────────────────────────────────────
- name: Get TrueNAS app status
ansible.builtin.shell: |
if command -v k3s >/dev/null 2>&1; then
k3s kubectl get pods -A --no-headers 2>/dev/null | \
awk '{print $4}' | sort | uniq -c | sort -rn
elif command -v midclt >/dev/null 2>&1; then
midclt call chart.release.query 2>/dev/null | \
python3 -c "
import json,sys
try:
apps=json.load(sys.stdin)
for a in apps:
print(f\"{a.get('id','?'):30} {a.get('status','?')}\")
except:
print('App status unavailable')
" 2>/dev/null
else
echo "App runtime not detected (k3s/midclt not found)"
fi
register: app_status
changed_when: false
failed_when: false
# ── Summary output ────────────────────────────────────────────────
- name: Display TrueNAS health summary
ansible.builtin.debug:
msg: |
═══ TrueNAS SCALE — {{ inventory_hostname }} ═══
Version : {{ truenas_version.stdout | default('unknown') | trim }}
Uptime : {{ uptime_out.stdout | default('n/a') }}
Pool errors: {{ pool_errors.stdout | trim | default('0') }}
ZFS Pool List:
{{ zpool_list.stdout | default('(none)') | indent(2) }}
ZFS Pool Status (degraded/faulted check):
Degraded pools found: {{ pool_errors.stdout | trim }}
Scrub Status:
{{ scrub_status.stdout | default('n/a') | indent(2) }}
Dataset Usage (top-level):
{{ dataset_usage.stdout | default('n/a') | indent(2) }}
SMART Disk Status:
{{ smart_results.stdout | default('n/a') | indent(2) }}
TrueNAS Apps:
{{ app_status.stdout | default('n/a') | indent(2) }}
# ── Write JSON report ─────────────────────────────────────────────
- name: Write TrueNAS health report
ansible.builtin.copy:
content: "{{ {'host': inventory_hostname, 'timestamp': ansible_date_time.iso8601, 'version': truenas_version.stdout | default('unknown') | trim, 'pool_errors': pool_errors.stdout | trim, 'zpool_list': zpool_list.stdout | default(''), 'scrub': scrub_status.stdout | default(''), 'smart': smart_results.stdout | default(''), 'apps': app_status.stdout | default('')} | to_nice_json }}"
dest: "{{ report_dir }}/truenas_{{ ansible_date_time.date }}.json"
delegate_to: localhost
changed_when: false
```
**Step 2: Validate syntax**
```bash
ansible-playbook --syntax-check -i hosts.ini playbooks/truenas_health.yml
```
Expected: no errors.
**Step 3: Run against truenas-scale**
```bash
ansible-playbook -i hosts.ini playbooks/truenas_health.yml
```
Expected: Health summary printed, pool status shown, SMART results visible. JSON report at `/tmp/health_reports/truenas_<date>.json`.
**Step 4: Commit**
```bash
git add playbooks/truenas_health.yml
git commit -m "feat: add truenas_health playbook for ZFS pool, scrub, SMART, and app status"
```
---
## Task 4: `ntp_check.yml` — Time sync health audit
**Files:**
- Create: `playbooks/ntp_check.yml`
**What it does:** Checks time sync status across all hosts. Detects which NTP daemon is running, extracts current offset in milliseconds, warns at >500ms, critical at >1000ms. Sends ntfy alert for hosts exceeding warn threshold. Read-only — no config changes.
**Platform notes:**
- Ubuntu/Debian: `systemd-timesyncd` → use `timedatectl show-timesync` or `chronyc tracking`
- Synology: Uses its own NTP, check via `/proc/driver/rtc` or `synoinfo.conf` + `ntpq -p`
- TrueNAS: Debian-based, likely `chrony` or `systemd-timesyncd`
- Proxmox: Debian-based
**Step 1: Create the playbook**
```yaml
---
# NTP Time Sync Health Check
# Audits time synchronization across all hosts. Read-only — no config changes.
# Warns when offset > 500ms, critical > 1000ms.
#
# Usage: ansible-playbook -i hosts.ini playbooks/ntp_check.yml
# Usage: ansible-playbook -i hosts.ini playbooks/ntp_check.yml --limit synology
- name: NTP Time Sync Health Check
hosts: "{{ host_target | default('active') }}"
gather_facts: yes
ignore_unreachable: true
vars:
warn_offset_ms: 500
critical_offset_ms: 1000
ntfy_url: "{{ ntfy_url | default('https://ntfy.sh/REDACTED_TOPIC') }}"
report_dir: "/tmp/ntp_reports"
tasks:
- name: Create report directory
ansible.builtin.file:
path: "{{ report_dir }}"
state: directory
mode: '0755'
delegate_to: localhost
run_once: true
# ── Detect NTP daemon ─────────────────────────────────────────────
- name: Detect active NTP implementation
ansible.builtin.shell: |
if command -v chronyc >/dev/null 2>&1 && chronyc tracking >/dev/null 2>&1; then
echo "chrony"
elif timedatectl show-timesync 2>/dev/null | grep -q ServerName; then
echo "timesyncd"
elif timedatectl 2>/dev/null | grep -q "NTP service: active"; then
echo "timesyncd"
elif command -v ntpq >/dev/null 2>&1 && ntpq -p >/dev/null 2>&1; then
echo "ntpd"
else
echo "unknown"
fi
register: ntp_impl
changed_when: false
failed_when: false
# ── Get offset (chrony) ───────────────────────────────────────────
- name: Get chrony tracking info
ansible.builtin.shell: chronyc tracking 2>/dev/null
register: chrony_tracking
changed_when: false
failed_when: false
when: ntp_impl.stdout | trim == "chrony"
- name: Parse chrony offset (ms)
ansible.builtin.shell: |
chronyc tracking 2>/dev/null | \
grep "System time" | \
awk '{printf "%.3f", $4 * 1000}'
register: chrony_offset_ms
changed_when: false
failed_when: false
when: ntp_impl.stdout | trim == "chrony"
- name: Get chrony sync source
ansible.builtin.shell: |
chronyc sources -v 2>/dev/null | grep "^\^" | head -3
register: chrony_sources
changed_when: false
failed_when: false
when: ntp_impl.stdout | trim == "chrony"
# ── Get offset (systemd-timesyncd) ────────────────────────────────
- name: Get timesyncd status
ansible.builtin.shell: timedatectl show-timesync 2>/dev/null || timedatectl 2>/dev/null
register: timesyncd_info
changed_when: false
failed_when: false
when: ntp_impl.stdout | trim == "timesyncd"
- name: Parse timesyncd offset (ms)
ansible.builtin.shell: |
# timesyncd doesn't expose offset cleanly — use systemd journal instead
# Fall back to 0 if not available
journalctl -u systemd-timesyncd --since "1 hour ago" --no-pager 2>/dev/null | \
grep -oE "offset [+-]?[0-9]+(\.[0-9]+)?(ms|us|s)" | tail -1 | \
awk '{
val=$2; unit=$3;
gsub(/[^0-9.-]/,"",val);
if (unit=="us") printf "%.3f", val/1000;
else if (unit=="s") printf "%.3f", val*1000;
else printf "%.3f", val;
}' || echo "0"
register: timesyncd_offset_ms
changed_when: false
failed_when: false
when: ntp_impl.stdout | trim == "timesyncd"
# ── Get offset (ntpd) ─────────────────────────────────────────────
- name: Get ntpq peers
ansible.builtin.shell: ntpq -pn 2>/dev/null | head -10
register: ntpq_peers
changed_when: false
failed_when: false
when: ntp_impl.stdout | trim == "ntpd"
- name: Parse ntpq offset (ms)
ansible.builtin.shell: |
# offset is column 9 in ntpq -p output (milliseconds)
ntpq -p 2>/dev/null | awk 'NR>2 && /^\*/ {printf "%.3f", $9; exit}' || echo "0"
register: ntpq_offset_ms
changed_when: false
failed_when: false
when: ntp_impl.stdout | trim == "ntpd"
# ── Consolidate offset ────────────────────────────────────────────
- name: Set unified offset fact
ansible.builtin.set_fact:
ntp_offset_ms: >-
{{
(chrony_offset_ms.stdout | default('0')) | float
if ntp_impl.stdout | trim == 'chrony'
else (timesyncd_offset_ms.stdout | default('0')) | float
if ntp_impl.stdout | trim == 'timesyncd'
else (ntpq_offset_ms.stdout | default('0')) | float
}}
ntp_raw_info: >-
{{
chrony_tracking.stdout | default('')
if ntp_impl.stdout | trim == 'chrony'
else timesyncd_info.stdout | default('')
if ntp_impl.stdout | trim == 'timesyncd'
else ntpq_peers.stdout | default('')
}}
- name: Determine sync status
ansible.builtin.set_fact:
ntp_status: >-
{{
'CRITICAL' if (ntp_offset_ms | abs) >= critical_offset_ms
else 'WARN' if (ntp_offset_ms | abs) >= warn_offset_ms
else 'OK'
}}
# ── Per-host summary ──────────────────────────────────────────────
- name: Display NTP summary
ansible.builtin.debug:
msg: |
═══ {{ inventory_hostname }} ═══
NTP daemon : {{ ntp_impl.stdout | trim | default('unknown') }}
Offset : {{ ntp_offset_ms }} ms
Status : {{ ntp_status }}
Details :
{{ ntp_raw_info | indent(2) }}
# ── Alert on warn/critical ────────────────────────────────────────
- name: Send ntfy alert for NTP issues
ansible.builtin.uri:
url: "{{ ntfy_url }}"
method: POST
body: "NTP {{ ntp_status }} on {{ inventory_hostname }}: offset={{ ntp_offset_ms }}ms (threshold={{ warn_offset_ms }}ms)"
headers:
Title: "Homelab NTP Alert"
Priority: "{{ 'urgent' if ntp_status == 'CRITICAL' else 'high' }}"
Tags: "warning,clock"
body_format: raw
status_code: [200, 204]
delegate_to: localhost
failed_when: false
when: ntp_status in ['WARN', 'CRITICAL']
# ── Write JSON report ─────────────────────────────────────────────
- name: Write NTP report
ansible.builtin.copy:
content: "{{ {'host': inventory_hostname, 'timestamp': ansible_date_time.iso8601, 'ntp_daemon': ntp_impl.stdout | trim, 'offset_ms': ntp_offset_ms, 'status': ntp_status} | to_nice_json }}"
dest: "{{ report_dir }}/{{ inventory_hostname }}_{{ ansible_date_time.date }}.json"
delegate_to: localhost
changed_when: false
```
**Step 2: Validate syntax**
```bash
ansible-playbook --syntax-check -i hosts.ini playbooks/ntp_check.yml
```
Expected: no errors.
**Step 3: Run against one host**
```bash
ansible-playbook -i hosts.ini playbooks/ntp_check.yml --limit homelab
```
Expected: NTP daemon detected, offset printed, status OK/WARN/CRITICAL.
**Step 4: Run across all hosts**
```bash
ansible-playbook -i hosts.ini playbooks/ntp_check.yml
```
Expected: Summary for every active host. Synology hosts may report `unknown` for daemon — that's acceptable (they have NTP but expose it differently).
**Step 5: Commit**
```bash
git add playbooks/ntp_check.yml
git commit -m "feat: add ntp_check playbook for time sync drift auditing across all hosts"
```
---
## Task 5: `cron_audit.yml` — Scheduled task inventory
**Files:**
- Create: `playbooks/cron_audit.yml`
**What it does:** Inventories all scheduled tasks across every host: system crontabs, user crontabs, and systemd timer units. Flags potential security issues (root cron jobs referencing world-writable paths, missing-file paths). Outputs per-host JSON.
**Step 1: Create the playbook**
```yaml
---
# Cron and Scheduled Task Audit
# Inventories crontabs and systemd timers across all hosts.
# Flags security concerns: root crons with world-writable path references.
#
# Usage: ansible-playbook -i hosts.ini playbooks/cron_audit.yml
# Usage: ansible-playbook -i hosts.ini playbooks/cron_audit.yml --limit homelab
- name: Cron and Scheduled Task Audit
hosts: "{{ host_target | default('active') }}"
gather_facts: yes
ignore_unreachable: true
vars:
report_dir: "/tmp/cron_audit"
tasks:
- name: Create audit report directory
ansible.builtin.file:
path: "{{ report_dir }}"
state: directory
mode: '0755'
delegate_to: localhost
run_once: true
# ── System crontabs ───────────────────────────────────────────────
- name: Read /etc/crontab
ansible.builtin.shell: cat /etc/crontab 2>/dev/null || echo "(not present)"
register: etc_crontab
changed_when: false
failed_when: false
- name: Read /etc/cron.d/ entries
ansible.builtin.shell: |
for f in /etc/cron.d/*; do
[ -f "$f" ] || continue
echo "=== $f ==="
cat "$f"
echo ""
done
register: cron_d_entries
changed_when: false
failed_when: false
- name: Read /etc/cron.{hourly,daily,weekly,monthly} scripts
ansible.builtin.shell: |
for dir in hourly daily weekly monthly; do
path="/etc/cron.$dir"
[ -d "$path" ] || continue
scripts=$(ls "$path" 2>/dev/null)
if [ -n "$scripts" ]; then
echo "=== /etc/cron.$dir ==="
echo "$scripts"
fi
done
register: cron_dirs
changed_when: false
failed_when: false
# ── User crontabs ─────────────────────────────────────────────────
- name: List users with crontabs
ansible.builtin.shell: |
if [ -d /var/spool/cron/crontabs ]; then
ls /var/spool/cron/crontabs/ 2>/dev/null
elif [ -d /var/spool/cron ]; then
ls /var/spool/cron/ 2>/dev/null | grep -v atjobs
else
echo "(crontab spool not found)"
fi
register: users_with_crontabs
changed_when: false
failed_when: false
- name: Dump user crontabs
ansible.builtin.shell: |
spool_dir=""
[ -d /var/spool/cron/crontabs ] && spool_dir=/var/spool/cron/crontabs
[ -d /var/spool/cron ] && [ -z "$spool_dir" ] && spool_dir=/var/spool/cron
if [ -z "$spool_dir" ]; then
echo "(no spool directory found)"
exit 0
fi
for user_file in "$spool_dir"/*; do
[ -f "$user_file" ] || continue
user=$(basename "$user_file")
echo "=== crontab for: $user ==="
cat "$user_file" 2>/dev/null
echo ""
done
register: user_crontabs
changed_when: false
failed_when: false
# ── Systemd timers ────────────────────────────────────────────────
- name: List systemd timers
ansible.builtin.shell: |
if command -v systemctl >/dev/null 2>&1; then
systemctl list-timers --all --no-pager 2>/dev/null || echo "(systemd not available)"
else
echo "(not a systemd host)"
fi
register: systemd_timers
changed_when: false
failed_when: false
# ── Security flags ────────────────────────────────────────────────
- name: REDACTED_APP_PASSWORD referencing world-writable paths
ansible.builtin.shell: |
# Gather all root cron entries
{
cat /etc/crontab 2>/dev/null
cat /etc/cron.d/* 2>/dev/null
spool=""
[ -d /var/spool/cron/crontabs ] && spool=/var/spool/cron/crontabs
[ -d /var/spool/cron ] && spool=/var/spool/cron
[ -n "$spool" ] && cat "$spool/root" 2>/dev/null
} | grep -v "^#" | grep -v "^$" > /tmp/_cron_lines.txt
found=0
while IFS= read -r line; do
# Extract script/binary paths from the cron command
cmd=$(echo "$line" | awk '{for(i=6;i<=NF;i++) printf $i" "; print ""}' | awk '{print $1}')
if [ -n "$cmd" ] && [ -f "$cmd" ]; then
perms=$(stat -c "%a" "$cmd" 2>/dev/null || echo "")
if echo "$perms" | grep -qE "^[0-9][0-9][2367]$"; then
echo "FLAGGED: $cmd is world-writable — used in cron: $line"
found=$((found+1))
fi
fi
done < /tmp/_cron_lines.txt
rm -f /tmp/_cron_lines.txt
[ "$found" -eq 0 ] && echo "No world-writable cron script paths found"
exit 0
register: security_flags
changed_when: false
failed_when: false
# ── Summary ───────────────────────────────────────────────────────
- name: Display cron audit summary
ansible.builtin.debug:
msg: |
═══ Cron Audit — {{ inventory_hostname }} ═══
/etc/crontab:
{{ etc_crontab.stdout | default('(empty)') | indent(2) }}
/etc/cron.d/:
{{ cron_d_entries.stdout | default('(empty)') | indent(2) }}
Cron directories (/etc/cron.{hourly,daily,weekly,monthly}):
{{ cron_dirs.stdout | default('(empty)') | indent(2) }}
Users with crontabs: {{ users_with_crontabs.stdout | default('(none)') | trim }}
User crontab contents:
{{ user_crontabs.stdout | default('(none)') | indent(2) }}
Systemd timers:
{{ systemd_timers.stdout | default('(none)') | indent(2) }}
Security flags:
{{ security_flags.stdout | default('(none)') | indent(2) }}
# ── Write JSON report ─────────────────────────────────────────────
- name: Write cron audit report
ansible.builtin.copy:
content: "{{ {'host': inventory_hostname, 'timestamp': ansible_date_time.iso8601, 'etc_crontab': etc_crontab.stdout | default(''), 'cron_d': cron_d_entries.stdout | default(''), 'cron_dirs': cron_dirs.stdout | default(''), 'users_with_crontabs': users_with_crontabs.stdout | default(''), 'user_crontabs': user_crontabs.stdout | default(''), 'systemd_timers': systemd_timers.stdout | default(''), 'security_flags': security_flags.stdout | default('')} | to_nice_json }}"
dest: "{{ report_dir }}/{{ inventory_hostname }}_{{ ansible_date_time.date }}.json"
delegate_to: localhost
changed_when: false
```
**Step 2: Validate syntax**
```bash
ansible-playbook --syntax-check -i hosts.ini playbooks/cron_audit.yml
```
Expected: no errors.
**Step 3: Run against one host**
```bash
ansible-playbook -i hosts.ini playbooks/cron_audit.yml --limit homelab
```
Expected: Cron entries and systemd timers displayed. Security flags report shown.
**Step 4: Run across all hosts**
```bash
ansible-playbook -i hosts.ini playbooks/cron_audit.yml
```
Expected: Summary per host. Reports written to `/tmp/cron_audit/`.
**Step 5: Commit**
```bash
git add playbooks/cron_audit.yml
git commit -m "feat: add cron_audit playbook for scheduled task inventory across all hosts"
```
---
## Task 6: Update README.md
**Files:**
- Modify: `README.md`
**Step 1: Add the 5 new playbooks to the relevant tables in README.md**
Add to the Health & Monitoring table:
```markdown
| **`network_connectivity.yml`** | Full mesh Tailscale + SSH + HTTP endpoint health | Daily | ✅ |
| **`ntp_check.yml`** | Time sync drift audit with ntfy alerts | Daily | ✅ |
```
Add a new "Platform Management" section (after Advanced Container Management):
```markdown
### 🖥️ Platform Management (3 playbooks)
| Playbook | Purpose | Usage | Multi-System |
|----------|---------|-------|--------------|
| `synology_health.yml` | Synology NAS health (DSM, RAID, Tailscale) | Monthly | Synology only |
| **`proxmox_management.yml`** | 🆕 PVE VM/LXC inventory, storage pools, snapshots | Weekly | PVE only |
| **`truenas_health.yml`** | 🆕 ZFS pool health, scrub, SMART, app status | Weekly | TrueNAS only |
```
Add to the Security & Maintenance table:
```markdown
| **`cron_audit.yml`** | 🆕 Scheduled task inventory + security flags | Monthly | ✅ |
```
**Step 2: Update the total playbook count at the bottom**
Change: `33 playbooks``38 playbooks`
**Step 3: Commit**
```bash
git add README.md
git commit -m "docs: update README with 5 new playbooks"
```