Vish/homelab-optimized

Fork 0

Files

Gitea Mirror Bot 8664c8417c

Documentation / Build Docusaurus (push) Failing after 9m20s

Details

Documentation / Deploy to GitHub Pages (push) Has been skipped

Details

Sanitized mirror from private repository - 2026-03-30 00:10:29 UTC

2026-03-30 00:10:29 +00:00

44 KiB

Raw Blame History

New Playbooks Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Add 5 new Ansible playbooks covering network connectivity health, Proxmox management, TrueNAS health, NTP sync auditing, and cron job inventory.

Architecture: Each playbook is standalone, follows existing patterns (read-only shell tasks with changed_when: false, failed_when: false for non-fatal checks, ntfy alerting via ntfy_url var, JSON reports in /tmp/<category>_reports/). Platform detection is done inline via command availability checks rather than Ansible facts to keep cross-platform compatibility with Synology/TrueNAS.

Tech Stack: Ansible, bash shell commands, Tailscale CLI, Proxmox qm/pct/pvesh CLI, ZFS zpool/zfs tools, chronyc/timedatectl, smartctl, standard POSIX cron paths.

Conventions to Follow (read this first)

These patterns appear in every existing playbook — match them exactly:

# Read-only tasks always have:
changed_when: false
failed_when: false   # (or ignore_errors: yes)

# Report directories:
delegate_to: localhost
run_once: true

# Variable defaults:
my_var: "{{ my_var | default('fallback') }}"

# Module names use fully-qualified form:
ansible.builtin.shell
ansible.builtin.debug
ansible.builtin.assert

# ntfy alerting (used in alert_check.yml — copy that pattern):
ntfy_url: "{{ ntfy_url | default('https://ntfy.sh/REDACTED_TOPIC') }}"

Reference files to read before each task:

playbooks/synology_health.yml — pattern for platform-specific health checks
playbooks/tailscale_health.yml — pattern for binary detection + JSON parsing
playbooks/disk_usage_report.yml — pattern for threshold variables + report dirs
playbooks/alert_check.yml — pattern for ntfy notifications

Task 1: `network_connectivity.yml` — Full mesh connectivity check

Files:

Create: playbooks/network_connectivity.yml

What it does: For every host in inventory, check Tailscale is Running, ping all other hosts by their ansible_host IP, test SSH port reachability, and verify HTTP endpoints for key services. Outputs a connectivity matrix and sends ntfy alert on failures.

Step 1: Create the playbook file

---
# Network Connectivity Health Check
# Verifies Tailscale mesh connectivity between all inventory hosts
# and checks HTTP/HTTPS endpoints for key services.
#
# Usage: ansible-playbook -i hosts.ini playbooks/network_connectivity.yml
# Usage: ansible-playbook -i hosts.ini playbooks/network_connectivity.yml --limit homelab

- name: Network Connectivity Health Check
  hosts: "{{ host_target | default('active') }}"
  gather_facts: yes
  ignore_unreachable: true
  vars:
    report_dir: "/tmp/connectivity_reports"
    ts_candidates:
      - /usr/bin/tailscale
      - /var/packages/Tailscale/target/bin/tailscale
    warn_on_failure: true
    ntfy_url: "{{ ntfy_url | default('https://ntfy.sh/REDACTED_TOPIC') }}"

    # HTTP endpoints to verify — add/remove per your services
    http_endpoints:
      - name: Portainer (homelab)
        url: "http://100.67.40.126:9000"
      - name: Gitea (homelab)
        url: "http://100.67.40.126:3000"
      - name: Immich (homelab)
        url: "http://100.67.40.126:2283"
      - name: Home Assistant
        url: "http://100.112.186.90:8123"

  tasks:
    - name: Create connectivity report directory
      ansible.builtin.file:
        path: "{{ report_dir }}"
        state: directory
        mode: '0755'
      delegate_to: localhost
      run_once: true

    # ── Tailscale status ──────────────────────────────────────────────
    - name: Detect Tailscale binary
      ansible.builtin.shell: |
        for p in {{ ts_candidates | join(' ') }}; do
          [ -x "$p" ] && echo "$p" && exit 0
        done
        echo ""
      register: ts_bin
      changed_when: false
      failed_when: false

    - name: Get Tailscale status JSON
      ansible.builtin.command: "{{ ts_bin.stdout }} status --json"
      register: ts_status_raw
      changed_when: false
      failed_when: false
      when: ts_bin.stdout | length > 0

    - name: Parse Tailscale state
      ansible.builtin.set_fact:
        ts_parsed: "{{ ts_status_raw.stdout | from_json }}"
        ts_backend: "{{ (ts_status_raw.stdout | from_json).BackendState | default('unknown') }}"
        ts_ip: "{{ ((ts_status_raw.stdout | from_json).Self.TailscaleIPs | default([]) | first) | default('n/a') }}"
      when:
        - ts_bin.stdout | length > 0
        - ts_status_raw.rc | default(1) == 0
        - ts_status_raw.stdout | default('') | length > 0
        - ts_status_raw.stdout is search('{')
      failed_when: false

    # ── Peer reachability (ping each inventory host by Tailscale IP) ──
    - name: Ping all inventory hosts
      ansible.builtin.shell: |
        ping -c 2 -W 2 {{ hostvars[item]['ansible_host'] }} > /dev/null 2>&1 && echo "OK" || echo "FAIL"
      register: ping_results
      changed_when: false
      failed_when: false
      loop: "{{ groups['active'] | select('ne', inventory_hostname) | list }}"
      loop_control:
        label: "{{ item }}"

    - name: Summarise ping results
      ansible.builtin.set_fact:
        ping_summary: "{{ ping_summary | default({}) | combine({item.item: item.stdout | trim}) }}"
      loop: "{{ ping_results.results }}"
      loop_control:
        label: "{{ item.item }}"

    # ── SSH port check ────────────────────────────────────────────────
    - name: Check SSH port on all inventory hosts
      ansible.builtin.shell: |
        port="{{ hostvars[item]['ansible_port'] | default(22) }}"
        nc -zw3 {{ hostvars[item]['ansible_host'] }} "$port" > /dev/null 2>&1 && echo "OK" || echo "FAIL"
      register: ssh_port_results
      changed_when: false
      failed_when: false
      loop: "{{ groups['active'] | select('ne', inventory_hostname) | list }}"
      loop_control:
        label: "{{ item }}"

    - name: Summarise SSH port results
      ansible.builtin.set_fact:
        ssh_summary: "{{ ssh_summary | default({}) | combine({item.item: item.stdout | trim}) }}"
      loop: "{{ ssh_port_results.results }}"
      loop_control:
        label: "{{ item.item }}"

    # ── HTTP endpoint checks (run once from localhost) ────────────────
    - name: Check HTTP endpoints
      ansible.builtin.uri:
        url: "{{ item.url }}"
        method: GET
        status_code: [200, 301, 302, 401, 403]
        timeout: 5
        validate_certs: false
      register: http_results
      failed_when: false
      loop: "{{ http_endpoints }}"
      loop_control:
        label: "{{ item.name }}"
      delegate_to: localhost
      run_once: true

    # ── Connectivity summary ──────────────────────────────────────────
    - name: Display connectivity summary per host
      ansible.builtin.debug:
        msg: |
          ═══ {{ inventory_hostname }} ═══
          Tailscale: {{ ts_backend | default('not installed') }} | IP: {{ ts_ip | default('n/a') }}
          Peer ping results:
          {% for host, result in (ping_summary | default({})).items() %}
            {{ host }}: {{ result }}
          {% endfor %}
          SSH port results:
          {% for host, result in (ssh_summary | default({})).items() %}
            {{ host }}: {{ result }}
          {% endfor %}

    - name: Display HTTP endpoint results
      ansible.builtin.debug:
        msg: |
          ═══ HTTP Endpoint Health ═══
          {% for item in http_results.results | default([]) %}
          {{ item.item.name }}: {{ 'OK (' + (item.status | string) + ')' if item.status is defined and item.status > 0 else 'FAIL' }}
          {% endfor %}
      run_once: true
      delegate_to: localhost

    # ── Alert on failures ─────────────────────────────────────────────
    - name: Collect failed peers
      ansible.builtin.set_fact:
        failed_peers: >-
          {{ (ping_summary | default({})).items() | selectattr('1', 'eq', 'FAIL') | map(attribute='0') | list }}

    - name: Send ntfy alert for connectivity failures
      ansible.builtin.uri:
        url: "{{ ntfy_url }}"
        method: POST
        body: "Connectivity failures on {{ inventory_hostname }}: {{ failed_peers | join(', ') }}"
        headers:
          Title: "Homelab Network Alert"
          Priority: "high"
          Tags: "warning,network"
        body_format: raw
        status_code: [200, 204]
      delegate_to: localhost
      failed_when: false
      when:
        - warn_on_failure | bool
        - failed_peers | length > 0

    # ── Write JSON report ─────────────────────────────────────────────
    - name: Write connectivity report
      ansible.builtin.copy:
        content: "{{ {'host': inventory_hostname, 'timestamp': ansible_date_time.iso8601, 'tailscale_state': ts_backend | default('unknown'), 'tailscale_ip': ts_ip | default('n/a'), 'ping': ping_summary | default({}), 'ssh_port': ssh_summary | default({})} | to_nice_json }}"
        dest: "{{ report_dir }}/{{ inventory_hostname }}_{{ ansible_date_time.date }}.json"
      delegate_to: localhost
      changed_when: false

Step 2: Validate YAML syntax

cd /home/homelab/organized/repos/homelab/ansible/automation
ansible-playbook --syntax-check -i hosts.ini playbooks/network_connectivity.yml

Expected: playbook: playbooks/network_connectivity.yml with no errors.

Step 3: Dry-run against one host

ansible-playbook -i hosts.ini playbooks/network_connectivity.yml --limit homelab --check

Expected: Tasks run, no failures. Some tasks will report skipped (when conditions, etc.) — that's fine.

Step 4: Run for real against one host

ansible-playbook -i hosts.ini playbooks/network_connectivity.yml --limit homelab

Expected: Connectivity summary printed, report written to /tmp/connectivity_reports/homelab_<date>.json.

Step 5: Run against all active hosts

ansible-playbook -i hosts.ini playbooks/network_connectivity.yml

Expected: Summary for every host in [active] group. Unreachable hosts are handled gracefully (skipped, not errored).

Step 6: Commit

git add playbooks/network_connectivity.yml
git commit -m "feat: add network_connectivity playbook for full mesh health check"

Task 2: `proxmox_management.yml` — Proxmox VM/LXC inventory and health

Files:

Create: playbooks/proxmox_management.yml

What it does: Targets the pve host. Reports VM inventory (qm list), LXC inventory (pct list), node resource summary, storage pool status, and last 10 task log entries. Optional snapshot action via -e action=snapshot -e vm_id=100.

Note: pve uses ansible_user=root (see hosts.ini), so become: false is correct here — root already has all access.

Step 1: Create the playbook

---
# Proxmox VE Management Playbook
# Reports VM/LXC inventory, resource usage, storage pool status, and recent tasks.
# Optionally creates a snapshot with -e action=snapshot -e vm_id=100
#
# Usage: ansible-playbook -i hosts.ini playbooks/proxmox_management.yml
# Usage: ansible-playbook -i hosts.ini playbooks/proxmox_management.yml -e action=snapshot -e vm_id=100

- name: Proxmox VE Management
  hosts: pve
  gather_facts: yes
  become: false
  vars:
    action: "{{ action | default('status') }}"   # status | snapshot
    vm_id: "{{ vm_id | default('') }}"
    report_dir: "/tmp/health_reports"

  tasks:
    - name: Create report directory
      ansible.builtin.file:
        path: "{{ report_dir }}"
        state: directory
        mode: '0755'
      delegate_to: localhost
      run_once: true

    # ── Node overview ─────────────────────────────────────────────────
    - name: Get PVE version
      ansible.builtin.command: pveversion
      register: pve_version
      changed_when: false
      failed_when: false

    - name: Get node resource summary
      ansible.builtin.shell: |
        pvesh get /nodes/$(hostname)/status --output-format json 2>/dev/null || \
          echo '{"error": "pvesh not available"}'
      register: node_status_raw
      changed_when: false
      failed_when: false

    - name: Parse node status
      ansible.builtin.set_fact:
        node_status: "{{ node_status_raw.stdout | from_json }}"
      failed_when: false
      when: node_status_raw.stdout | default('') | length > 0

    # ── VM inventory ──────────────────────────────────────────────────
    - name: List all VMs
      ansible.builtin.command: qm list
      register: vm_list
      changed_when: false
      failed_when: false

    - name: List all LXC containers
      ansible.builtin.command: pct list
      register: lxc_list
      changed_when: false
      failed_when: false

    - name: Count running VMs
      ansible.builtin.shell: |
        qm list 2>/dev/null | grep -c "running" || echo "0"
      register: vm_running_count
      changed_when: false
      failed_when: false

    - name: Count running LXCs
      ansible.builtin.shell: |
        pct list 2>/dev/null | grep -c "running" || echo "0"
      register: lxc_running_count
      changed_when: false
      failed_when: false

    # ── Storage pools ─────────────────────────────────────────────────
    - name: Get storage pool status
      ansible.builtin.shell: |
        pvesh get /nodes/$(hostname)/storage --output-format json 2>/dev/null | \
          python3 -c "
import json,sys
data=json.load(sys.stdin)
for s in data:
    used_pct = round(s.get('used',0) / s.get('total',1) * 100, 1) if s.get('total',0) > 0 else 0
    print(f\"{s.get('storage','?'):20} {s.get('type','?'):10} used={used_pct}%  avail={round(s.get('avail',0)/1073741824,1)}GiB\")
" 2>/dev/null || pvesm status 2>/dev/null || echo "Storage info unavailable"
      register: storage_status
      changed_when: false
      failed_when: false

    # ── Recent task log ───────────────────────────────────────────────
    - name: Get recent PVE tasks
      ansible.builtin.shell: |
        pvesh get /nodes/$(hostname)/tasks \
          --limit 10 \
          --output-format json 2>/dev/null | \
          python3 -c "
import json,sys,datetime
tasks=json.load(sys.stdin)
for t in tasks:
    ts=datetime.datetime.fromtimestamp(t.get('starttime',0)).strftime('%Y-%m-%d %H:%M')
    status=t.get('status','?')
    upid=t.get('upid','?')
    print(f'{ts}  {status:12}  {upid}')
" 2>/dev/null || echo "Task log unavailable"
      register: recent_tasks
      changed_when: false
      failed_when: false

    # ── Summary output ────────────────────────────────────────────────
    - name: Display Proxmox summary
      ansible.builtin.debug:
        msg: |
          ═══ Proxmox VE — {{ inventory_hostname }} ═══
          Version: {{ pve_version.stdout | default('unknown') }}

          VMs:  {{ vm_running_count.stdout | trim }} running
          {{ vm_list.stdout | default('(no VMs)') | indent(2) }}

          LXCs: {{ lxc_running_count.stdout | trim }} running
          {{ lxc_list.stdout | default('(no LXCs)') | indent(2) }}

          Storage Pools:
          {{ storage_status.stdout | default('n/a') | indent(2) }}

          Recent Tasks (last 10):
          {{ recent_tasks.stdout | default('n/a') | indent(2) }}

    # ── Optional: snapshot a VM ───────────────────────────────────────
    - name: Create VM snapshot
      ansible.builtin.shell: |
        snap_name="ansible-snap-$(date +%Y%m%d-%H%M%S)"
        qm snapshot {{ vm_id }} "$snap_name" --description "Ansible automated snapshot"
        echo "Snapshot created: $snap_name for VM {{ vm_id }}"
      register: snapshot_result
      when:
        - action == "snapshot"
        - vm_id | string | length > 0
      changed_when: true

    - name: Show snapshot result
      ansible.builtin.debug:
        msg: "{{ snapshot_result.stdout | default('No snapshot taken') }}"
      when: action == "snapshot"

    # ── Write JSON report ─────────────────────────────────────────────
    - name: Write Proxmox report
      ansible.builtin.copy:
        content: "{{ {'host': inventory_hostname, 'timestamp': ansible_date_time.iso8601, 'version': pve_version.stdout | default('unknown'), 'vms_running': vm_running_count.stdout | trim, 'lxcs_running': lxc_running_count.stdout | trim, 'storage': storage_status.stdout | default(''), 'tasks': recent_tasks.stdout | default('')} | to_nice_json }}"
        dest: "{{ report_dir }}/proxmox_{{ ansible_date_time.date }}.json"
      delegate_to: localhost
      changed_when: false

Step 2: Validate syntax

ansible-playbook --syntax-check -i hosts.ini playbooks/proxmox_management.yml

Expected: no errors.

Step 3: Run against pve

ansible-playbook -i hosts.ini playbooks/proxmox_management.yml

Expected: Proxmox summary table printed. JSON report written to /tmp/health_reports/proxmox_<date>.json.

Step 4: Test snapshot action (optional — only if you have a test VM)

# Replace 100 with a real VM ID from the qm list output above
ansible-playbook -i hosts.ini playbooks/proxmox_management.yml -e action=snapshot -e vm_id=100

Expected: Snapshot created: ansible-snap-<timestamp> for VM 100

Step 5: Commit

git add playbooks/proxmox_management.yml
git commit -m "feat: add proxmox_management playbook for PVE VM/LXC inventory and health"

Task 3: `truenas_health.yml` — TrueNAS SCALE ZFS and app health

Files:

Create: playbooks/truenas_health.yml

What it does: Targets truenas-scale. Checks ZFS pool health, scrub status, dataset usage, SMART disk status, and running TrueNAS apps (k3s-based). Flags degraded/faulted pools. Mirrors synology_health.yml structure.

Note: TrueNAS SCALE runs on Debian. The vish user needs sudo for smartctl and zpool. Check host_vars/truenas-scale.yml — ansible_become: true is set in group_vars/homelab_linux.yml which covers all hosts.

Step 1: Create the playbook

---
# TrueNAS SCALE Health Check
# Checks ZFS pool status, scrub health, dataset usage, SMART disk status, and app state.
# Mirrors synology_health.yml but for TrueNAS SCALE (Debian-based with ZFS).
#
# Usage: ansible-playbook -i hosts.ini playbooks/truenas_health.yml

- name: TrueNAS SCALE Health Check
  hosts: truenas-scale
  gather_facts: yes
  become: true
  vars:
    disk_warn_pct: 80
    disk_critical_pct: 90
    report_dir: "/tmp/health_reports"

  tasks:
    - name: Create report directory
      ansible.builtin.file:
        path: "{{ report_dir }}"
        state: directory
        mode: '0755'
      delegate_to: localhost
      run_once: true

    # ── System overview ───────────────────────────────────────────────
    - name: Get system uptime
      ansible.builtin.command: uptime -p
      register: uptime_out
      changed_when: false
      failed_when: false

    - name: Get TrueNAS version
      ansible.builtin.shell: |
        cat /etc/version 2>/dev/null || \
          midclt call system.version 2>/dev/null || \
          echo "version unavailable"
      register: truenas_version
      changed_when: false
      failed_when: false

    # ── ZFS pool health ───────────────────────────────────────────────
    - name: Get ZFS pool status
      ansible.builtin.command: zpool status -v
      register: zpool_status
      changed_when: false
      failed_when: false

    - name: Get ZFS pool list (usage)
      ansible.builtin.command: zpool list -H
      register: zpool_list
      changed_when: false
      failed_when: false

    - name: Check for degraded or faulted pools
      ansible.builtin.shell: |
        zpool status 2>/dev/null | grep -E "state:\s*(DEGRADED|FAULTED|OFFLINE|REMOVED)" | wc -l
      register: pool_errors
      changed_when: false
      failed_when: false

    - name: Assert no degraded pools
      ansible.builtin.assert:
        that:
          - (pool_errors.stdout | trim | int) == 0
        success_msg: "All ZFS pools ONLINE"
        fail_msg: "DEGRADED or FAULTED pool detected — run: zpool status"
      changed_when: false
      ignore_errors: yes

    # ── ZFS scrub status ──────────────────────────────────────────────
    - name: Get last scrub info per pool
      ansible.builtin.shell: |
        for pool in $(zpool list -H -o name 2>/dev/null); do
          echo "Pool: $pool"
          zpool status "$pool" 2>/dev/null | grep -E "scrub|scan" | head -3
          echo "---"
        done
      register: scrub_status
      changed_when: false
      failed_when: false

    # ── Dataset usage ─────────────────────────────────────────────────
    - name: Get dataset usage (top-level datasets)
      ansible.builtin.shell: |
        zfs list -H -o name,used,avail,refer,mountpoint -d 1 2>/dev/null | head -20
      register: dataset_usage
      changed_when: false
      failed_when: false

    # ── SMART disk status ─────────────────────────────────────────────
    - name: List physical disks
      ansible.builtin.shell: |
        lsblk -d -o NAME,SIZE,MODEL,SERIAL 2>/dev/null | grep -v "loop\|sr" || \
          ls /dev/sd? /dev/nvme?n? 2>/dev/null
      register: disk_list
      changed_when: false
      failed_when: false

    - name: Check SMART health for each disk
      ansible.builtin.shell: |
        failed=0
        for disk in $(lsblk -d -n -o NAME 2>/dev/null | grep -v "loop\|sr"); do
          result=$(smartctl -H /dev/$disk 2>/dev/null | grep -E "SMART overall-health|PASSED|FAILED" || echo "n/a")
          echo "$disk: $result"
          echo "$result" | grep -q "FAILED" && failed=$((failed+1))
        done
        exit $failed
      register: smart_results
      changed_when: false
      failed_when: false

    # ── TrueNAS apps (k3s) ────────────────────────────────────────────
    - name: Get TrueNAS app status
      ansible.builtin.shell: |
        if command -v k3s >/dev/null 2>&1; then
          k3s kubectl get pods -A --no-headers 2>/dev/null | \
            awk '{print $4}' | sort | uniq -c | sort -rn
        elif command -v midclt >/dev/null 2>&1; then
          midclt call chart.release.query 2>/dev/null | \
            python3 -c "
import json,sys
try:
    apps=json.load(sys.stdin)
    for a in apps:
        print(f\"{a.get('id','?'):30} {a.get('status','?')}\")
except:
    print('App status unavailable')
" 2>/dev/null
        else
          echo "App runtime not detected (k3s/midclt not found)"
        fi
      register: app_status
      changed_when: false
      failed_when: false

    # ── Summary output ────────────────────────────────────────────────
    - name: Display TrueNAS health summary
      ansible.builtin.debug:
        msg: |
          ═══ TrueNAS SCALE — {{ inventory_hostname }} ═══
          Version : {{ truenas_version.stdout | default('unknown') | trim }}
          Uptime  : {{ uptime_out.stdout | default('n/a') }}
          Pool errors: {{ pool_errors.stdout | trim | default('0') }}

          ZFS Pool List:
          {{ zpool_list.stdout | default('(none)') | indent(2) }}

          ZFS Pool Status (degraded/faulted check):
          Degraded pools found: {{ pool_errors.stdout | trim }}

          Scrub Status:
          {{ scrub_status.stdout | default('n/a') | indent(2) }}

          Dataset Usage (top-level):
          {{ dataset_usage.stdout | default('n/a') | indent(2) }}

          SMART Disk Status:
          {{ smart_results.stdout | default('n/a') | indent(2) }}

          TrueNAS Apps:
          {{ app_status.stdout | default('n/a') | indent(2) }}

    # ── Write JSON report ─────────────────────────────────────────────
    - name: Write TrueNAS health report
      ansible.builtin.copy:
        content: "{{ {'host': inventory_hostname, 'timestamp': ansible_date_time.iso8601, 'version': truenas_version.stdout | default('unknown') | trim, 'pool_errors': pool_errors.stdout | trim, 'zpool_list': zpool_list.stdout | default(''), 'scrub': scrub_status.stdout | default(''), 'smart': smart_results.stdout | default(''), 'apps': app_status.stdout | default('')} | to_nice_json }}"
        dest: "{{ report_dir }}/truenas_{{ ansible_date_time.date }}.json"
      delegate_to: localhost
      changed_when: false

Step 2: Validate syntax

ansible-playbook --syntax-check -i hosts.ini playbooks/truenas_health.yml

Expected: no errors.

Step 3: Run against truenas-scale

ansible-playbook -i hosts.ini playbooks/truenas_health.yml

Expected: Health summary printed, pool status shown, SMART results visible. JSON report at /tmp/health_reports/truenas_<date>.json.

Step 4: Commit

git add playbooks/truenas_health.yml
git commit -m "feat: add truenas_health playbook for ZFS pool, scrub, SMART, and app status"

Task 4: `ntp_check.yml` — Time sync health audit

Files:

Create: playbooks/ntp_check.yml

What it does: Checks time sync status across all hosts. Detects which NTP daemon is running, extracts current offset in milliseconds, warns at >500ms, critical at >1000ms. Sends ntfy alert for hosts exceeding warn threshold. Read-only — no config changes.

Platform notes:

Ubuntu/Debian: systemd-timesyncd → use timedatectl show-timesync or chronyc tracking
Synology: Uses its own NTP, check via /proc/driver/rtc or synoinfo.conf + ntpq -p
TrueNAS: Debian-based, likely chrony or systemd-timesyncd
Proxmox: Debian-based

Step 1: Create the playbook

---
# NTP Time Sync Health Check
# Audits time synchronization across all hosts. Read-only — no config changes.
# Warns when offset > 500ms, critical > 1000ms.
#
# Usage: ansible-playbook -i hosts.ini playbooks/ntp_check.yml
# Usage: ansible-playbook -i hosts.ini playbooks/ntp_check.yml --limit synology

- name: NTP Time Sync Health Check
  hosts: "{{ host_target | default('active') }}"
  gather_facts: yes
  ignore_unreachable: true
  vars:
    warn_offset_ms: 500
    critical_offset_ms: 1000
    ntfy_url: "{{ ntfy_url | default('https://ntfy.sh/REDACTED_TOPIC') }}"
    report_dir: "/tmp/ntp_reports"

  tasks:
    - name: Create report directory
      ansible.builtin.file:
        path: "{{ report_dir }}"
        state: directory
        mode: '0755'
      delegate_to: localhost
      run_once: true

    # ── Detect NTP daemon ─────────────────────────────────────────────
    - name: Detect active NTP implementation
      ansible.builtin.shell: |
        if command -v chronyc >/dev/null 2>&1 && chronyc tracking >/dev/null 2>&1; then
          echo "chrony"
        elif timedatectl show-timesync 2>/dev/null | grep -q ServerName; then
          echo "timesyncd"
        elif timedatectl 2>/dev/null | grep -q "NTP service: active"; then
          echo "timesyncd"
        elif command -v ntpq >/dev/null 2>&1 && ntpq -p >/dev/null 2>&1; then
          echo "ntpd"
        else
          echo "unknown"
        fi
      register: ntp_impl
      changed_when: false
      failed_when: false

    # ── Get offset (chrony) ───────────────────────────────────────────
    - name: Get chrony tracking info
      ansible.builtin.shell: chronyc tracking 2>/dev/null
      register: chrony_tracking
      changed_when: false
      failed_when: false
      when: ntp_impl.stdout | trim == "chrony"

    - name: Parse chrony offset (ms)
      ansible.builtin.shell: |
        chronyc tracking 2>/dev/null | \
          grep "System time" | \
          awk '{printf "%.3f", $4 * 1000}'
      register: chrony_offset_ms
      changed_when: false
      failed_when: false
      when: ntp_impl.stdout | trim == "chrony"

    - name: Get chrony sync source
      ansible.builtin.shell: |
        chronyc sources -v 2>/dev/null | grep "^\^" | head -3
      register: chrony_sources
      changed_when: false
      failed_when: false
      when: ntp_impl.stdout | trim == "chrony"

    # ── Get offset (systemd-timesyncd) ────────────────────────────────
    - name: Get timesyncd status
      ansible.builtin.shell: timedatectl show-timesync 2>/dev/null || timedatectl 2>/dev/null
      register: timesyncd_info
      changed_when: false
      failed_when: false
      when: ntp_impl.stdout | trim == "timesyncd"

    - name: Parse timesyncd offset (ms)
      ansible.builtin.shell: |
        # timesyncd doesn't expose offset cleanly — use systemd journal instead
        # Fall back to 0 if not available
        journalctl -u systemd-timesyncd --since "1 hour ago" --no-pager 2>/dev/null | \
          grep -oE "offset [+-]?[0-9]+(\.[0-9]+)?(ms|us|s)" | tail -1 | \
          awk '{
            val=$2; unit=$3;
            gsub(/[^0-9.-]/,"",val);
            if (unit=="us") printf "%.3f", val/1000;
            else if (unit=="s") printf "%.3f", val*1000;
            else printf "%.3f", val;
          }' || echo "0"
      register: timesyncd_offset_ms
      changed_when: false
      failed_when: false
      when: ntp_impl.stdout | trim == "timesyncd"

    # ── Get offset (ntpd) ─────────────────────────────────────────────
    - name: Get ntpq peers
      ansible.builtin.shell: ntpq -pn 2>/dev/null | head -10
      register: ntpq_peers
      changed_when: false
      failed_when: false
      when: ntp_impl.stdout | trim == "ntpd"

    - name: Parse ntpq offset (ms)
      ansible.builtin.shell: |
        # offset is column 9 in ntpq -p output (milliseconds)
        ntpq -p 2>/dev/null | awk 'NR>2 && /^\*/ {printf "%.3f", $9; exit}' || echo "0"
      register: ntpq_offset_ms
      changed_when: false
      failed_when: false
      when: ntp_impl.stdout | trim == "ntpd"

    # ── Consolidate offset ────────────────────────────────────────────
    - name: Set unified offset fact
      ansible.builtin.set_fact:
        ntp_offset_ms: >-
          {{
            (chrony_offset_ms.stdout | default('0')) | float
            if ntp_impl.stdout | trim == 'chrony'
            else (timesyncd_offset_ms.stdout | default('0')) | float
            if ntp_impl.stdout | trim == 'timesyncd'
            else (ntpq_offset_ms.stdout | default('0')) | float
          }}
        ntp_raw_info: >-
          {{
            chrony_tracking.stdout | default('')
            if ntp_impl.stdout | trim == 'chrony'
            else timesyncd_info.stdout | default('')
            if ntp_impl.stdout | trim == 'timesyncd'
            else ntpq_peers.stdout | default('')
          }}

    - name: Determine sync status
      ansible.builtin.set_fact:
        ntp_status: >-
          {{
            'CRITICAL' if (ntp_offset_ms | abs) >= critical_offset_ms
            else 'WARN' if (ntp_offset_ms | abs) >= warn_offset_ms
            else 'OK'
          }}

    # ── Per-host summary ──────────────────────────────────────────────
    - name: Display NTP summary
      ansible.builtin.debug:
        msg: |
          ═══ {{ inventory_hostname }} ═══
          NTP daemon : {{ ntp_impl.stdout | trim | default('unknown') }}
          Offset     : {{ ntp_offset_ms }} ms
          Status     : {{ ntp_status }}
          Details    :
          {{ ntp_raw_info | indent(2) }}

    # ── Alert on warn/critical ────────────────────────────────────────
    - name: Send ntfy alert for NTP issues
      ansible.builtin.uri:
        url: "{{ ntfy_url }}"
        method: POST
        body: "NTP {{ ntp_status }} on {{ inventory_hostname }}: offset={{ ntp_offset_ms }}ms (threshold={{ warn_offset_ms }}ms)"
        headers:
          Title: "Homelab NTP Alert"
          Priority: "{{ 'urgent' if ntp_status == 'CRITICAL' else 'high' }}"
          Tags: "warning,clock"
        body_format: raw
        status_code: [200, 204]
      delegate_to: localhost
      failed_when: false
      when: ntp_status in ['WARN', 'CRITICAL']

    # ── Write JSON report ─────────────────────────────────────────────
    - name: Write NTP report
      ansible.builtin.copy:
        content: "{{ {'host': inventory_hostname, 'timestamp': ansible_date_time.iso8601, 'ntp_daemon': ntp_impl.stdout | trim, 'offset_ms': ntp_offset_ms, 'status': ntp_status} | to_nice_json }}"
        dest: "{{ report_dir }}/{{ inventory_hostname }}_{{ ansible_date_time.date }}.json"
      delegate_to: localhost
      changed_when: false

Step 2: Validate syntax

ansible-playbook --syntax-check -i hosts.ini playbooks/ntp_check.yml

Expected: no errors.

Step 3: Run against one host

ansible-playbook -i hosts.ini playbooks/ntp_check.yml --limit homelab

Expected: NTP daemon detected, offset printed, status OK/WARN/CRITICAL.

Step 4: Run across all hosts

ansible-playbook -i hosts.ini playbooks/ntp_check.yml

Expected: Summary for every active host. Synology hosts may report unknown for daemon — that's acceptable (they have NTP but expose it differently).

Step 5: Commit

git add playbooks/ntp_check.yml
git commit -m "feat: add ntp_check playbook for time sync drift auditing across all hosts"

Task 5: `cron_audit.yml` — Scheduled task inventory

Files:

Create: playbooks/cron_audit.yml

What it does: Inventories all scheduled tasks across every host: system crontabs, user crontabs, and systemd timer units. Flags potential security issues (root cron jobs referencing world-writable paths, missing-file paths). Outputs per-host JSON.

Step 1: Create the playbook

---
# Cron and Scheduled Task Audit
# Inventories crontabs and systemd timers across all hosts.
# Flags security concerns: root crons with world-writable path references.
#
# Usage: ansible-playbook -i hosts.ini playbooks/cron_audit.yml
# Usage: ansible-playbook -i hosts.ini playbooks/cron_audit.yml --limit homelab

- name: Cron and Scheduled Task Audit
  hosts: "{{ host_target | default('active') }}"
  gather_facts: yes
  ignore_unreachable: true
  vars:
    report_dir: "/tmp/cron_audit"

  tasks:
    - name: Create audit report directory
      ansible.builtin.file:
        path: "{{ report_dir }}"
        state: directory
        mode: '0755'
      delegate_to: localhost
      run_once: true

    # ── System crontabs ───────────────────────────────────────────────
    - name: Read /etc/crontab
      ansible.builtin.shell: cat /etc/crontab 2>/dev/null || echo "(not present)"
      register: etc_crontab
      changed_when: false
      failed_when: false

    - name: Read /etc/cron.d/ entries
      ansible.builtin.shell: |
        for f in /etc/cron.d/*; do
          [ -f "$f" ] || continue
          echo "=== $f ==="
          cat "$f"
          echo ""
        done
      register: cron_d_entries
      changed_when: false
      failed_when: false

    - name: Read /etc/cron.{hourly,daily,weekly,monthly} scripts
      ansible.builtin.shell: |
        for dir in hourly daily weekly monthly; do
          path="/etc/cron.$dir"
          [ -d "$path" ] || continue
          scripts=$(ls "$path" 2>/dev/null)
          if [ -n "$scripts" ]; then
            echo "=== /etc/cron.$dir ==="
            echo "$scripts"
          fi
        done
      register: cron_dirs
      changed_when: false
      failed_when: false

    # ── User crontabs ─────────────────────────────────────────────────
    - name: List users with crontabs
      ansible.builtin.shell: |
        if [ -d /var/spool/cron/crontabs ]; then
          ls /var/spool/cron/crontabs/ 2>/dev/null
        elif [ -d /var/spool/cron ]; then
          ls /var/spool/cron/ 2>/dev/null | grep -v atjobs
        else
          echo "(crontab spool not found)"
        fi
      register: users_with_crontabs
      changed_when: false
      failed_when: false

    - name: Dump user crontabs
      ansible.builtin.shell: |
        spool_dir=""
        [ -d /var/spool/cron/crontabs ] && spool_dir=/var/spool/cron/crontabs
        [ -d /var/spool/cron ] && [ -z "$spool_dir" ] && spool_dir=/var/spool/cron

        if [ -z "$spool_dir" ]; then
          echo "(no spool directory found)"
          exit 0
        fi

        for user_file in "$spool_dir"/*; do
          [ -f "$user_file" ] || continue
          user=$(basename "$user_file")
          echo "=== crontab for: $user ==="
          cat "$user_file" 2>/dev/null
          echo ""
        done
      register: user_crontabs
      changed_when: false
      failed_when: false

    # ── Systemd timers ────────────────────────────────────────────────
    - name: List systemd timers
      ansible.builtin.shell: |
        if command -v systemctl >/dev/null 2>&1; then
          systemctl list-timers --all --no-pager 2>/dev/null || echo "(systemd not available)"
        else
          echo "(not a systemd host)"
        fi
      register: systemd_timers
      changed_when: false
      failed_when: false

    # ── Security flags ────────────────────────────────────────────────
    - name: REDACTED_APP_PASSWORD referencing world-writable paths
      ansible.builtin.shell: |
        # Gather all root cron entries
        {
          cat /etc/crontab 2>/dev/null
          cat /etc/cron.d/* 2>/dev/null
          spool=""
          [ -d /var/spool/cron/crontabs ] && spool=/var/spool/cron/crontabs
          [ -d /var/spool/cron ] && spool=/var/spool/cron
          [ -n "$spool" ] && cat "$spool/root" 2>/dev/null
        } | grep -v "^#" | grep -v "^$" > /tmp/_cron_lines.txt

        found=0
        while IFS= read -r line; do
          # Extract script/binary paths from the cron command
          cmd=$(echo "$line" | awk '{for(i=6;i<=NF;i++) printf $i" "; print ""}' | awk '{print $1}')
          if [ -n "$cmd" ] && [ -f "$cmd" ]; then
            perms=$(stat -c "%a" "$cmd" 2>/dev/null || echo "")
            if echo "$perms" | grep -qE "^[0-9][0-9][2367]$"; then
              echo "FLAGGED: $cmd is world-writable — used in cron: $line"
              found=$((found+1))
            fi
          fi
        done < /tmp/_cron_lines.txt
        rm -f /tmp/_cron_lines.txt

        [ "$found" -eq 0 ] && echo "No world-writable cron script paths found"
        exit 0
      register: security_flags
      changed_when: false
      failed_when: false

    # ── Summary ───────────────────────────────────────────────────────
    - name: Display cron audit summary
      ansible.builtin.debug:
        msg: |
          ═══ Cron Audit — {{ inventory_hostname }} ═══

          /etc/crontab:
          {{ etc_crontab.stdout | default('(empty)') | indent(2) }}

          /etc/cron.d/:
          {{ cron_d_entries.stdout | default('(empty)') | indent(2) }}

          Cron directories (/etc/cron.{hourly,daily,weekly,monthly}):
          {{ cron_dirs.stdout | default('(empty)') | indent(2) }}

          Users with crontabs: {{ users_with_crontabs.stdout | default('(none)') | trim }}

          User crontab contents:
          {{ user_crontabs.stdout | default('(none)') | indent(2) }}

          Systemd timers:
          {{ systemd_timers.stdout | default('(none)') | indent(2) }}

          Security flags:
          {{ security_flags.stdout | default('(none)') | indent(2) }}

    # ── Write JSON report ─────────────────────────────────────────────
    - name: Write cron audit report
      ansible.builtin.copy:
        content: "{{ {'host': inventory_hostname, 'timestamp': ansible_date_time.iso8601, 'etc_crontab': etc_crontab.stdout | default(''), 'cron_d': cron_d_entries.stdout | default(''), 'cron_dirs': cron_dirs.stdout | default(''), 'users_with_crontabs': users_with_crontabs.stdout | default(''), 'user_crontabs': user_crontabs.stdout | default(''), 'systemd_timers': systemd_timers.stdout | default(''), 'security_flags': security_flags.stdout | default('')} | to_nice_json }}"
        dest: "{{ report_dir }}/{{ inventory_hostname }}_{{ ansible_date_time.date }}.json"
      delegate_to: localhost
      changed_when: false

Step 2: Validate syntax

ansible-playbook --syntax-check -i hosts.ini playbooks/cron_audit.yml

Expected: no errors.

Step 3: Run against one host

ansible-playbook -i hosts.ini playbooks/cron_audit.yml --limit homelab

Expected: Cron entries and systemd timers displayed. Security flags report shown.

Step 4: Run across all hosts

ansible-playbook -i hosts.ini playbooks/cron_audit.yml

Expected: Summary per host. Reports written to /tmp/cron_audit/.

Step 5: Commit

git add playbooks/cron_audit.yml
git commit -m "feat: add cron_audit playbook for scheduled task inventory across all hosts"

Task 6: Update README.md

Files:

Modify: README.md

Step 1: Add the 5 new playbooks to the relevant tables in README.md

Add to the Health & Monitoring table:

| **`network_connectivity.yml`** | Full mesh Tailscale + SSH + HTTP endpoint health | Daily | ✅ |
| **`ntp_check.yml`** | Time sync drift audit with ntfy alerts | Daily | ✅ |

Add a new "Platform Management" section (after Advanced Container Management):

### 🖥️ Platform Management (3 playbooks)
| Playbook | Purpose | Usage | Multi-System |
|----------|---------|-------|--------------|
| `synology_health.yml` | Synology NAS health (DSM, RAID, Tailscale) | Monthly | Synology only |
| **`proxmox_management.yml`** | 🆕 PVE VM/LXC inventory, storage pools, snapshots | Weekly | PVE only |
| **`truenas_health.yml`** | 🆕 ZFS pool health, scrub, SMART, app status | Weekly | TrueNAS only |

Add to the Security & Maintenance table:

| **`cron_audit.yml`** | 🆕 Scheduled task inventory + security flags | Monthly | ✅ |

Step 2: Update the total playbook count at the bottom

Change: 33 playbooks → 38 playbooks

Step 3: Commit

git add README.md
git commit -m "docs: update README with 5 new playbooks"

44 KiB Raw Blame History

New Playbooks Implementation Plan

Conventions to Follow (read this first)

Task 1: network_connectivity.yml — Full mesh connectivity check

Task 2: proxmox_management.yml — Proxmox VM/LXC inventory and health

Task 3: truenas_health.yml — TrueNAS SCALE ZFS and app health

Task 4: ntp_check.yml — Time sync health audit

Task 5: cron_audit.yml — Scheduled task inventory

Task 6: Update README.md

44 KiB

Raw Blame History

Task 1: `network_connectivity.yml` — Full mesh connectivity check

Task 2: `proxmox_management.yml` — Proxmox VM/LXC inventory and health

Task 3: `truenas_health.yml` — TrueNAS SCALE ZFS and app health

Task 4: `ntp_check.yml` — Time sync health audit

Task 5: `cron_audit.yml` — Scheduled task inventory