Sanitized mirror from private repository - 2026-03-31 23:50:30 UTC
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m1s
Documentation / Deploy to GitHub Pages (push) Has been skipped

This commit is contained in:
Gitea Mirror Bot
2026-03-31 23:50:30 +00:00
commit 19b90cee4d
1284 changed files with 331912 additions and 0 deletions

View File

@@ -0,0 +1,146 @@
# LAN Routing Fix: Tailscale Table 52 LAN Interception
## Problem
Hosts with host-level Tailscale on the `192.168.0.0/24` LAN have their local traffic intercepted by Tailscale's policy routing table 52. Instead of going directly over the physical 10GbE link, traffic gets routed through the WireGuard tunnel via Calypso's advertised `192.168.0.0/24` subnet route.
### Root Cause
Calypso (Headscale node ID:12) advertises `192.168.0.0/24` as a subnet route so remote nodes (Moon, Seattle, NUC) can reach LAN devices over Tailscale. However, machines that are **already on** that LAN also accept this route into Tailscale's routing table 52 (ip rule priority 5270), causing local traffic to hairpin through the tunnel.
Diagnosis:
```bash
# Shows traffic going through tailscale0 instead of the physical NIC
ip route get 192.168.0.200
# → 192.168.0.200 dev tailscale0 table 52 src 100.75.252.64
# Table 52 has the LAN subnet routed through Tailscale
ip route show table 52 | grep 192.168.0
# → 192.168.0.0/24 dev tailscale0
```
### Affected Hosts
Any host on `192.168.0.0/24` with `--accept-routes` enabled will have this issue. Calypso advertises the LAN subnet so remote nodes can reach it; LAN-local hosts must not route LAN traffic through the tunnel.
| Host | LAN IP | Physical NIC | Status |
|---|---|---|---|
| Guava (TrueNAS) | 192.168.0.100 | enp1s0f0np0 (10GbE) | **Fixed** — TrueNAS POSTINIT script |
| homelab-vm | 192.168.0.210 | ens18 | **Fixed** — systemd service |
| Pi-5 | 192.168.0.66 | eth0 | **Fixed** (2026-03-31) — dispatcher script + cron |
| Matrix-Ubuntu | 192.168.0.154 | ens3 | **Fixed** (2026-03-31) — dispatcher script + cron |
| PVE | 192.168.0.205 | vmbr0 | **Fixed** (2026-03-31) — cron @reboot |
| Atlantis | 192.168.0.200 | eth2/ovs_eth2 (10GbE) | Not affected (`--accept-routes` off) |
| Calypso | 192.168.0.250 | ovs_eth2 | Not affected (`--accept-routes` off) |
| NUC | 192.168.68.100 | eno1 | Not affected (different subnet) |
### Measured Impact (Guava → Atlantis)
| Route | Throughput | Retransmits |
|---|---|---|
| Before fix (via Tailscale) | 1.39 Gbps | 6,891 |
| After fix (direct LAN) | **7.61 Gbps** | 5,066 |
**5.5x improvement** — from WireGuard-encapsulated tunnel to direct 10GbE.
## Fix Applied
Add an ip policy rule at priority 5200 (before Tailscale's table 52 at 5270) that forces LAN traffic to use the main routing table, which routes via the physical NIC:
```bash
sudo ip rule add to 192.168.0.0/24 lookup main priority 5200
```
This means: for any traffic destined to `192.168.0.0/24`, check the main table first. The main table has `192.168.0.0/24 dev <physical-nic>`, so traffic goes direct. All Tailscale traffic to `100.x.x.x` nodes is unaffected.
### Verification
```bash
# Should show physical NIC, not tailscale0
ip route get 192.168.0.200
# Should get sub-1ms ping
ping -c 3 192.168.0.200
# Confirm rule is in place
ip rule show | grep 5200
```
### Revert
```bash
sudo ip rule del to 192.168.0.0/24 lookup main priority 5200
```
## Persistence
Each host uses the persistence method appropriate to its OS:
### Guava (TrueNAS SCALE)
Init script added via TrueNAS API (ID: 2):
- **Type:** COMMAND
- **When:** POSTINIT
- **Command:** `ip rule add to 192.168.0.0/24 lookup main priority 5200`
- **Comment:** Bypass Tailscale routing for LAN traffic (direct 10GbE)
Manage via TrueNAS UI: **System → Advanced → Init/Shutdown Scripts**
### homelab-vm (Ubuntu 24.04)
Systemd service at `/etc/systemd/system/lan-route-fix.service`:
```ini
[Unit]
Description=Ensure LAN traffic bypasses Tailscale routing table
After=network-online.target tailscaled.service
Wants=network-online.target
[Service]
Type=oneshot
ExecStart=/sbin/ip rule add to 192.168.0.0/24 lookup main priority 5200
ExecStop=/sbin/ip rule del to 192.168.0.0/24 lookup main priority 5200
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
Enabled with `sudo systemctl enable lan-route-fix.service`.
### Pi-5 (Raspberry Pi OS) and Matrix-Ubuntu (Ubuntu 24.04)
Dispatcher script at `/etc/networkd-dispatcher/routable.d/50-tailscale-lan`:
```bash
#!/bin/bash
if ! ip rule show | grep -q "5200.*192.168.0.0/24"; then
ip rule add to 192.168.0.0/24 lookup main priority 5200
fi
```
Plus belt-and-suspenders `@reboot` cron entry:
```
@reboot /bin/bash /etc/networkd-dispatcher/routable.d/50-tailscale-lan
```
### PVE (Proxmox VE)
Root crontab `@reboot` entry:
```
@reboot /sbin/ip rule add to 192.168.0.0/24 lookup main priority 5200 2>/dev/null
```
## Adding a New LAN Host
If a new host is added to `192.168.0.0/24` with Tailscale and `--accept-routes`:
1. Apply the fix: `sudo ip rule add to 192.168.0.0/24 lookup main priority 5200`
2. Verify: `ip route get 192.168.0.200` should show the physical NIC, not `tailscale0`
3. Make persistent using one of the methods above
4. Update this document
## Notes
- Remote nodes (Moon, Seattle, NUC, Setillo) that are **not** on `192.168.0.0/24` are unaffected — they correctly use Calypso's subnet route to reach LAN devices via Tailscale.
- The Synology boxes (Atlantis, Calypso) have `--accept-routes` disabled and use Open vSwitch bridging, so they are not affected.
- The `--accept-routes` flag also pulls in `192.168.68.0/22` and `192.168.69.0/24` routes (from NUC's subnet advertisement), but these don't conflict with the primary LAN.
- Enabling `--accept-routes` without the priority 5200 rule will silently break LAN connectivity — outbound packets route through `tailscale0` and replies never reach the sender via the expected path.

View File

@@ -0,0 +1,79 @@
# SSH Mesh — Key-Based Authentication Across All Hosts
All Tailscale-connected hosts can SSH to each other using ed25519 key authentication.
No passwords needed.
## Participating Hosts
| Host | User | Tailscale IP | SSH Port | Key |
|------|------|-------------|----------|-----|
| homelab-vm | homelab | 100.67.40.126 | 22 | admin@thevish.io |
| atlantis | vish | 100.83.230.112 | 60000 | vish@atlantis |
| calypso | Vish | 100.103.48.78 | 62000 | calypso access |
| guava | vish | 100.75.252.64 | 22 | vish@guava |
| setillo | vish | 100.125.0.20 | 22 | setillo-key |
| pi-5 | vish | 100.77.151.40 | 22 | vish@pi-5 |
| nuc | vish | 100.72.55.21 | 22 | vish@nuc |
| moon | vish | 100.64.0.6 | 22 | vish@moon |
| seattle | root | 100.82.197.124 | 22 | root@seattle |
| matrix-ubuntu | test | 100.85.21.51 | 22 | test@matrix-ubuntu |
| jellyfish | lulu | 100.69.121.120 | 22 | lulu@jellyfish |
| pve | root | 100.87.12.28 | 22 | root@pve (RSA) |
| gl-mt3000 | root | 100.126.243.15 | 22 | (admin key only) |
| gl-be3600 | root | 100.105.59.123 | 22 | root@gl-be3600 |
The **admin key** (`admin@thevish.io` from homelab-vm) is present on every host.
## Ansible Playbook
Manage the mesh with `ansible/playbooks/ssh_mesh.yml`:
```bash
# Distribute keys to all hosts (collect + push)
ansible-playbook -i inventory.yml playbooks/ssh_mesh.yml --tags distribute
# Verify connectivity from localhost
ansible-playbook -i inventory.yml playbooks/ssh_mesh.yml --tags verify
# Generate missing keys + distribute
ansible-playbook -i inventory.yml playbooks/ssh_mesh.yml -e "generate_missing=true"
```
The `ssh_mesh` group in `inventory.yml` defines which hosts participate.
## Adding a New Host
1. Add the host to `ansible/inventory.yml` under the appropriate group and to the `ssh_mesh` children
2. Run the playbook with key generation:
```bash
ansible-playbook -i inventory.yml playbooks/ssh_mesh.yml -e "generate_missing=true"
```
3. This will generate a key on the new host if needed, collect all keys, and distribute them everywhere
## Notes
- **Synology NAS (Atlantis/Calypso/Setillo)**: Home directory must be `chmod 755` or stricter — SSH refuses key auth if home is world-writable. DSM can reset permissions on reboot.
- **OpenWrt routers (MT3000/BE3600)**: Use dropbear SSH, not OpenSSH. Keys must be in both `/etc/dropbear/authorized_keys` AND `/root/.ssh/authorized_keys`. Key auth works but `ssh -o` flags differ slightly.
- **GL-BE3600 in repeater mode**: SSH port 22 is accessible via Tailscale only — LAN SSH is blocked by the repeater firewall. Use `100.105.59.123` not `192.168.68.1`.
- **TrueNAS (Guava)**: Home directory is at `/mnt/data/vish-home/vish/`, not `/home/vish/`.
- **pi-5-kevin**: Frequently offline — will fail verification but has keys distributed.
- **homelab-vm**: SSH config historically uses password auth to itself; key auth works to all other hosts.
- **rsync to Atlantis**: rsync from homelab-vm to Atlantis fails (Synology SSH subsystem issue). Use `scp -O -r -P 60000` instead, or pull from Atlantis.
## Router Tailscale Auto-Start
Both GL.iNet routers have init scripts to auto-connect to Headscale on boot:
**GL-MT3000** (`/etc/init.d/tailscale-up`, START=81):
```sh
tailscale up --accept-routes --login-server=https://headscale.vish.gg:8443 --accept-dns=false --advertise-routes=192.168.12.0/24
```
**GL-BE3600** (`/etc/init.d/tailscale-up`, START=99):
- Waits for network connectivity (repeater mode needs WiFi first)
- Polls every 2s for up to 120s before running `tailscale up`
- Advertises `192.168.68.0/22,192.168.8.0/24`
Update script on both: `/root/update-tailscale.sh` (Admon's GL.iNet updater, use `--force` for non-interactive).
## Established 2026-03-23, updated 2026-03-24

View File

@@ -0,0 +1,116 @@
# Tailscale Mesh Connectivity Test
Last tested: 2026-03-31
## Test Method
Full `tailscale ping` from each SSH-accessible host to all other online Headscale nodes. This tests WireGuard tunnel establishment and reports whether the connection is direct (peer-to-peer) or relayed via DERP.
## Results
All 10 online hosts can reach all other online hosts. No failures.
### Connection Type Matrix
`D` = direct, `R` = DERP relay, `—` = self
| From \ To | Atlantis | Calypso | Pi-5 | Homelab | Matrix-Ubuntu | Setillo | NUC | Guava | Seattle | PVE |
|-----------|----------|---------|------|---------|---------------|---------|-----|-------|---------|-----|
| **Atlantis** | — | D | D | D | D | D | D | D | D | D |
| **Calypso** | D | — | D | D | D | R | D | D | D | D |
| **Pi-5** | D | D | — | D | D | D | D | D | D | D |
| **Homelab-VM** | D | D | D | — | D | R | D | D | D | D |
| **Matrix-Ubuntu** | (tested inbound) | | | | — | | | | | |
| **Setillo** | D | D | D | R | R | — | | | | |
| **NUC** | D | D | D | D | R | D | — | R | D | R |
| **Guava** | (no CLI) | | | | | | | — | | |
| **Seattle** | D | D | D | D | D | D | R | D | — | D |
| **PVE** | D | D | D | D | D | D | D | D | D | — |
### Notes
- **Atlantis/Calypso**: Tailscale binary at `/var/packages/Tailscale/target/bin/tailscale` (Synology package)
- **Setillo**: Tailscale binary at `/usr/local/bin/tailscale`
- **Guava**: Tailscale runs via TrueNAS built-in management; no `tailscale` CLI in PATH. Confirmed reachable via inbound pings from all other hosts.
- **Matrix-Ubuntu**: SSH via LAN IP (192.168.0.154) was unreliable during testing due to table 52 LAN interception (since fixed). Confirmed reachable via Tailscale IP and inbound pings.
- **DERP relays** (NUC ↔ some peers, Setillo ↔ homelab/matrix-ubuntu): Normal for nodes behind different NATs. Adds 15-60ms latency but does not affect reliability. Connections may upgrade to direct over time.
### ICMP Ping Notes
Standard ICMP ping from the OS (not `tailscale ping`) fails for Atlantis, Calypso, and Setillo because those hosts have ICMP blocked at the firewall level. This does not indicate a connectivity problem — `tailscale ping` and SSH both work.
## Tailscale Health Warnings
After fixes applied on 2026-03-31:
| Host | Health |
|------|--------|
| Homelab-VM | none |
| Pi-5 | none |
| NUC | none |
| Seattle | none |
| PVE | none (was `--accept-routes is false`, fixed) |
| Matrix-Ubuntu | none (was `--accept-routes is false`, fixed) |
## Fixes Applied (2026-03-31)
### LAN Routing (table 52 interception)
1. **Pi-5**: Enabled `--accept-routes`, added LAN routing rule (priority 5200), persistent via dispatcher script + cron
2. **Matrix-Ubuntu**: Enabled `--accept-routes`, added LAN routing rule (priority 5200), persistent via dispatcher script + cron. Enabling `--accept-routes` without the rule broke LAN connectivity (SSH via 192.168.0.154 timed out).
3. **PVE**: Enabled `--accept-routes`, added LAN routing rule (priority 5200), persistent via cron @reboot
See [LAN Routing Fix](GUAVA_LAN_ROUTING_FIX.md) for full details on the table 52 issue.
### Kuma monitors
- Switched **all 20 Calypso monitors** from Tailscale IP (`100.103.48.78`) to LAN IP (`192.168.0.250`) in the Kuma SQLite DB. Pi-5 (where Kuma runs) is on the same LAN, so using Tailscale IPs added unnecessary fragility.
- Added LAN-based monitors for Rustdesk (ID:124) and Rackula (ID:125).
- Fixed corrupted `accepted_statuscodes_json` field (`[200-299]``["200-299"]`) that prevented the Kuma UI from loading.
- Fixed ntfy notifications by setting `primaryBaseURL` to `https://kuma.vish.gg` — the "view" action button was missing a URL.
### Calypso Tailscale 5-minute disconnect (root cause)
**Symptom**: Calypso's disco key rotated every 5 minutes, dropping all peer WireGuard sessions for ~30 seconds.
**Root cause**: A cron job in `/etc/crontab` ran `/usr/local/bin/tailscale-fix.sh` every 5 minutes (`*/5 * * * *`). The script checked for the `tailscale0` TUN device, but Calypso runs Tailscale in `--tun=userspace-networking` mode (Synology DSM7), which has no TUN device. The script also checked `tailscale status --json` which returned empty state when run as the `tailscale` user. So every 5 minutes:
1. Script detects "tailscale0 missing" or "state empty"
2. Runs `configure-host` + full service restart via `synosystemctl`
3. Re-authenticates with `tailscale up --reset`
4. New disco key generated → all peers tear down and re-establish connections
**Fix**: Rewrote `/usr/local/bin/tailscale-fix.sh` to check if `tailscaled` process is running and can `tailscale ping` a known peer (Atlantis), which works in both TUN and userspace-networking modes.
**Additional changes on Calypso** (not the root cause but good hygiene):
- Disabled Docker IPv6 on all bridge interfaces via `sysctl` (77 routes removed)
- Updated `dockerd.json` with `"ipv6": false, "ip6tables": false` (persistent after Docker restart)
- Added `TS_DEBUG_NETMON_SKIP_INTERFACE_REGEXPS` and `TS_DEBUG_NETMON_NO_ROUTE_MONITORING` env vars to Tailscale startup script
- Added `/etc/hosts` entry: `192.168.0.250 headscale.vish.gg` to avoid hairpin NAT for control plane
### GL.iNet router Tailscale fixes
**GL-MT3000** (Hawaii, via Jellyfish at `192.168.12.1`):
- Tailscale was pointed at `https://controlplane.tailscale.com` (official) instead of Headscale
- Re-registered with preauth key: `tailscale up --login-server=https://headscale.vish.gg:8443 --authkey=<key> --hostname=gl-mt3000 --accept-routes --accept-dns=false --advertise-routes=192.168.12.0/24 --reset`
- Auto-start verified: `/etc/init.d/tailscale-up` (S81) runs `tailscale up` with correct flags after 10s delay
- State file at `/etc/tailscale/tailscaled.state` has `WantRunning: true`, `ControlURL: headscale.vish.gg:8443`
**GL-BE3600** (Home gateway at `192.168.0.1`):
- Was already connected to Headscale correctly, no fix needed
- SSH not available on LAN port 22 — accessible via Tailscale IP (`100.105.59.123`)
- Auto-start verified: `/etc/init.d/tailscale-up` (S99) waits up to 120s for network, then connects with `--advertise-routes=192.168.68.0/22,192.168.8.0/24`
### Speedtest results (Ookla official CLI)
| Host | Location | NIC | ISP | Download | Upload | Latency |
|------|----------|-----|-----|----------|--------|---------|
| Calypso | Home | 10GbE | Sonic | 6,255 Mbps | 7,039 Mbps | 2 ms |
| Atlantis | Home | 10GbE | Sonic | 4,623 Mbps | 6,988 Mbps | 3 ms |
| Matrix-Ubuntu | Home | virtio | Sonic | 1,907 Mbps | 6,330 Mbps | 3 ms |
| GL-MT3000 | Hawaii | 1GbE | Spectrum | 1,127 Mbps | 35 Mbps | 10 ms |
| NUC | Concord | 1GbE | Comcast | 882 Mbps | 361 Mbps | 13 ms |
| Pi-5 | Home | 1GbE | Sonic | 710 Mbps | 936 Mbps | 2 ms |
| Homelab-VM | Home | 1GbE virtio | Sonic | 448 Mbps | 931 Mbps | 3 ms |
Internal 10GbE link (Atlantis ↔ Calypso via iperf3): **8.61 Gbps** with 4 parallel streams, 0 retransmits.