Files
homelab-optimized/docs/troubleshooting/guava-smb-incident-2026-03-14.md
Gitea Mirror Bot fb00a325d1
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m14s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-04-18 11:19:59 UTC
2026-04-18 11:19:59 +00:00

146 lines
5.9 KiB
Markdown

# Guava SMB Incident — 2026-03-14
**Affected host:** guava (TrueNAS SCALE, `100.75.252.64` / `192.168.0.100`)
**Affected client:** shinku-ryuu (Windows, `192.168.0.3`)
**Symptoms:** All SMB shares on guava unreachable from shinku after guava reboot
---
## Root Causes (two separate issues)
### 1. Tailscale app was STOPPED after reboot
Guava's Tailscale was running as an **orphaned host process** rather than the managed TrueNAS app. On reboot the orphan was gone and the app didn't start because it was in `STOPPED` state.
**Why it was stopped:** The app had been upgraded from v1.3.30 → v1.4.2. The new version's startup script ran `tailscale up` but failed because the stored state had `--accept-dns=false` while the app config had `accept_dns: true` — a mismatch that requires `--reset`. The app exited, leaving the old manually-started daemon running until the next reboot.
### 2. Tailscale `accept_routes: true` caused SMB replies to route via tunnel
After fixing the app startup, shinku still couldn't reach guava on the LAN. The cause:
- **Calypso** advertises `192.168.0.0/24` as a subnet route via Tailscale
- Guava had `accept_routes: true` — it installed Calypso's `192.168.0.0/24` route into Tailscale's policy routing table (table 52, priority 5270)
- When shinku sent a TCP SYN to guava port 445, it arrived on `enp1s0f0np0`
- Guava's reply looked up `192.168.0.3` in the routing table — hit table 52 first — and sent the reply **out via `tailscale0`** instead of the LAN
- The reply never reached shinku; the connection timed out
This also affected shinku: it had `accept_routes: true` as well, so it was routing traffic destined for `192.168.0.100` via Calypso's Tailscale tunnel rather than its local Ethernet interface.
---
## Fixes Applied
### Fix 1 — Tailscale app startup config
Updated the TrueNAS app config to match the node's actual desired state:
```bash
sudo midclt call app.update tailscale '{"values": {"tailscale": {
"accept_dns": false,
"accept_routes": false,
"advertise_exit_node": true,
"advertise_routes": [],
"auth_key": "...",
"auth_once": true,
"hostname": "truenas-scale",
"reset": true
}}}'
```
Key changes:
- `accept_dns: false` — matches the running state stored in Tailscale's state dir
- `accept_routes: false` — prevents guava from pulling in subnet routes from other nodes (see Fix 2)
- `reset: true` — clears the flag mismatch that was causing `tailscale up` to fail
**Saved in:** `/mnt/.ix-apps/app_configs/tailscale/versions/1.4.2/user_config.yaml`
### Fix 2 — Remove stale subnet routes from guava's routing table
After updating the app config the stale routes persisted in table 52. Removed manually:
```bash
sudo ip route del 192.168.0.0/24 dev tailscale0 table 52
sudo ip route del 192.168.12.0/24 dev tailscale0 table 52
sudo ip route del 192.168.68.0/22 dev tailscale0 table 52
sudo ip route del 192.168.69.0/24 dev tailscale0 table 52
```
With `accept_routes: false` now saved, these routes will not reappear on next reboot.
### Fix 3 — Disable accept_routes on shinku
Shinku was also accepting Calypso's `192.168.0.0/24` route (metric 0 via Tailscale, beating Ethernet 3's metric 256):
```
# Before fix — traffic to 192.168.0.100 went via Tailscale
192.168.0.0/24 100.100.100.100 0 Tailscale
# After fix — traffic goes via local LAN
192.168.0.0/24 0.0.0.0 256 Ethernet 3
```
Fixed by running on shinku:
```
tailscale up --accept-routes=false --login-server=https://headscale.vish.gg:8443
```
### Fix 4 — SMB password reset and credential cache
The SMB password for `vish` on guava was changed via the TrueNAS web UI. Windows had stale credentials cached. Fixed by:
1. Clearing Windows Credential Manager entry for `192.168.0.100`
2. Re-mapping shares from an interactive PowerShell session on shinku
---
## SMB Share Layout on Guava
| Windows drive | Share | Path on guava |
|--------------|-------|---------------|
| I: | `guava_turquoise` | `/mnt/data/guava_turquoise` |
| J: | `photos` | `/mnt/data/photos` |
| K: | `data` | `/mnt/data/passionfruit` |
| L: | `website` | `/mnt/data/website` |
| M: | `jellyfin` | `/mnt/data/jellyfin` |
| N: | `truenas-exporters` | `/mnt/data/truenas-exporters` |
| Q: | `iso` | `/mnt/data/iso` |
All shares use `vish` as the SMB user. Credentials stored in Windows Credential Manager under `192.168.0.100`.
---
## Diagnosis Commands
```bash
# Check Tailscale app state on guava
ssh guava "sudo midclt call app.query '[[\"name\",\"=\",\"tailscale\"]]' | python3 -c 'import sys,json; a=json.load(sys.stdin)[0]; print(a[\"name\"], a[\"state\"])'"
# Check for rogue subnet routes in Tailscale's routing table
ssh guava "ip route show table 52 | grep 192.168"
# Check tailscale container logs
ssh guava "sudo docker logs \$(sudo docker ps | grep tailscale | awk '{print \$1}' | head -1) 2>&1 | tail -20"
# Check SMB audit log for auth failures on guava
ssh guava "sudo journalctl -u smbd --since '1 hour ago' --no-pager | grep -i 'wrong_password\|STATUS'"
# Check which Tailscale peer is advertising a given subnet (run on any node)
tailscale status --json | python3 -c "
import sys, json
d = json.load(sys.stdin)
for peer in d.get('Peer', {}).values():
routes = peer.get('PrimaryRoutes') or []
if routes:
print(peer['HostName'], routes)
"
```
---
## Prevention
- **Guava:** `accept_routes: false` is now saved in the TrueNAS app config — will survive reboots
- **Shinku:** `--accept-routes=false` set via `tailscale up` — survives reboots
- **General rule:** Hosts on the same LAN as the subnet-advertising node (Calypso → `192.168.0.0/24`) should have `accept_routes: false`, or the advertised subnet should be scoped to only nodes that need remote access to that LAN
- **TrueNAS app upgrades:** After upgrading the Tailscale app version, always check the new `user_config.yaml` to ensure `accept_dns`, `accept_routes`, and other flags match the node's actual running state. If unsure, set `reset: true` once to clear any stale state, then set it back to `false`