Sanitized mirror from private repository - 2026-04-20 01:24:42 UTC
This commit is contained in:
206
docs/troubleshooting/matrix-ssl-authentik-incident-2026-03-19.md
Normal file
206
docs/troubleshooting/matrix-ssl-authentik-incident-2026-03-19.md
Normal file
@@ -0,0 +1,206 @@
|
||||
# Matrix SSL + Authentik + Portainer OAuth Incidents — 2026-03-19/21
|
||||
|
||||
---
|
||||
|
||||
## Issues Addressed
|
||||
|
||||
### 1. mx.vish.gg "Not Secure" Warning
|
||||
|
||||
**Symptom:** Browser showed "Not Secure" on `https://mx.vish.gg`.
|
||||
|
||||
**Root cause:** NPM was serving the **Cloudflare Origin Certificate** (cert ID 1, `*.vish.gg`) for `mx.vish.gg`. Cloudflare Origin certs are only trusted by Cloudflare's edge — since `mx.vish.gg` is **unproxied** (required for Matrix federation), browsers hit the origin directly and don't trust the cert.
|
||||
|
||||
**Fix:**
|
||||
1. Got a proper Let's Encrypt cert for `mx.vish.gg` via Cloudflare DNS challenge on matrix-ubuntu:
|
||||
```bash
|
||||
sudo certbot certonly --dns-cloudflare \
|
||||
--dns-cloudflare-credentials /etc/cloudflare.ini \
|
||||
-d mx.vish.gg --email your-email@example.com --agree-tos
|
||||
```
|
||||
2. Copied cert to NPM as `npm-6`:
|
||||
```
|
||||
/volume1/docker/nginx-proxy-manager/data/custom_ssl/npm-6/fullchain.pem
|
||||
/volume1/docker/nginx-proxy-manager/data/custom_ssl/npm-6/privkey.pem
|
||||
```
|
||||
3. Updated NPM proxy host 10 (`mx.vish.gg`) to use cert ID 6
|
||||
4. Set up renewal hook: `/etc/letsencrypt/renewal-hooks/deploy/copy-to-npm.sh`
|
||||
|
||||
**Same fix applied for:** `livekit.mx.vish.gg` (cert `npm-7`, proxy host 47)
|
||||
|
||||
---
|
||||
|
||||
### 2. kuma.vish.gg Redirect Loop (`ERR_TOO_MANY_REDIRECTS`)
|
||||
|
||||
**Symptom:** `kuma.vish.gg` (Uptime Kuma) caused infinite redirect loop via Authentik Forward Auth.
|
||||
|
||||
**Root cause (two issues):**
|
||||
|
||||
**Issue A — Missing `X-Original-URL` header:**
|
||||
The Authentik outpost returned `500` for Forward Auth requests because NPM wasn't passing the `X-Original-URL` header. The outpost log showed:
|
||||
```
|
||||
failed to detect a forward URL from nginx
|
||||
```
|
||||
**Fix:** Added to NPM advanced config for `kuma.vish.gg` (proxy host 41):
|
||||
```nginx
|
||||
auth_request /outpost.goauthentik.io/auth/nginx;
|
||||
proxy_set_header X-Original-URL $scheme://$http_host$request_uri;
|
||||
```
|
||||
|
||||
**Issue B — Empty `cookie_domain` on all Forward Auth providers:**
|
||||
After login, Authentik couldn't set the session cookie correctly because `cookie_domain` was empty on all proxy providers. This caused the auth loop to continue even after successful authentication.
|
||||
|
||||
**Fix:** Set `cookie_domain: vish.gg` on all proxy providers via Authentik API:
|
||||
|
||||
| PK | Provider | Was | Now |
|
||||
|----|----------|-----|-----|
|
||||
| 4 | Paperless Forward Auth | `''` | `vish.gg` |
|
||||
| 5 | vish.gg Domain Forward Auth | `vish.gg` | ✅ already set |
|
||||
| 8 | Scrutiny Forward Auth | `''` | `vish.gg` |
|
||||
| 12 | Uptime Kuma Forward Auth | `''` | `vish.gg` |
|
||||
| 13 | Ollama Forward Auth | `''` | `vish.gg` |
|
||||
| 14 | Wizarr Forward Auth | `''` | `vish.gg` |
|
||||
|
||||
```bash
|
||||
AK_TOKEN="..."
|
||||
for pk in 4 8 12 13 14; do
|
||||
PROVIDER=$(curl -s "https://sso.vish.gg/api/v3/providers/proxy/$pk/" -H "Authorization: Bearer $AK_TOKEN")
|
||||
UPDATED=$(echo "$PROVIDER" | python3 -c "import sys,json; d=json.load(sys.stdin); d['cookie_domain']='vish.gg'; print(json.dumps(d))")
|
||||
curl -s -X PUT "https://sso.vish.gg/api/v3/providers/proxy/$pk/" \
|
||||
-H "Authorization: Bearer $AK_TOKEN" -H "Content-Type: application/json" -d "$UPDATED"
|
||||
done
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. TURN Server External Verification
|
||||
|
||||
**coturn** was verified working externally from Seattle VPS (different network):
|
||||
|
||||
| Test | Result |
|
||||
|------|--------|
|
||||
| UDP port 3479 reachable | ✅ |
|
||||
| STUN Binding request | ✅ `0x0101` success, returns `184.23.52.14:3479` |
|
||||
| TURN Allocate (auth required) | ✅ `0x0113` (401) — server responds, relay functional |
|
||||
|
||||
Config: `/etc/turnserver.conf` on matrix-ubuntu
|
||||
- `listening-port=3479`
|
||||
- `use-auth-secret`
|
||||
- `static-auth-secret` = same as `turn_shared_secret` in Synapse homeserver.yaml
|
||||
- `realm=matrix.thevish.io`
|
||||
|
||||
---
|
||||
|
||||
## NPM Certificate Reference
|
||||
|
||||
| Cert ID | Nice Name | Domain | Type | Expires | Notes |
|
||||
|---------|-----------|--------|------|---------|-------|
|
||||
| 1 | Cloudflare Origin - vish.gg | `*.vish.gg`, `vish.gg` | Cloudflare Origin | 2041 | Only trusted by CF edge — don't use for unproxied |
|
||||
| 2 | Cloudflare Origin - thevish.io | `*.thevish.io` | Cloudflare Origin | 2026 | Same caveat |
|
||||
| 3 | Cloudflare Origin - crista.love | `*.crista.love` | Cloudflare Origin | 2026 | Same caveat |
|
||||
| 4 | git.vish.gg (LE) | `git.vish.gg` | Let's Encrypt | 2026-05 | |
|
||||
| 5 | headscale.vish.gg (LE) | `headscale.vish.gg` | Let's Encrypt | 2026-06 | |
|
||||
| 6 | mx.vish.gg (LE) | `mx.vish.gg` | Let's Encrypt | 2026-06 | Added 2026-03-19 |
|
||||
| 7 | livekit.mx.vish.gg (LE) | `livekit.mx.vish.gg` | Let's Encrypt | 2026-06 | Added 2026-03-19 |
|
||||
|
||||
> **Rule:** Any domain that is **unproxied** in Cloudflare (DNS-only, orange cloud off) must use a real Let's Encrypt cert, not the Cloudflare Origin cert.
|
||||
|
||||
---
|
||||
|
||||
## Renewal Automation
|
||||
|
||||
Certs 6 and 7 are issued by certbot on `matrix-ubuntu` and auto-renewed via systemd timer. Deploy hooks copy renewed certs to NPM on Calypso:
|
||||
|
||||
```
|
||||
/etc/letsencrypt/renewal-hooks/deploy/copy-to-npm.sh
|
||||
```
|
||||
|
||||
To manually renew and deploy:
|
||||
```bash
|
||||
ssh matrix-ubuntu
|
||||
sudo certbot renew --force-renewal -d mx.vish.gg
|
||||
# hook runs automatically and copies to NPM
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Issue 4 — Portainer OAuth Hanging (2026-03-21)
|
||||
|
||||
**Symptom:** Clicking "Sign in with SSO" on `https://pt.vish.gg` would redirect to Authentik, authenticate successfully, but then hang on `https://pt.vish.gg/?code=...&state=...#!/auth`.
|
||||
|
||||
**Root causes (three layered issues):**
|
||||
|
||||
### A — NPM migrated to matrix-ubuntu (missed in session context)
|
||||
NPM was migrated from Calypso to matrix-ubuntu (`192.168.0.154`) on 2026-03-20. All cert and proxy operations needed to target the new NPM instance.
|
||||
|
||||
### B — AdGuard wildcard DNS `*.vish.gg → 100.85.21.51` (matrix-ubuntu Tailscale IP)
|
||||
The Calypso AdGuard had a wildcard rewrite `*.vish.gg → 100.85.21.51` (matrix-ubuntu's Tailscale IP) intended for LAN clients. This caused:
|
||||
- `pt.vish.gg` → `100.85.21.51` — Portainer OAuth redirect went to matrix-ubuntu instead of Atlantis
|
||||
- `sso.vish.gg` → `100.85.21.51` — Portainer's token exchange request to Authentik timed out
|
||||
- `git.vish.gg` → `100.85.21.51` — Portainer GitOps stack polling timed out
|
||||
|
||||
**Fix:** Added specific overrides before the wildcard in AdGuard (`/opt/adguardhome/conf/AdGuardHome.yaml`):
|
||||
```yaml
|
||||
- domain: pt.vish.gg
|
||||
answer: 192.168.0.154 # NPM on matrix-ubuntu (proxies to Atlantis:10000)
|
||||
enabled: true
|
||||
- domain: sso.vish.gg
|
||||
answer: 192.168.0.154 # NPM on matrix-ubuntu (proxies to Authentik)
|
||||
enabled: true
|
||||
- domain: git.vish.gg
|
||||
answer: 192.168.0.154 # NPM on matrix-ubuntu (proxies to Gitea)
|
||||
enabled: true
|
||||
- domain: '*.vish.gg'
|
||||
answer: 100.85.21.51 # wildcard — matrix-ubuntu for everything else
|
||||
```
|
||||
|
||||
### C — Cloudflare Origin certs not trusted by Synology/Atlantis
|
||||
Even with correct DNS, Atlantis couldn't verify the Cloudflare Origin cert on `sso.vish.gg` and `pt.vish.gg` since they're unproxied (DNS-only in Cloudflare).
|
||||
|
||||
**Fix:** Issued Let's Encrypt certs for each domain via Cloudflare DNS challenge on matrix-ubuntu:
|
||||
|
||||
| Domain | NPM cert ID | Expires |
|
||||
|--------|------------|---------|
|
||||
| `sso.vish.gg` | `npm-12` | 2026-06 |
|
||||
| `pt.vish.gg` | `npm-11` | 2026-06 |
|
||||
|
||||
All certs auto-renew via certbot on matrix-ubuntu with deploy hook at:
|
||||
`/etc/letsencrypt/renewal-hooks/deploy/copy-to-npm.sh`
|
||||
|
||||
The hook copies renewed certs to `/opt/npm/data/custom_ssl/npm-N/` and reloads nginx.
|
||||
|
||||
---
|
||||
|
||||
## Issue 5 — npm-8 cert overwrite caused mass cert mismatch (2026-03-21)
|
||||
|
||||
**Symptom:** All `*.vish.gg` services showing `Hostname/IP does not match certificate's altnames: DNS:sso.vish.gg` — Kuma, Homarr, NTFY, Mastodon, NPM, Ollama all down.
|
||||
|
||||
**Root cause:** When issuing the LE cert for `sso.vish.gg`, it was copied into `npm-8` which was the Cloudflare Origin wildcard cert `*.vish.gg` that ALL other `*.vish.gg` services relied on.
|
||||
|
||||
**Fix:**
|
||||
1. Created `npm-12` for `sso.vish.gg` LE cert
|
||||
2. Restored `npm-8` from `/opt/npm/data/custom_ssl/x-vish-gg/` (the CF Origin wildcard backup)
|
||||
3. Updated `sso.vish.gg` proxy host to use `npm-12`
|
||||
4. Updated certbot renewal hook to use `npm-12` for `sso.vish.gg`
|
||||
|
||||
**Prevention:** When adding a new LE cert, always use the **next available npm-N ID**, never reuse an existing one.
|
||||
|
||||
---
|
||||
|
||||
### Current NPM cert reference (matrix-ubuntu) — FINAL
|
||||
|
||||
| Cert ID | Domain | Type | Used by |
|
||||
|---------|--------|------|---------|
|
||||
| npm-1 | `*.vish.gg` + `vish.gg` (CF Origin) | Cloudflare Origin | Legacy — don't use for unproxied |
|
||||
| npm-2 | `*.thevish.io` (CF Origin) | Cloudflare Origin | Legacy |
|
||||
| npm-3 | `*.crista.love` (CF Origin) | Cloudflare Origin | Legacy |
|
||||
| npm-6 | `mx.vish.gg` | Let's Encrypt | `mx.vish.gg` (Matrix) |
|
||||
| npm-7 | `livekit.mx.vish.gg` | Let's Encrypt | `livekit.mx.vish.gg` |
|
||||
| npm-8 | `*.vish.gg` (CF Origin) | Cloudflare Origin | All `*.vish.gg` Cloudflare-proxied services |
|
||||
| npm-9 | `*.thevish.io` | Let's Encrypt | All `*.thevish.io` services |
|
||||
| npm-10 | `*.crista.love` | Let's Encrypt | All `*.crista.love` services |
|
||||
| npm-11 | `pt.vish.gg` | Let's Encrypt | `pt.vish.gg` (Portainer) |
|
||||
| npm-12 | `sso.vish.gg` | Let's Encrypt | `sso.vish.gg` (Authentik) |
|
||||
|
||||
> **Rule:** Any unproxied domain accessed by internal services (Portainer, Synology, Kuma) needs a real LE cert (npm-6+). Never overwrite an existing npm-N — always use the next available number.
|
||||
|
||||
**Last updated:** 2026-03-21
|
||||
Reference in New Issue
Block a user