Files
homelab-optimized/docs/troubleshooting/matrix-ssl-authentik-incident-2026-03-19.md
Gitea Mirror Bot e71c8ddb4b
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m5s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-04-20 01:24:42 UTC
2026-04-20 01:24:42 +00:00

9.0 KiB

Matrix SSL + Authentik + Portainer OAuth Incidents — 2026-03-19/21


Issues Addressed

1. mx.vish.gg "Not Secure" Warning

Symptom: Browser showed "Not Secure" on https://mx.vish.gg.

Root cause: NPM was serving the Cloudflare Origin Certificate (cert ID 1, *.vish.gg) for mx.vish.gg. Cloudflare Origin certs are only trusted by Cloudflare's edge — since mx.vish.gg is unproxied (required for Matrix federation), browsers hit the origin directly and don't trust the cert.

Fix:

  1. Got a proper Let's Encrypt cert for mx.vish.gg via Cloudflare DNS challenge on matrix-ubuntu:
    sudo certbot certonly --dns-cloudflare \
      --dns-cloudflare-credentials /etc/cloudflare.ini \
      -d mx.vish.gg --email your-email@example.com --agree-tos
    
  2. Copied cert to NPM as npm-6:
    /volume1/docker/nginx-proxy-manager/data/custom_ssl/npm-6/fullchain.pem
    /volume1/docker/nginx-proxy-manager/data/custom_ssl/npm-6/privkey.pem
    
  3. Updated NPM proxy host 10 (mx.vish.gg) to use cert ID 6
  4. Set up renewal hook: /etc/letsencrypt/renewal-hooks/deploy/copy-to-npm.sh

Same fix applied for: livekit.mx.vish.gg (cert npm-7, proxy host 47)


2. kuma.vish.gg Redirect Loop (ERR_TOO_MANY_REDIRECTS)

Symptom: kuma.vish.gg (Uptime Kuma) caused infinite redirect loop via Authentik Forward Auth.

Root cause (two issues):

Issue A — Missing X-Original-URL header: The Authentik outpost returned 500 for Forward Auth requests because NPM wasn't passing the X-Original-URL header. The outpost log showed:

failed to detect a forward URL from nginx

Fix: Added to NPM advanced config for kuma.vish.gg (proxy host 41):

auth_request /outpost.goauthentik.io/auth/nginx;
proxy_set_header X-Original-URL $scheme://$http_host$request_uri;

Issue B — Empty cookie_domain on all Forward Auth providers: After login, Authentik couldn't set the session cookie correctly because cookie_domain was empty on all proxy providers. This caused the auth loop to continue even after successful authentication.

Fix: Set cookie_domain: vish.gg on all proxy providers via Authentik API:

PK Provider Was Now
4 Paperless Forward Auth '' vish.gg
5 vish.gg Domain Forward Auth vish.gg already set
8 Scrutiny Forward Auth '' vish.gg
12 Uptime Kuma Forward Auth '' vish.gg
13 Ollama Forward Auth '' vish.gg
14 Wizarr Forward Auth '' vish.gg
AK_TOKEN="..."
for pk in 4 8 12 13 14; do
  PROVIDER=$(curl -s "https://sso.vish.gg/api/v3/providers/proxy/$pk/" -H "Authorization: Bearer $AK_TOKEN")
  UPDATED=$(echo "$PROVIDER" | python3 -c "import sys,json; d=json.load(sys.stdin); d['cookie_domain']='vish.gg'; print(json.dumps(d))")
  curl -s -X PUT "https://sso.vish.gg/api/v3/providers/proxy/$pk/" \
    -H "Authorization: Bearer $AK_TOKEN" -H "Content-Type: application/json" -d "$UPDATED"
done

3. TURN Server External Verification

coturn was verified working externally from Seattle VPS (different network):

Test Result
UDP port 3479 reachable
STUN Binding request 0x0101 success, returns 184.23.52.14:3479
TURN Allocate (auth required) 0x0113 (401) — server responds, relay functional

Config: /etc/turnserver.conf on matrix-ubuntu

  • listening-port=3479
  • use-auth-secret
  • static-auth-secret = same as turn_shared_secret in Synapse homeserver.yaml
  • realm=matrix.thevish.io

NPM Certificate Reference

Cert ID Nice Name Domain Type Expires Notes
1 Cloudflare Origin - vish.gg *.vish.gg, vish.gg Cloudflare Origin 2041 Only trusted by CF edge — don't use for unproxied
2 Cloudflare Origin - thevish.io *.thevish.io Cloudflare Origin 2026 Same caveat
3 Cloudflare Origin - crista.love *.crista.love Cloudflare Origin 2026 Same caveat
4 git.vish.gg (LE) git.vish.gg Let's Encrypt 2026-05
5 headscale.vish.gg (LE) headscale.vish.gg Let's Encrypt 2026-06
6 mx.vish.gg (LE) mx.vish.gg Let's Encrypt 2026-06 Added 2026-03-19
7 livekit.mx.vish.gg (LE) livekit.mx.vish.gg Let's Encrypt 2026-06 Added 2026-03-19

Rule: Any domain that is unproxied in Cloudflare (DNS-only, orange cloud off) must use a real Let's Encrypt cert, not the Cloudflare Origin cert.


Renewal Automation

Certs 6 and 7 are issued by certbot on matrix-ubuntu and auto-renewed via systemd timer. Deploy hooks copy renewed certs to NPM on Calypso:

/etc/letsencrypt/renewal-hooks/deploy/copy-to-npm.sh

To manually renew and deploy:

ssh matrix-ubuntu
sudo certbot renew --force-renewal -d mx.vish.gg
# hook runs automatically and copies to NPM

Issue 4 — Portainer OAuth Hanging (2026-03-21)

Symptom: Clicking "Sign in with SSO" on https://pt.vish.gg would redirect to Authentik, authenticate successfully, but then hang on https://pt.vish.gg/?code=...&state=...#!/auth.

Root causes (three layered issues):

A — NPM migrated to matrix-ubuntu (missed in session context)

NPM was migrated from Calypso to matrix-ubuntu (192.168.0.154) on 2026-03-20. All cert and proxy operations needed to target the new NPM instance.

B — AdGuard wildcard DNS *.vish.gg → 100.85.21.51 (matrix-ubuntu Tailscale IP)

The Calypso AdGuard had a wildcard rewrite *.vish.gg → 100.85.21.51 (matrix-ubuntu's Tailscale IP) intended for LAN clients. This caused:

  • pt.vish.gg100.85.21.51 — Portainer OAuth redirect went to matrix-ubuntu instead of Atlantis
  • sso.vish.gg100.85.21.51 — Portainer's token exchange request to Authentik timed out
  • git.vish.gg100.85.21.51 — Portainer GitOps stack polling timed out

Fix: Added specific overrides before the wildcard in AdGuard (/opt/adguardhome/conf/AdGuardHome.yaml):

- domain: pt.vish.gg
  answer: 192.168.0.154   # NPM on matrix-ubuntu (proxies to Atlantis:10000)
  enabled: true
- domain: sso.vish.gg
  answer: 192.168.0.154   # NPM on matrix-ubuntu (proxies to Authentik)
  enabled: true
- domain: git.vish.gg
  answer: 192.168.0.154   # NPM on matrix-ubuntu (proxies to Gitea)
  enabled: true
- domain: '*.vish.gg'
  answer: 100.85.21.51    # wildcard — matrix-ubuntu for everything else

C — Cloudflare Origin certs not trusted by Synology/Atlantis

Even with correct DNS, Atlantis couldn't verify the Cloudflare Origin cert on sso.vish.gg and pt.vish.gg since they're unproxied (DNS-only in Cloudflare).

Fix: Issued Let's Encrypt certs for each domain via Cloudflare DNS challenge on matrix-ubuntu:

Domain NPM cert ID Expires
sso.vish.gg npm-12 2026-06
pt.vish.gg npm-11 2026-06

All certs auto-renew via certbot on matrix-ubuntu with deploy hook at: /etc/letsencrypt/renewal-hooks/deploy/copy-to-npm.sh

The hook copies renewed certs to /opt/npm/data/custom_ssl/npm-N/ and reloads nginx.


Issue 5 — npm-8 cert overwrite caused mass cert mismatch (2026-03-21)

Symptom: All *.vish.gg services showing Hostname/IP does not match certificate's altnames: DNS:sso.vish.gg — Kuma, Homarr, NTFY, Mastodon, NPM, Ollama all down.

Root cause: When issuing the LE cert for sso.vish.gg, it was copied into npm-8 which was the Cloudflare Origin wildcard cert *.vish.gg that ALL other *.vish.gg services relied on.

Fix:

  1. Created npm-12 for sso.vish.gg LE cert
  2. Restored npm-8 from /opt/npm/data/custom_ssl/x-vish-gg/ (the CF Origin wildcard backup)
  3. Updated sso.vish.gg proxy host to use npm-12
  4. Updated certbot renewal hook to use npm-12 for sso.vish.gg

Prevention: When adding a new LE cert, always use the next available npm-N ID, never reuse an existing one.


Current NPM cert reference (matrix-ubuntu) — FINAL

Cert ID Domain Type Used by
npm-1 *.vish.gg + vish.gg (CF Origin) Cloudflare Origin Legacy — don't use for unproxied
npm-2 *.thevish.io (CF Origin) Cloudflare Origin Legacy
npm-3 *.crista.love (CF Origin) Cloudflare Origin Legacy
npm-6 mx.vish.gg Let's Encrypt mx.vish.gg (Matrix)
npm-7 livekit.mx.vish.gg Let's Encrypt livekit.mx.vish.gg
npm-8 *.vish.gg (CF Origin) Cloudflare Origin All *.vish.gg Cloudflare-proxied services
npm-9 *.thevish.io Let's Encrypt All *.thevish.io services
npm-10 *.crista.love Let's Encrypt All *.crista.love services
npm-11 pt.vish.gg Let's Encrypt pt.vish.gg (Portainer)
npm-12 sso.vish.gg Let's Encrypt sso.vish.gg (Authentik)

Rule: Any unproxied domain accessed by internal services (Portainer, Synology, Kuma) needs a real LE cert (npm-6+). Never overwrite an existing npm-N — always use the next available number.

Last updated: 2026-03-21