Sanitized mirror from private repository - 2026-04-18 11:19:59 UTC

2026-04-18 11:19:59 +00:00
commit fb00a325d1
1418 changed files with 359990 additions and 0 deletions
--- a/docs/runbooks/README.md
+++ b/docs/runbooks/README.md
@@ -0,0 +1,143 @@
+# Homelab Operational Runbooks
+
+This directory contains step-by-step operational runbooks for common homelab management tasks. Each runbook provides clear procedures, prerequisites, and rollback steps.
+
+## 📚 Available Runbooks
+
+### Service Management
+- **[Add New Service](add-new-service.md)** - Deploy new containerized services via GitOps
+- **[Service Migration](service-migration.md)** - Move services between hosts safely
+- **[Add New User](add-new-user.md)** - Onboard new users with proper access
+
+### Infrastructure Maintenance
+- **[Disk Full Procedure](disk-full-procedure.md)** - Handle full disk scenarios
+- **[Certificate Renewal](certificate-renewal.md)** - Manage SSL/TLS certificates
+- **[Synology DSM Upgrade](synology-dsm-upgrade.md)** - Safely upgrade NAS firmware
+
+### Security
+- **[Credential Rotation](credential-rotation.md)** - Rotate exposed or compromised credentials
+
+## 🎯 How to Use These Runbooks
+
+### Runbook Format
+Each runbook follows a standard format:
+1. **Overview** - What this procedure accomplishes
+2. **Prerequisites** - What you need before starting
+3. **Estimated Time** - How long it typically takes
+4. **Risk Level** - Low/Medium/High impact assessment
+5. **Procedure** - Step-by-step instructions
+6. **Verification** - How to confirm success
+7. **Rollback** - How to undo if something goes wrong
+8. **Troubleshooting** - Common issues and solutions
+
+### When to Use Runbooks
+- **Planned Maintenance** - Follow runbooks during scheduled maintenance windows
+- **Incident Response** - Use as quick reference during outages
+- **Training** - Onboard new admins with documented procedures
+- **Automation** - Use as basis for creating automated scripts
+
+### Best Practices
+- ✅ Always read the entire runbook before starting
+- ✅ Have a rollback plan ready
+- ✅ Test in development/staging when possible
+- ✅ Take snapshots/backups before major changes
+- ✅ Document any deviations from the runbook
+- ✅ Update runbooks when procedures change
+
+## 🚨 Emergency Procedures
+
+For emergency situations, refer to:
+- [Emergency Access Guide](../troubleshooting/EMERGENCY_ACCESS_GUIDE.md)
+- [Recovery Guide](../troubleshooting/RECOVERY_GUIDE.md)
+- [Disaster Recovery](../troubleshooting/disaster-recovery.md)
+
+## 📋 Runbook Maintenance
+
+### Contributing
+When you discover a new procedure or improvement:
+1. Create a new runbook using the template below
+2. Follow the standard format
+3. Include real examples from your infrastructure
+4. Test the procedure before documenting
+
+### Runbook Template
+```markdown
+# [Procedure Name]
+
+## Overview
+Brief description of what this accomplishes and when to use it.
+
+## Prerequisites
+- [ ] Required access/credentials
+- [ ] Required tools/software
+- [ ] Required knowledge/skills
+
+## Metadata
+- **Estimated Time**: X minutes/hours
+- **Risk Level**: Low/Medium/High
+- **Requires Downtime**: Yes/No
+- **Reversible**: Yes/No
+- **Tested On**: Date last tested
+
+## Procedure
+
+### Step 1: [Action]
+Detailed instructions...
+
+```bash
+# Example commands
+```
+
+Expected output:
+```
+Example of what you should see
+```
+
+### Step 2: [Next Action]
+Continue...
+
+## Verification
+How to confirm the procedure succeeded:
+- [ ] Verification step 1
+- [ ] Verification step 2
+
+## Rollback Procedure
+If something goes wrong:
+1. Step to undo changes
+2. How to restore previous state
+
+## Troubleshooting
+**Issue**: Common problem
+**Solution**: How to fix it
+
+## Related Documentation
+- [Link to related doc](path)
+
+## Change Log
+- YYYY-MM-DD - Initial creation
+- YYYY-MM-DD - Updated for new procedure
+```
+
+## 📞 Getting Help
+
+If a runbook is unclear or doesn't work as expected:
+1. Check the troubleshooting section
+2. Refer to related documentation links
+3. Review the homelab monitoring dashboards
+4. Consult the [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
+
+## 📊 Runbook Status
+
+| Runbook | Status | Last Updated | Tested On |
+|---------|--------|--------------|-----------|
+| Add New Service | ✅ Active | 2026-02-14 | 2026-02-14 |
+| Service Migration | ✅ Active | 2026-02-14 | 2026-02-14 |
+| Add New User | ✅ Active | 2026-02-14 | 2026-02-14 |
+| Disk Full Procedure | ✅ Active | 2026-02-14 | 2026-02-14 |
+| Certificate Renewal | ✅ Active | 2026-02-14 | 2026-02-14 |
+| Synology DSM Upgrade | ✅ Active | 2026-02-14 | 2026-02-14 |
+| Credential Rotation | ✅ Active | 2026-02-20 | — |
+
+---
+
+**Last Updated**: 2026-02-14
--- a/docs/runbooks/add-new-service.md
+++ b/docs/runbooks/add-new-service.md
@@ -0,0 +1,65 @@
+# Add New Service Runbook
+
+This runbook walks through a **clean, tested path** for adding a new service to the homelab using GitOps with Portainer.
+
+> ⚠️ **Prerequisites**: CI runner access, SSH to target hosts, SSO admin privilege.
+
+## 1. Prepare Compose File
+
+```bash
+# Generate a minimal stack template
+../scripts/ci/workflows/gen-template.py --service myservice
+```
+
+Adjust `docker-compose.yml`:
+- Image name
+- Ports
+- Environment variables
+- Health‑check
+
+## 2. Validate Configuration
+
+```bash
+docker compose -f docker-compose.yml config > /tmp/merged.yml
+# Validate against OpenAPI specs if needed
+```
+
+## 3. Commit Locally
+
+```bash
+git add docker/compose/*.yml
+git commit -m "Add myservice stack"
+```
+
+## 4. Push to Remote & Trigger GitOps
+
+```bash
+git push origin main
+```
+
+The Portainer EE GitOps agent will automatically deploy. Monitor the stack via the Portainer UI or `portainer api`.
+
+## 5. Post‑Deployment Verification
+
+| Check | Command | Expected Result |
+|-------|---------|-----------------
+| Service Running | `docker ps --filter "name=myservice"` | One container running |
+| Health Endpoint | `curl http://localhost:8080/health` | 200 OK |
+| Logs | `docker logs myservice` | No fatal errors |
+
+## 6. Update Documentation
+
+1. Add entry to `docs/services/VERIFIED_SERVICE_INVENTORY.md`.  
+2. Create a quick‑start guide in `docs/services/<service>/README.md`.  
+3. Publish to the shared wiki.
+
+## 7. Optional – Terraform Sync
+
+If the service also needs infra changes (e.g., new VM), update the Terraform modules under `infra/` and run `terragrunt run-all apply`.
+
+---
+
+**Gotchas** –
+- *Race conditions*: rebasing before push.
+- Health‑check failures: check Portainer Events.
+- Secrets: use Vault and reference in `secrets` section.
--- a/docs/runbooks/add-new-user.md
+++ b/docs/runbooks/add-new-user.md
@@ -0,0 +1,601 @@
+# Add New User Runbook
+
+## Overview
+This runbook provides a comprehensive procedure for onboarding new users to the homelab, including network access, service authentication, and permission management. It ensures users get appropriate access while maintaining security.
+
+## Prerequisites
+- [ ] User's full name and email address
+- [ ] Desired username (lowercase, no spaces)
+- [ ] Access level determined (read-only, standard, admin)
+- [ ] Required services identified
+- [ ] Admin access to all relevant systems
+- [ ] Authentik admin access (for SSO services)
+- [ ] Tailscale admin access (for VPN)
+- [ ] Synology admin access (for file shares)
+
+## Metadata
+- **Estimated Time**: 30-60 minutes
+- **Risk Level**: Low (proper access controls in place)
+- **Requires Downtime**: No
+- **Reversible**: Yes (can remove user access)
+- **Tested On**: 2026-02-14
+
+## User Access Levels
+
+| Level | Description | Typical Use Case | Services |
+|-------|-------------|------------------|----------|
+| **Guest** | Read-only, limited services | Family, friends | Plex, Jellyfin |
+| **Standard** | Read/write, most services | Family members | Media + storage |
+| **Power User** | Advanced services | Tech-savvy users | Dev tools, monitoring |
+| **Admin** | Full access, can manage | Co-admins, yourself | Everything + admin panels |
+
+## Pre-Onboarding Checklist
+
+### Step 1: Gather Information
+
+Create a user profile document:
+
+```markdown
+# New User: [Name]
+
+**Username**: [username]
+**Email**: [email@domain.com]
+**Access Level**: [Guest/Standard/Power User/Admin]
+**Start Date**: [YYYY-MM-DD]
+
+## Services Requested:
+- [ ] Plex/Jellyfin (Media streaming)
+- [ ] File Shares (NAS access)
+- [ ] Immich (Photo backup)
+- [ ] Paperless (Document management)
+- [ ] Development tools (Gitea, etc.)
+- [ ] Monitoring dashboards
+- [ ] Other: ___________
+
+## Access Requirements:
+- [ ] Remote access (Tailscale VPN)
+- [ ] Local network only
+- [ ] Mobile apps
+- [ ] Web browser only
+
+## Notes:
+[Any special requirements or restrictions]
+```
+
+### Step 2: Plan Access
+
+Determine which systems need accounts:
+
+- [ ] **Tailscale** (VPN access to homelab)
+- [ ] **Authentik** (SSO for web services)
+- [ ] **Synology NAS** (File shares - Atlantis/Calypso)
+- [ ] **Plex** (Media streaming)
+- [ ] **Jellyfin** (Alternative media)
+- [ ] **Immich** (Photo management)
+- [ ] **Portainer** (Container management - admin only)
+- [ ] **Grafana** (Monitoring - admin/power user)
+- [ ] **Other services**: ___________
+
+## User Onboarding Procedure
+
+### Step 1: Create Tailscale Access
+
+**Why First**: Tailscale provides secure remote access to the homelab network.
+
+1. **Invite via Tailscale Admin Console**:
+   - Go to https://login.tailscale.com/admin/settings/users
+   - Click **Invite Users**
+   - Enter user's email
+   - Set expiration (optional)
+   - Click **Send Invite**
+
+2. **User receives email**:
+   - User clicks invitation link
+   - Creates Tailscale account
+   - Installs Tailscale app on their device(s)
+   - Connects to your tailnet
+
+3. **Configure ACLs** (if needed):
+   ```json
+   // In Tailscale Admin Console → Access Controls
+   {
+     "acls": [
+       // Existing ACLs...
+       {
+         "action": "accept",
+         "src": ["user@email.com"],
+         "dst": [
+           "atlantis:*",  // Allow access to Atlantis
+           "calypso:*",   // Allow access to Calypso
+           "homelab-vm:*" // Allow access to VM
+         ]
+       }
+     ]
+   }
+   ```
+
+4. **Test connectivity**:
+   ```bash
+   # Ask user to test
+   ping atlantis.your-tailnet.ts.net
+   curl http://atlantis.your-tailnet.ts.net:9000  # Test Portainer
+   ```
+
+### Step 2: Create Authentik Account (SSO)
+
+**Purpose**: Single sign-on for most web services.
+
+1. **Access Authentik Admin**:
+   - Navigate to your Authentik instance
+   - Log in as admin
+
+2. **Create User**:
+   - **Directory** → **Users** → **Create**
+   - Fill in:
+     - **Username**: `username` (lowercase)
+     - **Name**: `First Last`
+     - **Email**: `user@email.com`
+     - **Groups**: Add to appropriate groups
+       - `homelab-users` (standard access)
+       - `homelab-admins` (for admin users)
+       - Service-specific groups (e.g., `jellyfin-users`)
+
+3. **Set Password**:
+   - Option A: Set temporary password, force change on first login
+   - Option B: Send password reset link via email
+
+4. **Assign Service Access**:
+   - **Applications** → **Outposts**
+   - For each service the user should access:
+     - Edit application
+     - Add user/group to **Policy Bindings**
+
+5. **Test SSO**:
+   ```bash
+   # User should test login to SSO-enabled services
+   # Example: Grafana, Jellyseerr, etc.
+   ```
+
+### Step 3: Create Synology NAS Account
+
+**Purpose**: Access to file shares, Photos, Drive, etc.
+
+#### On Atlantis (Primary NAS):
+
+```bash
+# SSH to Atlantis
+ssh admin@atlantis
+
+# Create user (DSM 7.x)
+# Via DSM UI (recommended):
+```
+
+1. **Control Panel** → **User & Group** → **User** → **Create**
+2. Fill in:
+   - **Name**: `username`
+   - **Description**: `[Full Name]`
+   - **Email**: `user@email.com`
+   - **Password**: Set strong password
+3. **Join Groups**:
+   - `users` (default)
+   - `http` (if web service access needed)
+4. **Configure Permissions**:
+   - **Applications** tab:
+     - [ ] Synology Photos (if needed)
+     - [ ] Synology Drive (if needed)
+     - [ ] File Station
+     - [ ] Other apps as needed
+   - **Shared Folders** tab:
+     - Set permissions for each share:
+       - Read/Write: For shares user can modify
+       - Read-only: For media libraries
+       - No access: For restricted folders
+5. **User Quotas** (optional):
+   - Set storage quota if needed
+   - Limit upload/download speed if needed
+6. **Click Create**
+
+#### On Calypso (Secondary NAS):
+
+Repeat the same process if user needs access to Calypso.
+
+**Alternative: SSH Method**:
+```bash
+# Create user via command line
+sudo synouser --add username "Full Name" "password" "user@email.com" 0 "" 0
+
+# Add to groups
+sudo synogroup --member users username add
+
+# Set folder permissions (example)
+sudo chown -R username:users /volume1/homes/username
+```
+
+### Step 4: Create Plex Account
+
+**Option A: Managed User (Recommended for Family)**
+
+1. Open Plex Web
+2. **Settings** → **Users & Sharing** → **Manage Home Users**
+3. Click **Add User**
+4. Set:
+   - **Username**: `[Name]`
+   - **PIN**: 4-digit PIN
+   - Enable **Managed user** if restricted access desired
+5. Configure library access
+
+**Option B: Plex Account (For External Users)**
+
+1. User creates their own Plex account
+2. **Settings** → **Users & Sharing** → **Friends**
+3. Invite by email
+4. Select libraries to share
+5. Configure restrictions:
+   - [ ] Allow sync
+   - [ ] Allow camera upload
+   - [ ] Rating restrictions (if children)
+
+### Step 5: Create Jellyfin Account
+
+```bash
+# SSH to host running Jellyfin
+ssh atlantis  # or wherever Jellyfin runs
+
+# Or via web UI:
+```
+
+1. Open Jellyfin web interface
+2. **Dashboard** → **Users** → **Add User**
+3. Set:
+   - **Name**: `username`
+   - **Password**: REDACTED_PASSWORD password
+4. Configure:
+   - **Library access**: Select which libraries
+   - **Permissions**:
+     - [ ] Allow media deletion
+     - [ ] Allow remote access
+     - [ ] Enable live TV (if applicable)
+5. **Save**
+
+### Step 6: Create Immich Account (If Used)
+
+```bash
+# Via Immich web interface
+```
+
+1. Open Immich
+2. **Administration** → **Users** → **Create User**
+3. Set:
+   - **Email**: `user@email.com`
+   - **Password**: REDACTED_PASSWORD password
+   - **Name**: `Full Name`
+4. User logs in and sets up mobile app
+
+### Step 7: Grant Service-Specific Access
+
+#### Gitea (Development)
+
+1. Gitea web interface
+2. **Site Administration** → **User Accounts** → **Create User Account**
+3. Fill in details
+4. Add to appropriate organizations/teams
+
+#### Portainer (Admin/Power Users Only)
+
+1. Portainer web interface
+2. **Users** → **Add user**
+3. Set:
+   - **Username**: `username`
+   - **Password**: REDACTED_PASSWORD password
+4. Assign role:
+   - **Administrator**: Full access
+   - **Operator**: Can manage containers
+   - **User**: Read-only
+5. Assign to teams/endpoints
+
+#### Grafana (Monitoring)
+
+If using Authentik SSO, user automatically gets access.
+
+If not using SSO:
+1. Grafana web interface
+2. **Configuration** → **Users** → **Invite**
+3. Set role:
+   - **Viewer**: Read-only dashboards
+   - **Editor**: Can create dashboards
+   - **Admin**: Full access
+
+### Step 8: Configure Mobile Apps
+
+Provide user with setup instructions:
+
+**Plex**:
+- Download Plex app
+- Sign in with Plex account
+- Server should auto-discover via Tailscale
+
+**Jellyfin**:
+- Download Jellyfin app
+- Add server: `http://atlantis.tailnet:8096`
+- Sign in with credentials
+
+**Immich** (if used):
+- Download Immich app
+- Server: `http://atlantis.tailnet:2283`
+- Enable auto-backup (optional)
+
+**Synology Apps**:
+- DS File (file access)
+- Synology Photos
+- DS Audio/Video
+- Server: `atlantis.tailnet` or QuickConnect ID
+
+**Tailscale**:
+- Already installed in Step 1
+- Ensure "Always On VPN" enabled for seamless access
+
+## User Documentation Package
+
+Provide new user with documentation:
+
+```markdown
+# Welcome to the Homelab!
+
+Hi [Name],
+
+Your access has been set up. Here's what you need to know:
+
+## Network Access
+
+**Tailscale VPN**:
+- Install Tailscale from: https://tailscale.com/download
+- Log in with your account (check email for invitation)
+- Connect to our tailnet
+- You can now access services remotely!
+
+## Available Services
+
+### Media Streaming
+- **Plex**: https://plex.vish.gg
+  - Username: [plex-username]
+  - Watch movies, TV shows, music
+
+- **Jellyfin**: https://jellyfin.vish.gg
+  - Username: [username]
+  - Alternative media server
+
+### File Storage
+- **Atlantis NAS**: smb://atlantis.tailnet/[your-folder]
+  - Access via file explorer
+  - Windows: \\atlantis.tailnet\folder
+  - Mac: smb://atlantis.tailnet/folder
+
+### Photos
+- **Immich**: https://immich.vish.gg
+  - Auto-backup from your phone
+  - Private photo storage
+
+### Other Services
+- [List other services user has access to]
+
+## Support
+
+If you need help:
+- Email: [your-email]
+- [Alternative contact method]
+
+## Security
+
+- Don't share passwords
+- Enable 2FA where available
+- Report any suspicious activity
+
+Welcome aboard!
+```
+
+## Post-Onboarding Tasks
+
+### Step 1: Update Documentation
+
+```bash
+cd ~/Documents/repos/homelab
+
+# Update user access documentation
+nano docs/infrastructure/USER_ACCESS_GUIDE.md
+
+# Add user to list:
+# | Username | Access Level | Services | Status |
+# | username | Standard | Plex, Files, Photos | ✅ Active |
+
+git add .
+git commit -m "Add new user: [username]"
+git push
+```
+
+### Step 2: Test User Access
+
+Verify everything works:
+- [ ] User can connect via Tailscale
+- [ ] User can access Plex/Jellyfin
+- [ ] User can access file shares
+- [ ] SSO login works
+- [ ] Mobile apps working
+- [ ] No access to restricted services
+
+### Step 3: Monitor Usage
+
+```bash
+# Check user activity after a few days
+# Grafana dashboards should show:
+# - Network traffic from user's IP
+# - Service access logs
+# - Any errors
+
+# Review logs
+ssh atlantis
+grep username /var/log/auth.log  # SSH attempts
+docker logs plex | grep username  # Plex usage
+```
+
+## Verification Checklist
+
+- [ ] Tailscale invitation sent and accepted
+- [ ] Authentik account created and tested
+- [ ] Synology NAS account created (Atlantis/Calypso)
+- [ ] Plex/Jellyfin access granted
+- [ ] Required service accounts created
+- [ ] Mobile apps configured and tested
+- [ ] User documentation sent
+- [ ] User confirmed access is working
+- [ ] Documentation updated
+- [ ] No access to restricted services
+
+## User Removal Procedure
+
+When user no longer needs access:
+
+### Step 1: Disable Accounts
+
+```bash
+# Disable in order of security priority:
+
+# 1. Tailscale
+# Admin Console → Users → [user] → Revoke keys
+
+# 2. Authentik
+# Directory → Users → [user] → Deactivate
+
+# 3. Synology NAS
+# Control Panel → User & Group → [user] → Disable
+# Or via SSH:
+sudo synouser --disable username
+
+# 4. Plex
+# Settings → Users & Sharing → Remove user
+
+# 5. Jellyfin
+# Dashboard → Users → [user] → Delete
+
+# 6. Other services
+# Remove from each service individually
+```
+
+### Step 2: Archive User Data (Optional)
+
+```bash
+# Backup user's data before deleting
+# Synology home folder:
+tar czf /volume1/backups/user-archives/username-$(date +%Y%m%d).tar.gz \
+    /volume1/homes/username
+
+# User's Immich photos (if applicable)
+# User's documents (if applicable)
+```
+
+### Step 3: Delete User
+
+After confirming data is backed up:
+
+```bash
+# Synology: Delete user
+# Control Panel → User & Group → [user] → Delete
+# Choose whether to keep or delete user's data
+
+# Or via SSH:
+sudo synouser --del username
+sudo rm -rf /volume1/homes/username  # If deleting data
+```
+
+### Step 4: Update Documentation
+
+```bash
+# Update user access guide
+nano docs/infrastructure/USER_ACCESS_GUIDE.md
+# Mark user as removed with date
+
+git add .
+git commit -m "Remove user: [username] - access terminated [date]"
+git push
+```
+
+## Troubleshooting
+
+### Issue: User Can't Connect via Tailscale
+
+**Solutions**:
+- Verify invitation was accepted
+- Check user installed Tailscale correctly
+- Verify ACLs allow user's device
+- Check user's device firewall
+- Try: `tailscale ping atlantis`
+
+### Issue: SSO Login Not Working
+
+**Solutions**:
+- Verify Authentik account is active
+- Check user is in correct groups
+- Verify application is assigned to user
+- Clear browser cookies
+- Try incognito mode
+- Check Authentik logs
+
+### Issue: Can't Access File Shares
+
+**Solutions**:
+```bash
+# Check Synology user exists and is enabled
+ssh atlantis
+sudo synouser --get username
+
+# Check folder permissions
+ls -la /volume1/homes/username
+
+# Check SMB service is running
+sudo synoservicectl --status smbd
+
+# Test from user's machine:
+smbclient -L atlantis.tailnet -U username
+```
+
+### Issue: Plex Not Showing Up for User
+
+**Solutions**:
+- Verify user accepted Plex sharing invitation
+- Check library access permissions
+- Verify user's account email is correct
+- Try removing and re-adding the user
+- Check Plex server accessibility
+
+## Best Practices
+
+### Security
+- Use strong passwords (12+ characters, mixed case, numbers, symbols)
+- Enable 2FA where available (Authentik supports it)
+- Least privilege principle (only grant needed access)
+- Regular access reviews (quarterly)
+- Disable accounts promptly when not needed
+
+### Documentation
+- Keep user list up to date
+- Document special access grants
+- Note user role changes
+- Archive user data before deletion
+
+### Communication
+- Set clear expectations with users
+- Provide good documentation
+- Be responsive to access issues
+- Notify users of maintenance windows
+
+## Related Documentation
+
+- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
+- [User Access Guide](../infrastructure/USER_ACCESS_GUIDE.md)
+- [SSH Access Guide](../infrastructure/SSH_ACCESS_GUIDE.md)
+- [Authentik SSO Setup](../infrastructure/authentik-sso.md)
+- [Security Guidelines](../infrastructure/security.md)
+
+## Change Log
+
+- 2026-02-14 - Initial creation
+- 2026-02-14 - Added comprehensive onboarding and offboarding procedures
--- a/docs/runbooks/certificate-renewal.md
+++ b/docs/runbooks/certificate-renewal.md
@@ -0,0 +1,570 @@
+# SSL/TLS Certificate Renewal Runbook
+
+## Overview
+This runbook covers SSL/TLS certificate management across the homelab, including Let's Encrypt certificates, Cloudflare Origin certificates, and self-signed certificates. It provides procedures for manual renewal, troubleshooting auto-renewal, and emergency certificate fixes.
+
+## Prerequisites
+- [ ] SSH access to relevant hosts
+- [ ] Cloudflare account access (if using Cloudflare)
+- [ ] Domain DNS control
+- [ ] Root/sudo privileges on hosts
+- [ ] Backup of current certificates
+
+## Metadata
+- **Estimated Time**: 15-45 minutes
+- **Risk Level**: Medium (service downtime if misconfigured)
+- **Requires Downtime**: Minimal (few seconds during reload)
+- **Reversible**: Yes (can restore old certificates)
+- **Tested On**: 2026-02-14
+
+## Certificate Types in Homelab
+
+| Type | Used For | Renewal Method | Expiration |
+|------|----------|----------------|------------|
+| **Let's Encrypt** | Public-facing services | Certbot auto-renewal | 90 days |
+| **Cloudflare Origin** | Services behind Cloudflare Tunnel | Manual/Cloudflare dashboard | 15 years |
+| **Synology Certificates** | Synology DSM, services | Synology DSM auto-renewal | 90 days |
+| **Self-Signed** | Internal/dev services | Manual generation | As configured |
+
+## Certificate Inventory
+
+Document your current certificates:
+
+```bash
+# Check Let's Encrypt certificates (on Linux hosts)
+sudo certbot certificates
+
+# Check Synology certificates
+# DSM UI → Control Panel → Security → Certificate
+# Or SSH:
+sudo cat /usr/syno/etc/certificate/_archive/*/cert.pem | openssl x509 -text -noout
+
+# Check certificate expiration for any domain
+echo | openssl s_client -servername service.vish.gg -connect service.vish.gg:443 2>/dev/null | openssl x509 -noout -dates
+
+# Check all certificates at once
+for domain in st.vish.gg gf.vish.gg mx.vish.gg; do
+    echo "=== $domain ==="
+    echo | timeout 5 openssl s_client -servername $domain -connect $domain:443 2>/dev/null | openssl x509 -noout -dates
+    echo
+done
+```
+
+Create inventory:
+```markdown
+| Domain | Type | Expiry Date | Auto-Renew | Status |
+|--------|------|-------------|------------|--------|
+| vish.gg | Let's Encrypt | 2026-05-15 | ✅ Yes | ✅ Valid |
+| st.vish.gg | Let's Encrypt | 2026-05-15 | ✅ Yes | ✅ Valid |
+| gf.vish.gg | Let's Encrypt | 2026-05-15 | ✅ Yes | ✅ Valid |
+```
+
+## Let's Encrypt Certificate Renewal
+
+### Automatic Renewal (Certbot)
+
+Let's Encrypt certificates should auto-renew. Check the renewal setup:
+
+```bash
+# Check certbot timer status (systemd)
+sudo systemctl status certbot.timer
+
+# Check cron job (if using cron)
+sudo crontab -l | grep certbot
+
+# Test renewal (dry-run, doesn't actually renew)
+sudo certbot renew --dry-run
+
+# Expected output:
+# Congratulations, all simulated renewals succeeded
+```
+
+### Manual Renewal
+
+If auto-renewal fails or you need to renew manually:
+
+```bash
+# Renew all certificates
+sudo certbot renew
+
+# Renew specific certificate
+sudo certbot renew --cert-name vish.gg
+
+# Force renewal (even if not expired)
+sudo certbot renew --force-renewal
+
+# Renew with verbose output for troubleshooting
+sudo certbot renew --verbose
+```
+
+After renewal, reload web servers:
+
+```bash
+# Nginx
+sudo nginx -t  # Test configuration
+sudo systemctl reload nginx
+
+# Apache
+sudo apachectl configtest
+sudo systemctl reload apache2
+```
+
+### Let's Encrypt with Nginx Proxy Manager
+
+If using Nginx Proxy Manager (NPM):
+
+1. Open NPM UI (typically port 81)
+2. Go to **SSL Certificates** tab
+3. Certificates should auto-renew 30 days before expiry
+4. To force renewal:
+   - Click the certificate
+   - Click **Renew** button
+5. No service reload needed (NPM handles it)
+
+## Synology Certificate Renewal
+
+### Automatic Renewal on Synology NAS
+
+```bash
+# SSH to Synology NAS (Atlantis or Calypso)
+ssh atlantis  # or calypso
+
+# Check certificate status
+sudo /usr/syno/sbin/syno-letsencrypt list
+
+# Force renewal check
+sudo /usr/syno/sbin/syno-letsencrypt renew-all
+
+# Check renewal logs
+sudo cat /var/log/letsencrypt/letsencrypt.log
+
+# Verify certificate expiry
+sudo openssl x509 -in /usr/syno/etc/certificate/system/default/cert.pem -text -noout | grep "Not After"
+```
+
+### Via Synology DSM UI
+
+1. Log in to DSM
+2. **Control Panel** → **Security** → **Certificate**
+3. Select certificate → Click **Renew**
+4. DSM will automatically renew and apply
+5. No manual reload needed
+
+### Synology Certificate Configuration
+
+Enable auto-renewal in DSM:
+1. **Control Panel** → **Security** → **Certificate**
+2. Click **Settings** button
+3. Check **Auto-renew certificate**
+4. Synology will renew 30 days before expiry
+
+## Stoatchat Certificates (Gaming VPS)
+
+The Stoatchat gaming server uses Let's Encrypt with Certbot:
+
+```bash
+# SSH to gaming VPS
+ssh root@gaming-vps
+
+# Check certificates
+sudo certbot certificates
+
+# Domains covered:
+# - st.vish.gg
+# - api.st.vish.gg
+# - events.st.vish.gg
+# - files.st.vish.gg
+# - proxy.st.vish.gg
+# - voice.st.vish.gg
+
+# Renew all
+sudo certbot renew
+
+# Reload Nginx
+sudo systemctl reload nginx
+```
+
+Auto-renewal cron:
+```bash
+# Check certbot timer
+sudo systemctl status certbot.timer
+
+# Or check cron
+sudo crontab -l | grep certbot
+```
+
+## Cloudflare Origin Certificates
+
+For services using Cloudflare Tunnel:
+
+### Generate New Origin Certificate
+
+1. Log in to Cloudflare Dashboard
+2. Select domain (vish.gg)
+3. **SSL/TLS** → **Origin Server**
+4. Click **Create Certificate**
+5. Configure:
+   - **Private key type**: RSA (2048)
+   - **Hostnames**: *.vish.gg, vish.gg
+   - **Certificate validity**: 15 years
+6. Copy certificate and private key
+7. Save to secure location
+
+### Install Origin Certificate
+
+```bash
+# SSH to target host
+ssh [host]
+
+# Create certificate files
+sudo nano /etc/ssl/cloudflare/cert.pem
+# Paste certificate
+
+sudo nano /etc/ssl/cloudflare/key.pem
+# Paste private key
+
+# Set permissions
+sudo chmod 644 /etc/ssl/cloudflare/cert.pem
+sudo chmod 600 /etc/ssl/cloudflare/key.pem
+
+# Update Nginx configuration
+sudo nano /etc/nginx/sites-available/[service]
+
+# Use new certificate
+ssl_certificate /etc/ssl/cloudflare/cert.pem;
+ssl_certificate_key /etc/ssl/cloudflare/key.pem;
+
+# Test and reload
+sudo nginx -t
+sudo systemctl reload nginx
+```
+
+## Self-Signed Certificates (Internal/Dev)
+
+For internal-only services not exposed publicly:
+
+### Generate Self-Signed Certificate
+
+```bash
+# Generate 10-year self-signed certificate
+sudo openssl req -x509 -nodes -days 3650 -newkey rsa:2048 \
+    -keyout /etc/ssl/private/selfsigned.key \
+    -out /etc/ssl/certs/selfsigned.crt \
+    -subj "/C=US/ST=State/L=City/O=Homelab/CN=internal.vish.local"
+
+# Generate with SAN (Subject Alternative Names) for multiple domains
+sudo openssl req -x509 -nodes -days 3650 -newkey rsa:2048 \
+    -keyout /etc/ssl/private/selfsigned.key \
+    -out /etc/ssl/certs/selfsigned.crt \
+    -subj "/C=US/ST=State/L=City/O=Homelab/CN=*.vish.local" \
+    -addext "subjectAltName=DNS:*.vish.local,DNS:vish.local"
+
+# Set permissions
+sudo chmod 600 /etc/ssl/private/selfsigned.key
+sudo chmod 644 /etc/ssl/certs/selfsigned.crt
+```
+
+### Install in Services
+
+Update Docker Compose to mount certificates:
+
+```yaml
+services:
+  service:
+    volumes:
+      - /etc/ssl/certs/selfsigned.crt:/etc/ssl/certs/cert.pem:ro
+      - /etc/ssl/private/selfsigned.key:/etc/ssl/private/key.pem:ro
+```
+
+## Monitoring Certificate Expiration
+
+### Set Up Expiration Alerts
+
+Create a certificate monitoring script:
+
+```bash
+sudo nano /usr/local/bin/check-certificates.sh
+```
+
+```bash
+#!/bin/bash
+# Certificate Expiration Monitoring Script
+
+DOMAINS=(
+    "vish.gg"
+    "st.vish.gg"
+    "gf.vish.gg"
+    "mx.vish.gg"
+)
+
+ALERT_DAYS=30  # Alert if expiring within 30 days
+WEBHOOK_URL="https://ntfy.sh/REDACTED_TOPIC"  # Your notification webhook
+
+for domain in "${DOMAINS[@]}"; do
+    echo "Checking $domain..."
+
+    # Get certificate expiration date
+    expiry=$(echo | openssl s_client -servername $domain -connect $domain:443 2>/dev/null | \
+             openssl x509 -noout -dates | grep "notAfter" | cut -d= -f2)
+
+    # Convert to epoch time
+    expiry_epoch=$(date -d "$expiry" +%s)
+    current_epoch=$(date +%s)
+    days_left=$(( ($expiry_epoch - $current_epoch) / 86400 ))
+
+    echo "$domain expires in $days_left days"
+
+    if [ $days_left -lt $ALERT_DAYS ]; then
+        # Send alert
+        curl -H "Title: Certificate Expiring Soon" \
+             -H "Priority: high" \
+             -H "Tags: warning,certificate" \
+             -d "Certificate for $domain expires in $days_left days!" \
+             $WEBHOOK_URL
+
+        echo "⚠️  Alert sent for $domain"
+    fi
+    echo
+done
+```
+
+Make executable and add to cron:
+```bash
+sudo chmod +x /usr/local/bin/check-certificates.sh
+
+# Add to cron (daily at 9 AM)
+(crontab -l 2>/dev/null; echo "0 9 * * * /usr/local/bin/check-certificates.sh") | crontab -
+```
+
+### Grafana Dashboard
+
+Add certificate monitoring to Grafana:
+
+```bash
+# Install blackbox_exporter for HTTPS probing
+# Add to prometheus.yml:
+
+scrape_configs:
+  - job_name: 'blackbox'
+    metrics_path: /probe
+    params:
+      module: [http_2xx]
+    static_configs:
+      - targets:
+        - https://vish.gg
+        - https://st.vish.gg
+        - https://gf.vish.gg
+    relabel_configs:
+      - source_labels: [__address__]
+        target_label: __param_target
+      - source_labels: [__param_target]
+        target_label: instance
+      - target_label: __address__
+        replacement: blackbox-exporter:9115
+
+# Create alert rule:
+- alert: SSLCertificateExpiring
+  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
+  labels:
+    severity: warning
+  annotations:
+    summary: "SSL certificate expiring soon"
+    description: "SSL certificate for {{ $labels.instance }} expires in {{ $value | REDACTED_APP_PASSWORD }}"
+```
+
+## Troubleshooting
+
+### Issue: Certbot Renewal Failing
+
+**Symptoms**: `certbot renew` fails with DNS or HTTP challenge errors
+
+**Solutions**:
+
+```bash
+# Check detailed error logs
+sudo certbot renew --verbose
+
+# Common issues:
+
+# 1. Port 80/443 not accessible
+sudo ufw status  # Check firewall
+sudo netstat -tlnp | grep :80  # Check if port is listening
+
+# 2. DNS not resolving correctly
+dig vish.gg  # Verify DNS points to correct IP
+
+# 3. Rate limits hit
+# Let's Encrypt has rate limits: 50 certificates per domain per week
+# Wait 7 days or use --staging for testing
+
+# 4. Webroot path incorrect
+sudo certbot renew --webroot -w /var/www/html
+
+# 5. Try force renewal with different challenge
+sudo certbot renew --force-renewal --preferred-challenges dns
+```
+
+### Issue: Certificate Valid But Browser Shows Warning
+
+**Symptoms**: Certificate is valid but browsers show security warning
+
+**Solutions**:
+
+```bash
+# Check certificate chain
+openssl s_client -connect vish.gg:443 -showcerts
+
+# Ensure intermediate certificates are included
+# Nginx: Use fullchain.pem, not cert.pem
+ssl_certificate /etc/letsencrypt/live/vish.gg/fullchain.pem;
+ssl_certificate_key /etc/letsencrypt/live/vish.gg/privkey.pem;
+
+# Test SSL configuration
+curl -I https://vish.gg
+# Or use: https://www.ssllabs.com/ssltest/
+```
+
+### Issue: Synology Certificate Not Auto-Renewing
+
+**Symptoms**: DSM certificate expired or shows renewal error
+
+**Solutions**:
+
+```bash
+# SSH to Synology
+ssh atlantis
+
+# Check renewal logs
+sudo cat /var/log/letsencrypt/letsencrypt.log
+
+# Common issues:
+
+# 1. Port 80 forwarding
+# Ensure port 80 is forwarded to NAS during renewal
+
+# 2. Domain validation
+# Check DNS points to correct external IP
+
+# 3. Force renewal
+sudo /usr/syno/sbin/syno-letsencrypt renew-all
+
+# 4. Restart certificate service
+sudo synosystemctl restart nginx
+```
+
+### Issue: Nginx Won't Reload After Certificate Update
+
+**Symptoms**: `nginx -t` shows SSL errors
+
+**Solutions**:
+
+```bash
+# Test Nginx configuration
+sudo nginx -t
+
+# Common errors:
+
+# 1. Certificate path incorrect
+# Fix: Update nginx config with correct path
+
+# 2. Certificate and key mismatch
+# Verify:
+sudo openssl x509 -noout -modulus -in cert.pem | openssl md5
+sudo openssl rsa -noout -modulus -in key.pem | openssl md5
+# MD5 sums should match
+
+# 3. Permission issues
+sudo chmod 644 /etc/ssl/certs/cert.pem
+sudo chmod 600 /etc/ssl/private/key.pem
+sudo chown root:root /etc/ssl/certs/cert.pem /etc/ssl/private/key.pem
+
+# 4. SELinux blocking (if enabled)
+sudo setsebool -P httpd_read_user_content 1
+```
+
+## Emergency Certificate Fix
+
+If a certificate expires and services are down:
+
+### Quick Fix: Use Self-Signed Temporarily
+
+```bash
+# Generate emergency self-signed certificate
+sudo openssl req -x509 -nodes -days 30 -newkey rsa:2048 \
+    -keyout /tmp/emergency.key \
+    -out /tmp/emergency.crt \
+    -subj "/CN=*.vish.gg"
+
+# Update Nginx to use emergency cert
+sudo nano /etc/nginx/sites-available/default
+
+ssl_certificate /tmp/emergency.crt;
+ssl_certificate_key /tmp/emergency.key;
+
+# Reload Nginx
+sudo nginx -t && sudo systemctl reload nginx
+
+# Services are now accessible (with browser warning)
+# Then fix proper certificate renewal
+```
+
+### Restore from Backup
+
+```bash
+# If certificates were backed up
+sudo cp /backup/letsencrypt/archive/vish.gg/* /etc/letsencrypt/archive/vish.gg/
+
+# Update symlinks
+sudo certbot certificates  # Shows current status
+sudo certbot install --cert-name vish.gg
+```
+
+## Best Practices
+
+### Renewal Schedule
+- Let's Encrypt certificates renew at 60 days (30 days before expiry)
+- Check certificates monthly
+- Set up expiration alerts
+- Test renewal process quarterly
+
+### Backup Certificates
+```bash
+# Backup Let's Encrypt certificates
+sudo tar czf ~/letsencrypt-backup-$(date +%Y%m%d).tar.gz /etc/letsencrypt/
+
+# Backup Synology certificates
+# Done via Synology backup tasks
+
+# Store backups securely (encrypted, off-site)
+```
+
+### Documentation
+- Document which certificates are used where
+- Keep inventory of expiration dates
+- Document renewal procedures
+- Note any special configurations
+
+## Verification Checklist
+
+After certificate renewal:
+
+- [ ] Certificate renewed successfully
+- [ ] Certificate expiry date extended
+- [ ] Web servers reloaded without errors
+- [ ] All services accessible via HTTPS
+- [ ] No browser security warnings
+- [ ] Certificate chain complete
+- [ ] Auto-renewal still enabled
+- [ ] Monitoring updated (if needed)
+
+## Related Documentation
+
+- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
+- [Nginx Configuration](../infrastructure/networking.md)
+- [Cloudflare Tunnels Setup](../infrastructure/cloudflare-tunnels-setup.md)
+- [Emergency Access Guide](../troubleshooting/EMERGENCY_ACCESS_GUIDE.md)
+
+## Change Log
+
+- 2026-02-14 - Initial creation
+- 2026-02-14 - Added monitoring and troubleshooting sections
--- a/docs/runbooks/credential-rotation.md
+++ b/docs/runbooks/credential-rotation.md
@@ -0,0 +1,661 @@
+# Credential Rotation Runbook
+
+## Overview
+
+Step-by-step rotation procedures for all credentials exposed in the
+`homelab-optimized` public mirror (audited 2026-02-20). Work through each
+section in priority order. After updating secrets in compose files, commit
+and push — GitOps will redeploy automatically.
+
+> **Note:** Almost all of these stem from the same root cause — secrets were
+> hard-coded in compose files, then those files were committed to git, then
+> `generate_service_docs.py` and wiki-upload scripts duplicated those secrets
+> into documentation, creating 3–5× copies of every secret across the repo.
+> See the "Going Forward" section for how to prevent this.
+
+## Prerequisites
+
+- [ ] SSH / Tailscale access to Atlantis, Calypso, Homelab VM, Seattle VM, matrix-ubuntu-vm
+- [ ] Gitea admin access (`git.vish.gg`)
+- [ ] Authentik admin access
+- [ ] Google account access (Gmail app passwords)
+- [ ] Cloudflare dashboard access
+- [ ] OpenAI platform access
+- [ ] Write access to this repository
+
+## Metadata
+
+- **Estimated Time**: 4–6 hours
+- **Risk Level**: Medium (service restarts required for most items)
+- **Requires Downtime**: Brief per-service restart only
+- **Reversible**: Yes (old values can be restored if something breaks)
+- **Last Updated**: 2026-02-20
+
+---
+
+## Priority 1 — Rotate Immediately (Externally Usable Tokens)
+
+### 1. Gitea API Tokens
+
+Two tokens hard-coded across scripts and docs.
+
+#### 1a. Wiki/scripts token (`77e3ddaf...`)
+
+**Files to update:**
+- `scripts/cleanup-gitea-wiki.sh`
+- `scripts/upload-all-docs-to-gitea-wiki.sh`
+- `scripts/upload-to-gitea-wiki.sh`
+- `scripts/create-clean-organized-wiki.sh`
+- `scripts/upload-organized-wiki.sh`
+- `docs/admin/DOCUMENTATION_MAINTENANCE_GUIDE.md`
+
+```bash
+# 1. Go to https://git.vish.gg/user/settings/applications
+# 2. Revoke the token starting 77e3ddaf
+# 3. Generate new token, name: homelab-wiki, scope: repo
+# 4. Replace in all files:
+NEW_TOKEN=REDACTED_TOKEN
+for f in scripts/cleanup-gitea-wiki.sh \
+          scripts/upload-all-docs-to-gitea-wiki.sh \
+          scripts/upload-to-gitea-wiki.sh \
+          scripts/create-clean-organized-wiki.sh \
+          scripts/upload-organized-wiki.sh \
+          docs/admin/DOCUMENTATION_MAINTENANCE_GUIDE.md; do
+  sed -i "s/REDACTED_GITEA_TOKEN/$NEW_TOKEN/g" "$f"
+done
+```
+
+#### 1b. Retro-site clone token (`52fa6ccb...`)
+
+**File:** `Calypso/retro-site.yaml` and `hosts/synology/calypso/retro-site.yaml`
+
+```bash
+# 1. Go to https://git.vish.gg/user/settings/applications
+# 2. Revoke the token starting 52fa6ccb
+# 3. Generate new token, name: retro-site-deploy, scope: repo:read
+# 4. Update the git clone URL in both compose files
+# Consider switching to a deploy key for least-privilege access
+```
+
+---
+
+### 2. Cloudflare API Token (`FGXlHM7doB8Z...`)
+
+Appears in 13 files including active dynamic DNS updaters on multiple hosts.
+
+**Files to update (active deployments):**
+- `hosts/synology/atlantis/dynamicdnsupdater.yaml`
+- `hosts/physical/guava/portainer_yaml/dynamic_dns.yaml`
+- `hosts/physical/concord-nuc/dyndns_updater.yaml`
+- Various Calypso/homelab-vm DDNS configs
+
+**Files to sanitize (docs):**
+- `docs/infrastructure/cloudflare-dns.md`
+- `docs/infrastructure/npm-migration-jan2026.md`
+- Any `docs/services/individual/ddns-*.md` files
+
+```bash
+# 1. Go to https://dash.cloudflare.com/profile/api-tokens
+# 2. Find the token (FGXlHM7doB8Z...) and click Revoke
+# 3. Create a new token: use "Edit zone DNS" template, scope to your zone only
+# 4. Replace in all compose files above
+# 5. Replace hardcoded value in docs with: YOUR_CLOUDFLARE_API_TOKEN
+
+# Verify DDNS containers restart and can still update DNS:
+docker logs cloudflare-ddns --tail 20
+```
+
+---
+
+### 3. OpenAI API Key (`sk-proj-C_IYp6io...`)
+
+**Files to update:**
+- `hosts/vms/homelab-vm/hoarder.yaml`
+- `docs/services/individual/web.md` (replace with placeholder)
+
+```bash
+# 1. Go to https://platform.openai.com/api-keys
+# 2. Delete the exposed key
+# 3. Create a new key, set a usage limit
+# 4. Update OPENAI_API_KEY in hoarder.yaml
+# 5. Replace value in docs with: YOUR_OPENAI_API_KEY
+```
+
+---
+
+## Priority 2 — OAuth / SSO Secrets
+
+### 4. Grafana ↔ Authentik OAuth Secret
+
+**Files to update:**
+- `hosts/vms/homelab-vm/monitoring.yaml`
+- `hosts/synology/atlantis/grafana.yml`
+- `docs/infrastructure/authentik-sso.md` (replace with placeholder)
+- `docs/services/individual/grafana-oauth.md` (replace with placeholder)
+
+```bash
+# 1. Log into Authentik admin: https://auth.vish.gg/if/admin/
+# 2. Applications → Providers → find Grafana OAuth2 provider
+# 3. Edit → regenerate Client Secret → copy both Client ID and Secret
+# 4. Update in both compose files:
+#    GF_AUTH_GENERIC_OAUTH_CLIENT_ID: NEW_ID
+#    GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET: NEW_SECRET
+# 5. Commit and push — both Grafana stacks restart automatically
+
+# Verify SSO works after restart:
+curl -I https://gf.vish.gg
+```
+
+---
+
+### 5. Seafile ↔ Authentik OAuth Secret
+
+**Files to update:**
+- `hosts/synology/calypso/seafile-oauth-config.py`
+- `docs/services/individual/seafile-oauth.md` (replace with placeholder)
+
+```bash
+# 1. Log into Authentik admin
+# 2. Applications → Providers → find Seafile OAuth2 provider
+# 3. Regenerate client secret
+# 4. Update OAUTH_CLIENT_ID and OAUTH_CLIENT_SECRET in seafile-oauth-config.py
+# 5. Re-run the config script on the Seafile server to apply
+```
+
+---
+
+### 6. Authentik Secret Key (`RpRexcYo5HAz...`)
+
+**Critical** — this key encrypts all Authentik data (tokens, sessions, stored credentials).
+
+**File:** `hosts/synology/calypso/authentik/docker-compose.yaml`
+
+```bash
+# 1. Generate a new secret:
+python3 -c "import secrets; print(secrets.token_urlsafe(50))"
+
+# 2. Update AUTHENTIK_SECRET_KEY in docker-compose.yaml
+# 3. Commit and push — Authentik will restart
+# WARNING: All active Authentik sessions will be invalidated.
+#          Users will need to log back in. SSO-protected services
+#          may temporarily show login errors while Authentik restarts.
+
+# Verify Authentik is healthy after restart:
+docker logs authentik_server --tail 30
+```
+
+---
+
+## Priority 3 — Application Secrets (Require Service Restart)
+
+### 7. Gmail App Passwords
+
+Five distinct app passwords were found across the repo. Revoke all of them
+in Google Account → Security → App passwords, then create new per-service ones.
+
+| Password | Used For | Active Files |
+|----------|----------|-------------|
+| (see Vaultwarden) | Mastodon, Joplin, Authentik SMTP | `matrix-ubuntu-vm/mastodon/.env.production.template`, `atlantis/joplin.yml`, `calypso/authentik/docker-compose.yaml` |
+| (see Vaultwarden) | Vaultwarden SMTP | `atlantis/vaultwarden.yaml` |
+| (see Vaultwarden) | Documenso SMTP | `atlantis/documenso/documenso.yaml` |
+| (see Vaultwarden) | Reactive Resume v4 (archived) | `archive/reactive_resume_v4_archived/docker-compose.yml` |
+| (see Vaultwarden) | Reactive Resume v5 (active) | `calypso/reactive_resume_v5/docker-compose.yml` |
+
+**Best practice:** Create one app password per service, named clearly (e.g.,
+`homelab-joplin`, `homelab-mastodon`). Update each file's `SMTP_PASS` /
+`SMTP_PASSWORD` / `MAILER_AUTH_PASSWORD` / `smtp_password` field.
+
+---
+
+### 8. Matrix Synapse Secrets
+
+Three secrets in `homeserver.yaml`, plus the TURN shared secret.
+
+**File:** `hosts/synology/atlantis/matrix_synapse_docs/homeserver.yaml`
+
+```bash
+# Generate fresh values for each:
+python3 -c "import secrets; print(secrets.token_urlsafe(48))"
+
+# Fields to rotate:
+#   registration_shared_secret
+#   macaroon_secret_key
+#   form_secret
+#   turn_shared_secret
+
+# After updating homeserver.yaml, restart Synapse:
+docker restart synapse   # or via Portainer
+
+# Also update coturn config on the server directly:
+ssh atlantis
+nano /path/to/turnserver.conf
+# Update: static-auth-secret=NEW_TURN_SECRET
+systemctl restart coturn
+
+# Update instructions.txt — replace old values with REDACTED
+```
+
+---
+
+### 9. Mastodon `SECRET_KEY_BASE` + `OTP_SECRET`
+
+**File:** `hosts/synology/atlantis/mastodon.yml`
+**Also in:** `docs/services/individual/mastodon.md` (replace with placeholder)
+
+```bash
+# Generate new values:
+openssl rand -hex 64   # for SECRET_KEY_BASE
+openssl rand -hex 64   # for OTP_SECRET
+
+# Update both in mastodon.yml
+# Commit and push — GitOps restarts Mastodon
+# WARNING: All active user sessions are invalidated. Users must log back in.
+
+# Verify Mastodon web is accessible:
+curl -I https://your-mastodon-domain/
+docker logs mastodon_web --tail 20
+```
+
+---
+
+### 10. Documenso Secrets (3 keys)
+
+**Files:**
+- `hosts/synology/atlantis/documenso/documenso.yaml`
+- `hosts/synology/atlantis/documenso/Secrets.txt` (will be removed by sanitizer)
+- `docs/services/individual/documenso.md` (replace with placeholder)
+
+```bash
+# Generate new values:
+python3 -c "import secrets; print(secrets.token_urlsafe(32))"  # NEXTAUTH_SECRET
+python3 -c "import secrets; print(secrets.token_urlsafe(32))"  # NEXT_PRIVATE_ENCRYPTION_KEY
+python3 -c "import secrets; print(secrets.token_urlsafe(32))"  # NEXT_PRIVATE_ENCRYPTION_SECONDARY_KEY
+
+# Update all three in documenso.yaml
+# NOTE: Rotating encryption keys will invalidate signed documents.
+#       Confirm this is acceptable before rotating.
+```
+
+---
+
+### 11. Paperless-NGX API Token
+
+**Files:**
+- `hosts/synology/calypso/paperless/paperless-ai.yml`
+- `hosts/synology/calypso/paperless/README.md` (replace with placeholder)
+- `docs/services/paperless.md` (replace with placeholder)
+
+```bash
+# 1. Log into Paperless web UI
+# 2. Admin → Auth Token → delete existing, generate new
+# 3. Update PAPERLESS_API_TOKEN in paperless-ai.yml
+# 4. Commit and push
+```
+
+---
+
+### 12. Immich JWT Secret (Both NAS)
+
+**Files:**
+- `hosts/synology/atlantis/immich/stack.env` (will be removed by sanitizer)
+- `hosts/synology/calypso/immich/stack.env` (will be removed by sanitizer)
+
+Since these files are removed by the sanitizer, ensure they are in `.gitignore`
+or managed via Portainer env variables going forward.
+
+```bash
+# Generate new secret:
+openssl rand -base64 96
+
+# Update JWT_SECRET in both stack.env files locally,
+# then apply via Portainer (not committed to git).
+# WARNING: All active Immich sessions invalidated.
+```
+
+---
+
+### 13. Revolt/Stoatchat — LiveKit API Secret + VAPID Private Key
+
+**Files:**
+- `hosts/vms/seattle/stoatchat/livekit.yml`
+- `hosts/vms/seattle/stoatchat/Revolt.overrides.toml`
+- `hosts/vms/homelab-vm/stoatchat.yaml`
+- `docs/services/stoatchat/Revolt.overrides.toml` (replace with placeholder)
+- `hosts/vms/seattle/stoatchat/DEPLOYMENT_SUMMARY.md` (replace with placeholder)
+
+```bash
+# Generate new LiveKit API key/secret pair:
+# Use the LiveKit CLI or generate random strings:
+python3 -c "import secrets; print(secrets.token_urlsafe(24))"  # API key
+python3 -c "import secrets; print(secrets.token_urlsafe(32))"  # API secret
+
+# Generate new VAPID key pair:
+npx web-push generate-vapid-keys
+# or: python3 -c "from py_vapid import Vapid; v=Vapid(); v.generate_keys(); print(v.private_key)"
+
+# Update in livekit.yml and Revolt.overrides.toml
+# Restart LiveKit and Revolt services
+```
+
+---
+
+### 14. Jitsi Internal Auth Passwords (6 passwords)
+
+**File:** `hosts/synology/atlantis/jitsi/jitsi.yml`
+**Also in:** `hosts/synology/atlantis/jitsi/.env` (will be removed by sanitizer)
+
+```bash
+# Generate new passwords for each variable:
+for var in JICOFO_COMPONENT_SECRET JICOFO_AUTH_PASSWORD JVB_AUTH_PASSWORD \
+           JIGASI_XMPP_PASSWORD JIBRI_RECORDER_PASSWORD JIBRI_XMPP_PASSWORD; do
+  echo "$var=$(openssl rand -hex 10)"
+done
+
+# Update all 6 in jitsi.yml
+# Restart the entire Jitsi stack — all components must use the same passwords
+docker compose -f jitsi.yml down && docker compose -f jitsi.yml up -d
+```
+
+---
+
+### 15. SNMP v3 Auth + Priv Passwords
+
+Used for NAS monitoring — same credentials across 6 files.
+
+**Files to update:**
+- `hosts/synology/setillo/prometheus/snmp.yml`
+- `hosts/synology/atlantis/grafana_prometheus/snmp.yml`
+- `hosts/synology/atlantis/grafana_prometheus/snmp_mariushosting.yml`
+- `hosts/synology/calypso/grafana_prometheus/snmp.yml`
+- `hosts/vms/homelab-vm/monitoring.yaml`
+
+```bash
+# 1. Log into each Synology NAS DSM
+# 2. Go to Control Panel → Terminal & SNMP → SNMP tab
+# 3. Update SNMPv3 auth password and privacy password to new values
+# 4. Update the same values in all 5 config files above
+# 5. The archive file (deprecated-monitoring-stacks) can just be left for
+#    the sanitizer to redact.
+```
+
+---
+
+### 16. Invidious `hmac_key`
+
+**Files:**
+- `hosts/physical/concord-nuc/invidious/invidious.yaml`
+- `hosts/physical/concord-nuc/invidious/invidious_old/invidious.yaml`
+- `hosts/synology/atlantis/invidious.yml`
+
+```bash
+# Generate new hmac_key:
+python3 -c "import secrets; print(secrets.token_hex(16))"
+
+# Update hmac_key in each active invidious.yaml
+# Restart Invidious containers
+```
+
+---
+
+### 17. Open WebUI Secret Keys
+
+**Files:**
+- `hosts/vms/contabo-vm/ollama/docker-compose.yml`
+- `hosts/synology/atlantis/ollama/docker-compose.yml`
+- `hosts/synology/atlantis/ollama/64_bit_key.txt` (will be removed by sanitizer)
+
+```bash
+# Generate new key:
+openssl rand -hex 32
+
+# Update WEBUI_SECRET_KEY in both compose files
+# Restart Open WebUI containers — active sessions invalidated
+```
+
+---
+
+### 18. Portainer Edge Key
+
+**File:** `hosts/vms/homelab-vm/portainer_agent.yaml`
+
+```bash
+# 1. Log into Portainer at https://192.168.0.200:9443
+# 2. Go to Settings → Edge Compute → Edge Agents
+# 3. Find the homelab-vm agent and regenerate its edge key
+# 4. Update EDGE_KEY in portainer_agent.yaml with the new base64 value
+# 5. Restart the Portainer edge agent container
+```
+
+---
+
+### 19. OpenProject Secret Key
+
+**File:** `hosts/vms/homelab-vm/openproject.yml`
+**Also in:** `docs/services/individual/openproject.md` (replace with placeholder)
+
+```bash
+openssl rand -hex 64
+# Update OPENPROJECT_SECRET_KEY_BASE in openproject.yml
+# Restart OpenProject — sessions invalidated
+```
+
+---
+
+### 20. RomM Auth Secret Key
+
+**File:** `hosts/vms/homelab-vm/romm/romm.yaml`
+**Also:** `hosts/vms/homelab-vm/romm/secret_key.yaml` (will be removed by sanitizer)
+
+```bash
+openssl rand -hex 32
+# Update ROMM_AUTH_SECRET_KEY in romm.yaml
+# Restart RomM — sessions invalidated
+```
+
+---
+
+### 21. Hoarder NEXTAUTH Secret
+
+**File:** `hosts/vms/homelab-vm/hoarder.yaml`
+**Also in:** `docs/services/individual/web.md` (replace with placeholder)
+
+```bash
+openssl rand -base64 36
+# Update NEXTAUTH_SECRET in hoarder.yaml
+# Restart Hoarder — sessions invalidated
+```
+
+---
+
+## Priority 4 — Shared / Weak Passwords
+
+### 22. `REDACTED_PASSWORD123!` — Used Across 5+ Services
+
+This password is the same for all of the following. Change each to a
+**unique** strong password:
+
+| Service | File | Variable |
+|---------|------|----------|
+| NetBox | `hosts/synology/atlantis/netbox.yml` | `SUPERUSER_PASSWORD` |
+| Paperless admin | `hosts/synology/calypso/paperless/docker-compose.yml` | `PAPERLESS_ADMIN_PASSWORD` |
+| Seafile admin | `hosts/synology/calypso/seafile-server.yaml` | `INIT_SEAFILE_ADMIN_PASSWORD` |
+| Seafile admin (new) | `hosts/synology/calypso/seafile-new.yaml` | `INIT_SEAFILE_ADMIN_PASSWORD` |
+| PhotoPrism | `hosts/physical/anubis/photoprism.yml` | `PHOTOPRISM_ADMIN_PASSWORD` |
+| Hemmelig | `hosts/vms/bulgaria-vm/hemmelig.yml` | `SECRET_JWT_SECRET` |
+| Vaultwarden admin | `hosts/synology/atlantis/bitwarden/bitwarden_token.txt` | (source password) |
+
+For each: generate `openssl rand -base64 18`, update in the compose file,
+restart the container, then log in to verify.
+
+---
+
+### 23. `REDACTED_PASSWORD` — Used Across 3 Services
+
+| Service | File | Variable |
+|---------|------|----------|
+| Gotify | `hosts/vms/homelab-vm/gotify.yml` | `GOTIFY_DEFAULTUSER_PASS` |
+| Pi-hole | `hosts/synology/atlantis/pihole.yml` | `WEBPASSWORD` |
+| Stirling PDF | `hosts/synology/atlantis/stirlingpdf.yml` | `SECURITY_INITIAL_LOGIN_PASSWORD` |
+
+---
+
+### 24. `mastodon_pass_2026` — Live PostgreSQL Password
+
+**Files:**
+- `hosts/vms/matrix-ubuntu-vm/mastodon/.env.production.template`
+- `hosts/vms/matrix-ubuntu-vm/docs/SETUP.md`
+
+```bash
+# On the matrix-ubuntu-vm server:
+ssh YOUR_WAN_IP
+sudo -u postgres psql
+ALTER USER mastodon WITH PASSWORD 'REDACTED_PASSWORD';
+\q
+
+# Update the password in .env.production.template and Mastodon's running config
+# Restart Mastodon services
+```
+
+---
+
+### 25. Watchtower API Token (`REDACTED_WATCHTOWER_TOKEN`)
+
+| File |
+|------|
+| `hosts/synology/atlantis/watchtower.yml` |
+| `hosts/synology/calypso/prometheus.yml` |
+
+```bash
+# Generate a proper random token:
+openssl rand -hex 20
+# Update WATCHTOWER_HTTP_API_TOKEN in both files
+# Update any scripts that call the Watchtower API
+```
+
+---
+
+### 26. `test:test` SSH Credentials on `YOUR_WAN_IP`
+
+The matrix-ubuntu-vm CREDENTIALS.md shows a `test` user with password `test`.
+
+```bash
+# SSH to the server and remove or secure the test account:
+ssh YOUR_WAN_IP
+passwd test          # change to a strong password
+# or: userdel -r test   # remove entirely if unused
+```
+
+---
+
+## Priority 5 — Network Infrastructure
+
+### 27. Management Switch Password Hashes
+
+**File:** `mgmtswitch.conf` (will be removed from public mirror by sanitizer)
+
+The SHA-512 hashes for `root`, `vish`, and `vkhemraj` switch accounts are
+crackable offline. Rotate the switch passwords:
+
+```bash
+# SSH to the management switch
+ssh admin@10.0.0.15
+# Change passwords for all local accounts:
+enable
+configure terminal
+username root secret NEW_PASSWORD
+username vish secret NEW_PASSWORD
+username vkhemraj secret NEW_PASSWORD
+write memory
+```
+
+---
+
+## Final Verification
+
+After completing all rotations:
+
+```bash
+# 1. Commit and push all file changes
+git add -A
+git commit -m "chore(security): rotate all exposed credentials"
+git push origin main
+
+# 2. Wait for the mirror workflow to complete, then pull:
+git -C /home/homelab/organized/repos/homelab-optimized pull
+
+# 3. Verify none of the old secrets appear in the public mirror:
+cd /home/homelab/organized/repos/homelab-optimized
+grep -r "77e3ddaf\|52fa6ccb\|FGXlHM7d\|sk-proj-C_IYp6io\|ArP5XWdkwVyw\|bdtrpmpce\|toiunzuby" . 2>/dev/null
+grep -r "244c619d\|RpRexcYo5\|mastodon_pass\|REDACTED_PASSWORD\|REDACTED_PASSWORD\|REDACTED_WATCHTOWER_TOKEN" . 2>/dev/null
+grep -r "2e80b1b7d3a\|eca299ae59\|rxmr4tJoqfu\|ZjCofRlfm6\|QE5SudhZ99" . 2>/dev/null
+# All should return no results
+
+# 4. Verify GitOps deployments are healthy in Portainer:
+#    https://192.168.0.200:9443
+```
+
+---
+
+## Going Forward — Preventing This Again
+
+The root cause: secrets hard-coded in compose files that get committed to git.
+
+**Rules:**
+1. **Never hard-code secrets in compose files** — use Docker Secrets, or an
+   `.env` file excluded by `.gitignore` (Portainer can load env files from the
+   host at deploy time)
+2. **Never put real values in documentation** — use `YOUR_API_KEY` placeholders
+3. **Never create `Secrets.txt` or `CREDENTIALS.md` files in the repo** — use
+   a password manager (you already have Vaultwarden/Bitwarden)
+4. **Run the sanitizer locally** before any commit that touches secrets:
+
+```bash
+# Test in a temp copy — see what the sanitizer would catch:
+tmpdir=$(mktemp -d)
+cp -r /path/to/homelab "$tmpdir/"
+python3 "$tmpdir/homelab/.gitea/sanitize.py"
+```
+
+## Related Documentation
+
+- [Security Hardening](../security/SERVER_HARDENING.md)
+- [Repository Sanitization](../admin/REPOSITORY_SANITIZATION.md)
+- [GitOps Deployment Guide](../admin/gitops-deployment-guide.md)
+
+## Portainer Git Credential Rotation
+
+The saved Git credential **`portainer-homelab`** (credId: 1) is used by ~43 stacks to
+pull compose files from `git.vish.gg`. When the Gitea token expires or is rotated,
+all those stacks fail to redeploy.
+
+```bash
+# 1. Generate a new Gitea token at https://git.vish.gg/user/settings/applications
+#    Scope: read:repository
+
+# 2. Test the token:
+curl -s -o /dev/null -w "%{http_code}" \
+  -H "Authorization: token YOUR_NEW_TOKEN" \
+  "https://git.vish.gg/api/v1/repos/Vish/homelab"
+# Should return 200
+
+# 3. Update in Portainer:
+curl -k -s -X PUT \
+  -H "X-API-Key: "REDACTED_API_KEY" \
+  -H "Content-Type: application/json" \
+  "https://192.168.0.200:9443/api/users/1/gitcredentials/1" \
+  -d '{"name":"portainer-homelab","username":"vish","password":"YOUR_NEW_TOKEN"}'
+```
+
+> Note: The API update may not immediately propagate to automated pulls.
+> Pass credentials inline in redeploy calls to force use of the new token.
+
+---
+
+## Change Log
+
+- 2026-02-27 — Incident: sanitization commit `037d766a` replaced credentials with
+  `REDACTED_PASSWORD` placeholders across 14 compose files. All affected containers
+  detected via Portainer API env scan and restored from `git show 037d766a^`. Added
+  Portainer Git credential rotation section above.
+- 2026-02-20 — Initial creation (8 items)
+- 2026-02-20 — Expanded after full private repo audit (27 items across 34 exposure categories)
--- a/docs/runbooks/disk-full-procedure.md
+++ b/docs/runbooks/disk-full-procedure.md
@@ -0,0 +1,490 @@
+# Disk Full Procedure Runbook
+
+## Overview
+This runbook provides procedures for handling disk space emergencies across all homelab hosts. It includes immediate actions, root cause analysis, and long-term solutions to prevent recurrence.
+
+## Prerequisites
+- [ ] SSH access to affected host
+- [ ] Root/sudo privileges on the host
+- [ ] Monitoring dashboards access
+- [ ] Backup verification capability
+
+## Metadata
+- **Estimated Time**: 30-90 minutes (depending on severity)
+- **Risk Level**: High (data loss possible if not handled carefully)
+- **Requires Downtime**: Minimal (may need to stop services temporarily)
+- **Reversible**: Partially (deleted data cannot be recovered)
+- **Tested On**: 2026-02-14
+
+## Severity Levels
+
+| Level | Disk Usage | Action Required | Urgency |
+|-------|------------|-----------------|---------|
+| 🟢 **Normal** | < 80% | Monitor | Low |
+| 🟡 **Warning** | 80-90% | Plan cleanup | Medium |
+| 🟠 **Critical** | 90-95% | Immediate cleanup | High |
+| 🔴 **Emergency** | > 95% | Emergency response | Critical |
+
+## Quick Triage
+
+First, determine which host and volume is affected:
+
+```bash
+# Check all hosts disk usage
+ssh atlantis "df -h"
+ssh calypso "df -h"
+ssh concordnuc "df -h"
+ssh homelab-vm "df -h"
+ssh raspberry-pi-5 "df -h"
+```
+
+## Emergency Procedure (>95% Full)
+
+### Step 1: Immediate Space Recovery
+
+**Goal**: Free up 5-10% space immediately to prevent system issues.
+
+```bash
+# SSH to affected host
+ssh [hostname]
+
+# Identify what's consuming space
+df -h
+du -sh /* 2>/dev/null | sort -rh | head -20
+
+# Quick wins - Clear Docker cache
+docker system df  # See what Docker is using
+docker system prune -a --volumes --force  # Reclaim space (BE CAREFUL!)
+
+# This typically frees 10-50GB depending on your setup
+```
+
+**⚠️ WARNING**: `docker system prune` will remove:
+- Stopped containers
+- Unused networks
+- Dangling images
+- Build cache
+- Unused volumes (with --volumes flag)
+
+**Safer alternative** if you're unsure:
+```bash
+# Less aggressive - removes only stopped containers and dangling images
+docker system prune --force
+```
+
+### Step 2: Clear Log Files
+
+```bash
+# Find large log files
+find /var/log -type f -size +100M -exec ls -lh {} \; | sort -k 5 -rh
+
+# Clear systemd journal (keeps last 3 days)
+sudo journalctl --vacuum-time=3d
+
+# Clear old Docker logs
+sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'
+
+# For Synology NAS
+sudo find /volume1/@docker/containers -name "*-json.log" -size +100M -exec truncate -s 0 {} \;
+```
+
+### Step 3: Remove Old Docker Images
+
+```bash
+# List images by size
+docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k 2 -rh | head -20
+
+# Remove specific old images
+docker image rm [image:tag]
+
+# Remove all unused images
+docker image prune -a --force
+```
+
+### Step 4: Verify Space Recovered
+
+```bash
+# Check current usage
+df -h
+
+# Verify critical services are running
+docker ps
+
+# Check container logs for errors
+docker ps --format "{{.Names}}" | xargs -I {} sh -c 'echo "=== {} ===" && docker logs --tail 20 {}'
+```
+
+## Detailed Analysis Procedure
+
+Once immediate danger is passed, perform thorough analysis:
+
+### Step 1: Identify Space Consumers
+
+```bash
+# Comprehensive disk usage analysis
+sudo du -h --max-depth=2 / 2>/dev/null | sort -rh | head -30
+
+# For Synology NAS specifically
+sudo du -h --max-depth=2 /volume1 2>/dev/null | sort -rh | head -30
+
+# Check Docker volumes
+docker volume ls
+docker system df -v
+
+# Check specific large directories
+du -sh /var/lib/docker/* | sort -rh
+du -sh /volume1/docker/* | sort -rh  # Synology
+```
+
+### Step 2: Analyze by Service
+
+Create a space usage report:
+
+```bash
+# Create analysis script
+cat > /tmp/analyze-space.sh << 'EOF'
+#!/bin/bash
+echo "=== Docker Container Volumes ==="
+docker ps --format "{{.Names}}" | while read container; do
+    size=$(docker exec $container du -sh / 2>/dev/null | awk '{print $1}')
+    echo "$container: $size"
+done | sort -rh
+
+echo ""
+echo "=== Docker Volumes ==="
+docker volume ls --format "{{.Name}}" | while read vol; do
+    size=$(docker volume inspect $vol --format '{{.Mountpoint}}' | xargs sudo du -sh 2>/dev/null | awk '{print $1}')
+    echo "$vol: $size"
+done | sort -rh
+
+echo ""
+echo "=== Log Files Over 100MB ==="
+find /var/log -type f -size +100M -exec ls -lh {} \; 2>/dev/null
+EOF
+
+chmod +x /tmp/analyze-space.sh
+/tmp/analyze-space.sh
+```
+
+### Step 3: Categorize Findings
+
+Identify the primary space consumers:
+
+| Category | Typical Culprits | Safe to Delete? |
+|----------|------------------|-----------------|
+| **Docker Images** | Old/unused image versions | ✅ Yes (if unused) |
+| **Docker Volumes** | Database growth, media cache | ⚠️ Maybe (check first) |
+| **Log Files** | Application logs, system logs | ✅ Yes (after review) |
+| **Media Files** | Plex, Jellyfin transcodes | ✅ Yes (transcodes) |
+| **Backups** | Old backup archives | ✅ Yes (keep recent) |
+| **Application Data** | Various service data | ❌ No (review first) |
+
+## Cleanup Strategies by Service Type
+
+### Media Services (Plex, Jellyfin)
+
+```bash
+# Clear Plex transcode cache
+docker exec plex rm -rf /transcode/*
+
+# Clear Jellyfin transcode cache
+docker exec jellyfin rm -rf /config/data/transcodes/*
+
+# Find and remove old media previews
+find /volume1/plex/Library/Application\ Support/Plex\ Media\ Server/Cache -type f -mtime +30 -delete
+```
+
+### *arr Suite (Sonarr, Radarr, etc.)
+
+```bash
+# Clear download client history and backups
+docker exec sonarr find /config/Backups -mtime +30 -delete
+docker exec radarr find /config/Backups -mtime +30 -delete
+
+# Clean up old logs
+docker exec sonarr find /config/logs -mtime +30 -delete
+docker exec radarr find /config/logs -mtime +30 -delete
+```
+
+### Database Services (PostgreSQL, MariaDB)
+
+```bash
+# Check database size
+docker exec postgres psql -U user -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database;"
+
+# Vacuum databases (for PostgreSQL)
+docker exec postgres vacuumdb -U user --all --full --analyze
+
+# Check MariaDB size
+docker exec mariadb mysql -u root -p -e "SELECT table_schema AS 'Database', ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)' FROM information_schema.TABLES GROUP BY table_schema;"
+```
+
+### Monitoring Services (Prometheus, Grafana)
+
+```bash
+# Check Prometheus storage size
+du -sh /volume1/docker/prometheus
+
+# Prometheus retention is configured in prometheus.yml
+# Default: --storage.tsdb.retention.time=15d
+# Consider reducing retention if space is critical
+
+# Clear old Grafana sessions
+docker exec grafana find /var/lib/grafana/sessions -mtime +7 -delete
+```
+
+### Immich (Photo Management)
+
+```bash
+# Check Immich storage usage
+docker exec immich-server df -h /usr/src/app/upload
+
+# Immich uses a lot of space for:
+# - Original photos
+# - Thumbnails
+# - Encoded videos
+# - ML models
+
+# Clean up old upload logs
+docker exec immich-server find /usr/src/app/upload/upload -mtime +90 -delete
+```
+
+## Long-Term Solutions
+
+### Solution 1: Configure Log Rotation
+
+Create proper log rotation for Docker containers:
+
+```bash
+# Edit Docker daemon config
+sudo nano /etc/docker/daemon.json
+
+# Add log rotation settings
+{
+  "log-driver": "json-file",
+  "log-opts": {
+    "max-size": "10m",
+    "max-file": "3"
+  }
+}
+
+# Restart Docker
+sudo systemctl restart docker  # Linux
+# OR for Synology
+sudo synoservicectl --restart pkgctl-Docker
+```
+
+### Solution 2: Set Up Automated Cleanup
+
+Create a cleanup cron job:
+
+```bash
+# Create cleanup script
+sudo nano /usr/local/bin/homelab-cleanup.sh
+
+#!/bin/bash
+# Homelab Automated Cleanup Script
+
+# Remove stopped containers older than 7 days
+docker container prune --filter "until=168h" --force
+
+# Remove unused images older than 30 days
+docker image prune --all --filter "until=720h" --force
+
+# Remove unused volumes (BE CAREFUL - only if you're sure)
+# docker volume prune --force
+
+# Clear journal logs older than 7 days
+journalctl --vacuum-time=7d
+
+# Clear old backups (keep last 30 days)
+find /volume1/backups -type f -mtime +30 -delete
+
+echo "Cleanup completed at $(date)" >> /var/log/homelab-cleanup.log
+
+# Make executable
+sudo chmod +x /usr/local/bin/homelab-cleanup.sh
+
+# Add to cron (runs weekly on Sunday at 3 AM)
+(crontab -l 2>/dev/null; echo "0 3 * * 0 /usr/local/bin/homelab-cleanup.sh") | crontab -
+```
+
+### Solution 3: Configure Service-Specific Retention
+
+Update each service with appropriate retention policies:
+
+**Prometheus** (`prometheus.yml`):
+```yaml
+global:
+  storage:
+    tsdb:
+      retention.time: 15d  # Reduce from default 15d to 7d if needed
+      retention.size: 50GB  # Set size limit
+```
+
+**Grafana** (docker-compose.yml):
+```yaml
+environment:
+  - GF_DATABASE_WAL=true
+  - GF_DATABASE_CLEANUP_INTERVAL=168h  # Weekly cleanup
+```
+
+**Plex** (Plex settings):
+- Settings → Transcoder → Transcoder temporary directory
+- Settings → Scheduled Tasks → Clean Bundles (daily)
+- Settings → Scheduled Tasks → Optimize Database (weekly)
+
+### Solution 4: Monitor Disk Usage Proactively
+
+Set up monitoring alerts in Grafana:
+
+```yaml
+# Alert rule for disk space
+- alert: REDACTED_APP_PASSWORD
+  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 20
+  for: 5m
+  labels:
+    severity: warning
+  annotations:
+    summary: "Disk space warning on {{ $labels.instance }}"
+    description: "Disk {{ $labels.mountpoint }} has less than 20% free space"
+
+- alert: DiskSpaceCritical
+  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 10
+  for: 5m
+  labels:
+    severity: critical
+  annotations:
+    summary: "CRITICAL: Disk space on {{ $labels.instance }}"
+    description: "Disk {{ $labels.mountpoint }} has less than 10% free space"
+```
+
+## Host-Specific Considerations
+
+### Atlantis (Synology DS1823xs+)
+
+```bash
+# Synology-specific cleanup
+# Clear Synology logs
+sudo find /var/log -name "*.log.*" -mtime +30 -delete
+
+# Clear package logs
+sudo find /var/packages/*/target/logs -name "*.log.*" -mtime +30 -delete
+
+# Check storage pool status
+sudo synostgpool --info
+
+# DSM has built-in storage analyzer
+# Control Panel → Storage Manager → Storage Analyzer
+```
+
+### Calypso (Synology DS723+)
+
+Same as Atlantis - use Synology-specific commands.
+
+### Concord NUC (Ubuntu)
+
+```bash
+# Ubuntu-specific cleanup
+sudo apt-get clean
+sudo apt-get autoclean
+sudo apt-get autoremove --purge
+
+# Clear old kernels (keep current + 1 previous)
+sudo apt-get autoremove --purge $(dpkg -l 'linux-*' | sed '/^ii/!d;/'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d;s/^[^ ]* [^ ]* \([^ ]*\).*/\1/;/[0-9]/!d')
+
+# Clear thumbnail cache
+rm -rf ~/.cache/thumbnails/*
+```
+
+### Homelab VM (Proxmox VM)
+
+```bash
+# VM-specific cleanup
+# Clear apt cache
+sudo apt-get clean
+
+# Clear old cloud-init logs
+sudo rm -rf /var/log/cloud-init*.log
+
+# Compact QCOW2 disk (from Proxmox host)
+# qemu-img convert -O qcow2 -c original.qcow2 compressed.qcow2
+```
+
+## Verification Checklist
+
+After cleanup, verify:
+
+- [ ] Disk usage below 80%: `df -h`
+- [ ] All critical containers running: `docker ps`
+- [ ] No errors in recent logs: `docker logs [container] --tail 50`
+- [ ] Services accessible via web interface
+- [ ] Monitoring dashboards show normal metrics
+- [ ] Backup jobs can complete successfully
+- [ ] Automated cleanup configured for future
+
+## Rollback Procedure
+
+If cleanup causes issues:
+
+1. **Check what was deleted**: Review command history and logs
+2. **Restore from backups**: If critical data was deleted
+   ```bash
+   cd ~/Documents/repos/homelab
+   ./restore.sh [backup-date]
+   ```
+3. **Recreate Docker volumes**: If volumes were accidentally pruned
+4. **Restart affected services**: Redeploy from Portainer
+
+## Troubleshooting
+
+### Issue: Still Running Out of Space After Cleanup
+
+**Solution**: Consider adding more storage
+- Add external USB drives
+- Expand existing RAID arrays
+- Move services to hosts with more space
+- Archive old media to cold storage
+
+### Issue: Docker Prune Removed Important Data
+
+**Solution**:
+- Always use `--filter` to be selective
+- Never use `docker volume prune` without checking first
+- Keep recent backups before major cleanup operations
+
+### Issue: Services Won't Start After Cleanup
+
+**Solution**:
+```bash
+# Check for missing volumes
+docker ps -a
+docker volume ls
+
+# Check logs
+docker logs [container]
+
+# Recreate volumes if needed (restore from backup)
+./restore.sh [backup-date]
+```
+
+## Prevention Checklist
+
+- [ ] Log rotation configured for all services
+- [ ] Automated cleanup script running weekly
+- [ ] Monitoring alerts set up for disk space
+- [ ] Retention policies configured appropriately
+- [ ] Regular backup verification scheduled
+- [ ] Capacity planning review quarterly
+
+## Related Documentation
+
+- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
+- [Backup Strategies](../admin/backup-strategies.md)
+- [Monitoring Setup](../admin/monitoring-setup.md)
+- [Troubleshooting Guide](../troubleshooting/common-issues.md)
+
+## Change Log
+
+- 2026-02-14 - Initial creation with host-specific procedures
+- 2026-02-14 - Added service-specific cleanup strategies
--- a/docs/runbooks/service-migration.md
+++ b/docs/runbooks/service-migration.md
@@ -0,0 +1,559 @@
+# Service Migration Runbook
+
+## Overview
+This runbook guides you through migrating a containerized service from one host to another in the homelab. The procedure minimizes downtime and ensures data integrity throughout the migration.
+
+## Prerequisites
+- [ ] SSH access to both source and target hosts
+- [ ] Sufficient disk space on target host
+- [ ] Network connectivity between hosts (Tailscale recommended)
+- [ ] Service backup completed and verified
+- [ ] Maintenance window scheduled (if downtime required)
+- [ ] Portainer access for both hosts
+
+## Metadata
+- **Estimated Time**: 1-3 hours (depending on data size)
+- **Risk Level**: Medium-High (data migration involved)
+- **Requires Downtime**: Yes (typically 15-60 minutes)
+- **Reversible**: Yes (can roll back to source host)
+- **Tested On**: 2026-02-14
+
+## When to Migrate Services
+
+Common reasons for service migration:
+
+| Scenario | Example | Recommended Target |
+|----------|---------|-------------------|
+| **Resource constraints** | NAS running out of CPU | Move to NUC or VM |
+| **Storage constraints** | Running out of disk space | Move to larger NAS |
+| **Performance issues** | High I/O affecting other services | Move to dedicated host |
+| **Host consolidation** | Reducing number of active hosts | Consolidate to primary hosts |
+| **Hardware maintenance** | Planned hardware upgrade | Temporary or permanent move |
+| **Improved organization** | Group related services | Move to appropriate host |
+
+## Migration Types
+
+### Type 1: Simple Migration (Stateless Service)
+- No persistent data
+- Can be redeployed from scratch
+- Example: Nginx, static web servers
+- **Downtime**: Minimal (5-15 minutes)
+
+### Type 2: Standard Migration (Small Data)
+- Persistent data < 10GB
+- Configuration and databases
+- Example: Uptime Kuma, AdGuard Home
+- **Downtime**: 15-30 minutes
+
+### Type 3: Large Data Migration
+- Persistent data > 10GB
+- Media libraries, large databases
+- Example: Plex, Immich, Jellyfin
+- **Downtime**: 1-4 hours (depending on size)
+
+## Pre-Migration Planning
+
+### Step 1: Assess the Service
+
+```bash
+# SSH to source host
+ssh [source-host]
+
+# Identify container and volumes
+docker ps | grep [service-name]
+docker inspect [service-name] | grep -A 10 Mounts
+
+# Check data size
+docker exec [service-name] du -sh /config /data
+
+# List all volumes used by service
+docker volume ls | grep [service-name]
+
+# Check volume sizes
+docker system df -v | grep [service-name]
+```
+
+Document findings:
+- Container name: ___________
+- Image and tag: ___________
+- Data size: ___________
+- Volume count: ___________
+- Network dependencies: ___________
+- Port mappings: ___________
+
+### Step 2: Check Target Host Capacity
+
+```bash
+# SSH to target host
+ssh [target-host]
+
+# Check available resources
+df -h  # Disk space
+free -h  # RAM
+nproc  # CPU cores
+docker ps | wc -l  # Current container count
+
+# Check port conflicts
+netstat -tlnp | grep [required-port]
+```
+
+### Step 3: Create Migration Plan
+
+**Downtime Window**:
+- Start: ___________
+- End: ___________
+- Duration: ___________
+
+**Dependencies**:
+- Services that depend on this: ___________
+- Services this depends on: ___________
+
+**Notification**:
+- Who to notify: ___________
+- When to notify: ___________
+
+## Migration Procedure
+
+### Method A: GitOps Migration (Recommended)
+
+Best for: Most services with proper version control
+
+#### Step 1: Backup Current Service
+
+```bash
+# SSH to source host
+ssh [source-host]
+
+# Create backup
+docker stop [service-name]
+docker export [service-name] > /tmp/[service-name]-backup.tar
+
+# Backup volumes
+for vol in $(docker volume ls -q | grep [service-name]); do
+    docker run --rm -v $vol:/source -v /tmp:/backup alpine tar czf /backup/$vol.tar.gz -C /source .
+done
+
+# Copy backups to safe location
+scp /tmp/[service-name]*.tar* [backup-location]:~/backups/
+```
+
+#### Step 2: Export Configuration
+
+```bash
+# Get current docker-compose configuration
+cd ~/Documents/repos/homelab
+cat hosts/[source-host]/[service-name].yaml > /tmp/service-config.yaml
+
+# Note environment variables
+docker inspect [service-name] | grep -A 50 Env
+```
+
+#### Step 3: Copy Data to Target Host
+
+**For Small Data (< 10GB)**: Use SCP
+```bash
+# From your workstation
+scp -r [source-host]:/volume1/docker/[service-name] /tmp/
+scp -r /tmp/[service-name] [target-host]:/path/to/docker/
+```
+
+**For Large Data (> 10GB)**: Use Rsync
+```bash
+# From source host to target host via Tailscale
+ssh [source-host]
+rsync -avz --progress /volume1/docker/[service-name]/ \
+    [target-host-tailscale-ip]:/path/to/docker/[service-name]/
+
+# Monitor progress
+watch -n 5 'du -sh /path/to/docker/[service-name]'
+```
+
+**For Very Large Data (> 100GB)**: Consider physical transfer
+```bash
+# Copy to USB drive, physically move, then copy to target
+# Or use network-attached storage as intermediate
+```
+
+#### Step 4: Stop Service on Source Host
+
+```bash
+# SSH to source host
+ssh [source-host]
+
+# Stop the container
+docker stop [service-name]
+
+# Verify it's stopped
+docker ps -a | grep [service-name]
+```
+
+#### Step 5: Update Git Configuration
+
+```bash
+# On your workstation
+cd ~/Documents/repos/homelab
+
+# Move service definition to new host
+git mv hosts/[source-host]/[service-name].yaml \
+       hosts/[target-host]/[service-name].yaml
+
+# Update paths in the configuration file if needed
+nano hosts/[target-host]/[service-name].yaml
+
+# Update volume paths for target host
+# Atlantis/Calypso: /volume1/docker/[service-name]
+# NUC/VM: /home/user/docker/[service-name]
+# Raspberry Pi: /home/pi/docker/[service-name]
+
+# Commit changes
+git add hosts/[target-host]/[service-name].yaml
+git commit -m "Migrate [service-name] from [source-host] to [target-host]
+
+- Move service configuration
+- Update volume paths for target host
+- Migration date: $(date +%Y-%m-%d)
+
+Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"
+
+git push origin main
+```
+
+#### Step 6: Deploy on Target Host
+
+**Via Portainer UI**:
+1. Open Portainer → Select target host endpoint
+2. Go to **Stacks** → **Add stack** → **Git Repository**
+3. Configure:
+   - Repository URL: Your git repository
+   - Compose path: `hosts/[target-host]/[service-name].yaml`
+   - Enable GitOps (optional)
+4. Click **Deploy the stack**
+
+**Via GitOps Auto-Sync**:
+- Wait 5-10 minutes for automatic deployment
+- Monitor Portainer for new stack appearance
+
+#### Step 7: Verify Migration
+
+```bash
+# SSH to target host
+ssh [target-host]
+
+# Check container is running
+docker ps | grep [service-name]
+
+# Check logs for errors
+docker logs [service-name] --tail 100
+
+# Test service accessibility
+curl http://localhost:[port]  # Internal
+curl https://[service].vish.gg  # External (if applicable)
+
+# Verify data integrity
+docker exec [service-name] ls -lah /config
+docker exec [service-name] ls -lah /data
+
+# Check resource usage
+docker stats [service-name] --no-stream
+```
+
+#### Step 8: Update DNS/Reverse Proxy (If Applicable)
+
+```bash
+# Update Nginx Proxy Manager or reverse proxy configuration
+# Point [service].vish.gg to new host IP
+
+# Update Cloudflare DNS if using Cloudflare Tunnels
+
+# Update local DNS (AdGuard Home) if applicable
+```
+
+#### Step 9: Remove from Source Host
+
+**Only after verifying target is working correctly!**
+
+```bash
+# SSH to source host
+ssh [source-host]
+
+# Remove container and volumes
+docker stop [service-name]
+docker rm [service-name]
+
+# Optional: Remove volumes (only if data copied successfully)
+# docker volume rm $(docker volume ls -q | grep [service-name])
+
+# Remove data directory
+rm -rf /volume1/docker/[service-name]  # BE CAREFUL!
+
+# Remove from Portainer if manually managed
+# Portainer UI → Stacks → Remove stack
+```
+
+### Method B: Manual Export/Import
+
+Best for: Quick migrations without git changes, or when testing
+
+#### Step 1: Stop and Export
+
+```bash
+# SSH to source host
+ssh [source-host]
+
+# Stop service
+docker stop [service-name]
+
+# Export container and volumes
+docker run --rm \
+  -v [service-name]_data:/source \
+  -v /tmp:/backup \
+  alpine tar czf /backup/[service-name]-data.tar.gz -C /source .
+
+# Export configuration
+docker inspect [service-name] > /tmp/[service-name]-config.json
+```
+
+#### Step 2: Transfer to Target
+
+```bash
+# Copy data to target host
+scp /tmp/[service-name]-data.tar.gz [target-host]:/tmp/
+scp /tmp/[service-name]-config.json [target-host]:/tmp/
+```
+
+#### Step 3: Import on Target
+
+```bash
+# SSH to target host
+ssh [target-host]
+
+# Create volume
+docker volume create [service-name]_data
+
+# Import data
+docker run --rm \
+  -v [service-name]_data:/target \
+  -v /tmp:/backup \
+  alpine tar xzf /backup/[service-name]-data.tar.gz -C /target
+
+# Create and start container using saved configuration
+# Adjust paths and ports as needed
+docker create --name [service-name] \
+  [options-from-config.json] \
+  [image:tag]
+
+docker start [service-name]
+```
+
+## Post-Migration Tasks
+
+### Update Documentation
+
+```bash
+# Update service inventory
+nano docs/services/VERIFIED_SERVICE_INVENTORY.md
+
+# Update the host column for migrated service
+# | Service | Host | Port | URL | Status |
+# | Service | [NEW-HOST] | 8080 | https://service.vish.gg | ✅ Active |
+```
+
+### Update Monitoring
+
+```bash
+# Update Prometheus configuration if needed
+nano prometheus/prometheus.yml
+
+# Update target host IP for scraped metrics
+# Restart Prometheus if configuration changed
+```
+
+### Test Backups
+
+```bash
+# Verify backups work on new host
+./backup.sh --test
+
+# Ensure service data is included in backup
+ls -lah /path/to/backups/[service-name]
+```
+
+### Performance Baseline
+
+```bash
+# Document baseline performance on new host
+docker stats [service-name] --no-stream
+
+# Monitor for 24 hours to ensure stability
+```
+
+## Verification Checklist
+
+- [ ] Service running on target host: `docker ps`
+- [ ] All data migrated correctly
+- [ ] Configuration preserved
+- [ ] Logs show no errors: `docker logs [service]`
+- [ ] External access works (if applicable)
+- [ ] Internal service connectivity works
+- [ ] Reverse proxy updated (if applicable)
+- [ ] DNS records updated (if applicable)
+- [ ] Monitoring updated
+- [ ] Documentation updated
+- [ ] Backups include new location
+- [ ] Old host cleaned up
+- [ ] Users notified of any URL changes
+
+## Rollback Procedure
+
+If migration fails or causes issues:
+
+### Quick Rollback (Within 24 hours)
+
+```bash
+# SSH to source host
+ssh [source-host]
+
+# Restore from backup
+docker import /tmp/[service-name]-backup.tar [service-name]:backup
+
+# Or redeploy from git (revert git changes)
+cd ~/Documents/repos/homelab
+git revert HEAD
+git push origin main
+
+# Restart service on source host
+# Via Portainer or:
+docker start [service-name]
+```
+
+### Full Rollback (After cleanup)
+
+```bash
+# Restore from backup
+./restore.sh [backup-date]
+
+# Redeploy to original host
+# Follow original deployment procedure
+```
+
+## Troubleshooting
+
+### Issue: Data Transfer Very Slow
+
+**Symptoms**: Rsync taking hours for moderate data
+
+**Solutions**:
+```bash
+# Use compression for better network performance
+rsync -avz --compress-level=6 --progress /source/ [target]:/dest/
+
+# Or use parallel transfer tools
+# Install: sudo apt-get install parallel
+find /source -type f | parallel -j 4 scp {} [target]:/dest/{}
+
+# For extremely large transfers, consider:
+# 1. Physical USB drive transfer
+# 2. NFS mount between hosts
+# 3. Transfer during off-peak hours
+```
+
+### Issue: Service Won't Start on Target Host
+
+**Symptoms**: Container starts then immediately exits
+
+**Solutions**:
+```bash
+# Check logs
+docker logs [service-name]
+
+# Common issues:
+# 1. Path issues - Update volume paths in compose file
+# 2. Permission issues - Check PUID/PGID
+# 3. Port conflicts - Check if port already in use
+# 4. Missing dependencies - Ensure all required services running
+
+# Fix permissions
+docker exec [service-name] chown -R 1000:1000 /config /data
+```
+
+### Issue: Lost Configuration Data
+
+**Symptoms**: Service starts but settings are default
+
+**Solutions**:
+```bash
+# Check if volumes mounted correctly
+docker inspect [service-name] | grep -A 10 Mounts
+
+# Restore configuration from backup
+docker stop [service-name]
+docker run --rm -v [service-name]_config:/target -v /tmp:/backup alpine \
+    tar xzf /backup/config-backup.tar.gz -C /target
+docker start [service-name]
+```
+
+### Issue: Network Connectivity Problems
+
+**Symptoms**: Service can't reach other services
+
+**Solutions**:
+```bash
+# Check network configuration
+docker network ls
+docker network inspect [network-name]
+
+# Add service to required networks
+docker network connect [network-name] [service-name]
+
+# Verify DNS resolution
+docker exec [service-name] ping [other-service]
+```
+
+## Migration Examples
+
+### Example 1: Migrate Uptime Kuma from Calypso to Homelab VM
+
+```bash
+# 1. Backup on Calypso
+ssh calypso
+docker stop uptime-kuma
+tar czf /tmp/uptime-kuma-data.tar.gz /volume1/docker/uptime-kuma
+
+# 2. Transfer
+scp /tmp/uptime-kuma-data.tar.gz homelab-vm:/tmp/
+
+# 3. Update git
+cd ~/Documents/repos/homelab
+git mv hosts/synology/calypso/uptime-kuma.yaml \
+       hosts/vms/homelab-vm/uptime-kuma.yaml
+# Update paths in file
+sed -i 's|/volume1/docker/uptime-kuma|/home/user/docker/uptime-kuma|g' \
+    hosts/vms/homelab-vm/uptime-kuma.yaml
+
+# 4. Deploy on target
+git add . && git commit -m "Migrate Uptime Kuma to Homelab VM" && git push
+
+# 5. Verify and cleanup Calypso
+```
+
+### Example 2: Migrate AdGuard Home between Hosts
+
+```bash
+# AdGuard Home requires DNS configuration updates
+# 1. Note current DNS settings on clients
+# 2. Migrate service (as above)
+# 3. Update client DNS to point to new host IP
+# 4. Test DNS resolution from clients
+```
+
+## Related Documentation
+
+- [Add New Service](add-new-service.md)
+- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
+- [Backup Strategies](../admin/backup-strategies.md)
+- [Deployment Workflow](../admin/DEPLOYMENT_WORKFLOW.md)
+
+## Change Log
+
+- 2026-02-14 - Initial creation with multiple migration methods
+- 2026-02-14 - Added large data migration strategies
--- a/docs/runbooks/synology-dsm-upgrade.md
+++ b/docs/runbooks/synology-dsm-upgrade.md
@@ -0,0 +1,622 @@
+# Synology DSM Upgrade Runbook
+
+## Overview
+This runbook provides a safe procedure for upgrading DiskStation Manager (DSM) on Synology NAS devices (Atlantis DS1823xs+ and Calypso DS723+). The procedure minimizes downtime and ensures data integrity during major and minor DSM upgrades.
+
+## Prerequisites
+- [ ] DSM admin credentials
+- [ ] Complete backup of NAS (HyperBackup or external)
+- [ ] Backup verification completed
+- [ ] List of installed packages and their versions
+- [ ] SSH access to NAS (for troubleshooting)
+- [ ] Maintenance window scheduled (1-3 hours)
+- [ ] All Docker containers documented and backed up
+- [ ] Tailscale or alternative remote access configured
+
+## Metadata
+- **Estimated Time**: 1-3 hours (including backups and verification)
+- **Risk Level**: Medium-High (system-level upgrade)
+- **Requires Downtime**: Yes (30-60 minutes for upgrade itself)
+- **Reversible**: Limited (can rollback but complicated)
+- **Tested On**: 2026-02-14
+
+## Upgrade Types
+
+| Type | Example | Risk | Downtime | Reversibility |
+|------|---------|------|----------|---------------|
+| **Patch Update** | 7.2.1 → 7.2.2 | Low | 15-30 min | Easy |
+| **Minor Update** | 7.2 → 7.3 | Medium | 30-60 min | Moderate |
+| **Major Update** | 7.x → 8.0 | High | 60-120 min | Difficult |
+
+## Pre-Upgrade Planning
+
+### Step 1: Check Compatibility
+
+Before upgrading, verify compatibility:
+
+```bash
+# SSH to NAS
+ssh admin@atlantis  # or calypso
+
+# Check current DSM version
+cat /etc.defaults/VERSION
+
+# Check hardware compatibility
+# Visit: https://www.synology.com/en-us/dsm
+# Verify your model supports the target DSM version
+
+# Check RAM requirements (DSM 7.2+ needs at least 1GB)
+free -h
+
+# Check disk space (need at least 5GB free in system partition)
+df -h
+```
+
+### Step 2: Document Current State
+
+Create a pre-upgrade snapshot of your configuration:
+
+```bash
+# Document installed packages
+# DSM UI → Package Center → Installed
+# Take screenshot or note down:
+# - Package names and versions
+# - Custom configurations
+
+# Export Docker Compose files (already in git)
+cd ~/Documents/repos/homelab
+git status  # Ensure all configs are committed
+
+# Document running containers
+ssh atlantis "docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}' > /volume1/docker/pre-upgrade-containers.txt"
+ssh calypso "docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}' > /volume1/docker/pre-upgrade-containers.txt"
+
+# Export package list
+ssh atlantis "synopkg list > /volume1/docker/pre-upgrade-packages.txt"
+ssh calypso "synopkg list > /volume1/docker/pre-upgrade-packages.txt"
+```
+
+### Step 3: Backup Everything
+
+**Critical**: Complete a full backup before proceeding.
+
+```bash
+# 1. Backup via HyperBackup (if configured)
+# DSM UI → HyperBackup → Backup Now
+
+# 2. Export DSM configuration
+# DSM UI → Control Panel → Update & Restore → Configuration Backup → Back Up Configuration
+
+# 3. Backup Docker volumes
+cd ~/Documents/repos/homelab
+./backup.sh
+
+# 4. Snapshot (if using Btrfs)
+# Storage Manager → Storage Pool → Snapshots → Take Snapshot
+
+# 5. Verify backups
+ls -lh /volume1/backups/
+# Ensure backup completed successfully
+```
+
+### Step 4: Notify Users
+
+If other users rely on your homelab:
+
+```bash
+# Send notification (via your notification system)
+curl -H "Title: Scheduled Maintenance" \
+     -H "Priority: high" \
+     -H "Tags: maintenance" \
+     -d "DSM upgrade scheduled for [DATE/TIME]. Services will be unavailable for approximately 1-2 hours." \
+     https://ntfy.sh/REDACTED_TOPIC
+
+# Or send notification via Signal/Discord/etc.
+```
+
+### Step 5: Plan Rollback Strategy
+
+Document your rollback plan:
+- [ ] Backup location verified: ___________
+- [ ] Restore procedure tested: Yes/No
+- [ ] Alternative access method ready (direct keyboard/monitor)
+- [ ] Support contact available if needed
+
+## Upgrade Procedure
+
+### Step 1: Download DSM Update
+
+**Option A: Via DSM UI (Recommended)**
+
+1. Log in to DSM web interface
+2. **Control Panel** → **Update & Restore**
+3. **DSM Update** tab
+4. If update available, click **Download** (don't install yet)
+5. Wait for download to complete
+6. Read release notes carefully
+
+**Option B: Manual Download**
+
+1. Visit Synology Download Center
+2. Find your model (DS1823xs+ or DS723+)
+3. Download appropriate DSM version
+4. Upload via DSM → **Manual DSM Update**
+
+### Step 2: Prepare for Downtime
+
+```bash
+# Stop non-critical Docker containers (optional, reduces memory pressure)
+ssh atlantis
+docker stop $(docker ps -q --filter "name=pattern")  # Stop specific containers
+
+# Or stop all non-critical containers
+# Review which containers can be safely stopped
+docker ps
+docker stop container1 container2 container3
+
+# Leave critical services running:
+# - Portainer (for post-upgrade management)
+# - Monitoring (to track upgrade progress)
+# - Core network services (AdGuard, VPN if critical)
+```
+
+### Step 3: Initiate Upgrade
+
+**Via DSM UI**:
+
+1. **Control Panel** → **Update & Restore** → **DSM Update**
+2. Click **Update Now**
+3. Review release notes and warnings
+4. Check **Yes, I understand I need to perform a backup before updating DSM**
+5. Click **OK** to start
+
+**Via SSH** (advanced, not recommended unless necessary):
+```bash
+# SSH to NAS
+ssh admin@atlantis
+
+# Start upgrade manually
+sudo synoupgrade --start /volume1/@tmp/upd@te/update.pat
+
+# Monitor progress
+tail -f /var/log/messages
+```
+
+### Step 4: Monitor Upgrade Progress
+
+During upgrade, you'll see:
+1. **Checking system**: Verifying prerequisites
+2. **Downloading**: If not pre-downloaded
+3. **Installing**: Actual upgrade process (30-45 minutes)
+4. **Optimizing system**: Post-install tasks
+5. **Reboot**: System will restart
+
+**Monitoring via SSH** (if you have access during upgrade):
+```bash
+# Watch upgrade progress
+tail -f /var/log/upgrade.log
+
+# Or watch system messages
+tail -f /var/log/messages | grep -i upgrade
+```
+
+**Expected timeline**:
+- Preparation: 5-10 minutes
+- Installation: 30-45 minutes
+- First reboot: 3-5 minutes
+- Optimization: 10-20 minutes
+- Final reboot: 3-5 minutes
+- **Total**: 60-90 minutes
+
+### Step 5: Wait for Completion
+
+**⚠️ IMPORTANT**: Do not power off or interrupt the upgrade!
+
+Signs of normal upgrade:
+- DSM UI becomes inaccessible
+- NAS may beep once (starting upgrade)
+- Disk lights active
+- NAS will reboot 1-2 times
+- Final beep when complete
+
+### Step 6: First Login After Upgrade
+
+1. Wait for NAS to complete all restarts
+2. Access DSM UI (may take 5-10 minutes after last reboot)
+3. Log in with admin credentials
+4. You may see "Optimization in progress" - this is normal
+5. Review the "What's New" page
+6. Accept any new terms/agreements
+
+## Post-Upgrade Verification
+
+### Step 1: Verify System Health
+
+```bash
+# SSH to NAS
+ssh admin@atlantis
+
+# Check DSM version
+cat /etc.defaults/VERSION
+# Should show new version
+
+# Check system status
+sudo syno_disk_check
+
+# Check RAID status
+cat /proc/mdstat
+
+# Check disk health
+sudo smartctl -a /dev/sda
+
+# Verify storage pools
+synospace --get
+```
+
+Via DSM UI:
+- **Storage Manager** → Verify all pools are "Healthy"
+- **Resource Monitor** → Check CPU, RAM, network
+- **Log Center** → Review any errors during upgrade
+
+### Step 2: Verify Packages
+
+```bash
+# Check all packages are running
+synopkg list
+
+# Compare with pre-upgrade package list
+diff /volume1/docker/pre-upgrade-packages.txt <(synopkg list)
+
+# Start any stopped packages
+# DSM UI → Package Center → Installed
+# Check each package, start if needed
+```
+
+Common packages to verify:
+- [ ] Docker
+- [ ] Synology Drive
+- [ ] Hyper Backup
+- [ ] Snapshot Replication
+- [ ] Any other installed packages
+
+### Step 3: Verify Docker Containers
+
+```bash
+# SSH to NAS
+ssh atlantis
+
+# Check Docker is running
+docker --version
+docker info
+
+# Check all containers
+docker ps -a
+
+# Compare with pre-upgrade state
+diff /volume1/docker/pre-upgrade-containers.txt <(docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}')
+
+# Start stopped containers
+docker start $(docker ps -a -q -f status=exited)
+
+# Check container logs for errors
+docker ps --format "{{.Names}}" | xargs -I {} sh -c 'echo "=== {} ===" && docker logs --tail 20 {}'
+```
+
+### Step 4: Test Key Services
+
+Verify critical services are working:
+
+```bash
+# Test network connectivity
+ping -c 4 8.8.8.8
+curl -I https://google.com
+
+# Test Docker networking
+docker exec [container] ping -c 2 8.8.8.8
+
+# Test Portainer access
+curl http://localhost:9000
+
+# Test Plex
+curl http://localhost:32400/web
+
+# Test monitoring
+curl http://localhost:3000  # Grafana
+curl http://localhost:9090  # Prometheus
+```
+
+Via browser:
+- [ ] Portainer accessible
+- [ ] Grafana dashboards loading
+- [ ] Plex/Jellyfin streaming works
+- [ ] File shares accessible
+- [ ] SSO (Authentik) working
+
+### Step 5: Verify Scheduled Tasks
+
+```bash
+# Check cron jobs
+crontab -l
+
+# Via DSM UI
+# Control Panel → Task Scheduler
+# Verify all tasks are enabled
+```
+
+### Step 6: Test Remote Access
+
+- [ ] Tailscale VPN working
+- [ ] External access via domain (if configured)
+- [ ] SSH access working
+- [ ] Mobile app access working (DS File, DS Photo, etc.)
+
+## Post-Upgrade Optimization
+
+### Step 1: Update Packages
+
+After DSM upgrade, packages may need updates:
+
+1. **Package Center** → **Update** tab
+2. Update available packages
+3. Prioritize critical packages:
+   - Docker (if updated)
+   - Surveillance Station (if used)
+   - Drive, Office, etc.
+
+### Step 2: Review New Features
+
+DSM upgrades often include new features:
+
+1. Review "What's New" page
+2. Check for new security features
+3. Review changed settings
+4. Update documentation if needed
+
+### Step 3: Re-enable Auto-Updates (if disabled)
+
+```bash
+# Via DSM UI
+# Control Panel → Update & Restore → DSM Update
+# Check "Notify me when DSM updates are available"
+# Or "Install latest DSM updates automatically" (if you trust auto-updates)
+```
+
+### Step 4: Update Documentation
+
+```bash
+cd ~/Documents/repos/homelab
+
+# Update infrastructure docs
+nano docs/infrastructure/INFRASTRUCTURE_OVERVIEW.md
+
+# Note DSM version upgrade
+# Document any configuration changes
+# Update troubleshooting docs if procedures changed
+
+git add .
+git commit -m "Update docs: DSM upgraded to X.X on Atlantis/Calypso"
+git push
+```
+
+## Troubleshooting
+
+### Issue: Upgrade Fails or Stalls
+
+**Symptoms**: Progress bar stuck, no activity for >30 minutes
+
+**Solutions**:
+
+```bash
+# If you have SSH access:
+ssh admin@atlantis
+
+# Check if upgrade process is running
+ps aux | grep -i upgrade
+
+# Check system logs
+tail -100 /var/log/messages
+tail -100 /var/log/upgrade.log
+
+# Check disk space
+df -h
+
+# If completely stuck (>1 hour no progress):
+# 1. Do NOT force reboot unless absolutely necessary
+# 2. Contact Synology support first
+# 3. As last resort, force reboot via physical button
+```
+
+### Issue: NAS Won't Boot After Upgrade
+
+**Symptoms**: Cannot access DSM UI, NAS beeping continuously
+
+**Solutions**:
+
+1. **Check beep pattern** (indicates specific error)
+   - 1 beep: Normal boot
+   - 3 beeps: RAM issue
+   - 4 beeps: Disk issue
+   - Continuous: Critical failure
+
+2. **Try Safe Mode**:
+   - Power off NAS
+   - Hold reset button
+   - Power on while holding reset
+   - Hold for 4 seconds until beep
+   - Release and wait for boot
+
+3. **Check via Synology Assistant**:
+   - Download Synology Assistant on PC
+   - Scan network for NAS
+   - May show recovery mode option
+
+4. **Last resort: Reinstall DSM**:
+   - Download latest DSM .pat file
+   - Access via http://[nas-ip]:5000
+   - Install DSM (will not erase data)
+
+### Issue: Docker Not Working After Upgrade
+
+**Symptoms**: Docker containers won't start, Docker package shows stopped
+
+**Solutions**:
+
+```bash
+# SSH to NAS
+ssh admin@atlantis
+
+# Check Docker status
+sudo synoservicectl --status pkgctl-Docker
+
+# Restart Docker
+sudo synoservicectl --restart pkgctl-Docker
+
+# If Docker won't start, check logs
+cat /var/log/docker.log
+
+# Reinstall Docker package (preserves volumes)
+# Via DSM UI → Package Center → Docker → Uninstall
+# Then reinstall Docker
+# Your volumes and data will be preserved
+```
+
+### Issue: Network Shares Not Accessible
+
+**Symptoms**: Can't connect to SMB/NFS shares
+
+**Solutions**:
+
+```bash
+# Check share services
+sudo synoservicectl --status smbd  # SMB
+sudo synoservicectl --status nfsd  # NFS
+
+# Restart services
+sudo synoservicectl --restart smbd
+sudo synoservicectl --restart nfsd
+
+# Check firewall
+# Control Panel → Security → Firewall
+# Ensure file sharing ports allowed
+```
+
+### Issue: Performance Degradation After Upgrade
+
+**Symptoms**: Slow response, high CPU/RAM usage
+
+**Solutions**:
+
+```bash
+# Check what's using resources
+top
+htop  # If installed
+
+# Via DSM UI → Resource Monitor
+# Identify resource-hungry processes
+
+# Common causes:
+# 1. Indexing in progress (Photos, Drive, Universal Search)
+#    - Wait for indexing to complete (can take hours)
+# 2. Optimization running
+#    - Check: ps aux | grep optimize
+#    - Let it complete
+# 3. Too many containers started at once
+#    - Stagger container startup
+```
+
+## Rollback Procedure
+
+⚠️ **WARNING**: Rollback is complex and risky. Only attempt if absolutely necessary.
+
+### Method 1: DSM Archive (If Available)
+
+```bash
+# SSH to NAS
+ssh admin@atlantis
+
+# Check if previous DSM version archived
+ls -la /volume1/@appstore/
+
+# If archive exists, you can attempt rollback
+# CAUTION: This is not officially supported and may cause data loss
+```
+
+### Method 2: Restore from Backup
+
+If upgrade caused critical issues:
+
+1. REDACTED_APP_PASSWORD
+2. Restore from HyperBackup
+3. Or restore from configuration backup:
+   - **Control Panel** → **Update & Restore**
+   - **Configuration Backup** → **Restore**
+
+### Method 3: Fresh Install (Nuclear Option)
+
+⚠️ **DANGER**: This will erase everything. Only for catastrophic failure.
+
+1. Download previous DSM version
+2. Install via Synology Assistant in "Recovery Mode"
+3. Restore from complete backup
+4. Reconfigure everything
+
+## Best Practices
+
+### Timing
+- Schedule upgrades during low-usage periods
+- Allow 3-4 hour maintenance window
+- Don't upgrade before important events
+- Wait 2-4 weeks after major DSM release (let others find bugs)
+
+### Testing
+- If you have 2 NAS units, upgrade one first
+- Test on less critical NAS before primary
+- Read community forums for known issues
+- Review Synology release notes thoroughly
+
+### Preparation
+- Always complete full backup
+- Test backup restore before upgrade
+- Document all configurations
+- Have physical access to NAS if possible
+- Keep Synology Assistant installed on PC
+
+### Post-Upgrade
+- Monitor closely for 24-48 hours
+- Check logs daily for first week
+- Report any bugs to Synology
+- Update your documentation
+
+## Verification Checklist
+
+- [ ] DSM upgraded to target version
+- [ ] All storage pools healthy
+- [ ] All packages running
+- [ ] All Docker containers running
+- [ ] Network shares accessible
+- [ ] Remote access working (Tailscale, QuickConnect)
+- [ ] Scheduled tasks running
+- [ ] Monitoring dashboards functional
+- [ ] Backups completing successfully
+- [ ] No errors in system logs
+- [ ] Performance normal
+- [ ] Documentation updated
+
+## Related Documentation
+
+- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
+- [Emergency Access Guide](../troubleshooting/EMERGENCY_ACCESS_GUIDE.md)
+- [Disaster Recovery](../troubleshooting/disaster-recovery.md)
+- [Synology Disaster Recovery](../troubleshooting/synology-disaster-recovery.md)
+- [Backup Strategies](../admin/backup-strategies.md)
+
+## Additional Resources
+
+- [Synology DSM Release Notes](https://www.synology.com/en-us/releaseNote/DSM)
+- [Synology Community Forums](https://community.synology.com/)
+- [Synology Knowledge Base](https://kb.synology.com/)
+
+## Change Log
+
+- 2026-02-14 - Initial creation
+- 2026-02-14 - Added comprehensive troubleshooting and rollback procedures