Sanitized mirror from private repository - 2026-04-18 11:19:59 UTC
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m14s
Documentation / Deploy to GitHub Pages (push) Has been skipped

This commit is contained in:
Gitea Mirror Bot
2026-04-18 11:19:59 +00:00
commit fb00a325d1
1418 changed files with 359990 additions and 0 deletions

143
docs/runbooks/README.md Normal file
View File

@@ -0,0 +1,143 @@
# Homelab Operational Runbooks
This directory contains step-by-step operational runbooks for common homelab management tasks. Each runbook provides clear procedures, prerequisites, and rollback steps.
## 📚 Available Runbooks
### Service Management
- **[Add New Service](add-new-service.md)** - Deploy new containerized services via GitOps
- **[Service Migration](service-migration.md)** - Move services between hosts safely
- **[Add New User](add-new-user.md)** - Onboard new users with proper access
### Infrastructure Maintenance
- **[Disk Full Procedure](disk-full-procedure.md)** - Handle full disk scenarios
- **[Certificate Renewal](certificate-renewal.md)** - Manage SSL/TLS certificates
- **[Synology DSM Upgrade](synology-dsm-upgrade.md)** - Safely upgrade NAS firmware
### Security
- **[Credential Rotation](credential-rotation.md)** - Rotate exposed or compromised credentials
## 🎯 How to Use These Runbooks
### Runbook Format
Each runbook follows a standard format:
1. **Overview** - What this procedure accomplishes
2. **Prerequisites** - What you need before starting
3. **Estimated Time** - How long it typically takes
4. **Risk Level** - Low/Medium/High impact assessment
5. **Procedure** - Step-by-step instructions
6. **Verification** - How to confirm success
7. **Rollback** - How to undo if something goes wrong
8. **Troubleshooting** - Common issues and solutions
### When to Use Runbooks
- **Planned Maintenance** - Follow runbooks during scheduled maintenance windows
- **Incident Response** - Use as quick reference during outages
- **Training** - Onboard new admins with documented procedures
- **Automation** - Use as basis for creating automated scripts
### Best Practices
- ✅ Always read the entire runbook before starting
- ✅ Have a rollback plan ready
- ✅ Test in development/staging when possible
- ✅ Take snapshots/backups before major changes
- ✅ Document any deviations from the runbook
- ✅ Update runbooks when procedures change
## 🚨 Emergency Procedures
For emergency situations, refer to:
- [Emergency Access Guide](../troubleshooting/EMERGENCY_ACCESS_GUIDE.md)
- [Recovery Guide](../troubleshooting/RECOVERY_GUIDE.md)
- [Disaster Recovery](../troubleshooting/disaster-recovery.md)
## 📋 Runbook Maintenance
### Contributing
When you discover a new procedure or improvement:
1. Create a new runbook using the template below
2. Follow the standard format
3. Include real examples from your infrastructure
4. Test the procedure before documenting
### Runbook Template
```markdown
# [Procedure Name]
## Overview
Brief description of what this accomplishes and when to use it.
## Prerequisites
- [ ] Required access/credentials
- [ ] Required tools/software
- [ ] Required knowledge/skills
## Metadata
- **Estimated Time**: X minutes/hours
- **Risk Level**: Low/Medium/High
- **Requires Downtime**: Yes/No
- **Reversible**: Yes/No
- **Tested On**: Date last tested
## Procedure
### Step 1: [Action]
Detailed instructions...
```bash
# Example commands
```
Expected output:
```
Example of what you should see
```
### Step 2: [Next Action]
Continue...
## Verification
How to confirm the procedure succeeded:
- [ ] Verification step 1
- [ ] Verification step 2
## Rollback Procedure
If something goes wrong:
1. Step to undo changes
2. How to restore previous state
## Troubleshooting
**Issue**: Common problem
**Solution**: How to fix it
## Related Documentation
- [Link to related doc](path)
## Change Log
- YYYY-MM-DD - Initial creation
- YYYY-MM-DD - Updated for new procedure
```
## 📞 Getting Help
If a runbook is unclear or doesn't work as expected:
1. Check the troubleshooting section
2. Refer to related documentation links
3. Review the homelab monitoring dashboards
4. Consult the [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
## 📊 Runbook Status
| Runbook | Status | Last Updated | Tested On |
|---------|--------|--------------|-----------|
| Add New Service | ✅ Active | 2026-02-14 | 2026-02-14 |
| Service Migration | ✅ Active | 2026-02-14 | 2026-02-14 |
| Add New User | ✅ Active | 2026-02-14 | 2026-02-14 |
| Disk Full Procedure | ✅ Active | 2026-02-14 | 2026-02-14 |
| Certificate Renewal | ✅ Active | 2026-02-14 | 2026-02-14 |
| Synology DSM Upgrade | ✅ Active | 2026-02-14 | 2026-02-14 |
| Credential Rotation | ✅ Active | 2026-02-20 | — |
---
**Last Updated**: 2026-02-14

View File

@@ -0,0 +1,65 @@
# Add New Service Runbook
This runbook walks through a **clean, tested path** for adding a new service to the homelab using GitOps with Portainer.
> ⚠️ **Prerequisites**: CI runner access, SSH to target hosts, SSO admin privilege.
## 1. Prepare Compose File
```bash
# Generate a minimal stack template
../scripts/ci/workflows/gen-template.py --service myservice
```
Adjust `docker-compose.yml`:
- Image name
- Ports
- Environment variables
- Healthcheck
## 2. Validate Configuration
```bash
docker compose -f docker-compose.yml config > /tmp/merged.yml
# Validate against OpenAPI specs if needed
```
## 3. Commit Locally
```bash
git add docker/compose/*.yml
git commit -m "Add myservice stack"
```
## 4. Push to Remote & Trigger GitOps
```bash
git push origin main
```
The Portainer EE GitOps agent will automatically deploy. Monitor the stack via the Portainer UI or `portainer api`.
## 5. PostDeployment Verification
| Check | Command | Expected Result |
|-------|---------|-----------------
| Service Running | `docker ps --filter "name=myservice"` | One container running |
| Health Endpoint | `curl http://localhost:8080/health` | 200 OK |
| Logs | `docker logs myservice` | No fatal errors |
## 6. Update Documentation
1. Add entry to `docs/services/VERIFIED_SERVICE_INVENTORY.md`.
2. Create a quickstart guide in `docs/services/<service>/README.md`.
3. Publish to the shared wiki.
## 7. Optional Terraform Sync
If the service also needs infra changes (e.g., new VM), update the Terraform modules under `infra/` and run `terragrunt run-all apply`.
---
**Gotchas**
- *Race conditions*: rebasing before push.
- Healthcheck failures: check Portainer Events.
- Secrets: use Vault and reference in `secrets` section.

View File

@@ -0,0 +1,601 @@
# Add New User Runbook
## Overview
This runbook provides a comprehensive procedure for onboarding new users to the homelab, including network access, service authentication, and permission management. It ensures users get appropriate access while maintaining security.
## Prerequisites
- [ ] User's full name and email address
- [ ] Desired username (lowercase, no spaces)
- [ ] Access level determined (read-only, standard, admin)
- [ ] Required services identified
- [ ] Admin access to all relevant systems
- [ ] Authentik admin access (for SSO services)
- [ ] Tailscale admin access (for VPN)
- [ ] Synology admin access (for file shares)
## Metadata
- **Estimated Time**: 30-60 minutes
- **Risk Level**: Low (proper access controls in place)
- **Requires Downtime**: No
- **Reversible**: Yes (can remove user access)
- **Tested On**: 2026-02-14
## User Access Levels
| Level | Description | Typical Use Case | Services |
|-------|-------------|------------------|----------|
| **Guest** | Read-only, limited services | Family, friends | Plex, Jellyfin |
| **Standard** | Read/write, most services | Family members | Media + storage |
| **Power User** | Advanced services | Tech-savvy users | Dev tools, monitoring |
| **Admin** | Full access, can manage | Co-admins, yourself | Everything + admin panels |
## Pre-Onboarding Checklist
### Step 1: Gather Information
Create a user profile document:
```markdown
# New User: [Name]
**Username**: [username]
**Email**: [email@domain.com]
**Access Level**: [Guest/Standard/Power User/Admin]
**Start Date**: [YYYY-MM-DD]
## Services Requested:
- [ ] Plex/Jellyfin (Media streaming)
- [ ] File Shares (NAS access)
- [ ] Immich (Photo backup)
- [ ] Paperless (Document management)
- [ ] Development tools (Gitea, etc.)
- [ ] Monitoring dashboards
- [ ] Other: ___________
## Access Requirements:
- [ ] Remote access (Tailscale VPN)
- [ ] Local network only
- [ ] Mobile apps
- [ ] Web browser only
## Notes:
[Any special requirements or restrictions]
```
### Step 2: Plan Access
Determine which systems need accounts:
- [ ] **Tailscale** (VPN access to homelab)
- [ ] **Authentik** (SSO for web services)
- [ ] **Synology NAS** (File shares - Atlantis/Calypso)
- [ ] **Plex** (Media streaming)
- [ ] **Jellyfin** (Alternative media)
- [ ] **Immich** (Photo management)
- [ ] **Portainer** (Container management - admin only)
- [ ] **Grafana** (Monitoring - admin/power user)
- [ ] **Other services**: ___________
## User Onboarding Procedure
### Step 1: Create Tailscale Access
**Why First**: Tailscale provides secure remote access to the homelab network.
1. **Invite via Tailscale Admin Console**:
- Go to https://login.tailscale.com/admin/settings/users
- Click **Invite Users**
- Enter user's email
- Set expiration (optional)
- Click **Send Invite**
2. **User receives email**:
- User clicks invitation link
- Creates Tailscale account
- Installs Tailscale app on their device(s)
- Connects to your tailnet
3. **Configure ACLs** (if needed):
```json
// In Tailscale Admin Console → Access Controls
{
"acls": [
// Existing ACLs...
{
"action": "accept",
"src": ["user@email.com"],
"dst": [
"atlantis:*", // Allow access to Atlantis
"calypso:*", // Allow access to Calypso
"homelab-vm:*" // Allow access to VM
]
}
]
}
```
4. **Test connectivity**:
```bash
# Ask user to test
ping atlantis.your-tailnet.ts.net
curl http://atlantis.your-tailnet.ts.net:9000 # Test Portainer
```
### Step 2: Create Authentik Account (SSO)
**Purpose**: Single sign-on for most web services.
1. **Access Authentik Admin**:
- Navigate to your Authentik instance
- Log in as admin
2. **Create User**:
- **Directory** → **Users** → **Create**
- Fill in:
- **Username**: `username` (lowercase)
- **Name**: `First Last`
- **Email**: `user@email.com`
- **Groups**: Add to appropriate groups
- `homelab-users` (standard access)
- `homelab-admins` (for admin users)
- Service-specific groups (e.g., `jellyfin-users`)
3. **Set Password**:
- Option A: Set temporary password, force change on first login
- Option B: Send password reset link via email
4. **Assign Service Access**:
- **Applications** → **Outposts**
- For each service the user should access:
- Edit application
- Add user/group to **Policy Bindings**
5. **Test SSO**:
```bash
# User should test login to SSO-enabled services
# Example: Grafana, Jellyseerr, etc.
```
### Step 3: Create Synology NAS Account
**Purpose**: Access to file shares, Photos, Drive, etc.
#### On Atlantis (Primary NAS):
```bash
# SSH to Atlantis
ssh admin@atlantis
# Create user (DSM 7.x)
# Via DSM UI (recommended):
```
1. **Control Panel** → **User & Group** → **User** → **Create**
2. Fill in:
- **Name**: `username`
- **Description**: `[Full Name]`
- **Email**: `user@email.com`
- **Password**: Set strong password
3. **Join Groups**:
- `users` (default)
- `http` (if web service access needed)
4. **Configure Permissions**:
- **Applications** tab:
- [ ] Synology Photos (if needed)
- [ ] Synology Drive (if needed)
- [ ] File Station
- [ ] Other apps as needed
- **Shared Folders** tab:
- Set permissions for each share:
- Read/Write: For shares user can modify
- Read-only: For media libraries
- No access: For restricted folders
5. **User Quotas** (optional):
- Set storage quota if needed
- Limit upload/download speed if needed
6. **Click Create**
#### On Calypso (Secondary NAS):
Repeat the same process if user needs access to Calypso.
**Alternative: SSH Method**:
```bash
# Create user via command line
sudo synouser --add username "Full Name" "password" "user@email.com" 0 "" 0
# Add to groups
sudo synogroup --member users username add
# Set folder permissions (example)
sudo chown -R username:users /volume1/homes/username
```
### Step 4: Create Plex Account
**Option A: Managed User (Recommended for Family)**
1. Open Plex Web
2. **Settings** → **Users & Sharing** → **Manage Home Users**
3. Click **Add User**
4. Set:
- **Username**: `[Name]`
- **PIN**: 4-digit PIN
- Enable **Managed user** if restricted access desired
5. Configure library access
**Option B: Plex Account (For External Users)**
1. User creates their own Plex account
2. **Settings** → **Users & Sharing** → **Friends**
3. Invite by email
4. Select libraries to share
5. Configure restrictions:
- [ ] Allow sync
- [ ] Allow camera upload
- [ ] Rating restrictions (if children)
### Step 5: Create Jellyfin Account
```bash
# SSH to host running Jellyfin
ssh atlantis # or wherever Jellyfin runs
# Or via web UI:
```
1. Open Jellyfin web interface
2. **Dashboard** → **Users** → **Add User**
3. Set:
- **Name**: `username`
- **Password**: REDACTED_PASSWORD password
4. Configure:
- **Library access**: Select which libraries
- **Permissions**:
- [ ] Allow media deletion
- [ ] Allow remote access
- [ ] Enable live TV (if applicable)
5. **Save**
### Step 6: Create Immich Account (If Used)
```bash
# Via Immich web interface
```
1. Open Immich
2. **Administration** → **Users** → **Create User**
3. Set:
- **Email**: `user@email.com`
- **Password**: REDACTED_PASSWORD password
- **Name**: `Full Name`
4. User logs in and sets up mobile app
### Step 7: Grant Service-Specific Access
#### Gitea (Development)
1. Gitea web interface
2. **Site Administration** → **User Accounts** → **Create User Account**
3. Fill in details
4. Add to appropriate organizations/teams
#### Portainer (Admin/Power Users Only)
1. Portainer web interface
2. **Users** → **Add user**
3. Set:
- **Username**: `username`
- **Password**: REDACTED_PASSWORD password
4. Assign role:
- **Administrator**: Full access
- **Operator**: Can manage containers
- **User**: Read-only
5. Assign to teams/endpoints
#### Grafana (Monitoring)
If using Authentik SSO, user automatically gets access.
If not using SSO:
1. Grafana web interface
2. **Configuration** → **Users** → **Invite**
3. Set role:
- **Viewer**: Read-only dashboards
- **Editor**: Can create dashboards
- **Admin**: Full access
### Step 8: Configure Mobile Apps
Provide user with setup instructions:
**Plex**:
- Download Plex app
- Sign in with Plex account
- Server should auto-discover via Tailscale
**Jellyfin**:
- Download Jellyfin app
- Add server: `http://atlantis.tailnet:8096`
- Sign in with credentials
**Immich** (if used):
- Download Immich app
- Server: `http://atlantis.tailnet:2283`
- Enable auto-backup (optional)
**Synology Apps**:
- DS File (file access)
- Synology Photos
- DS Audio/Video
- Server: `atlantis.tailnet` or QuickConnect ID
**Tailscale**:
- Already installed in Step 1
- Ensure "Always On VPN" enabled for seamless access
## User Documentation Package
Provide new user with documentation:
```markdown
# Welcome to the Homelab!
Hi [Name],
Your access has been set up. Here's what you need to know:
## Network Access
**Tailscale VPN**:
- Install Tailscale from: https://tailscale.com/download
- Log in with your account (check email for invitation)
- Connect to our tailnet
- You can now access services remotely!
## Available Services
### Media Streaming
- **Plex**: https://plex.vish.gg
- Username: [plex-username]
- Watch movies, TV shows, music
- **Jellyfin**: https://jellyfin.vish.gg
- Username: [username]
- Alternative media server
### File Storage
- **Atlantis NAS**: smb://atlantis.tailnet/[your-folder]
- Access via file explorer
- Windows: \\atlantis.tailnet\folder
- Mac: smb://atlantis.tailnet/folder
### Photos
- **Immich**: https://immich.vish.gg
- Auto-backup from your phone
- Private photo storage
### Other Services
- [List other services user has access to]
## Support
If you need help:
- Email: [your-email]
- [Alternative contact method]
## Security
- Don't share passwords
- Enable 2FA where available
- Report any suspicious activity
Welcome aboard!
```
## Post-Onboarding Tasks
### Step 1: Update Documentation
```bash
cd ~/Documents/repos/homelab
# Update user access documentation
nano docs/infrastructure/USER_ACCESS_GUIDE.md
# Add user to list:
# | Username | Access Level | Services | Status |
# | username | Standard | Plex, Files, Photos | ✅ Active |
git add .
git commit -m "Add new user: [username]"
git push
```
### Step 2: Test User Access
Verify everything works:
- [ ] User can connect via Tailscale
- [ ] User can access Plex/Jellyfin
- [ ] User can access file shares
- [ ] SSO login works
- [ ] Mobile apps working
- [ ] No access to restricted services
### Step 3: Monitor Usage
```bash
# Check user activity after a few days
# Grafana dashboards should show:
# - Network traffic from user's IP
# - Service access logs
# - Any errors
# Review logs
ssh atlantis
grep username /var/log/auth.log # SSH attempts
docker logs plex | grep username # Plex usage
```
## Verification Checklist
- [ ] Tailscale invitation sent and accepted
- [ ] Authentik account created and tested
- [ ] Synology NAS account created (Atlantis/Calypso)
- [ ] Plex/Jellyfin access granted
- [ ] Required service accounts created
- [ ] Mobile apps configured and tested
- [ ] User documentation sent
- [ ] User confirmed access is working
- [ ] Documentation updated
- [ ] No access to restricted services
## User Removal Procedure
When user no longer needs access:
### Step 1: Disable Accounts
```bash
# Disable in order of security priority:
# 1. Tailscale
# Admin Console → Users → [user] → Revoke keys
# 2. Authentik
# Directory → Users → [user] → Deactivate
# 3. Synology NAS
# Control Panel → User & Group → [user] → Disable
# Or via SSH:
sudo synouser --disable username
# 4. Plex
# Settings → Users & Sharing → Remove user
# 5. Jellyfin
# Dashboard → Users → [user] → Delete
# 6. Other services
# Remove from each service individually
```
### Step 2: Archive User Data (Optional)
```bash
# Backup user's data before deleting
# Synology home folder:
tar czf /volume1/backups/user-archives/username-$(date +%Y%m%d).tar.gz \
/volume1/homes/username
# User's Immich photos (if applicable)
# User's documents (if applicable)
```
### Step 3: Delete User
After confirming data is backed up:
```bash
# Synology: Delete user
# Control Panel → User & Group → [user] → Delete
# Choose whether to keep or delete user's data
# Or via SSH:
sudo synouser --del username
sudo rm -rf /volume1/homes/username # If deleting data
```
### Step 4: Update Documentation
```bash
# Update user access guide
nano docs/infrastructure/USER_ACCESS_GUIDE.md
# Mark user as removed with date
git add .
git commit -m "Remove user: [username] - access terminated [date]"
git push
```
## Troubleshooting
### Issue: User Can't Connect via Tailscale
**Solutions**:
- Verify invitation was accepted
- Check user installed Tailscale correctly
- Verify ACLs allow user's device
- Check user's device firewall
- Try: `tailscale ping atlantis`
### Issue: SSO Login Not Working
**Solutions**:
- Verify Authentik account is active
- Check user is in correct groups
- Verify application is assigned to user
- Clear browser cookies
- Try incognito mode
- Check Authentik logs
### Issue: Can't Access File Shares
**Solutions**:
```bash
# Check Synology user exists and is enabled
ssh atlantis
sudo synouser --get username
# Check folder permissions
ls -la /volume1/homes/username
# Check SMB service is running
sudo synoservicectl --status smbd
# Test from user's machine:
smbclient -L atlantis.tailnet -U username
```
### Issue: Plex Not Showing Up for User
**Solutions**:
- Verify user accepted Plex sharing invitation
- Check library access permissions
- Verify user's account email is correct
- Try removing and re-adding the user
- Check Plex server accessibility
## Best Practices
### Security
- Use strong passwords (12+ characters, mixed case, numbers, symbols)
- Enable 2FA where available (Authentik supports it)
- Least privilege principle (only grant needed access)
- Regular access reviews (quarterly)
- Disable accounts promptly when not needed
### Documentation
- Keep user list up to date
- Document special access grants
- Note user role changes
- Archive user data before deletion
### Communication
- Set clear expectations with users
- Provide good documentation
- Be responsive to access issues
- Notify users of maintenance windows
## Related Documentation
- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
- [User Access Guide](../infrastructure/USER_ACCESS_GUIDE.md)
- [SSH Access Guide](../infrastructure/SSH_ACCESS_GUIDE.md)
- [Authentik SSO Setup](../infrastructure/authentik-sso.md)
- [Security Guidelines](../infrastructure/security.md)
## Change Log
- 2026-02-14 - Initial creation
- 2026-02-14 - Added comprehensive onboarding and offboarding procedures

View File

@@ -0,0 +1,570 @@
# SSL/TLS Certificate Renewal Runbook
## Overview
This runbook covers SSL/TLS certificate management across the homelab, including Let's Encrypt certificates, Cloudflare Origin certificates, and self-signed certificates. It provides procedures for manual renewal, troubleshooting auto-renewal, and emergency certificate fixes.
## Prerequisites
- [ ] SSH access to relevant hosts
- [ ] Cloudflare account access (if using Cloudflare)
- [ ] Domain DNS control
- [ ] Root/sudo privileges on hosts
- [ ] Backup of current certificates
## Metadata
- **Estimated Time**: 15-45 minutes
- **Risk Level**: Medium (service downtime if misconfigured)
- **Requires Downtime**: Minimal (few seconds during reload)
- **Reversible**: Yes (can restore old certificates)
- **Tested On**: 2026-02-14
## Certificate Types in Homelab
| Type | Used For | Renewal Method | Expiration |
|------|----------|----------------|------------|
| **Let's Encrypt** | Public-facing services | Certbot auto-renewal | 90 days |
| **Cloudflare Origin** | Services behind Cloudflare Tunnel | Manual/Cloudflare dashboard | 15 years |
| **Synology Certificates** | Synology DSM, services | Synology DSM auto-renewal | 90 days |
| **Self-Signed** | Internal/dev services | Manual generation | As configured |
## Certificate Inventory
Document your current certificates:
```bash
# Check Let's Encrypt certificates (on Linux hosts)
sudo certbot certificates
# Check Synology certificates
# DSM UI → Control Panel → Security → Certificate
# Or SSH:
sudo cat /usr/syno/etc/certificate/_archive/*/cert.pem | openssl x509 -text -noout
# Check certificate expiration for any domain
echo | openssl s_client -servername service.vish.gg -connect service.vish.gg:443 2>/dev/null | openssl x509 -noout -dates
# Check all certificates at once
for domain in st.vish.gg gf.vish.gg mx.vish.gg; do
echo "=== $domain ==="
echo | timeout 5 openssl s_client -servername $domain -connect $domain:443 2>/dev/null | openssl x509 -noout -dates
echo
done
```
Create inventory:
```markdown
| Domain | Type | Expiry Date | Auto-Renew | Status |
|--------|------|-------------|------------|--------|
| vish.gg | Let's Encrypt | 2026-05-15 | ✅ Yes | ✅ Valid |
| st.vish.gg | Let's Encrypt | 2026-05-15 | ✅ Yes | ✅ Valid |
| gf.vish.gg | Let's Encrypt | 2026-05-15 | ✅ Yes | ✅ Valid |
```
## Let's Encrypt Certificate Renewal
### Automatic Renewal (Certbot)
Let's Encrypt certificates should auto-renew. Check the renewal setup:
```bash
# Check certbot timer status (systemd)
sudo systemctl status certbot.timer
# Check cron job (if using cron)
sudo crontab -l | grep certbot
# Test renewal (dry-run, doesn't actually renew)
sudo certbot renew --dry-run
# Expected output:
# Congratulations, all simulated renewals succeeded
```
### Manual Renewal
If auto-renewal fails or you need to renew manually:
```bash
# Renew all certificates
sudo certbot renew
# Renew specific certificate
sudo certbot renew --cert-name vish.gg
# Force renewal (even if not expired)
sudo certbot renew --force-renewal
# Renew with verbose output for troubleshooting
sudo certbot renew --verbose
```
After renewal, reload web servers:
```bash
# Nginx
sudo nginx -t # Test configuration
sudo systemctl reload nginx
# Apache
sudo apachectl configtest
sudo systemctl reload apache2
```
### Let's Encrypt with Nginx Proxy Manager
If using Nginx Proxy Manager (NPM):
1. Open NPM UI (typically port 81)
2. Go to **SSL Certificates** tab
3. Certificates should auto-renew 30 days before expiry
4. To force renewal:
- Click the certificate
- Click **Renew** button
5. No service reload needed (NPM handles it)
## Synology Certificate Renewal
### Automatic Renewal on Synology NAS
```bash
# SSH to Synology NAS (Atlantis or Calypso)
ssh atlantis # or calypso
# Check certificate status
sudo /usr/syno/sbin/syno-letsencrypt list
# Force renewal check
sudo /usr/syno/sbin/syno-letsencrypt renew-all
# Check renewal logs
sudo cat /var/log/letsencrypt/letsencrypt.log
# Verify certificate expiry
sudo openssl x509 -in /usr/syno/etc/certificate/system/default/cert.pem -text -noout | grep "Not After"
```
### Via Synology DSM UI
1. Log in to DSM
2. **Control Panel****Security****Certificate**
3. Select certificate → Click **Renew**
4. DSM will automatically renew and apply
5. No manual reload needed
### Synology Certificate Configuration
Enable auto-renewal in DSM:
1. **Control Panel****Security****Certificate**
2. Click **Settings** button
3. Check **Auto-renew certificate**
4. Synology will renew 30 days before expiry
## Stoatchat Certificates (Gaming VPS)
The Stoatchat gaming server uses Let's Encrypt with Certbot:
```bash
# SSH to gaming VPS
ssh root@gaming-vps
# Check certificates
sudo certbot certificates
# Domains covered:
# - st.vish.gg
# - api.st.vish.gg
# - events.st.vish.gg
# - files.st.vish.gg
# - proxy.st.vish.gg
# - voice.st.vish.gg
# Renew all
sudo certbot renew
# Reload Nginx
sudo systemctl reload nginx
```
Auto-renewal cron:
```bash
# Check certbot timer
sudo systemctl status certbot.timer
# Or check cron
sudo crontab -l | grep certbot
```
## Cloudflare Origin Certificates
For services using Cloudflare Tunnel:
### Generate New Origin Certificate
1. Log in to Cloudflare Dashboard
2. Select domain (vish.gg)
3. **SSL/TLS****Origin Server**
4. Click **Create Certificate**
5. Configure:
- **Private key type**: RSA (2048)
- **Hostnames**: *.vish.gg, vish.gg
- **Certificate validity**: 15 years
6. Copy certificate and private key
7. Save to secure location
### Install Origin Certificate
```bash
# SSH to target host
ssh [host]
# Create certificate files
sudo nano /etc/ssl/cloudflare/cert.pem
# Paste certificate
sudo nano /etc/ssl/cloudflare/key.pem
# Paste private key
# Set permissions
sudo chmod 644 /etc/ssl/cloudflare/cert.pem
sudo chmod 600 /etc/ssl/cloudflare/key.pem
# Update Nginx configuration
sudo nano /etc/nginx/sites-available/[service]
# Use new certificate
ssl_certificate /etc/ssl/cloudflare/cert.pem;
ssl_certificate_key /etc/ssl/cloudflare/key.pem;
# Test and reload
sudo nginx -t
sudo systemctl reload nginx
```
## Self-Signed Certificates (Internal/Dev)
For internal-only services not exposed publicly:
### Generate Self-Signed Certificate
```bash
# Generate 10-year self-signed certificate
sudo openssl req -x509 -nodes -days 3650 -newkey rsa:2048 \
-keyout /etc/ssl/private/selfsigned.key \
-out /etc/ssl/certs/selfsigned.crt \
-subj "/C=US/ST=State/L=City/O=Homelab/CN=internal.vish.local"
# Generate with SAN (Subject Alternative Names) for multiple domains
sudo openssl req -x509 -nodes -days 3650 -newkey rsa:2048 \
-keyout /etc/ssl/private/selfsigned.key \
-out /etc/ssl/certs/selfsigned.crt \
-subj "/C=US/ST=State/L=City/O=Homelab/CN=*.vish.local" \
-addext "subjectAltName=DNS:*.vish.local,DNS:vish.local"
# Set permissions
sudo chmod 600 /etc/ssl/private/selfsigned.key
sudo chmod 644 /etc/ssl/certs/selfsigned.crt
```
### Install in Services
Update Docker Compose to mount certificates:
```yaml
services:
service:
volumes:
- /etc/ssl/certs/selfsigned.crt:/etc/ssl/certs/cert.pem:ro
- /etc/ssl/private/selfsigned.key:/etc/ssl/private/key.pem:ro
```
## Monitoring Certificate Expiration
### Set Up Expiration Alerts
Create a certificate monitoring script:
```bash
sudo nano /usr/local/bin/check-certificates.sh
```
```bash
#!/bin/bash
# Certificate Expiration Monitoring Script
DOMAINS=(
"vish.gg"
"st.vish.gg"
"gf.vish.gg"
"mx.vish.gg"
)
ALERT_DAYS=30 # Alert if expiring within 30 days
WEBHOOK_URL="https://ntfy.sh/REDACTED_TOPIC" # Your notification webhook
for domain in "${DOMAINS[@]}"; do
echo "Checking $domain..."
# Get certificate expiration date
expiry=$(echo | openssl s_client -servername $domain -connect $domain:443 2>/dev/null | \
openssl x509 -noout -dates | grep "notAfter" | cut -d= -f2)
# Convert to epoch time
expiry_epoch=$(date -d "$expiry" +%s)
current_epoch=$(date +%s)
days_left=$(( ($expiry_epoch - $current_epoch) / 86400 ))
echo "$domain expires in $days_left days"
if [ $days_left -lt $ALERT_DAYS ]; then
# Send alert
curl -H "Title: Certificate Expiring Soon" \
-H "Priority: high" \
-H "Tags: warning,certificate" \
-d "Certificate for $domain expires in $days_left days!" \
$WEBHOOK_URL
echo "⚠️ Alert sent for $domain"
fi
echo
done
```
Make executable and add to cron:
```bash
sudo chmod +x /usr/local/bin/check-certificates.sh
# Add to cron (daily at 9 AM)
(crontab -l 2>/dev/null; echo "0 9 * * * /usr/local/bin/check-certificates.sh") | crontab -
```
### Grafana Dashboard
Add certificate monitoring to Grafana:
```bash
# Install blackbox_exporter for HTTPS probing
# Add to prometheus.yml:
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://vish.gg
- https://st.vish.gg
- https://gf.vish.gg
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# Create alert rule:
- alert: SSLCertificateExpiring
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
labels:
severity: warning
annotations:
summary: "SSL certificate expiring soon"
description: "SSL certificate for {{ $labels.instance }} expires in {{ $value | REDACTED_APP_PASSWORD }}"
```
## Troubleshooting
### Issue: Certbot Renewal Failing
**Symptoms**: `certbot renew` fails with DNS or HTTP challenge errors
**Solutions**:
```bash
# Check detailed error logs
sudo certbot renew --verbose
# Common issues:
# 1. Port 80/443 not accessible
sudo ufw status # Check firewall
sudo netstat -tlnp | grep :80 # Check if port is listening
# 2. DNS not resolving correctly
dig vish.gg # Verify DNS points to correct IP
# 3. Rate limits hit
# Let's Encrypt has rate limits: 50 certificates per domain per week
# Wait 7 days or use --staging for testing
# 4. Webroot path incorrect
sudo certbot renew --webroot -w /var/www/html
# 5. Try force renewal with different challenge
sudo certbot renew --force-renewal --preferred-challenges dns
```
### Issue: Certificate Valid But Browser Shows Warning
**Symptoms**: Certificate is valid but browsers show security warning
**Solutions**:
```bash
# Check certificate chain
openssl s_client -connect vish.gg:443 -showcerts
# Ensure intermediate certificates are included
# Nginx: Use fullchain.pem, not cert.pem
ssl_certificate /etc/letsencrypt/live/vish.gg/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/vish.gg/privkey.pem;
# Test SSL configuration
curl -I https://vish.gg
# Or use: https://www.ssllabs.com/ssltest/
```
### Issue: Synology Certificate Not Auto-Renewing
**Symptoms**: DSM certificate expired or shows renewal error
**Solutions**:
```bash
# SSH to Synology
ssh atlantis
# Check renewal logs
sudo cat /var/log/letsencrypt/letsencrypt.log
# Common issues:
# 1. Port 80 forwarding
# Ensure port 80 is forwarded to NAS during renewal
# 2. Domain validation
# Check DNS points to correct external IP
# 3. Force renewal
sudo /usr/syno/sbin/syno-letsencrypt renew-all
# 4. Restart certificate service
sudo synosystemctl restart nginx
```
### Issue: Nginx Won't Reload After Certificate Update
**Symptoms**: `nginx -t` shows SSL errors
**Solutions**:
```bash
# Test Nginx configuration
sudo nginx -t
# Common errors:
# 1. Certificate path incorrect
# Fix: Update nginx config with correct path
# 2. Certificate and key mismatch
# Verify:
sudo openssl x509 -noout -modulus -in cert.pem | openssl md5
sudo openssl rsa -noout -modulus -in key.pem | openssl md5
# MD5 sums should match
# 3. Permission issues
sudo chmod 644 /etc/ssl/certs/cert.pem
sudo chmod 600 /etc/ssl/private/key.pem
sudo chown root:root /etc/ssl/certs/cert.pem /etc/ssl/private/key.pem
# 4. SELinux blocking (if enabled)
sudo setsebool -P httpd_read_user_content 1
```
## Emergency Certificate Fix
If a certificate expires and services are down:
### Quick Fix: Use Self-Signed Temporarily
```bash
# Generate emergency self-signed certificate
sudo openssl req -x509 -nodes -days 30 -newkey rsa:2048 \
-keyout /tmp/emergency.key \
-out /tmp/emergency.crt \
-subj "/CN=*.vish.gg"
# Update Nginx to use emergency cert
sudo nano /etc/nginx/sites-available/default
ssl_certificate /tmp/emergency.crt;
ssl_certificate_key /tmp/emergency.key;
# Reload Nginx
sudo nginx -t && sudo systemctl reload nginx
# Services are now accessible (with browser warning)
# Then fix proper certificate renewal
```
### Restore from Backup
```bash
# If certificates were backed up
sudo cp /backup/letsencrypt/archive/vish.gg/* /etc/letsencrypt/archive/vish.gg/
# Update symlinks
sudo certbot certificates # Shows current status
sudo certbot install --cert-name vish.gg
```
## Best Practices
### Renewal Schedule
- Let's Encrypt certificates renew at 60 days (30 days before expiry)
- Check certificates monthly
- Set up expiration alerts
- Test renewal process quarterly
### Backup Certificates
```bash
# Backup Let's Encrypt certificates
sudo tar czf ~/letsencrypt-backup-$(date +%Y%m%d).tar.gz /etc/letsencrypt/
# Backup Synology certificates
# Done via Synology backup tasks
# Store backups securely (encrypted, off-site)
```
### Documentation
- Document which certificates are used where
- Keep inventory of expiration dates
- Document renewal procedures
- Note any special configurations
## Verification Checklist
After certificate renewal:
- [ ] Certificate renewed successfully
- [ ] Certificate expiry date extended
- [ ] Web servers reloaded without errors
- [ ] All services accessible via HTTPS
- [ ] No browser security warnings
- [ ] Certificate chain complete
- [ ] Auto-renewal still enabled
- [ ] Monitoring updated (if needed)
## Related Documentation
- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
- [Nginx Configuration](../infrastructure/networking.md)
- [Cloudflare Tunnels Setup](../infrastructure/cloudflare-tunnels-setup.md)
- [Emergency Access Guide](../troubleshooting/EMERGENCY_ACCESS_GUIDE.md)
## Change Log
- 2026-02-14 - Initial creation
- 2026-02-14 - Added monitoring and troubleshooting sections

View File

@@ -0,0 +1,661 @@
# Credential Rotation Runbook
## Overview
Step-by-step rotation procedures for all credentials exposed in the
`homelab-optimized` public mirror (audited 2026-02-20). Work through each
section in priority order. After updating secrets in compose files, commit
and push — GitOps will redeploy automatically.
> **Note:** Almost all of these stem from the same root cause — secrets were
> hard-coded in compose files, then those files were committed to git, then
> `generate_service_docs.py` and wiki-upload scripts duplicated those secrets
> into documentation, creating 35× copies of every secret across the repo.
> See the "Going Forward" section for how to prevent this.
## Prerequisites
- [ ] SSH / Tailscale access to Atlantis, Calypso, Homelab VM, Seattle VM, matrix-ubuntu-vm
- [ ] Gitea admin access (`git.vish.gg`)
- [ ] Authentik admin access
- [ ] Google account access (Gmail app passwords)
- [ ] Cloudflare dashboard access
- [ ] OpenAI platform access
- [ ] Write access to this repository
## Metadata
- **Estimated Time**: 46 hours
- **Risk Level**: Medium (service restarts required for most items)
- **Requires Downtime**: Brief per-service restart only
- **Reversible**: Yes (old values can be restored if something breaks)
- **Last Updated**: 2026-02-20
---
## Priority 1 — Rotate Immediately (Externally Usable Tokens)
### 1. Gitea API Tokens
Two tokens hard-coded across scripts and docs.
#### 1a. Wiki/scripts token (`77e3ddaf...`)
**Files to update:**
- `scripts/cleanup-gitea-wiki.sh`
- `scripts/upload-all-docs-to-gitea-wiki.sh`
- `scripts/upload-to-gitea-wiki.sh`
- `scripts/create-clean-organized-wiki.sh`
- `scripts/upload-organized-wiki.sh`
- `docs/admin/DOCUMENTATION_MAINTENANCE_GUIDE.md`
```bash
# 1. Go to https://git.vish.gg/user/settings/applications
# 2. Revoke the token starting 77e3ddaf
# 3. Generate new token, name: homelab-wiki, scope: repo
# 4. Replace in all files:
NEW_TOKEN=REDACTED_TOKEN
for f in scripts/cleanup-gitea-wiki.sh \
scripts/upload-all-docs-to-gitea-wiki.sh \
scripts/upload-to-gitea-wiki.sh \
scripts/create-clean-organized-wiki.sh \
scripts/upload-organized-wiki.sh \
docs/admin/DOCUMENTATION_MAINTENANCE_GUIDE.md; do
sed -i "s/REDACTED_GITEA_TOKEN/$NEW_TOKEN/g" "$f"
done
```
#### 1b. Retro-site clone token (`52fa6ccb...`)
**File:** `Calypso/retro-site.yaml` and `hosts/synology/calypso/retro-site.yaml`
```bash
# 1. Go to https://git.vish.gg/user/settings/applications
# 2. Revoke the token starting 52fa6ccb
# 3. Generate new token, name: retro-site-deploy, scope: repo:read
# 4. Update the git clone URL in both compose files
# Consider switching to a deploy key for least-privilege access
```
---
### 2. Cloudflare API Token (`FGXlHM7doB8Z...`)
Appears in 13 files including active dynamic DNS updaters on multiple hosts.
**Files to update (active deployments):**
- `hosts/synology/atlantis/dynamicdnsupdater.yaml`
- `hosts/physical/guava/portainer_yaml/dynamic_dns.yaml`
- `hosts/physical/concord-nuc/dyndns_updater.yaml`
- Various Calypso/homelab-vm DDNS configs
**Files to sanitize (docs):**
- `docs/infrastructure/cloudflare-dns.md`
- `docs/infrastructure/npm-migration-jan2026.md`
- Any `docs/services/individual/ddns-*.md` files
```bash
# 1. Go to https://dash.cloudflare.com/profile/api-tokens
# 2. Find the token (FGXlHM7doB8Z...) and click Revoke
# 3. Create a new token: use "Edit zone DNS" template, scope to your zone only
# 4. Replace in all compose files above
# 5. Replace hardcoded value in docs with: YOUR_CLOUDFLARE_API_TOKEN
# Verify DDNS containers restart and can still update DNS:
docker logs cloudflare-ddns --tail 20
```
---
### 3. OpenAI API Key (`sk-proj-C_IYp6io...`)
**Files to update:**
- `hosts/vms/homelab-vm/hoarder.yaml`
- `docs/services/individual/web.md` (replace with placeholder)
```bash
# 1. Go to https://platform.openai.com/api-keys
# 2. Delete the exposed key
# 3. Create a new key, set a usage limit
# 4. Update OPENAI_API_KEY in hoarder.yaml
# 5. Replace value in docs with: YOUR_OPENAI_API_KEY
```
---
## Priority 2 — OAuth / SSO Secrets
### 4. Grafana ↔ Authentik OAuth Secret
**Files to update:**
- `hosts/vms/homelab-vm/monitoring.yaml`
- `hosts/synology/atlantis/grafana.yml`
- `docs/infrastructure/authentik-sso.md` (replace with placeholder)
- `docs/services/individual/grafana-oauth.md` (replace with placeholder)
```bash
# 1. Log into Authentik admin: https://auth.vish.gg/if/admin/
# 2. Applications → Providers → find Grafana OAuth2 provider
# 3. Edit → regenerate Client Secret → copy both Client ID and Secret
# 4. Update in both compose files:
# GF_AUTH_GENERIC_OAUTH_CLIENT_ID: NEW_ID
# GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET: NEW_SECRET
# 5. Commit and push — both Grafana stacks restart automatically
# Verify SSO works after restart:
curl -I https://gf.vish.gg
```
---
### 5. Seafile ↔ Authentik OAuth Secret
**Files to update:**
- `hosts/synology/calypso/seafile-oauth-config.py`
- `docs/services/individual/seafile-oauth.md` (replace with placeholder)
```bash
# 1. Log into Authentik admin
# 2. Applications → Providers → find Seafile OAuth2 provider
# 3. Regenerate client secret
# 4. Update OAUTH_CLIENT_ID and OAUTH_CLIENT_SECRET in seafile-oauth-config.py
# 5. Re-run the config script on the Seafile server to apply
```
---
### 6. Authentik Secret Key (`RpRexcYo5HAz...`)
**Critical** — this key encrypts all Authentik data (tokens, sessions, stored credentials).
**File:** `hosts/synology/calypso/authentik/docker-compose.yaml`
```bash
# 1. Generate a new secret:
python3 -c "import secrets; print(secrets.token_urlsafe(50))"
# 2. Update AUTHENTIK_SECRET_KEY in docker-compose.yaml
# 3. Commit and push — Authentik will restart
# WARNING: All active Authentik sessions will be invalidated.
# Users will need to log back in. SSO-protected services
# may temporarily show login errors while Authentik restarts.
# Verify Authentik is healthy after restart:
docker logs authentik_server --tail 30
```
---
## Priority 3 — Application Secrets (Require Service Restart)
### 7. Gmail App Passwords
Five distinct app passwords were found across the repo. Revoke all of them
in Google Account → Security → App passwords, then create new per-service ones.
| Password | Used For | Active Files |
|----------|----------|-------------|
| (see Vaultwarden) | Mastodon, Joplin, Authentik SMTP | `matrix-ubuntu-vm/mastodon/.env.production.template`, `atlantis/joplin.yml`, `calypso/authentik/docker-compose.yaml` |
| (see Vaultwarden) | Vaultwarden SMTP | `atlantis/vaultwarden.yaml` |
| (see Vaultwarden) | Documenso SMTP | `atlantis/documenso/documenso.yaml` |
| (see Vaultwarden) | Reactive Resume v4 (archived) | `archive/reactive_resume_v4_archived/docker-compose.yml` |
| (see Vaultwarden) | Reactive Resume v5 (active) | `calypso/reactive_resume_v5/docker-compose.yml` |
**Best practice:** Create one app password per service, named clearly (e.g.,
`homelab-joplin`, `homelab-mastodon`). Update each file's `SMTP_PASS` /
`SMTP_PASSWORD` / `MAILER_AUTH_PASSWORD` / `smtp_password` field.
---
### 8. Matrix Synapse Secrets
Three secrets in `homeserver.yaml`, plus the TURN shared secret.
**File:** `hosts/synology/atlantis/matrix_synapse_docs/homeserver.yaml`
```bash
# Generate fresh values for each:
python3 -c "import secrets; print(secrets.token_urlsafe(48))"
# Fields to rotate:
# registration_shared_secret
# macaroon_secret_key
# form_secret
# turn_shared_secret
# After updating homeserver.yaml, restart Synapse:
docker restart synapse # or via Portainer
# Also update coturn config on the server directly:
ssh atlantis
nano /path/to/turnserver.conf
# Update: static-auth-secret=NEW_TURN_SECRET
systemctl restart coturn
# Update instructions.txt — replace old values with REDACTED
```
---
### 9. Mastodon `SECRET_KEY_BASE` + `OTP_SECRET`
**File:** `hosts/synology/atlantis/mastodon.yml`
**Also in:** `docs/services/individual/mastodon.md` (replace with placeholder)
```bash
# Generate new values:
openssl rand -hex 64 # for SECRET_KEY_BASE
openssl rand -hex 64 # for OTP_SECRET
# Update both in mastodon.yml
# Commit and push — GitOps restarts Mastodon
# WARNING: All active user sessions are invalidated. Users must log back in.
# Verify Mastodon web is accessible:
curl -I https://your-mastodon-domain/
docker logs mastodon_web --tail 20
```
---
### 10. Documenso Secrets (3 keys)
**Files:**
- `hosts/synology/atlantis/documenso/documenso.yaml`
- `hosts/synology/atlantis/documenso/Secrets.txt` (will be removed by sanitizer)
- `docs/services/individual/documenso.md` (replace with placeholder)
```bash
# Generate new values:
python3 -c "import secrets; print(secrets.token_urlsafe(32))" # NEXTAUTH_SECRET
python3 -c "import secrets; print(secrets.token_urlsafe(32))" # NEXT_PRIVATE_ENCRYPTION_KEY
python3 -c "import secrets; print(secrets.token_urlsafe(32))" # NEXT_PRIVATE_ENCRYPTION_SECONDARY_KEY
# Update all three in documenso.yaml
# NOTE: Rotating encryption keys will invalidate signed documents.
# Confirm this is acceptable before rotating.
```
---
### 11. Paperless-NGX API Token
**Files:**
- `hosts/synology/calypso/paperless/paperless-ai.yml`
- `hosts/synology/calypso/paperless/README.md` (replace with placeholder)
- `docs/services/paperless.md` (replace with placeholder)
```bash
# 1. Log into Paperless web UI
# 2. Admin → Auth Token → delete existing, generate new
# 3. Update PAPERLESS_API_TOKEN in paperless-ai.yml
# 4. Commit and push
```
---
### 12. Immich JWT Secret (Both NAS)
**Files:**
- `hosts/synology/atlantis/immich/stack.env` (will be removed by sanitizer)
- `hosts/synology/calypso/immich/stack.env` (will be removed by sanitizer)
Since these files are removed by the sanitizer, ensure they are in `.gitignore`
or managed via Portainer env variables going forward.
```bash
# Generate new secret:
openssl rand -base64 96
# Update JWT_SECRET in both stack.env files locally,
# then apply via Portainer (not committed to git).
# WARNING: All active Immich sessions invalidated.
```
---
### 13. Revolt/Stoatchat — LiveKit API Secret + VAPID Private Key
**Files:**
- `hosts/vms/seattle/stoatchat/livekit.yml`
- `hosts/vms/seattle/stoatchat/Revolt.overrides.toml`
- `hosts/vms/homelab-vm/stoatchat.yaml`
- `docs/services/stoatchat/Revolt.overrides.toml` (replace with placeholder)
- `hosts/vms/seattle/stoatchat/DEPLOYMENT_SUMMARY.md` (replace with placeholder)
```bash
# Generate new LiveKit API key/secret pair:
# Use the LiveKit CLI or generate random strings:
python3 -c "import secrets; print(secrets.token_urlsafe(24))" # API key
python3 -c "import secrets; print(secrets.token_urlsafe(32))" # API secret
# Generate new VAPID key pair:
npx web-push generate-vapid-keys
# or: python3 -c "from py_vapid import Vapid; v=Vapid(); v.generate_keys(); print(v.private_key)"
# Update in livekit.yml and Revolt.overrides.toml
# Restart LiveKit and Revolt services
```
---
### 14. Jitsi Internal Auth Passwords (6 passwords)
**File:** `hosts/synology/atlantis/jitsi/jitsi.yml`
**Also in:** `hosts/synology/atlantis/jitsi/.env` (will be removed by sanitizer)
```bash
# Generate new passwords for each variable:
for var in JICOFO_COMPONENT_SECRET JICOFO_AUTH_PASSWORD JVB_AUTH_PASSWORD \
JIGASI_XMPP_PASSWORD JIBRI_RECORDER_PASSWORD JIBRI_XMPP_PASSWORD; do
echo "$var=$(openssl rand -hex 10)"
done
# Update all 6 in jitsi.yml
# Restart the entire Jitsi stack — all components must use the same passwords
docker compose -f jitsi.yml down && docker compose -f jitsi.yml up -d
```
---
### 15. SNMP v3 Auth + Priv Passwords
Used for NAS monitoring — same credentials across 6 files.
**Files to update:**
- `hosts/synology/setillo/prometheus/snmp.yml`
- `hosts/synology/atlantis/grafana_prometheus/snmp.yml`
- `hosts/synology/atlantis/grafana_prometheus/snmp_mariushosting.yml`
- `hosts/synology/calypso/grafana_prometheus/snmp.yml`
- `hosts/vms/homelab-vm/monitoring.yaml`
```bash
# 1. Log into each Synology NAS DSM
# 2. Go to Control Panel → Terminal & SNMP → SNMP tab
# 3. Update SNMPv3 auth password and privacy password to new values
# 4. Update the same values in all 5 config files above
# 5. The archive file (deprecated-monitoring-stacks) can just be left for
# the sanitizer to redact.
```
---
### 16. Invidious `hmac_key`
**Files:**
- `hosts/physical/concord-nuc/invidious/invidious.yaml`
- `hosts/physical/concord-nuc/invidious/invidious_old/invidious.yaml`
- `hosts/synology/atlantis/invidious.yml`
```bash
# Generate new hmac_key:
python3 -c "import secrets; print(secrets.token_hex(16))"
# Update hmac_key in each active invidious.yaml
# Restart Invidious containers
```
---
### 17. Open WebUI Secret Keys
**Files:**
- `hosts/vms/contabo-vm/ollama/docker-compose.yml`
- `hosts/synology/atlantis/ollama/docker-compose.yml`
- `hosts/synology/atlantis/ollama/64_bit_key.txt` (will be removed by sanitizer)
```bash
# Generate new key:
openssl rand -hex 32
# Update WEBUI_SECRET_KEY in both compose files
# Restart Open WebUI containers — active sessions invalidated
```
---
### 18. Portainer Edge Key
**File:** `hosts/vms/homelab-vm/portainer_agent.yaml`
```bash
# 1. Log into Portainer at https://192.168.0.200:9443
# 2. Go to Settings → Edge Compute → Edge Agents
# 3. Find the homelab-vm agent and regenerate its edge key
# 4. Update EDGE_KEY in portainer_agent.yaml with the new base64 value
# 5. Restart the Portainer edge agent container
```
---
### 19. OpenProject Secret Key
**File:** `hosts/vms/homelab-vm/openproject.yml`
**Also in:** `docs/services/individual/openproject.md` (replace with placeholder)
```bash
openssl rand -hex 64
# Update OPENPROJECT_SECRET_KEY_BASE in openproject.yml
# Restart OpenProject — sessions invalidated
```
---
### 20. RomM Auth Secret Key
**File:** `hosts/vms/homelab-vm/romm/romm.yaml`
**Also:** `hosts/vms/homelab-vm/romm/secret_key.yaml` (will be removed by sanitizer)
```bash
openssl rand -hex 32
# Update ROMM_AUTH_SECRET_KEY in romm.yaml
# Restart RomM — sessions invalidated
```
---
### 21. Hoarder NEXTAUTH Secret
**File:** `hosts/vms/homelab-vm/hoarder.yaml`
**Also in:** `docs/services/individual/web.md` (replace with placeholder)
```bash
openssl rand -base64 36
# Update NEXTAUTH_SECRET in hoarder.yaml
# Restart Hoarder — sessions invalidated
```
---
## Priority 4 — Shared / Weak Passwords
### 22. `REDACTED_PASSWORD123!` — Used Across 5+ Services
This password is the same for all of the following. Change each to a
**unique** strong password:
| Service | File | Variable |
|---------|------|----------|
| NetBox | `hosts/synology/atlantis/netbox.yml` | `SUPERUSER_PASSWORD` |
| Paperless admin | `hosts/synology/calypso/paperless/docker-compose.yml` | `PAPERLESS_ADMIN_PASSWORD` |
| Seafile admin | `hosts/synology/calypso/seafile-server.yaml` | `INIT_SEAFILE_ADMIN_PASSWORD` |
| Seafile admin (new) | `hosts/synology/calypso/seafile-new.yaml` | `INIT_SEAFILE_ADMIN_PASSWORD` |
| PhotoPrism | `hosts/physical/anubis/photoprism.yml` | `PHOTOPRISM_ADMIN_PASSWORD` |
| Hemmelig | `hosts/vms/bulgaria-vm/hemmelig.yml` | `SECRET_JWT_SECRET` |
| Vaultwarden admin | `hosts/synology/atlantis/bitwarden/bitwarden_token.txt` | (source password) |
For each: generate `openssl rand -base64 18`, update in the compose file,
restart the container, then log in to verify.
---
### 23. `REDACTED_PASSWORD` — Used Across 3 Services
| Service | File | Variable |
|---------|------|----------|
| Gotify | `hosts/vms/homelab-vm/gotify.yml` | `GOTIFY_DEFAULTUSER_PASS` |
| Pi-hole | `hosts/synology/atlantis/pihole.yml` | `WEBPASSWORD` |
| Stirling PDF | `hosts/synology/atlantis/stirlingpdf.yml` | `SECURITY_INITIAL_LOGIN_PASSWORD` |
---
### 24. `mastodon_pass_2026` — Live PostgreSQL Password
**Files:**
- `hosts/vms/matrix-ubuntu-vm/mastodon/.env.production.template`
- `hosts/vms/matrix-ubuntu-vm/docs/SETUP.md`
```bash
# On the matrix-ubuntu-vm server:
ssh YOUR_WAN_IP
sudo -u postgres psql
ALTER USER mastodon WITH PASSWORD 'REDACTED_PASSWORD';
\q
# Update the password in .env.production.template and Mastodon's running config
# Restart Mastodon services
```
---
### 25. Watchtower API Token (`REDACTED_WATCHTOWER_TOKEN`)
| File |
|------|
| `hosts/synology/atlantis/watchtower.yml` |
| `hosts/synology/calypso/prometheus.yml` |
```bash
# Generate a proper random token:
openssl rand -hex 20
# Update WATCHTOWER_HTTP_API_TOKEN in both files
# Update any scripts that call the Watchtower API
```
---
### 26. `test:test` SSH Credentials on `YOUR_WAN_IP`
The matrix-ubuntu-vm CREDENTIALS.md shows a `test` user with password `test`.
```bash
# SSH to the server and remove or secure the test account:
ssh YOUR_WAN_IP
passwd test # change to a strong password
# or: userdel -r test # remove entirely if unused
```
---
## Priority 5 — Network Infrastructure
### 27. Management Switch Password Hashes
**File:** `mgmtswitch.conf` (will be removed from public mirror by sanitizer)
The SHA-512 hashes for `root`, `vish`, and `vkhemraj` switch accounts are
crackable offline. Rotate the switch passwords:
```bash
# SSH to the management switch
ssh admin@10.0.0.15
# Change passwords for all local accounts:
enable
configure terminal
username root secret NEW_PASSWORD
username vish secret NEW_PASSWORD
username vkhemraj secret NEW_PASSWORD
write memory
```
---
## Final Verification
After completing all rotations:
```bash
# 1. Commit and push all file changes
git add -A
git commit -m "chore(security): rotate all exposed credentials"
git push origin main
# 2. Wait for the mirror workflow to complete, then pull:
git -C /home/homelab/organized/repos/homelab-optimized pull
# 3. Verify none of the old secrets appear in the public mirror:
cd /home/homelab/organized/repos/homelab-optimized
grep -r "77e3ddaf\|52fa6ccb\|FGXlHM7d\|sk-proj-C_IYp6io\|ArP5XWdkwVyw\|bdtrpmpce\|toiunzuby" . 2>/dev/null
grep -r "244c619d\|RpRexcYo5\|mastodon_pass\|REDACTED_PASSWORD\|REDACTED_PASSWORD\|REDACTED_WATCHTOWER_TOKEN" . 2>/dev/null
grep -r "2e80b1b7d3a\|eca299ae59\|rxmr4tJoqfu\|ZjCofRlfm6\|QE5SudhZ99" . 2>/dev/null
# All should return no results
# 4. Verify GitOps deployments are healthy in Portainer:
# https://192.168.0.200:9443
```
---
## Going Forward — Preventing This Again
The root cause: secrets hard-coded in compose files that get committed to git.
**Rules:**
1. **Never hard-code secrets in compose files** — use Docker Secrets, or an
`.env` file excluded by `.gitignore` (Portainer can load env files from the
host at deploy time)
2. **Never put real values in documentation** — use `YOUR_API_KEY` placeholders
3. **Never create `Secrets.txt` or `CREDENTIALS.md` files in the repo** — use
a password manager (you already have Vaultwarden/Bitwarden)
4. **Run the sanitizer locally** before any commit that touches secrets:
```bash
# Test in a temp copy — see what the sanitizer would catch:
tmpdir=$(mktemp -d)
cp -r /path/to/homelab "$tmpdir/"
python3 "$tmpdir/homelab/.gitea/sanitize.py"
```
## Related Documentation
- [Security Hardening](../security/SERVER_HARDENING.md)
- [Repository Sanitization](../admin/REPOSITORY_SANITIZATION.md)
- [GitOps Deployment Guide](../admin/gitops-deployment-guide.md)
## Portainer Git Credential Rotation
The saved Git credential **`portainer-homelab`** (credId: 1) is used by ~43 stacks to
pull compose files from `git.vish.gg`. When the Gitea token expires or is rotated,
all those stacks fail to redeploy.
```bash
# 1. Generate a new Gitea token at https://git.vish.gg/user/settings/applications
# Scope: read:repository
# 2. Test the token:
curl -s -o /dev/null -w "%{http_code}" \
-H "Authorization: token YOUR_NEW_TOKEN" \
"https://git.vish.gg/api/v1/repos/Vish/homelab"
# Should return 200
# 3. Update in Portainer:
curl -k -s -X PUT \
-H "X-API-Key: "REDACTED_API_KEY" \
-H "Content-Type: application/json" \
"https://192.168.0.200:9443/api/users/1/gitcredentials/1" \
-d '{"name":"portainer-homelab","username":"vish","password":"YOUR_NEW_TOKEN"}'
```
> Note: The API update may not immediately propagate to automated pulls.
> Pass credentials inline in redeploy calls to force use of the new token.
---
## Change Log
- 2026-02-27 — Incident: sanitization commit `037d766a` replaced credentials with
`REDACTED_PASSWORD` placeholders across 14 compose files. All affected containers
detected via Portainer API env scan and restored from `git show 037d766a^`. Added
Portainer Git credential rotation section above.
- 2026-02-20 — Initial creation (8 items)
- 2026-02-20 — Expanded after full private repo audit (27 items across 34 exposure categories)

View File

@@ -0,0 +1,490 @@
# Disk Full Procedure Runbook
## Overview
This runbook provides procedures for handling disk space emergencies across all homelab hosts. It includes immediate actions, root cause analysis, and long-term solutions to prevent recurrence.
## Prerequisites
- [ ] SSH access to affected host
- [ ] Root/sudo privileges on the host
- [ ] Monitoring dashboards access
- [ ] Backup verification capability
## Metadata
- **Estimated Time**: 30-90 minutes (depending on severity)
- **Risk Level**: High (data loss possible if not handled carefully)
- **Requires Downtime**: Minimal (may need to stop services temporarily)
- **Reversible**: Partially (deleted data cannot be recovered)
- **Tested On**: 2026-02-14
## Severity Levels
| Level | Disk Usage | Action Required | Urgency |
|-------|------------|-----------------|---------|
| 🟢 **Normal** | < 80% | Monitor | Low |
| 🟡 **Warning** | 80-90% | Plan cleanup | Medium |
| 🟠 **Critical** | 90-95% | Immediate cleanup | High |
| 🔴 **Emergency** | > 95% | Emergency response | Critical |
## Quick Triage
First, determine which host and volume is affected:
```bash
# Check all hosts disk usage
ssh atlantis "df -h"
ssh calypso "df -h"
ssh concordnuc "df -h"
ssh homelab-vm "df -h"
ssh raspberry-pi-5 "df -h"
```
## Emergency Procedure (>95% Full)
### Step 1: Immediate Space Recovery
**Goal**: Free up 5-10% space immediately to prevent system issues.
```bash
# SSH to affected host
ssh [hostname]
# Identify what's consuming space
df -h
du -sh /* 2>/dev/null | sort -rh | head -20
# Quick wins - Clear Docker cache
docker system df # See what Docker is using
docker system prune -a --volumes --force # Reclaim space (BE CAREFUL!)
# This typically frees 10-50GB depending on your setup
```
**⚠️ WARNING**: `docker system prune` will remove:
- Stopped containers
- Unused networks
- Dangling images
- Build cache
- Unused volumes (with --volumes flag)
**Safer alternative** if you're unsure:
```bash
# Less aggressive - removes only stopped containers and dangling images
docker system prune --force
```
### Step 2: Clear Log Files
```bash
# Find large log files
find /var/log -type f -size +100M -exec ls -lh {} \; | sort -k 5 -rh
# Clear systemd journal (keeps last 3 days)
sudo journalctl --vacuum-time=3d
# Clear old Docker logs
sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'
# For Synology NAS
sudo find /volume1/@docker/containers -name "*-json.log" -size +100M -exec truncate -s 0 {} \;
```
### Step 3: Remove Old Docker Images
```bash
# List images by size
docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k 2 -rh | head -20
# Remove specific old images
docker image rm [image:tag]
# Remove all unused images
docker image prune -a --force
```
### Step 4: Verify Space Recovered
```bash
# Check current usage
df -h
# Verify critical services are running
docker ps
# Check container logs for errors
docker ps --format "{{.Names}}" | xargs -I {} sh -c 'echo "=== {} ===" && docker logs --tail 20 {}'
```
## Detailed Analysis Procedure
Once immediate danger is passed, perform thorough analysis:
### Step 1: Identify Space Consumers
```bash
# Comprehensive disk usage analysis
sudo du -h --max-depth=2 / 2>/dev/null | sort -rh | head -30
# For Synology NAS specifically
sudo du -h --max-depth=2 /volume1 2>/dev/null | sort -rh | head -30
# Check Docker volumes
docker volume ls
docker system df -v
# Check specific large directories
du -sh /var/lib/docker/* | sort -rh
du -sh /volume1/docker/* | sort -rh # Synology
```
### Step 2: Analyze by Service
Create a space usage report:
```bash
# Create analysis script
cat > /tmp/analyze-space.sh << 'EOF'
#!/bin/bash
echo "=== Docker Container Volumes ==="
docker ps --format "{{.Names}}" | while read container; do
size=$(docker exec $container du -sh / 2>/dev/null | awk '{print $1}')
echo "$container: $size"
done | sort -rh
echo ""
echo "=== Docker Volumes ==="
docker volume ls --format "{{.Name}}" | while read vol; do
size=$(docker volume inspect $vol --format '{{.Mountpoint}}' | xargs sudo du -sh 2>/dev/null | awk '{print $1}')
echo "$vol: $size"
done | sort -rh
echo ""
echo "=== Log Files Over 100MB ==="
find /var/log -type f -size +100M -exec ls -lh {} \; 2>/dev/null
EOF
chmod +x /tmp/analyze-space.sh
/tmp/analyze-space.sh
```
### Step 3: Categorize Findings
Identify the primary space consumers:
| Category | Typical Culprits | Safe to Delete? |
|----------|------------------|-----------------|
| **Docker Images** | Old/unused image versions | ✅ Yes (if unused) |
| **Docker Volumes** | Database growth, media cache | ⚠️ Maybe (check first) |
| **Log Files** | Application logs, system logs | ✅ Yes (after review) |
| **Media Files** | Plex, Jellyfin transcodes | ✅ Yes (transcodes) |
| **Backups** | Old backup archives | ✅ Yes (keep recent) |
| **Application Data** | Various service data | ❌ No (review first) |
## Cleanup Strategies by Service Type
### Media Services (Plex, Jellyfin)
```bash
# Clear Plex transcode cache
docker exec plex rm -rf /transcode/*
# Clear Jellyfin transcode cache
docker exec jellyfin rm -rf /config/data/transcodes/*
# Find and remove old media previews
find /volume1/plex/Library/Application\ Support/Plex\ Media\ Server/Cache -type f -mtime +30 -delete
```
### *arr Suite (Sonarr, Radarr, etc.)
```bash
# Clear download client history and backups
docker exec sonarr find /config/Backups -mtime +30 -delete
docker exec radarr find /config/Backups -mtime +30 -delete
# Clean up old logs
docker exec sonarr find /config/logs -mtime +30 -delete
docker exec radarr find /config/logs -mtime +30 -delete
```
### Database Services (PostgreSQL, MariaDB)
```bash
# Check database size
docker exec postgres psql -U user -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database;"
# Vacuum databases (for PostgreSQL)
docker exec postgres vacuumdb -U user --all --full --analyze
# Check MariaDB size
docker exec mariadb mysql -u root -p -e "SELECT table_schema AS 'Database', ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)' FROM information_schema.TABLES GROUP BY table_schema;"
```
### Monitoring Services (Prometheus, Grafana)
```bash
# Check Prometheus storage size
du -sh /volume1/docker/prometheus
# Prometheus retention is configured in prometheus.yml
# Default: --storage.tsdb.retention.time=15d
# Consider reducing retention if space is critical
# Clear old Grafana sessions
docker exec grafana find /var/lib/grafana/sessions -mtime +7 -delete
```
### Immich (Photo Management)
```bash
# Check Immich storage usage
docker exec immich-server df -h /usr/src/app/upload
# Immich uses a lot of space for:
# - Original photos
# - Thumbnails
# - Encoded videos
# - ML models
# Clean up old upload logs
docker exec immich-server find /usr/src/app/upload/upload -mtime +90 -delete
```
## Long-Term Solutions
### Solution 1: Configure Log Rotation
Create proper log rotation for Docker containers:
```bash
# Edit Docker daemon config
sudo nano /etc/docker/daemon.json
# Add log rotation settings
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
# Restart Docker
sudo systemctl restart docker # Linux
# OR for Synology
sudo synoservicectl --restart pkgctl-Docker
```
### Solution 2: Set Up Automated Cleanup
Create a cleanup cron job:
```bash
# Create cleanup script
sudo nano /usr/local/bin/homelab-cleanup.sh
#!/bin/bash
# Homelab Automated Cleanup Script
# Remove stopped containers older than 7 days
docker container prune --filter "until=168h" --force
# Remove unused images older than 30 days
docker image prune --all --filter "until=720h" --force
# Remove unused volumes (BE CAREFUL - only if you're sure)
# docker volume prune --force
# Clear journal logs older than 7 days
journalctl --vacuum-time=7d
# Clear old backups (keep last 30 days)
find /volume1/backups -type f -mtime +30 -delete
echo "Cleanup completed at $(date)" >> /var/log/homelab-cleanup.log
# Make executable
sudo chmod +x /usr/local/bin/homelab-cleanup.sh
# Add to cron (runs weekly on Sunday at 3 AM)
(crontab -l 2>/dev/null; echo "0 3 * * 0 /usr/local/bin/homelab-cleanup.sh") | crontab -
```
### Solution 3: Configure Service-Specific Retention
Update each service with appropriate retention policies:
**Prometheus** (`prometheus.yml`):
```yaml
global:
storage:
tsdb:
retention.time: 15d # Reduce from default 15d to 7d if needed
retention.size: 50GB # Set size limit
```
**Grafana** (docker-compose.yml):
```yaml
environment:
- GF_DATABASE_WAL=true
- GF_DATABASE_CLEANUP_INTERVAL=168h # Weekly cleanup
```
**Plex** (Plex settings):
- Settings → Transcoder → Transcoder temporary directory
- Settings → Scheduled Tasks → Clean Bundles (daily)
- Settings → Scheduled Tasks → Optimize Database (weekly)
### Solution 4: Monitor Disk Usage Proactively
Set up monitoring alerts in Grafana:
```yaml
# Alert rule for disk space
- alert: REDACTED_APP_PASSWORD
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space warning on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has less than 20% free space"
- alert: DiskSpaceCritical
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 10
for: 5m
labels:
severity: critical
annotations:
summary: "CRITICAL: Disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has less than 10% free space"
```
## Host-Specific Considerations
### Atlantis (Synology DS1823xs+)
```bash
# Synology-specific cleanup
# Clear Synology logs
sudo find /var/log -name "*.log.*" -mtime +30 -delete
# Clear package logs
sudo find /var/packages/*/target/logs -name "*.log.*" -mtime +30 -delete
# Check storage pool status
sudo synostgpool --info
# DSM has built-in storage analyzer
# Control Panel → Storage Manager → Storage Analyzer
```
### Calypso (Synology DS723+)
Same as Atlantis - use Synology-specific commands.
### Concord NUC (Ubuntu)
```bash
# Ubuntu-specific cleanup
sudo apt-get clean
sudo apt-get autoclean
sudo apt-get autoremove --purge
# Clear old kernels (keep current + 1 previous)
sudo apt-get autoremove --purge $(dpkg -l 'linux-*' | sed '/^ii/!d;/'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d;s/^[^ ]* [^ ]* \([^ ]*\).*/\1/;/[0-9]/!d')
# Clear thumbnail cache
rm -rf ~/.cache/thumbnails/*
```
### Homelab VM (Proxmox VM)
```bash
# VM-specific cleanup
# Clear apt cache
sudo apt-get clean
# Clear old cloud-init logs
sudo rm -rf /var/log/cloud-init*.log
# Compact QCOW2 disk (from Proxmox host)
# qemu-img convert -O qcow2 -c original.qcow2 compressed.qcow2
```
## Verification Checklist
After cleanup, verify:
- [ ] Disk usage below 80%: `df -h`
- [ ] All critical containers running: `docker ps`
- [ ] No errors in recent logs: `docker logs [container] --tail 50`
- [ ] Services accessible via web interface
- [ ] Monitoring dashboards show normal metrics
- [ ] Backup jobs can complete successfully
- [ ] Automated cleanup configured for future
## Rollback Procedure
If cleanup causes issues:
1. **Check what was deleted**: Review command history and logs
2. **Restore from backups**: If critical data was deleted
```bash
cd ~/Documents/repos/homelab
./restore.sh [backup-date]
```
3. **Recreate Docker volumes**: If volumes were accidentally pruned
4. **Restart affected services**: Redeploy from Portainer
## Troubleshooting
### Issue: Still Running Out of Space After Cleanup
**Solution**: Consider adding more storage
- Add external USB drives
- Expand existing RAID arrays
- Move services to hosts with more space
- Archive old media to cold storage
### Issue: Docker Prune Removed Important Data
**Solution**:
- Always use `--filter` to be selective
- Never use `docker volume prune` without checking first
- Keep recent backups before major cleanup operations
### Issue: Services Won't Start After Cleanup
**Solution**:
```bash
# Check for missing volumes
docker ps -a
docker volume ls
# Check logs
docker logs [container]
# Recreate volumes if needed (restore from backup)
./restore.sh [backup-date]
```
## Prevention Checklist
- [ ] Log rotation configured for all services
- [ ] Automated cleanup script running weekly
- [ ] Monitoring alerts set up for disk space
- [ ] Retention policies configured appropriately
- [ ] Regular backup verification scheduled
- [ ] Capacity planning review quarterly
## Related Documentation
- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
- [Backup Strategies](../admin/backup-strategies.md)
- [Monitoring Setup](../admin/monitoring-setup.md)
- [Troubleshooting Guide](../troubleshooting/common-issues.md)
## Change Log
- 2026-02-14 - Initial creation with host-specific procedures
- 2026-02-14 - Added service-specific cleanup strategies

View File

@@ -0,0 +1,559 @@
# Service Migration Runbook
## Overview
This runbook guides you through migrating a containerized service from one host to another in the homelab. The procedure minimizes downtime and ensures data integrity throughout the migration.
## Prerequisites
- [ ] SSH access to both source and target hosts
- [ ] Sufficient disk space on target host
- [ ] Network connectivity between hosts (Tailscale recommended)
- [ ] Service backup completed and verified
- [ ] Maintenance window scheduled (if downtime required)
- [ ] Portainer access for both hosts
## Metadata
- **Estimated Time**: 1-3 hours (depending on data size)
- **Risk Level**: Medium-High (data migration involved)
- **Requires Downtime**: Yes (typically 15-60 minutes)
- **Reversible**: Yes (can roll back to source host)
- **Tested On**: 2026-02-14
## When to Migrate Services
Common reasons for service migration:
| Scenario | Example | Recommended Target |
|----------|---------|-------------------|
| **Resource constraints** | NAS running out of CPU | Move to NUC or VM |
| **Storage constraints** | Running out of disk space | Move to larger NAS |
| **Performance issues** | High I/O affecting other services | Move to dedicated host |
| **Host consolidation** | Reducing number of active hosts | Consolidate to primary hosts |
| **Hardware maintenance** | Planned hardware upgrade | Temporary or permanent move |
| **Improved organization** | Group related services | Move to appropriate host |
## Migration Types
### Type 1: Simple Migration (Stateless Service)
- No persistent data
- Can be redeployed from scratch
- Example: Nginx, static web servers
- **Downtime**: Minimal (5-15 minutes)
### Type 2: Standard Migration (Small Data)
- Persistent data < 10GB
- Configuration and databases
- Example: Uptime Kuma, AdGuard Home
- **Downtime**: 15-30 minutes
### Type 3: Large Data Migration
- Persistent data > 10GB
- Media libraries, large databases
- Example: Plex, Immich, Jellyfin
- **Downtime**: 1-4 hours (depending on size)
## Pre-Migration Planning
### Step 1: Assess the Service
```bash
# SSH to source host
ssh [source-host]
# Identify container and volumes
docker ps | grep [service-name]
docker inspect [service-name] | grep -A 10 Mounts
# Check data size
docker exec [service-name] du -sh /config /data
# List all volumes used by service
docker volume ls | grep [service-name]
# Check volume sizes
docker system df -v | grep [service-name]
```
Document findings:
- Container name: ___________
- Image and tag: ___________
- Data size: ___________
- Volume count: ___________
- Network dependencies: ___________
- Port mappings: ___________
### Step 2: Check Target Host Capacity
```bash
# SSH to target host
ssh [target-host]
# Check available resources
df -h # Disk space
free -h # RAM
nproc # CPU cores
docker ps | wc -l # Current container count
# Check port conflicts
netstat -tlnp | grep [required-port]
```
### Step 3: Create Migration Plan
**Downtime Window**:
- Start: ___________
- End: ___________
- Duration: ___________
**Dependencies**:
- Services that depend on this: ___________
- Services this depends on: ___________
**Notification**:
- Who to notify: ___________
- When to notify: ___________
## Migration Procedure
### Method A: GitOps Migration (Recommended)
Best for: Most services with proper version control
#### Step 1: Backup Current Service
```bash
# SSH to source host
ssh [source-host]
# Create backup
docker stop [service-name]
docker export [service-name] > /tmp/[service-name]-backup.tar
# Backup volumes
for vol in $(docker volume ls -q | grep [service-name]); do
docker run --rm -v $vol:/source -v /tmp:/backup alpine tar czf /backup/$vol.tar.gz -C /source .
done
# Copy backups to safe location
scp /tmp/[service-name]*.tar* [backup-location]:~/backups/
```
#### Step 2: Export Configuration
```bash
# Get current docker-compose configuration
cd ~/Documents/repos/homelab
cat hosts/[source-host]/[service-name].yaml > /tmp/service-config.yaml
# Note environment variables
docker inspect [service-name] | grep -A 50 Env
```
#### Step 3: Copy Data to Target Host
**For Small Data (< 10GB)**: Use SCP
```bash
# From your workstation
scp -r [source-host]:/volume1/docker/[service-name] /tmp/
scp -r /tmp/[service-name] [target-host]:/path/to/docker/
```
**For Large Data (> 10GB)**: Use Rsync
```bash
# From source host to target host via Tailscale
ssh [source-host]
rsync -avz --progress /volume1/docker/[service-name]/ \
[target-host-tailscale-ip]:/path/to/docker/[service-name]/
# Monitor progress
watch -n 5 'du -sh /path/to/docker/[service-name]'
```
**For Very Large Data (> 100GB)**: Consider physical transfer
```bash
# Copy to USB drive, physically move, then copy to target
# Or use network-attached storage as intermediate
```
#### Step 4: Stop Service on Source Host
```bash
# SSH to source host
ssh [source-host]
# Stop the container
docker stop [service-name]
# Verify it's stopped
docker ps -a | grep [service-name]
```
#### Step 5: Update Git Configuration
```bash
# On your workstation
cd ~/Documents/repos/homelab
# Move service definition to new host
git mv hosts/[source-host]/[service-name].yaml \
hosts/[target-host]/[service-name].yaml
# Update paths in the configuration file if needed
nano hosts/[target-host]/[service-name].yaml
# Update volume paths for target host
# Atlantis/Calypso: /volume1/docker/[service-name]
# NUC/VM: /home/user/docker/[service-name]
# Raspberry Pi: /home/pi/docker/[service-name]
# Commit changes
git add hosts/[target-host]/[service-name].yaml
git commit -m "Migrate [service-name] from [source-host] to [target-host]
- Move service configuration
- Update volume paths for target host
- Migration date: $(date +%Y-%m-%d)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"
git push origin main
```
#### Step 6: Deploy on Target Host
**Via Portainer UI**:
1. Open Portainer → Select target host endpoint
2. Go to **Stacks****Add stack****Git Repository**
3. Configure:
- Repository URL: Your git repository
- Compose path: `hosts/[target-host]/[service-name].yaml`
- Enable GitOps (optional)
4. Click **Deploy the stack**
**Via GitOps Auto-Sync**:
- Wait 5-10 minutes for automatic deployment
- Monitor Portainer for new stack appearance
#### Step 7: Verify Migration
```bash
# SSH to target host
ssh [target-host]
# Check container is running
docker ps | grep [service-name]
# Check logs for errors
docker logs [service-name] --tail 100
# Test service accessibility
curl http://localhost:[port] # Internal
curl https://[service].vish.gg # External (if applicable)
# Verify data integrity
docker exec [service-name] ls -lah /config
docker exec [service-name] ls -lah /data
# Check resource usage
docker stats [service-name] --no-stream
```
#### Step 8: Update DNS/Reverse Proxy (If Applicable)
```bash
# Update Nginx Proxy Manager or reverse proxy configuration
# Point [service].vish.gg to new host IP
# Update Cloudflare DNS if using Cloudflare Tunnels
# Update local DNS (AdGuard Home) if applicable
```
#### Step 9: Remove from Source Host
**Only after verifying target is working correctly!**
```bash
# SSH to source host
ssh [source-host]
# Remove container and volumes
docker stop [service-name]
docker rm [service-name]
# Optional: Remove volumes (only if data copied successfully)
# docker volume rm $(docker volume ls -q | grep [service-name])
# Remove data directory
rm -rf /volume1/docker/[service-name] # BE CAREFUL!
# Remove from Portainer if manually managed
# Portainer UI → Stacks → Remove stack
```
### Method B: Manual Export/Import
Best for: Quick migrations without git changes, or when testing
#### Step 1: Stop and Export
```bash
# SSH to source host
ssh [source-host]
# Stop service
docker stop [service-name]
# Export container and volumes
docker run --rm \
-v [service-name]_data:/source \
-v /tmp:/backup \
alpine tar czf /backup/[service-name]-data.tar.gz -C /source .
# Export configuration
docker inspect [service-name] > /tmp/[service-name]-config.json
```
#### Step 2: Transfer to Target
```bash
# Copy data to target host
scp /tmp/[service-name]-data.tar.gz [target-host]:/tmp/
scp /tmp/[service-name]-config.json [target-host]:/tmp/
```
#### Step 3: Import on Target
```bash
# SSH to target host
ssh [target-host]
# Create volume
docker volume create [service-name]_data
# Import data
docker run --rm \
-v [service-name]_data:/target \
-v /tmp:/backup \
alpine tar xzf /backup/[service-name]-data.tar.gz -C /target
# Create and start container using saved configuration
# Adjust paths and ports as needed
docker create --name [service-name] \
[options-from-config.json] \
[image:tag]
docker start [service-name]
```
## Post-Migration Tasks
### Update Documentation
```bash
# Update service inventory
nano docs/services/VERIFIED_SERVICE_INVENTORY.md
# Update the host column for migrated service
# | Service | Host | Port | URL | Status |
# | Service | [NEW-HOST] | 8080 | https://service.vish.gg | ✅ Active |
```
### Update Monitoring
```bash
# Update Prometheus configuration if needed
nano prometheus/prometheus.yml
# Update target host IP for scraped metrics
# Restart Prometheus if configuration changed
```
### Test Backups
```bash
# Verify backups work on new host
./backup.sh --test
# Ensure service data is included in backup
ls -lah /path/to/backups/[service-name]
```
### Performance Baseline
```bash
# Document baseline performance on new host
docker stats [service-name] --no-stream
# Monitor for 24 hours to ensure stability
```
## Verification Checklist
- [ ] Service running on target host: `docker ps`
- [ ] All data migrated correctly
- [ ] Configuration preserved
- [ ] Logs show no errors: `docker logs [service]`
- [ ] External access works (if applicable)
- [ ] Internal service connectivity works
- [ ] Reverse proxy updated (if applicable)
- [ ] DNS records updated (if applicable)
- [ ] Monitoring updated
- [ ] Documentation updated
- [ ] Backups include new location
- [ ] Old host cleaned up
- [ ] Users notified of any URL changes
## Rollback Procedure
If migration fails or causes issues:
### Quick Rollback (Within 24 hours)
```bash
# SSH to source host
ssh [source-host]
# Restore from backup
docker import /tmp/[service-name]-backup.tar [service-name]:backup
# Or redeploy from git (revert git changes)
cd ~/Documents/repos/homelab
git revert HEAD
git push origin main
# Restart service on source host
# Via Portainer or:
docker start [service-name]
```
### Full Rollback (After cleanup)
```bash
# Restore from backup
./restore.sh [backup-date]
# Redeploy to original host
# Follow original deployment procedure
```
## Troubleshooting
### Issue: Data Transfer Very Slow
**Symptoms**: Rsync taking hours for moderate data
**Solutions**:
```bash
# Use compression for better network performance
rsync -avz --compress-level=6 --progress /source/ [target]:/dest/
# Or use parallel transfer tools
# Install: sudo apt-get install parallel
find /source -type f | parallel -j 4 scp {} [target]:/dest/{}
# For extremely large transfers, consider:
# 1. Physical USB drive transfer
# 2. NFS mount between hosts
# 3. Transfer during off-peak hours
```
### Issue: Service Won't Start on Target Host
**Symptoms**: Container starts then immediately exits
**Solutions**:
```bash
# Check logs
docker logs [service-name]
# Common issues:
# 1. Path issues - Update volume paths in compose file
# 2. Permission issues - Check PUID/PGID
# 3. Port conflicts - Check if port already in use
# 4. Missing dependencies - Ensure all required services running
# Fix permissions
docker exec [service-name] chown -R 1000:1000 /config /data
```
### Issue: Lost Configuration Data
**Symptoms**: Service starts but settings are default
**Solutions**:
```bash
# Check if volumes mounted correctly
docker inspect [service-name] | grep -A 10 Mounts
# Restore configuration from backup
docker stop [service-name]
docker run --rm -v [service-name]_config:/target -v /tmp:/backup alpine \
tar xzf /backup/config-backup.tar.gz -C /target
docker start [service-name]
```
### Issue: Network Connectivity Problems
**Symptoms**: Service can't reach other services
**Solutions**:
```bash
# Check network configuration
docker network ls
docker network inspect [network-name]
# Add service to required networks
docker network connect [network-name] [service-name]
# Verify DNS resolution
docker exec [service-name] ping [other-service]
```
## Migration Examples
### Example 1: Migrate Uptime Kuma from Calypso to Homelab VM
```bash
# 1. Backup on Calypso
ssh calypso
docker stop uptime-kuma
tar czf /tmp/uptime-kuma-data.tar.gz /volume1/docker/uptime-kuma
# 2. Transfer
scp /tmp/uptime-kuma-data.tar.gz homelab-vm:/tmp/
# 3. Update git
cd ~/Documents/repos/homelab
git mv hosts/synology/calypso/uptime-kuma.yaml \
hosts/vms/homelab-vm/uptime-kuma.yaml
# Update paths in file
sed -i 's|/volume1/docker/uptime-kuma|/home/user/docker/uptime-kuma|g' \
hosts/vms/homelab-vm/uptime-kuma.yaml
# 4. Deploy on target
git add . && git commit -m "Migrate Uptime Kuma to Homelab VM" && git push
# 5. Verify and cleanup Calypso
```
### Example 2: Migrate AdGuard Home between Hosts
```bash
# AdGuard Home requires DNS configuration updates
# 1. Note current DNS settings on clients
# 2. Migrate service (as above)
# 3. Update client DNS to point to new host IP
# 4. Test DNS resolution from clients
```
## Related Documentation
- [Add New Service](add-new-service.md)
- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
- [Backup Strategies](../admin/backup-strategies.md)
- [Deployment Workflow](../admin/DEPLOYMENT_WORKFLOW.md)
## Change Log
- 2026-02-14 - Initial creation with multiple migration methods
- 2026-02-14 - Added large data migration strategies

View File

@@ -0,0 +1,622 @@
# Synology DSM Upgrade Runbook
## Overview
This runbook provides a safe procedure for upgrading DiskStation Manager (DSM) on Synology NAS devices (Atlantis DS1823xs+ and Calypso DS723+). The procedure minimizes downtime and ensures data integrity during major and minor DSM upgrades.
## Prerequisites
- [ ] DSM admin credentials
- [ ] Complete backup of NAS (HyperBackup or external)
- [ ] Backup verification completed
- [ ] List of installed packages and their versions
- [ ] SSH access to NAS (for troubleshooting)
- [ ] Maintenance window scheduled (1-3 hours)
- [ ] All Docker containers documented and backed up
- [ ] Tailscale or alternative remote access configured
## Metadata
- **Estimated Time**: 1-3 hours (including backups and verification)
- **Risk Level**: Medium-High (system-level upgrade)
- **Requires Downtime**: Yes (30-60 minutes for upgrade itself)
- **Reversible**: Limited (can rollback but complicated)
- **Tested On**: 2026-02-14
## Upgrade Types
| Type | Example | Risk | Downtime | Reversibility |
|------|---------|------|----------|---------------|
| **Patch Update** | 7.2.1 → 7.2.2 | Low | 15-30 min | Easy |
| **Minor Update** | 7.2 → 7.3 | Medium | 30-60 min | Moderate |
| **Major Update** | 7.x → 8.0 | High | 60-120 min | Difficult |
## Pre-Upgrade Planning
### Step 1: Check Compatibility
Before upgrading, verify compatibility:
```bash
# SSH to NAS
ssh admin@atlantis # or calypso
# Check current DSM version
cat /etc.defaults/VERSION
# Check hardware compatibility
# Visit: https://www.synology.com/en-us/dsm
# Verify your model supports the target DSM version
# Check RAM requirements (DSM 7.2+ needs at least 1GB)
free -h
# Check disk space (need at least 5GB free in system partition)
df -h
```
### Step 2: Document Current State
Create a pre-upgrade snapshot of your configuration:
```bash
# Document installed packages
# DSM UI → Package Center → Installed
# Take screenshot or note down:
# - Package names and versions
# - Custom configurations
# Export Docker Compose files (already in git)
cd ~/Documents/repos/homelab
git status # Ensure all configs are committed
# Document running containers
ssh atlantis "docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}' > /volume1/docker/pre-upgrade-containers.txt"
ssh calypso "docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}' > /volume1/docker/pre-upgrade-containers.txt"
# Export package list
ssh atlantis "synopkg list > /volume1/docker/pre-upgrade-packages.txt"
ssh calypso "synopkg list > /volume1/docker/pre-upgrade-packages.txt"
```
### Step 3: Backup Everything
**Critical**: Complete a full backup before proceeding.
```bash
# 1. Backup via HyperBackup (if configured)
# DSM UI → HyperBackup → Backup Now
# 2. Export DSM configuration
# DSM UI → Control Panel → Update & Restore → Configuration Backup → Back Up Configuration
# 3. Backup Docker volumes
cd ~/Documents/repos/homelab
./backup.sh
# 4. Snapshot (if using Btrfs)
# Storage Manager → Storage Pool → Snapshots → Take Snapshot
# 5. Verify backups
ls -lh /volume1/backups/
# Ensure backup completed successfully
```
### Step 4: Notify Users
If other users rely on your homelab:
```bash
# Send notification (via your notification system)
curl -H "Title: Scheduled Maintenance" \
-H "Priority: high" \
-H "Tags: maintenance" \
-d "DSM upgrade scheduled for [DATE/TIME]. Services will be unavailable for approximately 1-2 hours." \
https://ntfy.sh/REDACTED_TOPIC
# Or send notification via Signal/Discord/etc.
```
### Step 5: Plan Rollback Strategy
Document your rollback plan:
- [ ] Backup location verified: ___________
- [ ] Restore procedure tested: Yes/No
- [ ] Alternative access method ready (direct keyboard/monitor)
- [ ] Support contact available if needed
## Upgrade Procedure
### Step 1: Download DSM Update
**Option A: Via DSM UI (Recommended)**
1. Log in to DSM web interface
2. **Control Panel****Update & Restore**
3. **DSM Update** tab
4. If update available, click **Download** (don't install yet)
5. Wait for download to complete
6. Read release notes carefully
**Option B: Manual Download**
1. Visit Synology Download Center
2. Find your model (DS1823xs+ or DS723+)
3. Download appropriate DSM version
4. Upload via DSM → **Manual DSM Update**
### Step 2: Prepare for Downtime
```bash
# Stop non-critical Docker containers (optional, reduces memory pressure)
ssh atlantis
docker stop $(docker ps -q --filter "name=pattern") # Stop specific containers
# Or stop all non-critical containers
# Review which containers can be safely stopped
docker ps
docker stop container1 container2 container3
# Leave critical services running:
# - Portainer (for post-upgrade management)
# - Monitoring (to track upgrade progress)
# - Core network services (AdGuard, VPN if critical)
```
### Step 3: Initiate Upgrade
**Via DSM UI**:
1. **Control Panel****Update & Restore****DSM Update**
2. Click **Update Now**
3. Review release notes and warnings
4. Check **Yes, I understand I need to perform a backup before updating DSM**
5. Click **OK** to start
**Via SSH** (advanced, not recommended unless necessary):
```bash
# SSH to NAS
ssh admin@atlantis
# Start upgrade manually
sudo synoupgrade --start /volume1/@tmp/upd@te/update.pat
# Monitor progress
tail -f /var/log/messages
```
### Step 4: Monitor Upgrade Progress
During upgrade, you'll see:
1. **Checking system**: Verifying prerequisites
2. **Downloading**: If not pre-downloaded
3. **Installing**: Actual upgrade process (30-45 minutes)
4. **Optimizing system**: Post-install tasks
5. **Reboot**: System will restart
**Monitoring via SSH** (if you have access during upgrade):
```bash
# Watch upgrade progress
tail -f /var/log/upgrade.log
# Or watch system messages
tail -f /var/log/messages | grep -i upgrade
```
**Expected timeline**:
- Preparation: 5-10 minutes
- Installation: 30-45 minutes
- First reboot: 3-5 minutes
- Optimization: 10-20 minutes
- Final reboot: 3-5 minutes
- **Total**: 60-90 minutes
### Step 5: Wait for Completion
**⚠️ IMPORTANT**: Do not power off or interrupt the upgrade!
Signs of normal upgrade:
- DSM UI becomes inaccessible
- NAS may beep once (starting upgrade)
- Disk lights active
- NAS will reboot 1-2 times
- Final beep when complete
### Step 6: First Login After Upgrade
1. Wait for NAS to complete all restarts
2. Access DSM UI (may take 5-10 minutes after last reboot)
3. Log in with admin credentials
4. You may see "Optimization in progress" - this is normal
5. Review the "What's New" page
6. Accept any new terms/agreements
## Post-Upgrade Verification
### Step 1: Verify System Health
```bash
# SSH to NAS
ssh admin@atlantis
# Check DSM version
cat /etc.defaults/VERSION
# Should show new version
# Check system status
sudo syno_disk_check
# Check RAID status
cat /proc/mdstat
# Check disk health
sudo smartctl -a /dev/sda
# Verify storage pools
synospace --get
```
Via DSM UI:
- **Storage Manager** → Verify all pools are "Healthy"
- **Resource Monitor** → Check CPU, RAM, network
- **Log Center** → Review any errors during upgrade
### Step 2: Verify Packages
```bash
# Check all packages are running
synopkg list
# Compare with pre-upgrade package list
diff /volume1/docker/pre-upgrade-packages.txt <(synopkg list)
# Start any stopped packages
# DSM UI → Package Center → Installed
# Check each package, start if needed
```
Common packages to verify:
- [ ] Docker
- [ ] Synology Drive
- [ ] Hyper Backup
- [ ] Snapshot Replication
- [ ] Any other installed packages
### Step 3: Verify Docker Containers
```bash
# SSH to NAS
ssh atlantis
# Check Docker is running
docker --version
docker info
# Check all containers
docker ps -a
# Compare with pre-upgrade state
diff /volume1/docker/pre-upgrade-containers.txt <(docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}')
# Start stopped containers
docker start $(docker ps -a -q -f status=exited)
# Check container logs for errors
docker ps --format "{{.Names}}" | xargs -I {} sh -c 'echo "=== {} ===" && docker logs --tail 20 {}'
```
### Step 4: Test Key Services
Verify critical services are working:
```bash
# Test network connectivity
ping -c 4 8.8.8.8
curl -I https://google.com
# Test Docker networking
docker exec [container] ping -c 2 8.8.8.8
# Test Portainer access
curl http://localhost:9000
# Test Plex
curl http://localhost:32400/web
# Test monitoring
curl http://localhost:3000 # Grafana
curl http://localhost:9090 # Prometheus
```
Via browser:
- [ ] Portainer accessible
- [ ] Grafana dashboards loading
- [ ] Plex/Jellyfin streaming works
- [ ] File shares accessible
- [ ] SSO (Authentik) working
### Step 5: Verify Scheduled Tasks
```bash
# Check cron jobs
crontab -l
# Via DSM UI
# Control Panel → Task Scheduler
# Verify all tasks are enabled
```
### Step 6: Test Remote Access
- [ ] Tailscale VPN working
- [ ] External access via domain (if configured)
- [ ] SSH access working
- [ ] Mobile app access working (DS File, DS Photo, etc.)
## Post-Upgrade Optimization
### Step 1: Update Packages
After DSM upgrade, packages may need updates:
1. **Package Center****Update** tab
2. Update available packages
3. Prioritize critical packages:
- Docker (if updated)
- Surveillance Station (if used)
- Drive, Office, etc.
### Step 2: Review New Features
DSM upgrades often include new features:
1. Review "What's New" page
2. Check for new security features
3. Review changed settings
4. Update documentation if needed
### Step 3: Re-enable Auto-Updates (if disabled)
```bash
# Via DSM UI
# Control Panel → Update & Restore → DSM Update
# Check "Notify me when DSM updates are available"
# Or "Install latest DSM updates automatically" (if you trust auto-updates)
```
### Step 4: Update Documentation
```bash
cd ~/Documents/repos/homelab
# Update infrastructure docs
nano docs/infrastructure/INFRASTRUCTURE_OVERVIEW.md
# Note DSM version upgrade
# Document any configuration changes
# Update troubleshooting docs if procedures changed
git add .
git commit -m "Update docs: DSM upgraded to X.X on Atlantis/Calypso"
git push
```
## Troubleshooting
### Issue: Upgrade Fails or Stalls
**Symptoms**: Progress bar stuck, no activity for >30 minutes
**Solutions**:
```bash
# If you have SSH access:
ssh admin@atlantis
# Check if upgrade process is running
ps aux | grep -i upgrade
# Check system logs
tail -100 /var/log/messages
tail -100 /var/log/upgrade.log
# Check disk space
df -h
# If completely stuck (>1 hour no progress):
# 1. Do NOT force reboot unless absolutely necessary
# 2. Contact Synology support first
# 3. As last resort, force reboot via physical button
```
### Issue: NAS Won't Boot After Upgrade
**Symptoms**: Cannot access DSM UI, NAS beeping continuously
**Solutions**:
1. **Check beep pattern** (indicates specific error)
- 1 beep: Normal boot
- 3 beeps: RAM issue
- 4 beeps: Disk issue
- Continuous: Critical failure
2. **Try Safe Mode**:
- Power off NAS
- Hold reset button
- Power on while holding reset
- Hold for 4 seconds until beep
- Release and wait for boot
3. **Check via Synology Assistant**:
- Download Synology Assistant on PC
- Scan network for NAS
- May show recovery mode option
4. **Last resort: Reinstall DSM**:
- Download latest DSM .pat file
- Access via http://[nas-ip]:5000
- Install DSM (will not erase data)
### Issue: Docker Not Working After Upgrade
**Symptoms**: Docker containers won't start, Docker package shows stopped
**Solutions**:
```bash
# SSH to NAS
ssh admin@atlantis
# Check Docker status
sudo synoservicectl --status pkgctl-Docker
# Restart Docker
sudo synoservicectl --restart pkgctl-Docker
# If Docker won't start, check logs
cat /var/log/docker.log
# Reinstall Docker package (preserves volumes)
# Via DSM UI → Package Center → Docker → Uninstall
# Then reinstall Docker
# Your volumes and data will be preserved
```
### Issue: Network Shares Not Accessible
**Symptoms**: Can't connect to SMB/NFS shares
**Solutions**:
```bash
# Check share services
sudo synoservicectl --status smbd # SMB
sudo synoservicectl --status nfsd # NFS
# Restart services
sudo synoservicectl --restart smbd
sudo synoservicectl --restart nfsd
# Check firewall
# Control Panel → Security → Firewall
# Ensure file sharing ports allowed
```
### Issue: Performance Degradation After Upgrade
**Symptoms**: Slow response, high CPU/RAM usage
**Solutions**:
```bash
# Check what's using resources
top
htop # If installed
# Via DSM UI → Resource Monitor
# Identify resource-hungry processes
# Common causes:
# 1. Indexing in progress (Photos, Drive, Universal Search)
# - Wait for indexing to complete (can take hours)
# 2. Optimization running
# - Check: ps aux | grep optimize
# - Let it complete
# 3. Too many containers started at once
# - Stagger container startup
```
## Rollback Procedure
⚠️ **WARNING**: Rollback is complex and risky. Only attempt if absolutely necessary.
### Method 1: DSM Archive (If Available)
```bash
# SSH to NAS
ssh admin@atlantis
# Check if previous DSM version archived
ls -la /volume1/@appstore/
# If archive exists, you can attempt rollback
# CAUTION: This is not officially supported and may cause data loss
```
### Method 2: Restore from Backup
If upgrade caused critical issues:
1. REDACTED_APP_PASSWORD
2. Restore from HyperBackup
3. Or restore from configuration backup:
- **Control Panel** → **Update & Restore**
- **Configuration Backup** → **Restore**
### Method 3: Fresh Install (Nuclear Option)
⚠️ **DANGER**: This will erase everything. Only for catastrophic failure.
1. Download previous DSM version
2. Install via Synology Assistant in "Recovery Mode"
3. Restore from complete backup
4. Reconfigure everything
## Best Practices
### Timing
- Schedule upgrades during low-usage periods
- Allow 3-4 hour maintenance window
- Don't upgrade before important events
- Wait 2-4 weeks after major DSM release (let others find bugs)
### Testing
- If you have 2 NAS units, upgrade one first
- Test on less critical NAS before primary
- Read community forums for known issues
- Review Synology release notes thoroughly
### Preparation
- Always complete full backup
- Test backup restore before upgrade
- Document all configurations
- Have physical access to NAS if possible
- Keep Synology Assistant installed on PC
### Post-Upgrade
- Monitor closely for 24-48 hours
- Check logs daily for first week
- Report any bugs to Synology
- Update your documentation
## Verification Checklist
- [ ] DSM upgraded to target version
- [ ] All storage pools healthy
- [ ] All packages running
- [ ] All Docker containers running
- [ ] Network shares accessible
- [ ] Remote access working (Tailscale, QuickConnect)
- [ ] Scheduled tasks running
- [ ] Monitoring dashboards functional
- [ ] Backups completing successfully
- [ ] No errors in system logs
- [ ] Performance normal
- [ ] Documentation updated
## Related Documentation
- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
- [Emergency Access Guide](../troubleshooting/EMERGENCY_ACCESS_GUIDE.md)
- [Disaster Recovery](../troubleshooting/disaster-recovery.md)
- [Synology Disaster Recovery](../troubleshooting/synology-disaster-recovery.md)
- [Backup Strategies](../admin/backup-strategies.md)
## Additional Resources
- [Synology DSM Release Notes](https://www.synology.com/en-us/releaseNote/DSM)
- [Synology Community Forums](https://community.synology.com/)
- [Synology Knowledge Base](https://kb.synology.com/)
## Change Log
- 2026-02-14 - Initial creation
- 2026-02-14 - Added comprehensive troubleshooting and rollback procedures