Sanitized mirror from private repository - 2026-04-19 09:54:54 UTC
This commit is contained in:
143
docs/runbooks/README.md
Normal file
143
docs/runbooks/README.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# Homelab Operational Runbooks
|
||||
|
||||
This directory contains step-by-step operational runbooks for common homelab management tasks. Each runbook provides clear procedures, prerequisites, and rollback steps.
|
||||
|
||||
## 📚 Available Runbooks
|
||||
|
||||
### Service Management
|
||||
- **[Add New Service](add-new-service.md)** - Deploy new containerized services via GitOps
|
||||
- **[Service Migration](service-migration.md)** - Move services between hosts safely
|
||||
- **[Add New User](add-new-user.md)** - Onboard new users with proper access
|
||||
|
||||
### Infrastructure Maintenance
|
||||
- **[Disk Full Procedure](disk-full-procedure.md)** - Handle full disk scenarios
|
||||
- **[Certificate Renewal](certificate-renewal.md)** - Manage SSL/TLS certificates
|
||||
- **[Synology DSM Upgrade](synology-dsm-upgrade.md)** - Safely upgrade NAS firmware
|
||||
|
||||
### Security
|
||||
- **[Credential Rotation](credential-rotation.md)** - Rotate exposed or compromised credentials
|
||||
|
||||
## 🎯 How to Use These Runbooks
|
||||
|
||||
### Runbook Format
|
||||
Each runbook follows a standard format:
|
||||
1. **Overview** - What this procedure accomplishes
|
||||
2. **Prerequisites** - What you need before starting
|
||||
3. **Estimated Time** - How long it typically takes
|
||||
4. **Risk Level** - Low/Medium/High impact assessment
|
||||
5. **Procedure** - Step-by-step instructions
|
||||
6. **Verification** - How to confirm success
|
||||
7. **Rollback** - How to undo if something goes wrong
|
||||
8. **Troubleshooting** - Common issues and solutions
|
||||
|
||||
### When to Use Runbooks
|
||||
- **Planned Maintenance** - Follow runbooks during scheduled maintenance windows
|
||||
- **Incident Response** - Use as quick reference during outages
|
||||
- **Training** - Onboard new admins with documented procedures
|
||||
- **Automation** - Use as basis for creating automated scripts
|
||||
|
||||
### Best Practices
|
||||
- ✅ Always read the entire runbook before starting
|
||||
- ✅ Have a rollback plan ready
|
||||
- ✅ Test in development/staging when possible
|
||||
- ✅ Take snapshots/backups before major changes
|
||||
- ✅ Document any deviations from the runbook
|
||||
- ✅ Update runbooks when procedures change
|
||||
|
||||
## 🚨 Emergency Procedures
|
||||
|
||||
For emergency situations, refer to:
|
||||
- [Emergency Access Guide](../troubleshooting/EMERGENCY_ACCESS_GUIDE.md)
|
||||
- [Recovery Guide](../troubleshooting/RECOVERY_GUIDE.md)
|
||||
- [Disaster Recovery](../troubleshooting/disaster-recovery.md)
|
||||
|
||||
## 📋 Runbook Maintenance
|
||||
|
||||
### Contributing
|
||||
When you discover a new procedure or improvement:
|
||||
1. Create a new runbook using the template below
|
||||
2. Follow the standard format
|
||||
3. Include real examples from your infrastructure
|
||||
4. Test the procedure before documenting
|
||||
|
||||
### Runbook Template
|
||||
```markdown
|
||||
# [Procedure Name]
|
||||
|
||||
## Overview
|
||||
Brief description of what this accomplishes and when to use it.
|
||||
|
||||
## Prerequisites
|
||||
- [ ] Required access/credentials
|
||||
- [ ] Required tools/software
|
||||
- [ ] Required knowledge/skills
|
||||
|
||||
## Metadata
|
||||
- **Estimated Time**: X minutes/hours
|
||||
- **Risk Level**: Low/Medium/High
|
||||
- **Requires Downtime**: Yes/No
|
||||
- **Reversible**: Yes/No
|
||||
- **Tested On**: Date last tested
|
||||
|
||||
## Procedure
|
||||
|
||||
### Step 1: [Action]
|
||||
Detailed instructions...
|
||||
|
||||
```bash
|
||||
# Example commands
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
Example of what you should see
|
||||
```
|
||||
|
||||
### Step 2: [Next Action]
|
||||
Continue...
|
||||
|
||||
## Verification
|
||||
How to confirm the procedure succeeded:
|
||||
- [ ] Verification step 1
|
||||
- [ ] Verification step 2
|
||||
|
||||
## Rollback Procedure
|
||||
If something goes wrong:
|
||||
1. Step to undo changes
|
||||
2. How to restore previous state
|
||||
|
||||
## Troubleshooting
|
||||
**Issue**: Common problem
|
||||
**Solution**: How to fix it
|
||||
|
||||
## Related Documentation
|
||||
- [Link to related doc](path)
|
||||
|
||||
## Change Log
|
||||
- YYYY-MM-DD - Initial creation
|
||||
- YYYY-MM-DD - Updated for new procedure
|
||||
```
|
||||
|
||||
## 📞 Getting Help
|
||||
|
||||
If a runbook is unclear or doesn't work as expected:
|
||||
1. Check the troubleshooting section
|
||||
2. Refer to related documentation links
|
||||
3. Review the homelab monitoring dashboards
|
||||
4. Consult the [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
|
||||
|
||||
## 📊 Runbook Status
|
||||
|
||||
| Runbook | Status | Last Updated | Tested On |
|
||||
|---------|--------|--------------|-----------|
|
||||
| Add New Service | ✅ Active | 2026-02-14 | 2026-02-14 |
|
||||
| Service Migration | ✅ Active | 2026-02-14 | 2026-02-14 |
|
||||
| Add New User | ✅ Active | 2026-02-14 | 2026-02-14 |
|
||||
| Disk Full Procedure | ✅ Active | 2026-02-14 | 2026-02-14 |
|
||||
| Certificate Renewal | ✅ Active | 2026-02-14 | 2026-02-14 |
|
||||
| Synology DSM Upgrade | ✅ Active | 2026-02-14 | 2026-02-14 |
|
||||
| Credential Rotation | ✅ Active | 2026-02-20 | — |
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2026-02-14
|
||||
65
docs/runbooks/add-new-service.md
Normal file
65
docs/runbooks/add-new-service.md
Normal file
@@ -0,0 +1,65 @@
|
||||
# Add New Service Runbook
|
||||
|
||||
This runbook walks through a **clean, tested path** for adding a new service to the homelab using GitOps with Portainer.
|
||||
|
||||
> ⚠️ **Prerequisites**: CI runner access, SSH to target hosts, SSO admin privilege.
|
||||
|
||||
## 1. Prepare Compose File
|
||||
|
||||
```bash
|
||||
# Generate a minimal stack template
|
||||
../scripts/ci/workflows/gen-template.py --service myservice
|
||||
```
|
||||
|
||||
Adjust `docker-compose.yml`:
|
||||
- Image name
|
||||
- Ports
|
||||
- Environment variables
|
||||
- Health‑check
|
||||
|
||||
## 2. Validate Configuration
|
||||
|
||||
```bash
|
||||
docker compose -f docker-compose.yml config > /tmp/merged.yml
|
||||
# Validate against OpenAPI specs if needed
|
||||
```
|
||||
|
||||
## 3. Commit Locally
|
||||
|
||||
```bash
|
||||
git add docker/compose/*.yml
|
||||
git commit -m "Add myservice stack"
|
||||
```
|
||||
|
||||
## 4. Push to Remote & Trigger GitOps
|
||||
|
||||
```bash
|
||||
git push origin main
|
||||
```
|
||||
|
||||
The Portainer EE GitOps agent will automatically deploy. Monitor the stack via the Portainer UI or `portainer api`.
|
||||
|
||||
## 5. Post‑Deployment Verification
|
||||
|
||||
| Check | Command | Expected Result |
|
||||
|-------|---------|-----------------
|
||||
| Service Running | `docker ps --filter "name=myservice"` | One container running |
|
||||
| Health Endpoint | `curl http://localhost:8080/health` | 200 OK |
|
||||
| Logs | `docker logs myservice` | No fatal errors |
|
||||
|
||||
## 6. Update Documentation
|
||||
|
||||
1. Add entry to `docs/services/VERIFIED_SERVICE_INVENTORY.md`.
|
||||
2. Create a quick‑start guide in `docs/services/<service>/README.md`.
|
||||
3. Publish to the shared wiki.
|
||||
|
||||
## 7. Optional – Terraform Sync
|
||||
|
||||
If the service also needs infra changes (e.g., new VM), update the Terraform modules under `infra/` and run `terragrunt run-all apply`.
|
||||
|
||||
---
|
||||
|
||||
**Gotchas** –
|
||||
- *Race conditions*: rebasing before push.
|
||||
- Health‑check failures: check Portainer Events.
|
||||
- Secrets: use Vault and reference in `secrets` section.
|
||||
601
docs/runbooks/add-new-user.md
Normal file
601
docs/runbooks/add-new-user.md
Normal file
@@ -0,0 +1,601 @@
|
||||
# Add New User Runbook
|
||||
|
||||
## Overview
|
||||
This runbook provides a comprehensive procedure for onboarding new users to the homelab, including network access, service authentication, and permission management. It ensures users get appropriate access while maintaining security.
|
||||
|
||||
## Prerequisites
|
||||
- [ ] User's full name and email address
|
||||
- [ ] Desired username (lowercase, no spaces)
|
||||
- [ ] Access level determined (read-only, standard, admin)
|
||||
- [ ] Required services identified
|
||||
- [ ] Admin access to all relevant systems
|
||||
- [ ] Authentik admin access (for SSO services)
|
||||
- [ ] Tailscale admin access (for VPN)
|
||||
- [ ] Synology admin access (for file shares)
|
||||
|
||||
## Metadata
|
||||
- **Estimated Time**: 30-60 minutes
|
||||
- **Risk Level**: Low (proper access controls in place)
|
||||
- **Requires Downtime**: No
|
||||
- **Reversible**: Yes (can remove user access)
|
||||
- **Tested On**: 2026-02-14
|
||||
|
||||
## User Access Levels
|
||||
|
||||
| Level | Description | Typical Use Case | Services |
|
||||
|-------|-------------|------------------|----------|
|
||||
| **Guest** | Read-only, limited services | Family, friends | Plex, Jellyfin |
|
||||
| **Standard** | Read/write, most services | Family members | Media + storage |
|
||||
| **Power User** | Advanced services | Tech-savvy users | Dev tools, monitoring |
|
||||
| **Admin** | Full access, can manage | Co-admins, yourself | Everything + admin panels |
|
||||
|
||||
## Pre-Onboarding Checklist
|
||||
|
||||
### Step 1: Gather Information
|
||||
|
||||
Create a user profile document:
|
||||
|
||||
```markdown
|
||||
# New User: [Name]
|
||||
|
||||
**Username**: [username]
|
||||
**Email**: [email@domain.com]
|
||||
**Access Level**: [Guest/Standard/Power User/Admin]
|
||||
**Start Date**: [YYYY-MM-DD]
|
||||
|
||||
## Services Requested:
|
||||
- [ ] Plex/Jellyfin (Media streaming)
|
||||
- [ ] File Shares (NAS access)
|
||||
- [ ] Immich (Photo backup)
|
||||
- [ ] Paperless (Document management)
|
||||
- [ ] Development tools (Gitea, etc.)
|
||||
- [ ] Monitoring dashboards
|
||||
- [ ] Other: ___________
|
||||
|
||||
## Access Requirements:
|
||||
- [ ] Remote access (Tailscale VPN)
|
||||
- [ ] Local network only
|
||||
- [ ] Mobile apps
|
||||
- [ ] Web browser only
|
||||
|
||||
## Notes:
|
||||
[Any special requirements or restrictions]
|
||||
```
|
||||
|
||||
### Step 2: Plan Access
|
||||
|
||||
Determine which systems need accounts:
|
||||
|
||||
- [ ] **Tailscale** (VPN access to homelab)
|
||||
- [ ] **Authentik** (SSO for web services)
|
||||
- [ ] **Synology NAS** (File shares - Atlantis/Calypso)
|
||||
- [ ] **Plex** (Media streaming)
|
||||
- [ ] **Jellyfin** (Alternative media)
|
||||
- [ ] **Immich** (Photo management)
|
||||
- [ ] **Portainer** (Container management - admin only)
|
||||
- [ ] **Grafana** (Monitoring - admin/power user)
|
||||
- [ ] **Other services**: ___________
|
||||
|
||||
## User Onboarding Procedure
|
||||
|
||||
### Step 1: Create Tailscale Access
|
||||
|
||||
**Why First**: Tailscale provides secure remote access to the homelab network.
|
||||
|
||||
1. **Invite via Tailscale Admin Console**:
|
||||
- Go to https://login.tailscale.com/admin/settings/users
|
||||
- Click **Invite Users**
|
||||
- Enter user's email
|
||||
- Set expiration (optional)
|
||||
- Click **Send Invite**
|
||||
|
||||
2. **User receives email**:
|
||||
- User clicks invitation link
|
||||
- Creates Tailscale account
|
||||
- Installs Tailscale app on their device(s)
|
||||
- Connects to your tailnet
|
||||
|
||||
3. **Configure ACLs** (if needed):
|
||||
```json
|
||||
// In Tailscale Admin Console → Access Controls
|
||||
{
|
||||
"acls": [
|
||||
// Existing ACLs...
|
||||
{
|
||||
"action": "accept",
|
||||
"src": ["user@email.com"],
|
||||
"dst": [
|
||||
"atlantis:*", // Allow access to Atlantis
|
||||
"calypso:*", // Allow access to Calypso
|
||||
"homelab-vm:*" // Allow access to VM
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
4. **Test connectivity**:
|
||||
```bash
|
||||
# Ask user to test
|
||||
ping atlantis.your-tailnet.ts.net
|
||||
curl http://atlantis.your-tailnet.ts.net:9000 # Test Portainer
|
||||
```
|
||||
|
||||
### Step 2: Create Authentik Account (SSO)
|
||||
|
||||
**Purpose**: Single sign-on for most web services.
|
||||
|
||||
1. **Access Authentik Admin**:
|
||||
- Navigate to your Authentik instance
|
||||
- Log in as admin
|
||||
|
||||
2. **Create User**:
|
||||
- **Directory** → **Users** → **Create**
|
||||
- Fill in:
|
||||
- **Username**: `username` (lowercase)
|
||||
- **Name**: `First Last`
|
||||
- **Email**: `user@email.com`
|
||||
- **Groups**: Add to appropriate groups
|
||||
- `homelab-users` (standard access)
|
||||
- `homelab-admins` (for admin users)
|
||||
- Service-specific groups (e.g., `jellyfin-users`)
|
||||
|
||||
3. **Set Password**:
|
||||
- Option A: Set temporary password, force change on first login
|
||||
- Option B: Send password reset link via email
|
||||
|
||||
4. **Assign Service Access**:
|
||||
- **Applications** → **Outposts**
|
||||
- For each service the user should access:
|
||||
- Edit application
|
||||
- Add user/group to **Policy Bindings**
|
||||
|
||||
5. **Test SSO**:
|
||||
```bash
|
||||
# User should test login to SSO-enabled services
|
||||
# Example: Grafana, Jellyseerr, etc.
|
||||
```
|
||||
|
||||
### Step 3: Create Synology NAS Account
|
||||
|
||||
**Purpose**: Access to file shares, Photos, Drive, etc.
|
||||
|
||||
#### On Atlantis (Primary NAS):
|
||||
|
||||
```bash
|
||||
# SSH to Atlantis
|
||||
ssh admin@atlantis
|
||||
|
||||
# Create user (DSM 7.x)
|
||||
# Via DSM UI (recommended):
|
||||
```
|
||||
|
||||
1. **Control Panel** → **User & Group** → **User** → **Create**
|
||||
2. Fill in:
|
||||
- **Name**: `username`
|
||||
- **Description**: `[Full Name]`
|
||||
- **Email**: `user@email.com`
|
||||
- **Password**: Set strong password
|
||||
3. **Join Groups**:
|
||||
- `users` (default)
|
||||
- `http` (if web service access needed)
|
||||
4. **Configure Permissions**:
|
||||
- **Applications** tab:
|
||||
- [ ] Synology Photos (if needed)
|
||||
- [ ] Synology Drive (if needed)
|
||||
- [ ] File Station
|
||||
- [ ] Other apps as needed
|
||||
- **Shared Folders** tab:
|
||||
- Set permissions for each share:
|
||||
- Read/Write: For shares user can modify
|
||||
- Read-only: For media libraries
|
||||
- No access: For restricted folders
|
||||
5. **User Quotas** (optional):
|
||||
- Set storage quota if needed
|
||||
- Limit upload/download speed if needed
|
||||
6. **Click Create**
|
||||
|
||||
#### On Calypso (Secondary NAS):
|
||||
|
||||
Repeat the same process if user needs access to Calypso.
|
||||
|
||||
**Alternative: SSH Method**:
|
||||
```bash
|
||||
# Create user via command line
|
||||
sudo synouser --add username "Full Name" "password" "user@email.com" 0 "" 0
|
||||
|
||||
# Add to groups
|
||||
sudo synogroup --member users username add
|
||||
|
||||
# Set folder permissions (example)
|
||||
sudo chown -R username:users /volume1/homes/username
|
||||
```
|
||||
|
||||
### Step 4: Create Plex Account
|
||||
|
||||
**Option A: Managed User (Recommended for Family)**
|
||||
|
||||
1. Open Plex Web
|
||||
2. **Settings** → **Users & Sharing** → **Manage Home Users**
|
||||
3. Click **Add User**
|
||||
4. Set:
|
||||
- **Username**: `[Name]`
|
||||
- **PIN**: 4-digit PIN
|
||||
- Enable **Managed user** if restricted access desired
|
||||
5. Configure library access
|
||||
|
||||
**Option B: Plex Account (For External Users)**
|
||||
|
||||
1. User creates their own Plex account
|
||||
2. **Settings** → **Users & Sharing** → **Friends**
|
||||
3. Invite by email
|
||||
4. Select libraries to share
|
||||
5. Configure restrictions:
|
||||
- [ ] Allow sync
|
||||
- [ ] Allow camera upload
|
||||
- [ ] Rating restrictions (if children)
|
||||
|
||||
### Step 5: Create Jellyfin Account
|
||||
|
||||
```bash
|
||||
# SSH to host running Jellyfin
|
||||
ssh atlantis # or wherever Jellyfin runs
|
||||
|
||||
# Or via web UI:
|
||||
```
|
||||
|
||||
1. Open Jellyfin web interface
|
||||
2. **Dashboard** → **Users** → **Add User**
|
||||
3. Set:
|
||||
- **Name**: `username`
|
||||
- **Password**: REDACTED_PASSWORD password
|
||||
4. Configure:
|
||||
- **Library access**: Select which libraries
|
||||
- **Permissions**:
|
||||
- [ ] Allow media deletion
|
||||
- [ ] Allow remote access
|
||||
- [ ] Enable live TV (if applicable)
|
||||
5. **Save**
|
||||
|
||||
### Step 6: Create Immich Account (If Used)
|
||||
|
||||
```bash
|
||||
# Via Immich web interface
|
||||
```
|
||||
|
||||
1. Open Immich
|
||||
2. **Administration** → **Users** → **Create User**
|
||||
3. Set:
|
||||
- **Email**: `user@email.com`
|
||||
- **Password**: REDACTED_PASSWORD password
|
||||
- **Name**: `Full Name`
|
||||
4. User logs in and sets up mobile app
|
||||
|
||||
### Step 7: Grant Service-Specific Access
|
||||
|
||||
#### Gitea (Development)
|
||||
|
||||
1. Gitea web interface
|
||||
2. **Site Administration** → **User Accounts** → **Create User Account**
|
||||
3. Fill in details
|
||||
4. Add to appropriate organizations/teams
|
||||
|
||||
#### Portainer (Admin/Power Users Only)
|
||||
|
||||
1. Portainer web interface
|
||||
2. **Users** → **Add user**
|
||||
3. Set:
|
||||
- **Username**: `username`
|
||||
- **Password**: REDACTED_PASSWORD password
|
||||
4. Assign role:
|
||||
- **Administrator**: Full access
|
||||
- **Operator**: Can manage containers
|
||||
- **User**: Read-only
|
||||
5. Assign to teams/endpoints
|
||||
|
||||
#### Grafana (Monitoring)
|
||||
|
||||
If using Authentik SSO, user automatically gets access.
|
||||
|
||||
If not using SSO:
|
||||
1. Grafana web interface
|
||||
2. **Configuration** → **Users** → **Invite**
|
||||
3. Set role:
|
||||
- **Viewer**: Read-only dashboards
|
||||
- **Editor**: Can create dashboards
|
||||
- **Admin**: Full access
|
||||
|
||||
### Step 8: Configure Mobile Apps
|
||||
|
||||
Provide user with setup instructions:
|
||||
|
||||
**Plex**:
|
||||
- Download Plex app
|
||||
- Sign in with Plex account
|
||||
- Server should auto-discover via Tailscale
|
||||
|
||||
**Jellyfin**:
|
||||
- Download Jellyfin app
|
||||
- Add server: `http://atlantis.tailnet:8096`
|
||||
- Sign in with credentials
|
||||
|
||||
**Immich** (if used):
|
||||
- Download Immich app
|
||||
- Server: `http://atlantis.tailnet:2283`
|
||||
- Enable auto-backup (optional)
|
||||
|
||||
**Synology Apps**:
|
||||
- DS File (file access)
|
||||
- Synology Photos
|
||||
- DS Audio/Video
|
||||
- Server: `atlantis.tailnet` or QuickConnect ID
|
||||
|
||||
**Tailscale**:
|
||||
- Already installed in Step 1
|
||||
- Ensure "Always On VPN" enabled for seamless access
|
||||
|
||||
## User Documentation Package
|
||||
|
||||
Provide new user with documentation:
|
||||
|
||||
```markdown
|
||||
# Welcome to the Homelab!
|
||||
|
||||
Hi [Name],
|
||||
|
||||
Your access has been set up. Here's what you need to know:
|
||||
|
||||
## Network Access
|
||||
|
||||
**Tailscale VPN**:
|
||||
- Install Tailscale from: https://tailscale.com/download
|
||||
- Log in with your account (check email for invitation)
|
||||
- Connect to our tailnet
|
||||
- You can now access services remotely!
|
||||
|
||||
## Available Services
|
||||
|
||||
### Media Streaming
|
||||
- **Plex**: https://plex.vish.gg
|
||||
- Username: [plex-username]
|
||||
- Watch movies, TV shows, music
|
||||
|
||||
- **Jellyfin**: https://jellyfin.vish.gg
|
||||
- Username: [username]
|
||||
- Alternative media server
|
||||
|
||||
### File Storage
|
||||
- **Atlantis NAS**: smb://atlantis.tailnet/[your-folder]
|
||||
- Access via file explorer
|
||||
- Windows: \\atlantis.tailnet\folder
|
||||
- Mac: smb://atlantis.tailnet/folder
|
||||
|
||||
### Photos
|
||||
- **Immich**: https://immich.vish.gg
|
||||
- Auto-backup from your phone
|
||||
- Private photo storage
|
||||
|
||||
### Other Services
|
||||
- [List other services user has access to]
|
||||
|
||||
## Support
|
||||
|
||||
If you need help:
|
||||
- Email: [your-email]
|
||||
- [Alternative contact method]
|
||||
|
||||
## Security
|
||||
|
||||
- Don't share passwords
|
||||
- Enable 2FA where available
|
||||
- Report any suspicious activity
|
||||
|
||||
Welcome aboard!
|
||||
```
|
||||
|
||||
## Post-Onboarding Tasks
|
||||
|
||||
### Step 1: Update Documentation
|
||||
|
||||
```bash
|
||||
cd ~/Documents/repos/homelab
|
||||
|
||||
# Update user access documentation
|
||||
nano docs/infrastructure/USER_ACCESS_GUIDE.md
|
||||
|
||||
# Add user to list:
|
||||
# | Username | Access Level | Services | Status |
|
||||
# | username | Standard | Plex, Files, Photos | ✅ Active |
|
||||
|
||||
git add .
|
||||
git commit -m "Add new user: [username]"
|
||||
git push
|
||||
```
|
||||
|
||||
### Step 2: Test User Access
|
||||
|
||||
Verify everything works:
|
||||
- [ ] User can connect via Tailscale
|
||||
- [ ] User can access Plex/Jellyfin
|
||||
- [ ] User can access file shares
|
||||
- [ ] SSO login works
|
||||
- [ ] Mobile apps working
|
||||
- [ ] No access to restricted services
|
||||
|
||||
### Step 3: Monitor Usage
|
||||
|
||||
```bash
|
||||
# Check user activity after a few days
|
||||
# Grafana dashboards should show:
|
||||
# - Network traffic from user's IP
|
||||
# - Service access logs
|
||||
# - Any errors
|
||||
|
||||
# Review logs
|
||||
ssh atlantis
|
||||
grep username /var/log/auth.log # SSH attempts
|
||||
docker logs plex | grep username # Plex usage
|
||||
```
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
- [ ] Tailscale invitation sent and accepted
|
||||
- [ ] Authentik account created and tested
|
||||
- [ ] Synology NAS account created (Atlantis/Calypso)
|
||||
- [ ] Plex/Jellyfin access granted
|
||||
- [ ] Required service accounts created
|
||||
- [ ] Mobile apps configured and tested
|
||||
- [ ] User documentation sent
|
||||
- [ ] User confirmed access is working
|
||||
- [ ] Documentation updated
|
||||
- [ ] No access to restricted services
|
||||
|
||||
## User Removal Procedure
|
||||
|
||||
When user no longer needs access:
|
||||
|
||||
### Step 1: Disable Accounts
|
||||
|
||||
```bash
|
||||
# Disable in order of security priority:
|
||||
|
||||
# 1. Tailscale
|
||||
# Admin Console → Users → [user] → Revoke keys
|
||||
|
||||
# 2. Authentik
|
||||
# Directory → Users → [user] → Deactivate
|
||||
|
||||
# 3. Synology NAS
|
||||
# Control Panel → User & Group → [user] → Disable
|
||||
# Or via SSH:
|
||||
sudo synouser --disable username
|
||||
|
||||
# 4. Plex
|
||||
# Settings → Users & Sharing → Remove user
|
||||
|
||||
# 5. Jellyfin
|
||||
# Dashboard → Users → [user] → Delete
|
||||
|
||||
# 6. Other services
|
||||
# Remove from each service individually
|
||||
```
|
||||
|
||||
### Step 2: Archive User Data (Optional)
|
||||
|
||||
```bash
|
||||
# Backup user's data before deleting
|
||||
# Synology home folder:
|
||||
tar czf /volume1/backups/user-archives/username-$(date +%Y%m%d).tar.gz \
|
||||
/volume1/homes/username
|
||||
|
||||
# User's Immich photos (if applicable)
|
||||
# User's documents (if applicable)
|
||||
```
|
||||
|
||||
### Step 3: Delete User
|
||||
|
||||
After confirming data is backed up:
|
||||
|
||||
```bash
|
||||
# Synology: Delete user
|
||||
# Control Panel → User & Group → [user] → Delete
|
||||
# Choose whether to keep or delete user's data
|
||||
|
||||
# Or via SSH:
|
||||
sudo synouser --del username
|
||||
sudo rm -rf /volume1/homes/username # If deleting data
|
||||
```
|
||||
|
||||
### Step 4: Update Documentation
|
||||
|
||||
```bash
|
||||
# Update user access guide
|
||||
nano docs/infrastructure/USER_ACCESS_GUIDE.md
|
||||
# Mark user as removed with date
|
||||
|
||||
git add .
|
||||
git commit -m "Remove user: [username] - access terminated [date]"
|
||||
git push
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: User Can't Connect via Tailscale
|
||||
|
||||
**Solutions**:
|
||||
- Verify invitation was accepted
|
||||
- Check user installed Tailscale correctly
|
||||
- Verify ACLs allow user's device
|
||||
- Check user's device firewall
|
||||
- Try: `tailscale ping atlantis`
|
||||
|
||||
### Issue: SSO Login Not Working
|
||||
|
||||
**Solutions**:
|
||||
- Verify Authentik account is active
|
||||
- Check user is in correct groups
|
||||
- Verify application is assigned to user
|
||||
- Clear browser cookies
|
||||
- Try incognito mode
|
||||
- Check Authentik logs
|
||||
|
||||
### Issue: Can't Access File Shares
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Check Synology user exists and is enabled
|
||||
ssh atlantis
|
||||
sudo synouser --get username
|
||||
|
||||
# Check folder permissions
|
||||
ls -la /volume1/homes/username
|
||||
|
||||
# Check SMB service is running
|
||||
sudo synoservicectl --status smbd
|
||||
|
||||
# Test from user's machine:
|
||||
smbclient -L atlantis.tailnet -U username
|
||||
```
|
||||
|
||||
### Issue: Plex Not Showing Up for User
|
||||
|
||||
**Solutions**:
|
||||
- Verify user accepted Plex sharing invitation
|
||||
- Check library access permissions
|
||||
- Verify user's account email is correct
|
||||
- Try removing and re-adding the user
|
||||
- Check Plex server accessibility
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Security
|
||||
- Use strong passwords (12+ characters, mixed case, numbers, symbols)
|
||||
- Enable 2FA where available (Authentik supports it)
|
||||
- Least privilege principle (only grant needed access)
|
||||
- Regular access reviews (quarterly)
|
||||
- Disable accounts promptly when not needed
|
||||
|
||||
### Documentation
|
||||
- Keep user list up to date
|
||||
- Document special access grants
|
||||
- Note user role changes
|
||||
- Archive user data before deletion
|
||||
|
||||
### Communication
|
||||
- Set clear expectations with users
|
||||
- Provide good documentation
|
||||
- Be responsive to access issues
|
||||
- Notify users of maintenance windows
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
|
||||
- [User Access Guide](../infrastructure/USER_ACCESS_GUIDE.md)
|
||||
- [SSH Access Guide](../infrastructure/SSH_ACCESS_GUIDE.md)
|
||||
- [Authentik SSO Setup](../infrastructure/authentik-sso.md)
|
||||
- [Security Guidelines](../infrastructure/security.md)
|
||||
|
||||
## Change Log
|
||||
|
||||
- 2026-02-14 - Initial creation
|
||||
- 2026-02-14 - Added comprehensive onboarding and offboarding procedures
|
||||
570
docs/runbooks/certificate-renewal.md
Normal file
570
docs/runbooks/certificate-renewal.md
Normal file
@@ -0,0 +1,570 @@
|
||||
# SSL/TLS Certificate Renewal Runbook
|
||||
|
||||
## Overview
|
||||
This runbook covers SSL/TLS certificate management across the homelab, including Let's Encrypt certificates, Cloudflare Origin certificates, and self-signed certificates. It provides procedures for manual renewal, troubleshooting auto-renewal, and emergency certificate fixes.
|
||||
|
||||
## Prerequisites
|
||||
- [ ] SSH access to relevant hosts
|
||||
- [ ] Cloudflare account access (if using Cloudflare)
|
||||
- [ ] Domain DNS control
|
||||
- [ ] Root/sudo privileges on hosts
|
||||
- [ ] Backup of current certificates
|
||||
|
||||
## Metadata
|
||||
- **Estimated Time**: 15-45 minutes
|
||||
- **Risk Level**: Medium (service downtime if misconfigured)
|
||||
- **Requires Downtime**: Minimal (few seconds during reload)
|
||||
- **Reversible**: Yes (can restore old certificates)
|
||||
- **Tested On**: 2026-02-14
|
||||
|
||||
## Certificate Types in Homelab
|
||||
|
||||
| Type | Used For | Renewal Method | Expiration |
|
||||
|------|----------|----------------|------------|
|
||||
| **Let's Encrypt** | Public-facing services | Certbot auto-renewal | 90 days |
|
||||
| **Cloudflare Origin** | Services behind Cloudflare Tunnel | Manual/Cloudflare dashboard | 15 years |
|
||||
| **Synology Certificates** | Synology DSM, services | Synology DSM auto-renewal | 90 days |
|
||||
| **Self-Signed** | Internal/dev services | Manual generation | As configured |
|
||||
|
||||
## Certificate Inventory
|
||||
|
||||
Document your current certificates:
|
||||
|
||||
```bash
|
||||
# Check Let's Encrypt certificates (on Linux hosts)
|
||||
sudo certbot certificates
|
||||
|
||||
# Check Synology certificates
|
||||
# DSM UI → Control Panel → Security → Certificate
|
||||
# Or SSH:
|
||||
sudo cat /usr/syno/etc/certificate/_archive/*/cert.pem | openssl x509 -text -noout
|
||||
|
||||
# Check certificate expiration for any domain
|
||||
echo | openssl s_client -servername service.vish.gg -connect service.vish.gg:443 2>/dev/null | openssl x509 -noout -dates
|
||||
|
||||
# Check all certificates at once
|
||||
for domain in st.vish.gg gf.vish.gg mx.vish.gg; do
|
||||
echo "=== $domain ==="
|
||||
echo | timeout 5 openssl s_client -servername $domain -connect $domain:443 2>/dev/null | openssl x509 -noout -dates
|
||||
echo
|
||||
done
|
||||
```
|
||||
|
||||
Create inventory:
|
||||
```markdown
|
||||
| Domain | Type | Expiry Date | Auto-Renew | Status |
|
||||
|--------|------|-------------|------------|--------|
|
||||
| vish.gg | Let's Encrypt | 2026-05-15 | ✅ Yes | ✅ Valid |
|
||||
| st.vish.gg | Let's Encrypt | 2026-05-15 | ✅ Yes | ✅ Valid |
|
||||
| gf.vish.gg | Let's Encrypt | 2026-05-15 | ✅ Yes | ✅ Valid |
|
||||
```
|
||||
|
||||
## Let's Encrypt Certificate Renewal
|
||||
|
||||
### Automatic Renewal (Certbot)
|
||||
|
||||
Let's Encrypt certificates should auto-renew. Check the renewal setup:
|
||||
|
||||
```bash
|
||||
# Check certbot timer status (systemd)
|
||||
sudo systemctl status certbot.timer
|
||||
|
||||
# Check cron job (if using cron)
|
||||
sudo crontab -l | grep certbot
|
||||
|
||||
# Test renewal (dry-run, doesn't actually renew)
|
||||
sudo certbot renew --dry-run
|
||||
|
||||
# Expected output:
|
||||
# Congratulations, all simulated renewals succeeded
|
||||
```
|
||||
|
||||
### Manual Renewal
|
||||
|
||||
If auto-renewal fails or you need to renew manually:
|
||||
|
||||
```bash
|
||||
# Renew all certificates
|
||||
sudo certbot renew
|
||||
|
||||
# Renew specific certificate
|
||||
sudo certbot renew --cert-name vish.gg
|
||||
|
||||
# Force renewal (even if not expired)
|
||||
sudo certbot renew --force-renewal
|
||||
|
||||
# Renew with verbose output for troubleshooting
|
||||
sudo certbot renew --verbose
|
||||
```
|
||||
|
||||
After renewal, reload web servers:
|
||||
|
||||
```bash
|
||||
# Nginx
|
||||
sudo nginx -t # Test configuration
|
||||
sudo systemctl reload nginx
|
||||
|
||||
# Apache
|
||||
sudo apachectl configtest
|
||||
sudo systemctl reload apache2
|
||||
```
|
||||
|
||||
### Let's Encrypt with Nginx Proxy Manager
|
||||
|
||||
If using Nginx Proxy Manager (NPM):
|
||||
|
||||
1. Open NPM UI (typically port 81)
|
||||
2. Go to **SSL Certificates** tab
|
||||
3. Certificates should auto-renew 30 days before expiry
|
||||
4. To force renewal:
|
||||
- Click the certificate
|
||||
- Click **Renew** button
|
||||
5. No service reload needed (NPM handles it)
|
||||
|
||||
## Synology Certificate Renewal
|
||||
|
||||
### Automatic Renewal on Synology NAS
|
||||
|
||||
```bash
|
||||
# SSH to Synology NAS (Atlantis or Calypso)
|
||||
ssh atlantis # or calypso
|
||||
|
||||
# Check certificate status
|
||||
sudo /usr/syno/sbin/syno-letsencrypt list
|
||||
|
||||
# Force renewal check
|
||||
sudo /usr/syno/sbin/syno-letsencrypt renew-all
|
||||
|
||||
# Check renewal logs
|
||||
sudo cat /var/log/letsencrypt/letsencrypt.log
|
||||
|
||||
# Verify certificate expiry
|
||||
sudo openssl x509 -in /usr/syno/etc/certificate/system/default/cert.pem -text -noout | grep "Not After"
|
||||
```
|
||||
|
||||
### Via Synology DSM UI
|
||||
|
||||
1. Log in to DSM
|
||||
2. **Control Panel** → **Security** → **Certificate**
|
||||
3. Select certificate → Click **Renew**
|
||||
4. DSM will automatically renew and apply
|
||||
5. No manual reload needed
|
||||
|
||||
### Synology Certificate Configuration
|
||||
|
||||
Enable auto-renewal in DSM:
|
||||
1. **Control Panel** → **Security** → **Certificate**
|
||||
2. Click **Settings** button
|
||||
3. Check **Auto-renew certificate**
|
||||
4. Synology will renew 30 days before expiry
|
||||
|
||||
## Stoatchat Certificates (Gaming VPS)
|
||||
|
||||
The Stoatchat gaming server uses Let's Encrypt with Certbot:
|
||||
|
||||
```bash
|
||||
# SSH to gaming VPS
|
||||
ssh root@gaming-vps
|
||||
|
||||
# Check certificates
|
||||
sudo certbot certificates
|
||||
|
||||
# Domains covered:
|
||||
# - st.vish.gg
|
||||
# - api.st.vish.gg
|
||||
# - events.st.vish.gg
|
||||
# - files.st.vish.gg
|
||||
# - proxy.st.vish.gg
|
||||
# - voice.st.vish.gg
|
||||
|
||||
# Renew all
|
||||
sudo certbot renew
|
||||
|
||||
# Reload Nginx
|
||||
sudo systemctl reload nginx
|
||||
```
|
||||
|
||||
Auto-renewal cron:
|
||||
```bash
|
||||
# Check certbot timer
|
||||
sudo systemctl status certbot.timer
|
||||
|
||||
# Or check cron
|
||||
sudo crontab -l | grep certbot
|
||||
```
|
||||
|
||||
## Cloudflare Origin Certificates
|
||||
|
||||
For services using Cloudflare Tunnel:
|
||||
|
||||
### Generate New Origin Certificate
|
||||
|
||||
1. Log in to Cloudflare Dashboard
|
||||
2. Select domain (vish.gg)
|
||||
3. **SSL/TLS** → **Origin Server**
|
||||
4. Click **Create Certificate**
|
||||
5. Configure:
|
||||
- **Private key type**: RSA (2048)
|
||||
- **Hostnames**: *.vish.gg, vish.gg
|
||||
- **Certificate validity**: 15 years
|
||||
6. Copy certificate and private key
|
||||
7. Save to secure location
|
||||
|
||||
### Install Origin Certificate
|
||||
|
||||
```bash
|
||||
# SSH to target host
|
||||
ssh [host]
|
||||
|
||||
# Create certificate files
|
||||
sudo nano /etc/ssl/cloudflare/cert.pem
|
||||
# Paste certificate
|
||||
|
||||
sudo nano /etc/ssl/cloudflare/key.pem
|
||||
# Paste private key
|
||||
|
||||
# Set permissions
|
||||
sudo chmod 644 /etc/ssl/cloudflare/cert.pem
|
||||
sudo chmod 600 /etc/ssl/cloudflare/key.pem
|
||||
|
||||
# Update Nginx configuration
|
||||
sudo nano /etc/nginx/sites-available/[service]
|
||||
|
||||
# Use new certificate
|
||||
ssl_certificate /etc/ssl/cloudflare/cert.pem;
|
||||
ssl_certificate_key /etc/ssl/cloudflare/key.pem;
|
||||
|
||||
# Test and reload
|
||||
sudo nginx -t
|
||||
sudo systemctl reload nginx
|
||||
```
|
||||
|
||||
## Self-Signed Certificates (Internal/Dev)
|
||||
|
||||
For internal-only services not exposed publicly:
|
||||
|
||||
### Generate Self-Signed Certificate
|
||||
|
||||
```bash
|
||||
# Generate 10-year self-signed certificate
|
||||
sudo openssl req -x509 -nodes -days 3650 -newkey rsa:2048 \
|
||||
-keyout /etc/ssl/private/selfsigned.key \
|
||||
-out /etc/ssl/certs/selfsigned.crt \
|
||||
-subj "/C=US/ST=State/L=City/O=Homelab/CN=internal.vish.local"
|
||||
|
||||
# Generate with SAN (Subject Alternative Names) for multiple domains
|
||||
sudo openssl req -x509 -nodes -days 3650 -newkey rsa:2048 \
|
||||
-keyout /etc/ssl/private/selfsigned.key \
|
||||
-out /etc/ssl/certs/selfsigned.crt \
|
||||
-subj "/C=US/ST=State/L=City/O=Homelab/CN=*.vish.local" \
|
||||
-addext "subjectAltName=DNS:*.vish.local,DNS:vish.local"
|
||||
|
||||
# Set permissions
|
||||
sudo chmod 600 /etc/ssl/private/selfsigned.key
|
||||
sudo chmod 644 /etc/ssl/certs/selfsigned.crt
|
||||
```
|
||||
|
||||
### Install in Services
|
||||
|
||||
Update Docker Compose to mount certificates:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
service:
|
||||
volumes:
|
||||
- /etc/ssl/certs/selfsigned.crt:/etc/ssl/certs/cert.pem:ro
|
||||
- /etc/ssl/private/selfsigned.key:/etc/ssl/private/key.pem:ro
|
||||
```
|
||||
|
||||
## Monitoring Certificate Expiration
|
||||
|
||||
### Set Up Expiration Alerts
|
||||
|
||||
Create a certificate monitoring script:
|
||||
|
||||
```bash
|
||||
sudo nano /usr/local/bin/check-certificates.sh
|
||||
```
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Certificate Expiration Monitoring Script
|
||||
|
||||
DOMAINS=(
|
||||
"vish.gg"
|
||||
"st.vish.gg"
|
||||
"gf.vish.gg"
|
||||
"mx.vish.gg"
|
||||
)
|
||||
|
||||
ALERT_DAYS=30 # Alert if expiring within 30 days
|
||||
WEBHOOK_URL="https://ntfy.sh/REDACTED_TOPIC" # Your notification webhook
|
||||
|
||||
for domain in "${DOMAINS[@]}"; do
|
||||
echo "Checking $domain..."
|
||||
|
||||
# Get certificate expiration date
|
||||
expiry=$(echo | openssl s_client -servername $domain -connect $domain:443 2>/dev/null | \
|
||||
openssl x509 -noout -dates | grep "notAfter" | cut -d= -f2)
|
||||
|
||||
# Convert to epoch time
|
||||
expiry_epoch=$(date -d "$expiry" +%s)
|
||||
current_epoch=$(date +%s)
|
||||
days_left=$(( ($expiry_epoch - $current_epoch) / 86400 ))
|
||||
|
||||
echo "$domain expires in $days_left days"
|
||||
|
||||
if [ $days_left -lt $ALERT_DAYS ]; then
|
||||
# Send alert
|
||||
curl -H "Title: Certificate Expiring Soon" \
|
||||
-H "Priority: high" \
|
||||
-H "Tags: warning,certificate" \
|
||||
-d "Certificate for $domain expires in $days_left days!" \
|
||||
$WEBHOOK_URL
|
||||
|
||||
echo "⚠️ Alert sent for $domain"
|
||||
fi
|
||||
echo
|
||||
done
|
||||
```
|
||||
|
||||
Make executable and add to cron:
|
||||
```bash
|
||||
sudo chmod +x /usr/local/bin/check-certificates.sh
|
||||
|
||||
# Add to cron (daily at 9 AM)
|
||||
(crontab -l 2>/dev/null; echo "0 9 * * * /usr/local/bin/check-certificates.sh") | crontab -
|
||||
```
|
||||
|
||||
### Grafana Dashboard
|
||||
|
||||
Add certificate monitoring to Grafana:
|
||||
|
||||
```bash
|
||||
# Install blackbox_exporter for HTTPS probing
|
||||
# Add to prometheus.yml:
|
||||
|
||||
scrape_configs:
|
||||
- job_name: 'blackbox'
|
||||
metrics_path: /probe
|
||||
params:
|
||||
module: [http_2xx]
|
||||
static_configs:
|
||||
- targets:
|
||||
- https://vish.gg
|
||||
- https://st.vish.gg
|
||||
- https://gf.vish.gg
|
||||
relabel_configs:
|
||||
- source_labels: [__address__]
|
||||
target_label: __param_target
|
||||
- source_labels: [__param_target]
|
||||
target_label: instance
|
||||
- target_label: __address__
|
||||
replacement: blackbox-exporter:9115
|
||||
|
||||
# Create alert rule:
|
||||
- alert: SSLCertificateExpiring
|
||||
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "SSL certificate expiring soon"
|
||||
description: "SSL certificate for {{ $labels.instance }} expires in {{ $value | REDACTED_APP_PASSWORD }}"
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Certbot Renewal Failing
|
||||
|
||||
**Symptoms**: `certbot renew` fails with DNS or HTTP challenge errors
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# Check detailed error logs
|
||||
sudo certbot renew --verbose
|
||||
|
||||
# Common issues:
|
||||
|
||||
# 1. Port 80/443 not accessible
|
||||
sudo ufw status # Check firewall
|
||||
sudo netstat -tlnp | grep :80 # Check if port is listening
|
||||
|
||||
# 2. DNS not resolving correctly
|
||||
dig vish.gg # Verify DNS points to correct IP
|
||||
|
||||
# 3. Rate limits hit
|
||||
# Let's Encrypt has rate limits: 50 certificates per domain per week
|
||||
# Wait 7 days or use --staging for testing
|
||||
|
||||
# 4. Webroot path incorrect
|
||||
sudo certbot renew --webroot -w /var/www/html
|
||||
|
||||
# 5. Try force renewal with different challenge
|
||||
sudo certbot renew --force-renewal --preferred-challenges dns
|
||||
```
|
||||
|
||||
### Issue: Certificate Valid But Browser Shows Warning
|
||||
|
||||
**Symptoms**: Certificate is valid but browsers show security warning
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# Check certificate chain
|
||||
openssl s_client -connect vish.gg:443 -showcerts
|
||||
|
||||
# Ensure intermediate certificates are included
|
||||
# Nginx: Use fullchain.pem, not cert.pem
|
||||
ssl_certificate /etc/letsencrypt/live/vish.gg/fullchain.pem;
|
||||
ssl_certificate_key /etc/letsencrypt/live/vish.gg/privkey.pem;
|
||||
|
||||
# Test SSL configuration
|
||||
curl -I https://vish.gg
|
||||
# Or use: https://www.ssllabs.com/ssltest/
|
||||
```
|
||||
|
||||
### Issue: Synology Certificate Not Auto-Renewing
|
||||
|
||||
**Symptoms**: DSM certificate expired or shows renewal error
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# SSH to Synology
|
||||
ssh atlantis
|
||||
|
||||
# Check renewal logs
|
||||
sudo cat /var/log/letsencrypt/letsencrypt.log
|
||||
|
||||
# Common issues:
|
||||
|
||||
# 1. Port 80 forwarding
|
||||
# Ensure port 80 is forwarded to NAS during renewal
|
||||
|
||||
# 2. Domain validation
|
||||
# Check DNS points to correct external IP
|
||||
|
||||
# 3. Force renewal
|
||||
sudo /usr/syno/sbin/syno-letsencrypt renew-all
|
||||
|
||||
# 4. Restart certificate service
|
||||
sudo synosystemctl restart nginx
|
||||
```
|
||||
|
||||
### Issue: Nginx Won't Reload After Certificate Update
|
||||
|
||||
**Symptoms**: `nginx -t` shows SSL errors
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# Test Nginx configuration
|
||||
sudo nginx -t
|
||||
|
||||
# Common errors:
|
||||
|
||||
# 1. Certificate path incorrect
|
||||
# Fix: Update nginx config with correct path
|
||||
|
||||
# 2. Certificate and key mismatch
|
||||
# Verify:
|
||||
sudo openssl x509 -noout -modulus -in cert.pem | openssl md5
|
||||
sudo openssl rsa -noout -modulus -in key.pem | openssl md5
|
||||
# MD5 sums should match
|
||||
|
||||
# 3. Permission issues
|
||||
sudo chmod 644 /etc/ssl/certs/cert.pem
|
||||
sudo chmod 600 /etc/ssl/private/key.pem
|
||||
sudo chown root:root /etc/ssl/certs/cert.pem /etc/ssl/private/key.pem
|
||||
|
||||
# 4. SELinux blocking (if enabled)
|
||||
sudo setsebool -P httpd_read_user_content 1
|
||||
```
|
||||
|
||||
## Emergency Certificate Fix
|
||||
|
||||
If a certificate expires and services are down:
|
||||
|
||||
### Quick Fix: Use Self-Signed Temporarily
|
||||
|
||||
```bash
|
||||
# Generate emergency self-signed certificate
|
||||
sudo openssl req -x509 -nodes -days 30 -newkey rsa:2048 \
|
||||
-keyout /tmp/emergency.key \
|
||||
-out /tmp/emergency.crt \
|
||||
-subj "/CN=*.vish.gg"
|
||||
|
||||
# Update Nginx to use emergency cert
|
||||
sudo nano /etc/nginx/sites-available/default
|
||||
|
||||
ssl_certificate /tmp/emergency.crt;
|
||||
ssl_certificate_key /tmp/emergency.key;
|
||||
|
||||
# Reload Nginx
|
||||
sudo nginx -t && sudo systemctl reload nginx
|
||||
|
||||
# Services are now accessible (with browser warning)
|
||||
# Then fix proper certificate renewal
|
||||
```
|
||||
|
||||
### Restore from Backup
|
||||
|
||||
```bash
|
||||
# If certificates were backed up
|
||||
sudo cp /backup/letsencrypt/archive/vish.gg/* /etc/letsencrypt/archive/vish.gg/
|
||||
|
||||
# Update symlinks
|
||||
sudo certbot certificates # Shows current status
|
||||
sudo certbot install --cert-name vish.gg
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Renewal Schedule
|
||||
- Let's Encrypt certificates renew at 60 days (30 days before expiry)
|
||||
- Check certificates monthly
|
||||
- Set up expiration alerts
|
||||
- Test renewal process quarterly
|
||||
|
||||
### Backup Certificates
|
||||
```bash
|
||||
# Backup Let's Encrypt certificates
|
||||
sudo tar czf ~/letsencrypt-backup-$(date +%Y%m%d).tar.gz /etc/letsencrypt/
|
||||
|
||||
# Backup Synology certificates
|
||||
# Done via Synology backup tasks
|
||||
|
||||
# Store backups securely (encrypted, off-site)
|
||||
```
|
||||
|
||||
### Documentation
|
||||
- Document which certificates are used where
|
||||
- Keep inventory of expiration dates
|
||||
- Document renewal procedures
|
||||
- Note any special configurations
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
After certificate renewal:
|
||||
|
||||
- [ ] Certificate renewed successfully
|
||||
- [ ] Certificate expiry date extended
|
||||
- [ ] Web servers reloaded without errors
|
||||
- [ ] All services accessible via HTTPS
|
||||
- [ ] No browser security warnings
|
||||
- [ ] Certificate chain complete
|
||||
- [ ] Auto-renewal still enabled
|
||||
- [ ] Monitoring updated (if needed)
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
|
||||
- [Nginx Configuration](../infrastructure/networking.md)
|
||||
- [Cloudflare Tunnels Setup](../infrastructure/cloudflare-tunnels-setup.md)
|
||||
- [Emergency Access Guide](../troubleshooting/EMERGENCY_ACCESS_GUIDE.md)
|
||||
|
||||
## Change Log
|
||||
|
||||
- 2026-02-14 - Initial creation
|
||||
- 2026-02-14 - Added monitoring and troubleshooting sections
|
||||
661
docs/runbooks/credential-rotation.md
Normal file
661
docs/runbooks/credential-rotation.md
Normal file
@@ -0,0 +1,661 @@
|
||||
# Credential Rotation Runbook
|
||||
|
||||
## Overview
|
||||
|
||||
Step-by-step rotation procedures for all credentials exposed in the
|
||||
`homelab-optimized` public mirror (audited 2026-02-20). Work through each
|
||||
section in priority order. After updating secrets in compose files, commit
|
||||
and push — GitOps will redeploy automatically.
|
||||
|
||||
> **Note:** Almost all of these stem from the same root cause — secrets were
|
||||
> hard-coded in compose files, then those files were committed to git, then
|
||||
> `generate_service_docs.py` and wiki-upload scripts duplicated those secrets
|
||||
> into documentation, creating 3–5× copies of every secret across the repo.
|
||||
> See the "Going Forward" section for how to prevent this.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- [ ] SSH / Tailscale access to Atlantis, Calypso, Homelab VM, Seattle VM, matrix-ubuntu-vm
|
||||
- [ ] Gitea admin access (`git.vish.gg`)
|
||||
- [ ] Authentik admin access
|
||||
- [ ] Google account access (Gmail app passwords)
|
||||
- [ ] Cloudflare dashboard access
|
||||
- [ ] OpenAI platform access
|
||||
- [ ] Write access to this repository
|
||||
|
||||
## Metadata
|
||||
|
||||
- **Estimated Time**: 4–6 hours
|
||||
- **Risk Level**: Medium (service restarts required for most items)
|
||||
- **Requires Downtime**: Brief per-service restart only
|
||||
- **Reversible**: Yes (old values can be restored if something breaks)
|
||||
- **Last Updated**: 2026-02-20
|
||||
|
||||
---
|
||||
|
||||
## Priority 1 — Rotate Immediately (Externally Usable Tokens)
|
||||
|
||||
### 1. Gitea API Tokens
|
||||
|
||||
Two tokens hard-coded across scripts and docs.
|
||||
|
||||
#### 1a. Wiki/scripts token (`77e3ddaf...`)
|
||||
|
||||
**Files to update:**
|
||||
- `scripts/cleanup-gitea-wiki.sh`
|
||||
- `scripts/upload-all-docs-to-gitea-wiki.sh`
|
||||
- `scripts/upload-to-gitea-wiki.sh`
|
||||
- `scripts/create-clean-organized-wiki.sh`
|
||||
- `scripts/upload-organized-wiki.sh`
|
||||
- `docs/admin/DOCUMENTATION_MAINTENANCE_GUIDE.md`
|
||||
|
||||
```bash
|
||||
# 1. Go to https://git.vish.gg/user/settings/applications
|
||||
# 2. Revoke the token starting 77e3ddaf
|
||||
# 3. Generate new token, name: homelab-wiki, scope: repo
|
||||
# 4. Replace in all files:
|
||||
NEW_TOKEN=REDACTED_TOKEN
|
||||
for f in scripts/cleanup-gitea-wiki.sh \
|
||||
scripts/upload-all-docs-to-gitea-wiki.sh \
|
||||
scripts/upload-to-gitea-wiki.sh \
|
||||
scripts/create-clean-organized-wiki.sh \
|
||||
scripts/upload-organized-wiki.sh \
|
||||
docs/admin/DOCUMENTATION_MAINTENANCE_GUIDE.md; do
|
||||
sed -i "s/REDACTED_GITEA_TOKEN/$NEW_TOKEN/g" "$f"
|
||||
done
|
||||
```
|
||||
|
||||
#### 1b. Retro-site clone token (`52fa6ccb...`)
|
||||
|
||||
**File:** `Calypso/retro-site.yaml` and `hosts/synology/calypso/retro-site.yaml`
|
||||
|
||||
```bash
|
||||
# 1. Go to https://git.vish.gg/user/settings/applications
|
||||
# 2. Revoke the token starting 52fa6ccb
|
||||
# 3. Generate new token, name: retro-site-deploy, scope: repo:read
|
||||
# 4. Update the git clone URL in both compose files
|
||||
# Consider switching to a deploy key for least-privilege access
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Cloudflare API Token (`FGXlHM7doB8Z...`)
|
||||
|
||||
Appears in 13 files including active dynamic DNS updaters on multiple hosts.
|
||||
|
||||
**Files to update (active deployments):**
|
||||
- `hosts/synology/atlantis/dynamicdnsupdater.yaml`
|
||||
- `hosts/physical/guava/portainer_yaml/dynamic_dns.yaml`
|
||||
- `hosts/physical/concord-nuc/dyndns_updater.yaml`
|
||||
- Various Calypso/homelab-vm DDNS configs
|
||||
|
||||
**Files to sanitize (docs):**
|
||||
- `docs/infrastructure/cloudflare-dns.md`
|
||||
- `docs/infrastructure/npm-migration-jan2026.md`
|
||||
- Any `docs/services/individual/ddns-*.md` files
|
||||
|
||||
```bash
|
||||
# 1. Go to https://dash.cloudflare.com/profile/api-tokens
|
||||
# 2. Find the token (FGXlHM7doB8Z...) and click Revoke
|
||||
# 3. Create a new token: use "Edit zone DNS" template, scope to your zone only
|
||||
# 4. Replace in all compose files above
|
||||
# 5. Replace hardcoded value in docs with: YOUR_CLOUDFLARE_API_TOKEN
|
||||
|
||||
# Verify DDNS containers restart and can still update DNS:
|
||||
docker logs cloudflare-ddns --tail 20
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. OpenAI API Key (`sk-proj-C_IYp6io...`)
|
||||
|
||||
**Files to update:**
|
||||
- `hosts/vms/homelab-vm/hoarder.yaml`
|
||||
- `docs/services/individual/web.md` (replace with placeholder)
|
||||
|
||||
```bash
|
||||
# 1. Go to https://platform.openai.com/api-keys
|
||||
# 2. Delete the exposed key
|
||||
# 3. Create a new key, set a usage limit
|
||||
# 4. Update OPENAI_API_KEY in hoarder.yaml
|
||||
# 5. Replace value in docs with: YOUR_OPENAI_API_KEY
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Priority 2 — OAuth / SSO Secrets
|
||||
|
||||
### 4. Grafana ↔ Authentik OAuth Secret
|
||||
|
||||
**Files to update:**
|
||||
- `hosts/vms/homelab-vm/monitoring.yaml`
|
||||
- `hosts/synology/atlantis/grafana.yml`
|
||||
- `docs/infrastructure/authentik-sso.md` (replace with placeholder)
|
||||
- `docs/services/individual/grafana-oauth.md` (replace with placeholder)
|
||||
|
||||
```bash
|
||||
# 1. Log into Authentik admin: https://auth.vish.gg/if/admin/
|
||||
# 2. Applications → Providers → find Grafana OAuth2 provider
|
||||
# 3. Edit → regenerate Client Secret → copy both Client ID and Secret
|
||||
# 4. Update in both compose files:
|
||||
# GF_AUTH_GENERIC_OAUTH_CLIENT_ID: NEW_ID
|
||||
# GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET: NEW_SECRET
|
||||
# 5. Commit and push — both Grafana stacks restart automatically
|
||||
|
||||
# Verify SSO works after restart:
|
||||
curl -I https://gf.vish.gg
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5. Seafile ↔ Authentik OAuth Secret
|
||||
|
||||
**Files to update:**
|
||||
- `hosts/synology/calypso/seafile-oauth-config.py`
|
||||
- `docs/services/individual/seafile-oauth.md` (replace with placeholder)
|
||||
|
||||
```bash
|
||||
# 1. Log into Authentik admin
|
||||
# 2. Applications → Providers → find Seafile OAuth2 provider
|
||||
# 3. Regenerate client secret
|
||||
# 4. Update OAUTH_CLIENT_ID and OAUTH_CLIENT_SECRET in seafile-oauth-config.py
|
||||
# 5. Re-run the config script on the Seafile server to apply
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 6. Authentik Secret Key (`RpRexcYo5HAz...`)
|
||||
|
||||
**Critical** — this key encrypts all Authentik data (tokens, sessions, stored credentials).
|
||||
|
||||
**File:** `hosts/synology/calypso/authentik/docker-compose.yaml`
|
||||
|
||||
```bash
|
||||
# 1. Generate a new secret:
|
||||
python3 -c "import secrets; print(secrets.token_urlsafe(50))"
|
||||
|
||||
# 2. Update AUTHENTIK_SECRET_KEY in docker-compose.yaml
|
||||
# 3. Commit and push — Authentik will restart
|
||||
# WARNING: All active Authentik sessions will be invalidated.
|
||||
# Users will need to log back in. SSO-protected services
|
||||
# may temporarily show login errors while Authentik restarts.
|
||||
|
||||
# Verify Authentik is healthy after restart:
|
||||
docker logs authentik_server --tail 30
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Priority 3 — Application Secrets (Require Service Restart)
|
||||
|
||||
### 7. Gmail App Passwords
|
||||
|
||||
Five distinct app passwords were found across the repo. Revoke all of them
|
||||
in Google Account → Security → App passwords, then create new per-service ones.
|
||||
|
||||
| Password | Used For | Active Files |
|
||||
|----------|----------|-------------|
|
||||
| (see Vaultwarden) | Mastodon, Joplin, Authentik SMTP | `matrix-ubuntu-vm/mastodon/.env.production.template`, `atlantis/joplin.yml`, `calypso/authentik/docker-compose.yaml` |
|
||||
| (see Vaultwarden) | Vaultwarden SMTP | `atlantis/vaultwarden.yaml` |
|
||||
| (see Vaultwarden) | Documenso SMTP | `atlantis/documenso/documenso.yaml` |
|
||||
| (see Vaultwarden) | Reactive Resume v4 (archived) | `archive/reactive_resume_v4_archived/docker-compose.yml` |
|
||||
| (see Vaultwarden) | Reactive Resume v5 (active) | `calypso/reactive_resume_v5/docker-compose.yml` |
|
||||
|
||||
**Best practice:** Create one app password per service, named clearly (e.g.,
|
||||
`homelab-joplin`, `homelab-mastodon`). Update each file's `SMTP_PASS` /
|
||||
`SMTP_PASSWORD` / `MAILER_AUTH_PASSWORD` / `smtp_password` field.
|
||||
|
||||
---
|
||||
|
||||
### 8. Matrix Synapse Secrets
|
||||
|
||||
Three secrets in `homeserver.yaml`, plus the TURN shared secret.
|
||||
|
||||
**File:** `hosts/synology/atlantis/matrix_synapse_docs/homeserver.yaml`
|
||||
|
||||
```bash
|
||||
# Generate fresh values for each:
|
||||
python3 -c "import secrets; print(secrets.token_urlsafe(48))"
|
||||
|
||||
# Fields to rotate:
|
||||
# registration_shared_secret
|
||||
# macaroon_secret_key
|
||||
# form_secret
|
||||
# turn_shared_secret
|
||||
|
||||
# After updating homeserver.yaml, restart Synapse:
|
||||
docker restart synapse # or via Portainer
|
||||
|
||||
# Also update coturn config on the server directly:
|
||||
ssh atlantis
|
||||
nano /path/to/turnserver.conf
|
||||
# Update: static-auth-secret=NEW_TURN_SECRET
|
||||
systemctl restart coturn
|
||||
|
||||
# Update instructions.txt — replace old values with REDACTED
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 9. Mastodon `SECRET_KEY_BASE` + `OTP_SECRET`
|
||||
|
||||
**File:** `hosts/synology/atlantis/mastodon.yml`
|
||||
**Also in:** `docs/services/individual/mastodon.md` (replace with placeholder)
|
||||
|
||||
```bash
|
||||
# Generate new values:
|
||||
openssl rand -hex 64 # for SECRET_KEY_BASE
|
||||
openssl rand -hex 64 # for OTP_SECRET
|
||||
|
||||
# Update both in mastodon.yml
|
||||
# Commit and push — GitOps restarts Mastodon
|
||||
# WARNING: All active user sessions are invalidated. Users must log back in.
|
||||
|
||||
# Verify Mastodon web is accessible:
|
||||
curl -I https://your-mastodon-domain/
|
||||
docker logs mastodon_web --tail 20
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 10. Documenso Secrets (3 keys)
|
||||
|
||||
**Files:**
|
||||
- `hosts/synology/atlantis/documenso/documenso.yaml`
|
||||
- `hosts/synology/atlantis/documenso/Secrets.txt` (will be removed by sanitizer)
|
||||
- `docs/services/individual/documenso.md` (replace with placeholder)
|
||||
|
||||
```bash
|
||||
# Generate new values:
|
||||
python3 -c "import secrets; print(secrets.token_urlsafe(32))" # NEXTAUTH_SECRET
|
||||
python3 -c "import secrets; print(secrets.token_urlsafe(32))" # NEXT_PRIVATE_ENCRYPTION_KEY
|
||||
python3 -c "import secrets; print(secrets.token_urlsafe(32))" # NEXT_PRIVATE_ENCRYPTION_SECONDARY_KEY
|
||||
|
||||
# Update all three in documenso.yaml
|
||||
# NOTE: Rotating encryption keys will invalidate signed documents.
|
||||
# Confirm this is acceptable before rotating.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 11. Paperless-NGX API Token
|
||||
|
||||
**Files:**
|
||||
- `hosts/synology/calypso/paperless/paperless-ai.yml`
|
||||
- `hosts/synology/calypso/paperless/README.md` (replace with placeholder)
|
||||
- `docs/services/paperless.md` (replace with placeholder)
|
||||
|
||||
```bash
|
||||
# 1. Log into Paperless web UI
|
||||
# 2. Admin → Auth Token → delete existing, generate new
|
||||
# 3. Update PAPERLESS_API_TOKEN in paperless-ai.yml
|
||||
# 4. Commit and push
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 12. Immich JWT Secret (Both NAS)
|
||||
|
||||
**Files:**
|
||||
- `hosts/synology/atlantis/immich/stack.env` (will be removed by sanitizer)
|
||||
- `hosts/synology/calypso/immich/stack.env` (will be removed by sanitizer)
|
||||
|
||||
Since these files are removed by the sanitizer, ensure they are in `.gitignore`
|
||||
or managed via Portainer env variables going forward.
|
||||
|
||||
```bash
|
||||
# Generate new secret:
|
||||
openssl rand -base64 96
|
||||
|
||||
# Update JWT_SECRET in both stack.env files locally,
|
||||
# then apply via Portainer (not committed to git).
|
||||
# WARNING: All active Immich sessions invalidated.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 13. Revolt/Stoatchat — LiveKit API Secret + VAPID Private Key
|
||||
|
||||
**Files:**
|
||||
- `hosts/vms/seattle/stoatchat/livekit.yml`
|
||||
- `hosts/vms/seattle/stoatchat/Revolt.overrides.toml`
|
||||
- `hosts/vms/homelab-vm/stoatchat.yaml`
|
||||
- `docs/services/stoatchat/Revolt.overrides.toml` (replace with placeholder)
|
||||
- `hosts/vms/seattle/stoatchat/DEPLOYMENT_SUMMARY.md` (replace with placeholder)
|
||||
|
||||
```bash
|
||||
# Generate new LiveKit API key/secret pair:
|
||||
# Use the LiveKit CLI or generate random strings:
|
||||
python3 -c "import secrets; print(secrets.token_urlsafe(24))" # API key
|
||||
python3 -c "import secrets; print(secrets.token_urlsafe(32))" # API secret
|
||||
|
||||
# Generate new VAPID key pair:
|
||||
npx web-push generate-vapid-keys
|
||||
# or: python3 -c "from py_vapid import Vapid; v=Vapid(); v.generate_keys(); print(v.private_key)"
|
||||
|
||||
# Update in livekit.yml and Revolt.overrides.toml
|
||||
# Restart LiveKit and Revolt services
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 14. Jitsi Internal Auth Passwords (6 passwords)
|
||||
|
||||
**File:** `hosts/synology/atlantis/jitsi/jitsi.yml`
|
||||
**Also in:** `hosts/synology/atlantis/jitsi/.env` (will be removed by sanitizer)
|
||||
|
||||
```bash
|
||||
# Generate new passwords for each variable:
|
||||
for var in JICOFO_COMPONENT_SECRET JICOFO_AUTH_PASSWORD JVB_AUTH_PASSWORD \
|
||||
JIGASI_XMPP_PASSWORD JIBRI_RECORDER_PASSWORD JIBRI_XMPP_PASSWORD; do
|
||||
echo "$var=$(openssl rand -hex 10)"
|
||||
done
|
||||
|
||||
# Update all 6 in jitsi.yml
|
||||
# Restart the entire Jitsi stack — all components must use the same passwords
|
||||
docker compose -f jitsi.yml down && docker compose -f jitsi.yml up -d
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 15. SNMP v3 Auth + Priv Passwords
|
||||
|
||||
Used for NAS monitoring — same credentials across 6 files.
|
||||
|
||||
**Files to update:**
|
||||
- `hosts/synology/setillo/prometheus/snmp.yml`
|
||||
- `hosts/synology/atlantis/grafana_prometheus/snmp.yml`
|
||||
- `hosts/synology/atlantis/grafana_prometheus/snmp_mariushosting.yml`
|
||||
- `hosts/synology/calypso/grafana_prometheus/snmp.yml`
|
||||
- `hosts/vms/homelab-vm/monitoring.yaml`
|
||||
|
||||
```bash
|
||||
# 1. Log into each Synology NAS DSM
|
||||
# 2. Go to Control Panel → Terminal & SNMP → SNMP tab
|
||||
# 3. Update SNMPv3 auth password and privacy password to new values
|
||||
# 4. Update the same values in all 5 config files above
|
||||
# 5. The archive file (deprecated-monitoring-stacks) can just be left for
|
||||
# the sanitizer to redact.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 16. Invidious `hmac_key`
|
||||
|
||||
**Files:**
|
||||
- `hosts/physical/concord-nuc/invidious/invidious.yaml`
|
||||
- `hosts/physical/concord-nuc/invidious/invidious_old/invidious.yaml`
|
||||
- `hosts/synology/atlantis/invidious.yml`
|
||||
|
||||
```bash
|
||||
# Generate new hmac_key:
|
||||
python3 -c "import secrets; print(secrets.token_hex(16))"
|
||||
|
||||
# Update hmac_key in each active invidious.yaml
|
||||
# Restart Invidious containers
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 17. Open WebUI Secret Keys
|
||||
|
||||
**Files:**
|
||||
- `hosts/vms/contabo-vm/ollama/docker-compose.yml`
|
||||
- `hosts/synology/atlantis/ollama/docker-compose.yml`
|
||||
- `hosts/synology/atlantis/ollama/64_bit_key.txt` (will be removed by sanitizer)
|
||||
|
||||
```bash
|
||||
# Generate new key:
|
||||
openssl rand -hex 32
|
||||
|
||||
# Update WEBUI_SECRET_KEY in both compose files
|
||||
# Restart Open WebUI containers — active sessions invalidated
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 18. Portainer Edge Key
|
||||
|
||||
**File:** `hosts/vms/homelab-vm/portainer_agent.yaml`
|
||||
|
||||
```bash
|
||||
# 1. Log into Portainer at https://192.168.0.200:9443
|
||||
# 2. Go to Settings → Edge Compute → Edge Agents
|
||||
# 3. Find the homelab-vm agent and regenerate its edge key
|
||||
# 4. Update EDGE_KEY in portainer_agent.yaml with the new base64 value
|
||||
# 5. Restart the Portainer edge agent container
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 19. OpenProject Secret Key
|
||||
|
||||
**File:** `hosts/vms/homelab-vm/openproject.yml`
|
||||
**Also in:** `docs/services/individual/openproject.md` (replace with placeholder)
|
||||
|
||||
```bash
|
||||
openssl rand -hex 64
|
||||
# Update OPENPROJECT_SECRET_KEY_BASE in openproject.yml
|
||||
# Restart OpenProject — sessions invalidated
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 20. RomM Auth Secret Key
|
||||
|
||||
**File:** `hosts/vms/homelab-vm/romm/romm.yaml`
|
||||
**Also:** `hosts/vms/homelab-vm/romm/secret_key.yaml` (will be removed by sanitizer)
|
||||
|
||||
```bash
|
||||
openssl rand -hex 32
|
||||
# Update ROMM_AUTH_SECRET_KEY in romm.yaml
|
||||
# Restart RomM — sessions invalidated
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 21. Hoarder NEXTAUTH Secret
|
||||
|
||||
**File:** `hosts/vms/homelab-vm/hoarder.yaml`
|
||||
**Also in:** `docs/services/individual/web.md` (replace with placeholder)
|
||||
|
||||
```bash
|
||||
openssl rand -base64 36
|
||||
# Update NEXTAUTH_SECRET in hoarder.yaml
|
||||
# Restart Hoarder — sessions invalidated
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Priority 4 — Shared / Weak Passwords
|
||||
|
||||
### 22. `REDACTED_PASSWORD123!` — Used Across 5+ Services
|
||||
|
||||
This password is the same for all of the following. Change each to a
|
||||
**unique** strong password:
|
||||
|
||||
| Service | File | Variable |
|
||||
|---------|------|----------|
|
||||
| NetBox | `hosts/synology/atlantis/netbox.yml` | `SUPERUSER_PASSWORD` |
|
||||
| Paperless admin | `hosts/synology/calypso/paperless/docker-compose.yml` | `PAPERLESS_ADMIN_PASSWORD` |
|
||||
| Seafile admin | `hosts/synology/calypso/seafile-server.yaml` | `INIT_SEAFILE_ADMIN_PASSWORD` |
|
||||
| Seafile admin (new) | `hosts/synology/calypso/seafile-new.yaml` | `INIT_SEAFILE_ADMIN_PASSWORD` |
|
||||
| PhotoPrism | `hosts/physical/anubis/photoprism.yml` | `PHOTOPRISM_ADMIN_PASSWORD` |
|
||||
| Hemmelig | `hosts/vms/bulgaria-vm/hemmelig.yml` | `SECRET_JWT_SECRET` |
|
||||
| Vaultwarden admin | `hosts/synology/atlantis/bitwarden/bitwarden_token.txt` | (source password) |
|
||||
|
||||
For each: generate `openssl rand -base64 18`, update in the compose file,
|
||||
restart the container, then log in to verify.
|
||||
|
||||
---
|
||||
|
||||
### 23. `REDACTED_PASSWORD` — Used Across 3 Services
|
||||
|
||||
| Service | File | Variable |
|
||||
|---------|------|----------|
|
||||
| Gotify | `hosts/vms/homelab-vm/gotify.yml` | `GOTIFY_DEFAULTUSER_PASS` |
|
||||
| Pi-hole | `hosts/synology/atlantis/pihole.yml` | `WEBPASSWORD` |
|
||||
| Stirling PDF | `hosts/synology/atlantis/stirlingpdf.yml` | `SECURITY_INITIAL_LOGIN_PASSWORD` |
|
||||
|
||||
---
|
||||
|
||||
### 24. `mastodon_pass_2026` — Live PostgreSQL Password
|
||||
|
||||
**Files:**
|
||||
- `hosts/vms/matrix-ubuntu-vm/mastodon/.env.production.template`
|
||||
- `hosts/vms/matrix-ubuntu-vm/docs/SETUP.md`
|
||||
|
||||
```bash
|
||||
# On the matrix-ubuntu-vm server:
|
||||
ssh YOUR_WAN_IP
|
||||
sudo -u postgres psql
|
||||
ALTER USER mastodon WITH PASSWORD 'REDACTED_PASSWORD';
|
||||
\q
|
||||
|
||||
# Update the password in .env.production.template and Mastodon's running config
|
||||
# Restart Mastodon services
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 25. Watchtower API Token (`REDACTED_WATCHTOWER_TOKEN`)
|
||||
|
||||
| File |
|
||||
|------|
|
||||
| `hosts/synology/atlantis/watchtower.yml` |
|
||||
| `hosts/synology/calypso/prometheus.yml` |
|
||||
|
||||
```bash
|
||||
# Generate a proper random token:
|
||||
openssl rand -hex 20
|
||||
# Update WATCHTOWER_HTTP_API_TOKEN in both files
|
||||
# Update any scripts that call the Watchtower API
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 26. `test:test` SSH Credentials on `YOUR_WAN_IP`
|
||||
|
||||
The matrix-ubuntu-vm CREDENTIALS.md shows a `test` user with password `test`.
|
||||
|
||||
```bash
|
||||
# SSH to the server and remove or secure the test account:
|
||||
ssh YOUR_WAN_IP
|
||||
passwd test # change to a strong password
|
||||
# or: userdel -r test # remove entirely if unused
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Priority 5 — Network Infrastructure
|
||||
|
||||
### 27. Management Switch Password Hashes
|
||||
|
||||
**File:** `mgmtswitch.conf` (will be removed from public mirror by sanitizer)
|
||||
|
||||
The SHA-512 hashes for `root`, `vish`, and `vkhemraj` switch accounts are
|
||||
crackable offline. Rotate the switch passwords:
|
||||
|
||||
```bash
|
||||
# SSH to the management switch
|
||||
ssh admin@10.0.0.15
|
||||
# Change passwords for all local accounts:
|
||||
enable
|
||||
configure terminal
|
||||
username root secret NEW_PASSWORD
|
||||
username vish secret NEW_PASSWORD
|
||||
username vkhemraj secret NEW_PASSWORD
|
||||
write memory
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Final Verification
|
||||
|
||||
After completing all rotations:
|
||||
|
||||
```bash
|
||||
# 1. Commit and push all file changes
|
||||
git add -A
|
||||
git commit -m "chore(security): rotate all exposed credentials"
|
||||
git push origin main
|
||||
|
||||
# 2. Wait for the mirror workflow to complete, then pull:
|
||||
git -C /home/homelab/organized/repos/homelab-optimized pull
|
||||
|
||||
# 3. Verify none of the old secrets appear in the public mirror:
|
||||
cd /home/homelab/organized/repos/homelab-optimized
|
||||
grep -r "77e3ddaf\|52fa6ccb\|FGXlHM7d\|sk-proj-C_IYp6io\|ArP5XWdkwVyw\|bdtrpmpce\|toiunzuby" . 2>/dev/null
|
||||
grep -r "244c619d\|RpRexcYo5\|mastodon_pass\|REDACTED_PASSWORD\|REDACTED_PASSWORD\|REDACTED_WATCHTOWER_TOKEN" . 2>/dev/null
|
||||
grep -r "2e80b1b7d3a\|eca299ae59\|rxmr4tJoqfu\|ZjCofRlfm6\|QE5SudhZ99" . 2>/dev/null
|
||||
# All should return no results
|
||||
|
||||
# 4. Verify GitOps deployments are healthy in Portainer:
|
||||
# https://192.168.0.200:9443
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Going Forward — Preventing This Again
|
||||
|
||||
The root cause: secrets hard-coded in compose files that get committed to git.
|
||||
|
||||
**Rules:**
|
||||
1. **Never hard-code secrets in compose files** — use Docker Secrets, or an
|
||||
`.env` file excluded by `.gitignore` (Portainer can load env files from the
|
||||
host at deploy time)
|
||||
2. **Never put real values in documentation** — use `YOUR_API_KEY` placeholders
|
||||
3. **Never create `Secrets.txt` or `CREDENTIALS.md` files in the repo** — use
|
||||
a password manager (you already have Vaultwarden/Bitwarden)
|
||||
4. **Run the sanitizer locally** before any commit that touches secrets:
|
||||
|
||||
```bash
|
||||
# Test in a temp copy — see what the sanitizer would catch:
|
||||
tmpdir=$(mktemp -d)
|
||||
cp -r /path/to/homelab "$tmpdir/"
|
||||
python3 "$tmpdir/homelab/.gitea/sanitize.py"
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Security Hardening](../security/SERVER_HARDENING.md)
|
||||
- [Repository Sanitization](../admin/REPOSITORY_SANITIZATION.md)
|
||||
- [GitOps Deployment Guide](../admin/gitops-deployment-guide.md)
|
||||
|
||||
## Portainer Git Credential Rotation
|
||||
|
||||
The saved Git credential **`portainer-homelab`** (credId: 1) is used by ~43 stacks to
|
||||
pull compose files from `git.vish.gg`. When the Gitea token expires or is rotated,
|
||||
all those stacks fail to redeploy.
|
||||
|
||||
```bash
|
||||
# 1. Generate a new Gitea token at https://git.vish.gg/user/settings/applications
|
||||
# Scope: read:repository
|
||||
|
||||
# 2. Test the token:
|
||||
curl -s -o /dev/null -w "%{http_code}" \
|
||||
-H "Authorization: token YOUR_NEW_TOKEN" \
|
||||
"https://git.vish.gg/api/v1/repos/Vish/homelab"
|
||||
# Should return 200
|
||||
|
||||
# 3. Update in Portainer:
|
||||
curl -k -s -X PUT \
|
||||
-H "X-API-Key: "REDACTED_API_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://192.168.0.200:9443/api/users/1/gitcredentials/1" \
|
||||
-d '{"name":"portainer-homelab","username":"vish","password":"YOUR_NEW_TOKEN"}'
|
||||
```
|
||||
|
||||
> Note: The API update may not immediately propagate to automated pulls.
|
||||
> Pass credentials inline in redeploy calls to force use of the new token.
|
||||
|
||||
---
|
||||
|
||||
## Change Log
|
||||
|
||||
- 2026-02-27 — Incident: sanitization commit `037d766a` replaced credentials with
|
||||
`REDACTED_PASSWORD` placeholders across 14 compose files. All affected containers
|
||||
detected via Portainer API env scan and restored from `git show 037d766a^`. Added
|
||||
Portainer Git credential rotation section above.
|
||||
- 2026-02-20 — Initial creation (8 items)
|
||||
- 2026-02-20 — Expanded after full private repo audit (27 items across 34 exposure categories)
|
||||
490
docs/runbooks/disk-full-procedure.md
Normal file
490
docs/runbooks/disk-full-procedure.md
Normal file
@@ -0,0 +1,490 @@
|
||||
# Disk Full Procedure Runbook
|
||||
|
||||
## Overview
|
||||
This runbook provides procedures for handling disk space emergencies across all homelab hosts. It includes immediate actions, root cause analysis, and long-term solutions to prevent recurrence.
|
||||
|
||||
## Prerequisites
|
||||
- [ ] SSH access to affected host
|
||||
- [ ] Root/sudo privileges on the host
|
||||
- [ ] Monitoring dashboards access
|
||||
- [ ] Backup verification capability
|
||||
|
||||
## Metadata
|
||||
- **Estimated Time**: 30-90 minutes (depending on severity)
|
||||
- **Risk Level**: High (data loss possible if not handled carefully)
|
||||
- **Requires Downtime**: Minimal (may need to stop services temporarily)
|
||||
- **Reversible**: Partially (deleted data cannot be recovered)
|
||||
- **Tested On**: 2026-02-14
|
||||
|
||||
## Severity Levels
|
||||
|
||||
| Level | Disk Usage | Action Required | Urgency |
|
||||
|-------|------------|-----------------|---------|
|
||||
| 🟢 **Normal** | < 80% | Monitor | Low |
|
||||
| 🟡 **Warning** | 80-90% | Plan cleanup | Medium |
|
||||
| 🟠 **Critical** | 90-95% | Immediate cleanup | High |
|
||||
| 🔴 **Emergency** | > 95% | Emergency response | Critical |
|
||||
|
||||
## Quick Triage
|
||||
|
||||
First, determine which host and volume is affected:
|
||||
|
||||
```bash
|
||||
# Check all hosts disk usage
|
||||
ssh atlantis "df -h"
|
||||
ssh calypso "df -h"
|
||||
ssh concordnuc "df -h"
|
||||
ssh homelab-vm "df -h"
|
||||
ssh raspberry-pi-5 "df -h"
|
||||
```
|
||||
|
||||
## Emergency Procedure (>95% Full)
|
||||
|
||||
### Step 1: Immediate Space Recovery
|
||||
|
||||
**Goal**: Free up 5-10% space immediately to prevent system issues.
|
||||
|
||||
```bash
|
||||
# SSH to affected host
|
||||
ssh [hostname]
|
||||
|
||||
# Identify what's consuming space
|
||||
df -h
|
||||
du -sh /* 2>/dev/null | sort -rh | head -20
|
||||
|
||||
# Quick wins - Clear Docker cache
|
||||
docker system df # See what Docker is using
|
||||
docker system prune -a --volumes --force # Reclaim space (BE CAREFUL!)
|
||||
|
||||
# This typically frees 10-50GB depending on your setup
|
||||
```
|
||||
|
||||
**⚠️ WARNING**: `docker system prune` will remove:
|
||||
- Stopped containers
|
||||
- Unused networks
|
||||
- Dangling images
|
||||
- Build cache
|
||||
- Unused volumes (with --volumes flag)
|
||||
|
||||
**Safer alternative** if you're unsure:
|
||||
```bash
|
||||
# Less aggressive - removes only stopped containers and dangling images
|
||||
docker system prune --force
|
||||
```
|
||||
|
||||
### Step 2: Clear Log Files
|
||||
|
||||
```bash
|
||||
# Find large log files
|
||||
find /var/log -type f -size +100M -exec ls -lh {} \; | sort -k 5 -rh
|
||||
|
||||
# Clear systemd journal (keeps last 3 days)
|
||||
sudo journalctl --vacuum-time=3d
|
||||
|
||||
# Clear old Docker logs
|
||||
sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'
|
||||
|
||||
# For Synology NAS
|
||||
sudo find /volume1/@docker/containers -name "*-json.log" -size +100M -exec truncate -s 0 {} \;
|
||||
```
|
||||
|
||||
### Step 3: Remove Old Docker Images
|
||||
|
||||
```bash
|
||||
# List images by size
|
||||
docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k 2 -rh | head -20
|
||||
|
||||
# Remove specific old images
|
||||
docker image rm [image:tag]
|
||||
|
||||
# Remove all unused images
|
||||
docker image prune -a --force
|
||||
```
|
||||
|
||||
### Step 4: Verify Space Recovered
|
||||
|
||||
```bash
|
||||
# Check current usage
|
||||
df -h
|
||||
|
||||
# Verify critical services are running
|
||||
docker ps
|
||||
|
||||
# Check container logs for errors
|
||||
docker ps --format "{{.Names}}" | xargs -I {} sh -c 'echo "=== {} ===" && docker logs --tail 20 {}'
|
||||
```
|
||||
|
||||
## Detailed Analysis Procedure
|
||||
|
||||
Once immediate danger is passed, perform thorough analysis:
|
||||
|
||||
### Step 1: Identify Space Consumers
|
||||
|
||||
```bash
|
||||
# Comprehensive disk usage analysis
|
||||
sudo du -h --max-depth=2 / 2>/dev/null | sort -rh | head -30
|
||||
|
||||
# For Synology NAS specifically
|
||||
sudo du -h --max-depth=2 /volume1 2>/dev/null | sort -rh | head -30
|
||||
|
||||
# Check Docker volumes
|
||||
docker volume ls
|
||||
docker system df -v
|
||||
|
||||
# Check specific large directories
|
||||
du -sh /var/lib/docker/* | sort -rh
|
||||
du -sh /volume1/docker/* | sort -rh # Synology
|
||||
```
|
||||
|
||||
### Step 2: Analyze by Service
|
||||
|
||||
Create a space usage report:
|
||||
|
||||
```bash
|
||||
# Create analysis script
|
||||
cat > /tmp/analyze-space.sh << 'EOF'
|
||||
#!/bin/bash
|
||||
echo "=== Docker Container Volumes ==="
|
||||
docker ps --format "{{.Names}}" | while read container; do
|
||||
size=$(docker exec $container du -sh / 2>/dev/null | awk '{print $1}')
|
||||
echo "$container: $size"
|
||||
done | sort -rh
|
||||
|
||||
echo ""
|
||||
echo "=== Docker Volumes ==="
|
||||
docker volume ls --format "{{.Name}}" | while read vol; do
|
||||
size=$(docker volume inspect $vol --format '{{.Mountpoint}}' | xargs sudo du -sh 2>/dev/null | awk '{print $1}')
|
||||
echo "$vol: $size"
|
||||
done | sort -rh
|
||||
|
||||
echo ""
|
||||
echo "=== Log Files Over 100MB ==="
|
||||
find /var/log -type f -size +100M -exec ls -lh {} \; 2>/dev/null
|
||||
EOF
|
||||
|
||||
chmod +x /tmp/analyze-space.sh
|
||||
/tmp/analyze-space.sh
|
||||
```
|
||||
|
||||
### Step 3: Categorize Findings
|
||||
|
||||
Identify the primary space consumers:
|
||||
|
||||
| Category | Typical Culprits | Safe to Delete? |
|
||||
|----------|------------------|-----------------|
|
||||
| **Docker Images** | Old/unused image versions | ✅ Yes (if unused) |
|
||||
| **Docker Volumes** | Database growth, media cache | ⚠️ Maybe (check first) |
|
||||
| **Log Files** | Application logs, system logs | ✅ Yes (after review) |
|
||||
| **Media Files** | Plex, Jellyfin transcodes | ✅ Yes (transcodes) |
|
||||
| **Backups** | Old backup archives | ✅ Yes (keep recent) |
|
||||
| **Application Data** | Various service data | ❌ No (review first) |
|
||||
|
||||
## Cleanup Strategies by Service Type
|
||||
|
||||
### Media Services (Plex, Jellyfin)
|
||||
|
||||
```bash
|
||||
# Clear Plex transcode cache
|
||||
docker exec plex rm -rf /transcode/*
|
||||
|
||||
# Clear Jellyfin transcode cache
|
||||
docker exec jellyfin rm -rf /config/data/transcodes/*
|
||||
|
||||
# Find and remove old media previews
|
||||
find /volume1/plex/Library/Application\ Support/Plex\ Media\ Server/Cache -type f -mtime +30 -delete
|
||||
```
|
||||
|
||||
### *arr Suite (Sonarr, Radarr, etc.)
|
||||
|
||||
```bash
|
||||
# Clear download client history and backups
|
||||
docker exec sonarr find /config/Backups -mtime +30 -delete
|
||||
docker exec radarr find /config/Backups -mtime +30 -delete
|
||||
|
||||
# Clean up old logs
|
||||
docker exec sonarr find /config/logs -mtime +30 -delete
|
||||
docker exec radarr find /config/logs -mtime +30 -delete
|
||||
```
|
||||
|
||||
### Database Services (PostgreSQL, MariaDB)
|
||||
|
||||
```bash
|
||||
# Check database size
|
||||
docker exec postgres psql -U user -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database;"
|
||||
|
||||
# Vacuum databases (for PostgreSQL)
|
||||
docker exec postgres vacuumdb -U user --all --full --analyze
|
||||
|
||||
# Check MariaDB size
|
||||
docker exec mariadb mysql -u root -p -e "SELECT table_schema AS 'Database', ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)' FROM information_schema.TABLES GROUP BY table_schema;"
|
||||
```
|
||||
|
||||
### Monitoring Services (Prometheus, Grafana)
|
||||
|
||||
```bash
|
||||
# Check Prometheus storage size
|
||||
du -sh /volume1/docker/prometheus
|
||||
|
||||
# Prometheus retention is configured in prometheus.yml
|
||||
# Default: --storage.tsdb.retention.time=15d
|
||||
# Consider reducing retention if space is critical
|
||||
|
||||
# Clear old Grafana sessions
|
||||
docker exec grafana find /var/lib/grafana/sessions -mtime +7 -delete
|
||||
```
|
||||
|
||||
### Immich (Photo Management)
|
||||
|
||||
```bash
|
||||
# Check Immich storage usage
|
||||
docker exec immich-server df -h /usr/src/app/upload
|
||||
|
||||
# Immich uses a lot of space for:
|
||||
# - Original photos
|
||||
# - Thumbnails
|
||||
# - Encoded videos
|
||||
# - ML models
|
||||
|
||||
# Clean up old upload logs
|
||||
docker exec immich-server find /usr/src/app/upload/upload -mtime +90 -delete
|
||||
```
|
||||
|
||||
## Long-Term Solutions
|
||||
|
||||
### Solution 1: Configure Log Rotation
|
||||
|
||||
Create proper log rotation for Docker containers:
|
||||
|
||||
```bash
|
||||
# Edit Docker daemon config
|
||||
sudo nano /etc/docker/daemon.json
|
||||
|
||||
# Add log rotation settings
|
||||
{
|
||||
"log-driver": "json-file",
|
||||
"log-opts": {
|
||||
"max-size": "10m",
|
||||
"max-file": "3"
|
||||
}
|
||||
}
|
||||
|
||||
# Restart Docker
|
||||
sudo systemctl restart docker # Linux
|
||||
# OR for Synology
|
||||
sudo synoservicectl --restart pkgctl-Docker
|
||||
```
|
||||
|
||||
### Solution 2: Set Up Automated Cleanup
|
||||
|
||||
Create a cleanup cron job:
|
||||
|
||||
```bash
|
||||
# Create cleanup script
|
||||
sudo nano /usr/local/bin/homelab-cleanup.sh
|
||||
|
||||
#!/bin/bash
|
||||
# Homelab Automated Cleanup Script
|
||||
|
||||
# Remove stopped containers older than 7 days
|
||||
docker container prune --filter "until=168h" --force
|
||||
|
||||
# Remove unused images older than 30 days
|
||||
docker image prune --all --filter "until=720h" --force
|
||||
|
||||
# Remove unused volumes (BE CAREFUL - only if you're sure)
|
||||
# docker volume prune --force
|
||||
|
||||
# Clear journal logs older than 7 days
|
||||
journalctl --vacuum-time=7d
|
||||
|
||||
# Clear old backups (keep last 30 days)
|
||||
find /volume1/backups -type f -mtime +30 -delete
|
||||
|
||||
echo "Cleanup completed at $(date)" >> /var/log/homelab-cleanup.log
|
||||
|
||||
# Make executable
|
||||
sudo chmod +x /usr/local/bin/homelab-cleanup.sh
|
||||
|
||||
# Add to cron (runs weekly on Sunday at 3 AM)
|
||||
(crontab -l 2>/dev/null; echo "0 3 * * 0 /usr/local/bin/homelab-cleanup.sh") | crontab -
|
||||
```
|
||||
|
||||
### Solution 3: Configure Service-Specific Retention
|
||||
|
||||
Update each service with appropriate retention policies:
|
||||
|
||||
**Prometheus** (`prometheus.yml`):
|
||||
```yaml
|
||||
global:
|
||||
storage:
|
||||
tsdb:
|
||||
retention.time: 15d # Reduce from default 15d to 7d if needed
|
||||
retention.size: 50GB # Set size limit
|
||||
```
|
||||
|
||||
**Grafana** (docker-compose.yml):
|
||||
```yaml
|
||||
environment:
|
||||
- GF_DATABASE_WAL=true
|
||||
- GF_DATABASE_CLEANUP_INTERVAL=168h # Weekly cleanup
|
||||
```
|
||||
|
||||
**Plex** (Plex settings):
|
||||
- Settings → Transcoder → Transcoder temporary directory
|
||||
- Settings → Scheduled Tasks → Clean Bundles (daily)
|
||||
- Settings → Scheduled Tasks → Optimize Database (weekly)
|
||||
|
||||
### Solution 4: Monitor Disk Usage Proactively
|
||||
|
||||
Set up monitoring alerts in Grafana:
|
||||
|
||||
```yaml
|
||||
# Alert rule for disk space
|
||||
- alert: REDACTED_APP_PASSWORD
|
||||
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 20
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Disk space warning on {{ $labels.instance }}"
|
||||
description: "Disk {{ $labels.mountpoint }} has less than 20% free space"
|
||||
|
||||
- alert: DiskSpaceCritical
|
||||
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 10
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "CRITICAL: Disk space on {{ $labels.instance }}"
|
||||
description: "Disk {{ $labels.mountpoint }} has less than 10% free space"
|
||||
```
|
||||
|
||||
## Host-Specific Considerations
|
||||
|
||||
### Atlantis (Synology DS1823xs+)
|
||||
|
||||
```bash
|
||||
# Synology-specific cleanup
|
||||
# Clear Synology logs
|
||||
sudo find /var/log -name "*.log.*" -mtime +30 -delete
|
||||
|
||||
# Clear package logs
|
||||
sudo find /var/packages/*/target/logs -name "*.log.*" -mtime +30 -delete
|
||||
|
||||
# Check storage pool status
|
||||
sudo synostgpool --info
|
||||
|
||||
# DSM has built-in storage analyzer
|
||||
# Control Panel → Storage Manager → Storage Analyzer
|
||||
```
|
||||
|
||||
### Calypso (Synology DS723+)
|
||||
|
||||
Same as Atlantis - use Synology-specific commands.
|
||||
|
||||
### Concord NUC (Ubuntu)
|
||||
|
||||
```bash
|
||||
# Ubuntu-specific cleanup
|
||||
sudo apt-get clean
|
||||
sudo apt-get autoclean
|
||||
sudo apt-get autoremove --purge
|
||||
|
||||
# Clear old kernels (keep current + 1 previous)
|
||||
sudo apt-get autoremove --purge $(dpkg -l 'linux-*' | sed '/^ii/!d;/'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d;s/^[^ ]* [^ ]* \([^ ]*\).*/\1/;/[0-9]/!d')
|
||||
|
||||
# Clear thumbnail cache
|
||||
rm -rf ~/.cache/thumbnails/*
|
||||
```
|
||||
|
||||
### Homelab VM (Proxmox VM)
|
||||
|
||||
```bash
|
||||
# VM-specific cleanup
|
||||
# Clear apt cache
|
||||
sudo apt-get clean
|
||||
|
||||
# Clear old cloud-init logs
|
||||
sudo rm -rf /var/log/cloud-init*.log
|
||||
|
||||
# Compact QCOW2 disk (from Proxmox host)
|
||||
# qemu-img convert -O qcow2 -c original.qcow2 compressed.qcow2
|
||||
```
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
After cleanup, verify:
|
||||
|
||||
- [ ] Disk usage below 80%: `df -h`
|
||||
- [ ] All critical containers running: `docker ps`
|
||||
- [ ] No errors in recent logs: `docker logs [container] --tail 50`
|
||||
- [ ] Services accessible via web interface
|
||||
- [ ] Monitoring dashboards show normal metrics
|
||||
- [ ] Backup jobs can complete successfully
|
||||
- [ ] Automated cleanup configured for future
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
If cleanup causes issues:
|
||||
|
||||
1. **Check what was deleted**: Review command history and logs
|
||||
2. **Restore from backups**: If critical data was deleted
|
||||
```bash
|
||||
cd ~/Documents/repos/homelab
|
||||
./restore.sh [backup-date]
|
||||
```
|
||||
3. **Recreate Docker volumes**: If volumes were accidentally pruned
|
||||
4. **Restart affected services**: Redeploy from Portainer
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Still Running Out of Space After Cleanup
|
||||
|
||||
**Solution**: Consider adding more storage
|
||||
- Add external USB drives
|
||||
- Expand existing RAID arrays
|
||||
- Move services to hosts with more space
|
||||
- Archive old media to cold storage
|
||||
|
||||
### Issue: Docker Prune Removed Important Data
|
||||
|
||||
**Solution**:
|
||||
- Always use `--filter` to be selective
|
||||
- Never use `docker volume prune` without checking first
|
||||
- Keep recent backups before major cleanup operations
|
||||
|
||||
### Issue: Services Won't Start After Cleanup
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Check for missing volumes
|
||||
docker ps -a
|
||||
docker volume ls
|
||||
|
||||
# Check logs
|
||||
docker logs [container]
|
||||
|
||||
# Recreate volumes if needed (restore from backup)
|
||||
./restore.sh [backup-date]
|
||||
```
|
||||
|
||||
## Prevention Checklist
|
||||
|
||||
- [ ] Log rotation configured for all services
|
||||
- [ ] Automated cleanup script running weekly
|
||||
- [ ] Monitoring alerts set up for disk space
|
||||
- [ ] Retention policies configured appropriately
|
||||
- [ ] Regular backup verification scheduled
|
||||
- [ ] Capacity planning review quarterly
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
|
||||
- [Backup Strategies](../admin/backup-strategies.md)
|
||||
- [Monitoring Setup](../admin/monitoring-setup.md)
|
||||
- [Troubleshooting Guide](../troubleshooting/common-issues.md)
|
||||
|
||||
## Change Log
|
||||
|
||||
- 2026-02-14 - Initial creation with host-specific procedures
|
||||
- 2026-02-14 - Added service-specific cleanup strategies
|
||||
559
docs/runbooks/service-migration.md
Normal file
559
docs/runbooks/service-migration.md
Normal file
@@ -0,0 +1,559 @@
|
||||
# Service Migration Runbook
|
||||
|
||||
## Overview
|
||||
This runbook guides you through migrating a containerized service from one host to another in the homelab. The procedure minimizes downtime and ensures data integrity throughout the migration.
|
||||
|
||||
## Prerequisites
|
||||
- [ ] SSH access to both source and target hosts
|
||||
- [ ] Sufficient disk space on target host
|
||||
- [ ] Network connectivity between hosts (Tailscale recommended)
|
||||
- [ ] Service backup completed and verified
|
||||
- [ ] Maintenance window scheduled (if downtime required)
|
||||
- [ ] Portainer access for both hosts
|
||||
|
||||
## Metadata
|
||||
- **Estimated Time**: 1-3 hours (depending on data size)
|
||||
- **Risk Level**: Medium-High (data migration involved)
|
||||
- **Requires Downtime**: Yes (typically 15-60 minutes)
|
||||
- **Reversible**: Yes (can roll back to source host)
|
||||
- **Tested On**: 2026-02-14
|
||||
|
||||
## When to Migrate Services
|
||||
|
||||
Common reasons for service migration:
|
||||
|
||||
| Scenario | Example | Recommended Target |
|
||||
|----------|---------|-------------------|
|
||||
| **Resource constraints** | NAS running out of CPU | Move to NUC or VM |
|
||||
| **Storage constraints** | Running out of disk space | Move to larger NAS |
|
||||
| **Performance issues** | High I/O affecting other services | Move to dedicated host |
|
||||
| **Host consolidation** | Reducing number of active hosts | Consolidate to primary hosts |
|
||||
| **Hardware maintenance** | Planned hardware upgrade | Temporary or permanent move |
|
||||
| **Improved organization** | Group related services | Move to appropriate host |
|
||||
|
||||
## Migration Types
|
||||
|
||||
### Type 1: Simple Migration (Stateless Service)
|
||||
- No persistent data
|
||||
- Can be redeployed from scratch
|
||||
- Example: Nginx, static web servers
|
||||
- **Downtime**: Minimal (5-15 minutes)
|
||||
|
||||
### Type 2: Standard Migration (Small Data)
|
||||
- Persistent data < 10GB
|
||||
- Configuration and databases
|
||||
- Example: Uptime Kuma, AdGuard Home
|
||||
- **Downtime**: 15-30 minutes
|
||||
|
||||
### Type 3: Large Data Migration
|
||||
- Persistent data > 10GB
|
||||
- Media libraries, large databases
|
||||
- Example: Plex, Immich, Jellyfin
|
||||
- **Downtime**: 1-4 hours (depending on size)
|
||||
|
||||
## Pre-Migration Planning
|
||||
|
||||
### Step 1: Assess the Service
|
||||
|
||||
```bash
|
||||
# SSH to source host
|
||||
ssh [source-host]
|
||||
|
||||
# Identify container and volumes
|
||||
docker ps | grep [service-name]
|
||||
docker inspect [service-name] | grep -A 10 Mounts
|
||||
|
||||
# Check data size
|
||||
docker exec [service-name] du -sh /config /data
|
||||
|
||||
# List all volumes used by service
|
||||
docker volume ls | grep [service-name]
|
||||
|
||||
# Check volume sizes
|
||||
docker system df -v | grep [service-name]
|
||||
```
|
||||
|
||||
Document findings:
|
||||
- Container name: ___________
|
||||
- Image and tag: ___________
|
||||
- Data size: ___________
|
||||
- Volume count: ___________
|
||||
- Network dependencies: ___________
|
||||
- Port mappings: ___________
|
||||
|
||||
### Step 2: Check Target Host Capacity
|
||||
|
||||
```bash
|
||||
# SSH to target host
|
||||
ssh [target-host]
|
||||
|
||||
# Check available resources
|
||||
df -h # Disk space
|
||||
free -h # RAM
|
||||
nproc # CPU cores
|
||||
docker ps | wc -l # Current container count
|
||||
|
||||
# Check port conflicts
|
||||
netstat -tlnp | grep [required-port]
|
||||
```
|
||||
|
||||
### Step 3: Create Migration Plan
|
||||
|
||||
**Downtime Window**:
|
||||
- Start: ___________
|
||||
- End: ___________
|
||||
- Duration: ___________
|
||||
|
||||
**Dependencies**:
|
||||
- Services that depend on this: ___________
|
||||
- Services this depends on: ___________
|
||||
|
||||
**Notification**:
|
||||
- Who to notify: ___________
|
||||
- When to notify: ___________
|
||||
|
||||
## Migration Procedure
|
||||
|
||||
### Method A: GitOps Migration (Recommended)
|
||||
|
||||
Best for: Most services with proper version control
|
||||
|
||||
#### Step 1: Backup Current Service
|
||||
|
||||
```bash
|
||||
# SSH to source host
|
||||
ssh [source-host]
|
||||
|
||||
# Create backup
|
||||
docker stop [service-name]
|
||||
docker export [service-name] > /tmp/[service-name]-backup.tar
|
||||
|
||||
# Backup volumes
|
||||
for vol in $(docker volume ls -q | grep [service-name]); do
|
||||
docker run --rm -v $vol:/source -v /tmp:/backup alpine tar czf /backup/$vol.tar.gz -C /source .
|
||||
done
|
||||
|
||||
# Copy backups to safe location
|
||||
scp /tmp/[service-name]*.tar* [backup-location]:~/backups/
|
||||
```
|
||||
|
||||
#### Step 2: Export Configuration
|
||||
|
||||
```bash
|
||||
# Get current docker-compose configuration
|
||||
cd ~/Documents/repos/homelab
|
||||
cat hosts/[source-host]/[service-name].yaml > /tmp/service-config.yaml
|
||||
|
||||
# Note environment variables
|
||||
docker inspect [service-name] | grep -A 50 Env
|
||||
```
|
||||
|
||||
#### Step 3: Copy Data to Target Host
|
||||
|
||||
**For Small Data (< 10GB)**: Use SCP
|
||||
```bash
|
||||
# From your workstation
|
||||
scp -r [source-host]:/volume1/docker/[service-name] /tmp/
|
||||
scp -r /tmp/[service-name] [target-host]:/path/to/docker/
|
||||
```
|
||||
|
||||
**For Large Data (> 10GB)**: Use Rsync
|
||||
```bash
|
||||
# From source host to target host via Tailscale
|
||||
ssh [source-host]
|
||||
rsync -avz --progress /volume1/docker/[service-name]/ \
|
||||
[target-host-tailscale-ip]:/path/to/docker/[service-name]/
|
||||
|
||||
# Monitor progress
|
||||
watch -n 5 'du -sh /path/to/docker/[service-name]'
|
||||
```
|
||||
|
||||
**For Very Large Data (> 100GB)**: Consider physical transfer
|
||||
```bash
|
||||
# Copy to USB drive, physically move, then copy to target
|
||||
# Or use network-attached storage as intermediate
|
||||
```
|
||||
|
||||
#### Step 4: Stop Service on Source Host
|
||||
|
||||
```bash
|
||||
# SSH to source host
|
||||
ssh [source-host]
|
||||
|
||||
# Stop the container
|
||||
docker stop [service-name]
|
||||
|
||||
# Verify it's stopped
|
||||
docker ps -a | grep [service-name]
|
||||
```
|
||||
|
||||
#### Step 5: Update Git Configuration
|
||||
|
||||
```bash
|
||||
# On your workstation
|
||||
cd ~/Documents/repos/homelab
|
||||
|
||||
# Move service definition to new host
|
||||
git mv hosts/[source-host]/[service-name].yaml \
|
||||
hosts/[target-host]/[service-name].yaml
|
||||
|
||||
# Update paths in the configuration file if needed
|
||||
nano hosts/[target-host]/[service-name].yaml
|
||||
|
||||
# Update volume paths for target host
|
||||
# Atlantis/Calypso: /volume1/docker/[service-name]
|
||||
# NUC/VM: /home/user/docker/[service-name]
|
||||
# Raspberry Pi: /home/pi/docker/[service-name]
|
||||
|
||||
# Commit changes
|
||||
git add hosts/[target-host]/[service-name].yaml
|
||||
git commit -m "Migrate [service-name] from [source-host] to [target-host]
|
||||
|
||||
- Move service configuration
|
||||
- Update volume paths for target host
|
||||
- Migration date: $(date +%Y-%m-%d)
|
||||
|
||||
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"
|
||||
|
||||
git push origin main
|
||||
```
|
||||
|
||||
#### Step 6: Deploy on Target Host
|
||||
|
||||
**Via Portainer UI**:
|
||||
1. Open Portainer → Select target host endpoint
|
||||
2. Go to **Stacks** → **Add stack** → **Git Repository**
|
||||
3. Configure:
|
||||
- Repository URL: Your git repository
|
||||
- Compose path: `hosts/[target-host]/[service-name].yaml`
|
||||
- Enable GitOps (optional)
|
||||
4. Click **Deploy the stack**
|
||||
|
||||
**Via GitOps Auto-Sync**:
|
||||
- Wait 5-10 minutes for automatic deployment
|
||||
- Monitor Portainer for new stack appearance
|
||||
|
||||
#### Step 7: Verify Migration
|
||||
|
||||
```bash
|
||||
# SSH to target host
|
||||
ssh [target-host]
|
||||
|
||||
# Check container is running
|
||||
docker ps | grep [service-name]
|
||||
|
||||
# Check logs for errors
|
||||
docker logs [service-name] --tail 100
|
||||
|
||||
# Test service accessibility
|
||||
curl http://localhost:[port] # Internal
|
||||
curl https://[service].vish.gg # External (if applicable)
|
||||
|
||||
# Verify data integrity
|
||||
docker exec [service-name] ls -lah /config
|
||||
docker exec [service-name] ls -lah /data
|
||||
|
||||
# Check resource usage
|
||||
docker stats [service-name] --no-stream
|
||||
```
|
||||
|
||||
#### Step 8: Update DNS/Reverse Proxy (If Applicable)
|
||||
|
||||
```bash
|
||||
# Update Nginx Proxy Manager or reverse proxy configuration
|
||||
# Point [service].vish.gg to new host IP
|
||||
|
||||
# Update Cloudflare DNS if using Cloudflare Tunnels
|
||||
|
||||
# Update local DNS (AdGuard Home) if applicable
|
||||
```
|
||||
|
||||
#### Step 9: Remove from Source Host
|
||||
|
||||
**Only after verifying target is working correctly!**
|
||||
|
||||
```bash
|
||||
# SSH to source host
|
||||
ssh [source-host]
|
||||
|
||||
# Remove container and volumes
|
||||
docker stop [service-name]
|
||||
docker rm [service-name]
|
||||
|
||||
# Optional: Remove volumes (only if data copied successfully)
|
||||
# docker volume rm $(docker volume ls -q | grep [service-name])
|
||||
|
||||
# Remove data directory
|
||||
rm -rf /volume1/docker/[service-name] # BE CAREFUL!
|
||||
|
||||
# Remove from Portainer if manually managed
|
||||
# Portainer UI → Stacks → Remove stack
|
||||
```
|
||||
|
||||
### Method B: Manual Export/Import
|
||||
|
||||
Best for: Quick migrations without git changes, or when testing
|
||||
|
||||
#### Step 1: Stop and Export
|
||||
|
||||
```bash
|
||||
# SSH to source host
|
||||
ssh [source-host]
|
||||
|
||||
# Stop service
|
||||
docker stop [service-name]
|
||||
|
||||
# Export container and volumes
|
||||
docker run --rm \
|
||||
-v [service-name]_data:/source \
|
||||
-v /tmp:/backup \
|
||||
alpine tar czf /backup/[service-name]-data.tar.gz -C /source .
|
||||
|
||||
# Export configuration
|
||||
docker inspect [service-name] > /tmp/[service-name]-config.json
|
||||
```
|
||||
|
||||
#### Step 2: Transfer to Target
|
||||
|
||||
```bash
|
||||
# Copy data to target host
|
||||
scp /tmp/[service-name]-data.tar.gz [target-host]:/tmp/
|
||||
scp /tmp/[service-name]-config.json [target-host]:/tmp/
|
||||
```
|
||||
|
||||
#### Step 3: Import on Target
|
||||
|
||||
```bash
|
||||
# SSH to target host
|
||||
ssh [target-host]
|
||||
|
||||
# Create volume
|
||||
docker volume create [service-name]_data
|
||||
|
||||
# Import data
|
||||
docker run --rm \
|
||||
-v [service-name]_data:/target \
|
||||
-v /tmp:/backup \
|
||||
alpine tar xzf /backup/[service-name]-data.tar.gz -C /target
|
||||
|
||||
# Create and start container using saved configuration
|
||||
# Adjust paths and ports as needed
|
||||
docker create --name [service-name] \
|
||||
[options-from-config.json] \
|
||||
[image:tag]
|
||||
|
||||
docker start [service-name]
|
||||
```
|
||||
|
||||
## Post-Migration Tasks
|
||||
|
||||
### Update Documentation
|
||||
|
||||
```bash
|
||||
# Update service inventory
|
||||
nano docs/services/VERIFIED_SERVICE_INVENTORY.md
|
||||
|
||||
# Update the host column for migrated service
|
||||
# | Service | Host | Port | URL | Status |
|
||||
# | Service | [NEW-HOST] | 8080 | https://service.vish.gg | ✅ Active |
|
||||
```
|
||||
|
||||
### Update Monitoring
|
||||
|
||||
```bash
|
||||
# Update Prometheus configuration if needed
|
||||
nano prometheus/prometheus.yml
|
||||
|
||||
# Update target host IP for scraped metrics
|
||||
# Restart Prometheus if configuration changed
|
||||
```
|
||||
|
||||
### Test Backups
|
||||
|
||||
```bash
|
||||
# Verify backups work on new host
|
||||
./backup.sh --test
|
||||
|
||||
# Ensure service data is included in backup
|
||||
ls -lah /path/to/backups/[service-name]
|
||||
```
|
||||
|
||||
### Performance Baseline
|
||||
|
||||
```bash
|
||||
# Document baseline performance on new host
|
||||
docker stats [service-name] --no-stream
|
||||
|
||||
# Monitor for 24 hours to ensure stability
|
||||
```
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
- [ ] Service running on target host: `docker ps`
|
||||
- [ ] All data migrated correctly
|
||||
- [ ] Configuration preserved
|
||||
- [ ] Logs show no errors: `docker logs [service]`
|
||||
- [ ] External access works (if applicable)
|
||||
- [ ] Internal service connectivity works
|
||||
- [ ] Reverse proxy updated (if applicable)
|
||||
- [ ] DNS records updated (if applicable)
|
||||
- [ ] Monitoring updated
|
||||
- [ ] Documentation updated
|
||||
- [ ] Backups include new location
|
||||
- [ ] Old host cleaned up
|
||||
- [ ] Users notified of any URL changes
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
If migration fails or causes issues:
|
||||
|
||||
### Quick Rollback (Within 24 hours)
|
||||
|
||||
```bash
|
||||
# SSH to source host
|
||||
ssh [source-host]
|
||||
|
||||
# Restore from backup
|
||||
docker import /tmp/[service-name]-backup.tar [service-name]:backup
|
||||
|
||||
# Or redeploy from git (revert git changes)
|
||||
cd ~/Documents/repos/homelab
|
||||
git revert HEAD
|
||||
git push origin main
|
||||
|
||||
# Restart service on source host
|
||||
# Via Portainer or:
|
||||
docker start [service-name]
|
||||
```
|
||||
|
||||
### Full Rollback (After cleanup)
|
||||
|
||||
```bash
|
||||
# Restore from backup
|
||||
./restore.sh [backup-date]
|
||||
|
||||
# Redeploy to original host
|
||||
# Follow original deployment procedure
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Data Transfer Very Slow
|
||||
|
||||
**Symptoms**: Rsync taking hours for moderate data
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Use compression for better network performance
|
||||
rsync -avz --compress-level=6 --progress /source/ [target]:/dest/
|
||||
|
||||
# Or use parallel transfer tools
|
||||
# Install: sudo apt-get install parallel
|
||||
find /source -type f | parallel -j 4 scp {} [target]:/dest/{}
|
||||
|
||||
# For extremely large transfers, consider:
|
||||
# 1. Physical USB drive transfer
|
||||
# 2. NFS mount between hosts
|
||||
# 3. Transfer during off-peak hours
|
||||
```
|
||||
|
||||
### Issue: Service Won't Start on Target Host
|
||||
|
||||
**Symptoms**: Container starts then immediately exits
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Check logs
|
||||
docker logs [service-name]
|
||||
|
||||
# Common issues:
|
||||
# 1. Path issues - Update volume paths in compose file
|
||||
# 2. Permission issues - Check PUID/PGID
|
||||
# 3. Port conflicts - Check if port already in use
|
||||
# 4. Missing dependencies - Ensure all required services running
|
||||
|
||||
# Fix permissions
|
||||
docker exec [service-name] chown -R 1000:1000 /config /data
|
||||
```
|
||||
|
||||
### Issue: Lost Configuration Data
|
||||
|
||||
**Symptoms**: Service starts but settings are default
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Check if volumes mounted correctly
|
||||
docker inspect [service-name] | grep -A 10 Mounts
|
||||
|
||||
# Restore configuration from backup
|
||||
docker stop [service-name]
|
||||
docker run --rm -v [service-name]_config:/target -v /tmp:/backup alpine \
|
||||
tar xzf /backup/config-backup.tar.gz -C /target
|
||||
docker start [service-name]
|
||||
```
|
||||
|
||||
### Issue: Network Connectivity Problems
|
||||
|
||||
**Symptoms**: Service can't reach other services
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Check network configuration
|
||||
docker network ls
|
||||
docker network inspect [network-name]
|
||||
|
||||
# Add service to required networks
|
||||
docker network connect [network-name] [service-name]
|
||||
|
||||
# Verify DNS resolution
|
||||
docker exec [service-name] ping [other-service]
|
||||
```
|
||||
|
||||
## Migration Examples
|
||||
|
||||
### Example 1: Migrate Uptime Kuma from Calypso to Homelab VM
|
||||
|
||||
```bash
|
||||
# 1. Backup on Calypso
|
||||
ssh calypso
|
||||
docker stop uptime-kuma
|
||||
tar czf /tmp/uptime-kuma-data.tar.gz /volume1/docker/uptime-kuma
|
||||
|
||||
# 2. Transfer
|
||||
scp /tmp/uptime-kuma-data.tar.gz homelab-vm:/tmp/
|
||||
|
||||
# 3. Update git
|
||||
cd ~/Documents/repos/homelab
|
||||
git mv hosts/synology/calypso/uptime-kuma.yaml \
|
||||
hosts/vms/homelab-vm/uptime-kuma.yaml
|
||||
# Update paths in file
|
||||
sed -i 's|/volume1/docker/uptime-kuma|/home/user/docker/uptime-kuma|g' \
|
||||
hosts/vms/homelab-vm/uptime-kuma.yaml
|
||||
|
||||
# 4. Deploy on target
|
||||
git add . && git commit -m "Migrate Uptime Kuma to Homelab VM" && git push
|
||||
|
||||
# 5. Verify and cleanup Calypso
|
||||
```
|
||||
|
||||
### Example 2: Migrate AdGuard Home between Hosts
|
||||
|
||||
```bash
|
||||
# AdGuard Home requires DNS configuration updates
|
||||
# 1. Note current DNS settings on clients
|
||||
# 2. Migrate service (as above)
|
||||
# 3. Update client DNS to point to new host IP
|
||||
# 4. Test DNS resolution from clients
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Add New Service](add-new-service.md)
|
||||
- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
|
||||
- [Backup Strategies](../admin/backup-strategies.md)
|
||||
- [Deployment Workflow](../admin/DEPLOYMENT_WORKFLOW.md)
|
||||
|
||||
## Change Log
|
||||
|
||||
- 2026-02-14 - Initial creation with multiple migration methods
|
||||
- 2026-02-14 - Added large data migration strategies
|
||||
622
docs/runbooks/synology-dsm-upgrade.md
Normal file
622
docs/runbooks/synology-dsm-upgrade.md
Normal file
@@ -0,0 +1,622 @@
|
||||
# Synology DSM Upgrade Runbook
|
||||
|
||||
## Overview
|
||||
This runbook provides a safe procedure for upgrading DiskStation Manager (DSM) on Synology NAS devices (Atlantis DS1823xs+ and Calypso DS723+). The procedure minimizes downtime and ensures data integrity during major and minor DSM upgrades.
|
||||
|
||||
## Prerequisites
|
||||
- [ ] DSM admin credentials
|
||||
- [ ] Complete backup of NAS (HyperBackup or external)
|
||||
- [ ] Backup verification completed
|
||||
- [ ] List of installed packages and their versions
|
||||
- [ ] SSH access to NAS (for troubleshooting)
|
||||
- [ ] Maintenance window scheduled (1-3 hours)
|
||||
- [ ] All Docker containers documented and backed up
|
||||
- [ ] Tailscale or alternative remote access configured
|
||||
|
||||
## Metadata
|
||||
- **Estimated Time**: 1-3 hours (including backups and verification)
|
||||
- **Risk Level**: Medium-High (system-level upgrade)
|
||||
- **Requires Downtime**: Yes (30-60 minutes for upgrade itself)
|
||||
- **Reversible**: Limited (can rollback but complicated)
|
||||
- **Tested On**: 2026-02-14
|
||||
|
||||
## Upgrade Types
|
||||
|
||||
| Type | Example | Risk | Downtime | Reversibility |
|
||||
|------|---------|------|----------|---------------|
|
||||
| **Patch Update** | 7.2.1 → 7.2.2 | Low | 15-30 min | Easy |
|
||||
| **Minor Update** | 7.2 → 7.3 | Medium | 30-60 min | Moderate |
|
||||
| **Major Update** | 7.x → 8.0 | High | 60-120 min | Difficult |
|
||||
|
||||
## Pre-Upgrade Planning
|
||||
|
||||
### Step 1: Check Compatibility
|
||||
|
||||
Before upgrading, verify compatibility:
|
||||
|
||||
```bash
|
||||
# SSH to NAS
|
||||
ssh admin@atlantis # or calypso
|
||||
|
||||
# Check current DSM version
|
||||
cat /etc.defaults/VERSION
|
||||
|
||||
# Check hardware compatibility
|
||||
# Visit: https://www.synology.com/en-us/dsm
|
||||
# Verify your model supports the target DSM version
|
||||
|
||||
# Check RAM requirements (DSM 7.2+ needs at least 1GB)
|
||||
free -h
|
||||
|
||||
# Check disk space (need at least 5GB free in system partition)
|
||||
df -h
|
||||
```
|
||||
|
||||
### Step 2: Document Current State
|
||||
|
||||
Create a pre-upgrade snapshot of your configuration:
|
||||
|
||||
```bash
|
||||
# Document installed packages
|
||||
# DSM UI → Package Center → Installed
|
||||
# Take screenshot or note down:
|
||||
# - Package names and versions
|
||||
# - Custom configurations
|
||||
|
||||
# Export Docker Compose files (already in git)
|
||||
cd ~/Documents/repos/homelab
|
||||
git status # Ensure all configs are committed
|
||||
|
||||
# Document running containers
|
||||
ssh atlantis "docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}' > /volume1/docker/pre-upgrade-containers.txt"
|
||||
ssh calypso "docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}' > /volume1/docker/pre-upgrade-containers.txt"
|
||||
|
||||
# Export package list
|
||||
ssh atlantis "synopkg list > /volume1/docker/pre-upgrade-packages.txt"
|
||||
ssh calypso "synopkg list > /volume1/docker/pre-upgrade-packages.txt"
|
||||
```
|
||||
|
||||
### Step 3: Backup Everything
|
||||
|
||||
**Critical**: Complete a full backup before proceeding.
|
||||
|
||||
```bash
|
||||
# 1. Backup via HyperBackup (if configured)
|
||||
# DSM UI → HyperBackup → Backup Now
|
||||
|
||||
# 2. Export DSM configuration
|
||||
# DSM UI → Control Panel → Update & Restore → Configuration Backup → Back Up Configuration
|
||||
|
||||
# 3. Backup Docker volumes
|
||||
cd ~/Documents/repos/homelab
|
||||
./backup.sh
|
||||
|
||||
# 4. Snapshot (if using Btrfs)
|
||||
# Storage Manager → Storage Pool → Snapshots → Take Snapshot
|
||||
|
||||
# 5. Verify backups
|
||||
ls -lh /volume1/backups/
|
||||
# Ensure backup completed successfully
|
||||
```
|
||||
|
||||
### Step 4: Notify Users
|
||||
|
||||
If other users rely on your homelab:
|
||||
|
||||
```bash
|
||||
# Send notification (via your notification system)
|
||||
curl -H "Title: Scheduled Maintenance" \
|
||||
-H "Priority: high" \
|
||||
-H "Tags: maintenance" \
|
||||
-d "DSM upgrade scheduled for [DATE/TIME]. Services will be unavailable for approximately 1-2 hours." \
|
||||
https://ntfy.sh/REDACTED_TOPIC
|
||||
|
||||
# Or send notification via Signal/Discord/etc.
|
||||
```
|
||||
|
||||
### Step 5: Plan Rollback Strategy
|
||||
|
||||
Document your rollback plan:
|
||||
- [ ] Backup location verified: ___________
|
||||
- [ ] Restore procedure tested: Yes/No
|
||||
- [ ] Alternative access method ready (direct keyboard/monitor)
|
||||
- [ ] Support contact available if needed
|
||||
|
||||
## Upgrade Procedure
|
||||
|
||||
### Step 1: Download DSM Update
|
||||
|
||||
**Option A: Via DSM UI (Recommended)**
|
||||
|
||||
1. Log in to DSM web interface
|
||||
2. **Control Panel** → **Update & Restore**
|
||||
3. **DSM Update** tab
|
||||
4. If update available, click **Download** (don't install yet)
|
||||
5. Wait for download to complete
|
||||
6. Read release notes carefully
|
||||
|
||||
**Option B: Manual Download**
|
||||
|
||||
1. Visit Synology Download Center
|
||||
2. Find your model (DS1823xs+ or DS723+)
|
||||
3. Download appropriate DSM version
|
||||
4. Upload via DSM → **Manual DSM Update**
|
||||
|
||||
### Step 2: Prepare for Downtime
|
||||
|
||||
```bash
|
||||
# Stop non-critical Docker containers (optional, reduces memory pressure)
|
||||
ssh atlantis
|
||||
docker stop $(docker ps -q --filter "name=pattern") # Stop specific containers
|
||||
|
||||
# Or stop all non-critical containers
|
||||
# Review which containers can be safely stopped
|
||||
docker ps
|
||||
docker stop container1 container2 container3
|
||||
|
||||
# Leave critical services running:
|
||||
# - Portainer (for post-upgrade management)
|
||||
# - Monitoring (to track upgrade progress)
|
||||
# - Core network services (AdGuard, VPN if critical)
|
||||
```
|
||||
|
||||
### Step 3: Initiate Upgrade
|
||||
|
||||
**Via DSM UI**:
|
||||
|
||||
1. **Control Panel** → **Update & Restore** → **DSM Update**
|
||||
2. Click **Update Now**
|
||||
3. Review release notes and warnings
|
||||
4. Check **Yes, I understand I need to perform a backup before updating DSM**
|
||||
5. Click **OK** to start
|
||||
|
||||
**Via SSH** (advanced, not recommended unless necessary):
|
||||
```bash
|
||||
# SSH to NAS
|
||||
ssh admin@atlantis
|
||||
|
||||
# Start upgrade manually
|
||||
sudo synoupgrade --start /volume1/@tmp/upd@te/update.pat
|
||||
|
||||
# Monitor progress
|
||||
tail -f /var/log/messages
|
||||
```
|
||||
|
||||
### Step 4: Monitor Upgrade Progress
|
||||
|
||||
During upgrade, you'll see:
|
||||
1. **Checking system**: Verifying prerequisites
|
||||
2. **Downloading**: If not pre-downloaded
|
||||
3. **Installing**: Actual upgrade process (30-45 minutes)
|
||||
4. **Optimizing system**: Post-install tasks
|
||||
5. **Reboot**: System will restart
|
||||
|
||||
**Monitoring via SSH** (if you have access during upgrade):
|
||||
```bash
|
||||
# Watch upgrade progress
|
||||
tail -f /var/log/upgrade.log
|
||||
|
||||
# Or watch system messages
|
||||
tail -f /var/log/messages | grep -i upgrade
|
||||
```
|
||||
|
||||
**Expected timeline**:
|
||||
- Preparation: 5-10 minutes
|
||||
- Installation: 30-45 minutes
|
||||
- First reboot: 3-5 minutes
|
||||
- Optimization: 10-20 minutes
|
||||
- Final reboot: 3-5 minutes
|
||||
- **Total**: 60-90 minutes
|
||||
|
||||
### Step 5: Wait for Completion
|
||||
|
||||
**⚠️ IMPORTANT**: Do not power off or interrupt the upgrade!
|
||||
|
||||
Signs of normal upgrade:
|
||||
- DSM UI becomes inaccessible
|
||||
- NAS may beep once (starting upgrade)
|
||||
- Disk lights active
|
||||
- NAS will reboot 1-2 times
|
||||
- Final beep when complete
|
||||
|
||||
### Step 6: First Login After Upgrade
|
||||
|
||||
1. Wait for NAS to complete all restarts
|
||||
2. Access DSM UI (may take 5-10 minutes after last reboot)
|
||||
3. Log in with admin credentials
|
||||
4. You may see "Optimization in progress" - this is normal
|
||||
5. Review the "What's New" page
|
||||
6. Accept any new terms/agreements
|
||||
|
||||
## Post-Upgrade Verification
|
||||
|
||||
### Step 1: Verify System Health
|
||||
|
||||
```bash
|
||||
# SSH to NAS
|
||||
ssh admin@atlantis
|
||||
|
||||
# Check DSM version
|
||||
cat /etc.defaults/VERSION
|
||||
# Should show new version
|
||||
|
||||
# Check system status
|
||||
sudo syno_disk_check
|
||||
|
||||
# Check RAID status
|
||||
cat /proc/mdstat
|
||||
|
||||
# Check disk health
|
||||
sudo smartctl -a /dev/sda
|
||||
|
||||
# Verify storage pools
|
||||
synospace --get
|
||||
```
|
||||
|
||||
Via DSM UI:
|
||||
- **Storage Manager** → Verify all pools are "Healthy"
|
||||
- **Resource Monitor** → Check CPU, RAM, network
|
||||
- **Log Center** → Review any errors during upgrade
|
||||
|
||||
### Step 2: Verify Packages
|
||||
|
||||
```bash
|
||||
# Check all packages are running
|
||||
synopkg list
|
||||
|
||||
# Compare with pre-upgrade package list
|
||||
diff /volume1/docker/pre-upgrade-packages.txt <(synopkg list)
|
||||
|
||||
# Start any stopped packages
|
||||
# DSM UI → Package Center → Installed
|
||||
# Check each package, start if needed
|
||||
```
|
||||
|
||||
Common packages to verify:
|
||||
- [ ] Docker
|
||||
- [ ] Synology Drive
|
||||
- [ ] Hyper Backup
|
||||
- [ ] Snapshot Replication
|
||||
- [ ] Any other installed packages
|
||||
|
||||
### Step 3: Verify Docker Containers
|
||||
|
||||
```bash
|
||||
# SSH to NAS
|
||||
ssh atlantis
|
||||
|
||||
# Check Docker is running
|
||||
docker --version
|
||||
docker info
|
||||
|
||||
# Check all containers
|
||||
docker ps -a
|
||||
|
||||
# Compare with pre-upgrade state
|
||||
diff /volume1/docker/pre-upgrade-containers.txt <(docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}')
|
||||
|
||||
# Start stopped containers
|
||||
docker start $(docker ps -a -q -f status=exited)
|
||||
|
||||
# Check container logs for errors
|
||||
docker ps --format "{{.Names}}" | xargs -I {} sh -c 'echo "=== {} ===" && docker logs --tail 20 {}'
|
||||
```
|
||||
|
||||
### Step 4: Test Key Services
|
||||
|
||||
Verify critical services are working:
|
||||
|
||||
```bash
|
||||
# Test network connectivity
|
||||
ping -c 4 8.8.8.8
|
||||
curl -I https://google.com
|
||||
|
||||
# Test Docker networking
|
||||
docker exec [container] ping -c 2 8.8.8.8
|
||||
|
||||
# Test Portainer access
|
||||
curl http://localhost:9000
|
||||
|
||||
# Test Plex
|
||||
curl http://localhost:32400/web
|
||||
|
||||
# Test monitoring
|
||||
curl http://localhost:3000 # Grafana
|
||||
curl http://localhost:9090 # Prometheus
|
||||
```
|
||||
|
||||
Via browser:
|
||||
- [ ] Portainer accessible
|
||||
- [ ] Grafana dashboards loading
|
||||
- [ ] Plex/Jellyfin streaming works
|
||||
- [ ] File shares accessible
|
||||
- [ ] SSO (Authentik) working
|
||||
|
||||
### Step 5: Verify Scheduled Tasks
|
||||
|
||||
```bash
|
||||
# Check cron jobs
|
||||
crontab -l
|
||||
|
||||
# Via DSM UI
|
||||
# Control Panel → Task Scheduler
|
||||
# Verify all tasks are enabled
|
||||
```
|
||||
|
||||
### Step 6: Test Remote Access
|
||||
|
||||
- [ ] Tailscale VPN working
|
||||
- [ ] External access via domain (if configured)
|
||||
- [ ] SSH access working
|
||||
- [ ] Mobile app access working (DS File, DS Photo, etc.)
|
||||
|
||||
## Post-Upgrade Optimization
|
||||
|
||||
### Step 1: Update Packages
|
||||
|
||||
After DSM upgrade, packages may need updates:
|
||||
|
||||
1. **Package Center** → **Update** tab
|
||||
2. Update available packages
|
||||
3. Prioritize critical packages:
|
||||
- Docker (if updated)
|
||||
- Surveillance Station (if used)
|
||||
- Drive, Office, etc.
|
||||
|
||||
### Step 2: Review New Features
|
||||
|
||||
DSM upgrades often include new features:
|
||||
|
||||
1. Review "What's New" page
|
||||
2. Check for new security features
|
||||
3. Review changed settings
|
||||
4. Update documentation if needed
|
||||
|
||||
### Step 3: Re-enable Auto-Updates (if disabled)
|
||||
|
||||
```bash
|
||||
# Via DSM UI
|
||||
# Control Panel → Update & Restore → DSM Update
|
||||
# Check "Notify me when DSM updates are available"
|
||||
# Or "Install latest DSM updates automatically" (if you trust auto-updates)
|
||||
```
|
||||
|
||||
### Step 4: Update Documentation
|
||||
|
||||
```bash
|
||||
cd ~/Documents/repos/homelab
|
||||
|
||||
# Update infrastructure docs
|
||||
nano docs/infrastructure/INFRASTRUCTURE_OVERVIEW.md
|
||||
|
||||
# Note DSM version upgrade
|
||||
# Document any configuration changes
|
||||
# Update troubleshooting docs if procedures changed
|
||||
|
||||
git add .
|
||||
git commit -m "Update docs: DSM upgraded to X.X on Atlantis/Calypso"
|
||||
git push
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Upgrade Fails or Stalls
|
||||
|
||||
**Symptoms**: Progress bar stuck, no activity for >30 minutes
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# If you have SSH access:
|
||||
ssh admin@atlantis
|
||||
|
||||
# Check if upgrade process is running
|
||||
ps aux | grep -i upgrade
|
||||
|
||||
# Check system logs
|
||||
tail -100 /var/log/messages
|
||||
tail -100 /var/log/upgrade.log
|
||||
|
||||
# Check disk space
|
||||
df -h
|
||||
|
||||
# If completely stuck (>1 hour no progress):
|
||||
# 1. Do NOT force reboot unless absolutely necessary
|
||||
# 2. Contact Synology support first
|
||||
# 3. As last resort, force reboot via physical button
|
||||
```
|
||||
|
||||
### Issue: NAS Won't Boot After Upgrade
|
||||
|
||||
**Symptoms**: Cannot access DSM UI, NAS beeping continuously
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Check beep pattern** (indicates specific error)
|
||||
- 1 beep: Normal boot
|
||||
- 3 beeps: RAM issue
|
||||
- 4 beeps: Disk issue
|
||||
- Continuous: Critical failure
|
||||
|
||||
2. **Try Safe Mode**:
|
||||
- Power off NAS
|
||||
- Hold reset button
|
||||
- Power on while holding reset
|
||||
- Hold for 4 seconds until beep
|
||||
- Release and wait for boot
|
||||
|
||||
3. **Check via Synology Assistant**:
|
||||
- Download Synology Assistant on PC
|
||||
- Scan network for NAS
|
||||
- May show recovery mode option
|
||||
|
||||
4. **Last resort: Reinstall DSM**:
|
||||
- Download latest DSM .pat file
|
||||
- Access via http://[nas-ip]:5000
|
||||
- Install DSM (will not erase data)
|
||||
|
||||
### Issue: Docker Not Working After Upgrade
|
||||
|
||||
**Symptoms**: Docker containers won't start, Docker package shows stopped
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# SSH to NAS
|
||||
ssh admin@atlantis
|
||||
|
||||
# Check Docker status
|
||||
sudo synoservicectl --status pkgctl-Docker
|
||||
|
||||
# Restart Docker
|
||||
sudo synoservicectl --restart pkgctl-Docker
|
||||
|
||||
# If Docker won't start, check logs
|
||||
cat /var/log/docker.log
|
||||
|
||||
# Reinstall Docker package (preserves volumes)
|
||||
# Via DSM UI → Package Center → Docker → Uninstall
|
||||
# Then reinstall Docker
|
||||
# Your volumes and data will be preserved
|
||||
```
|
||||
|
||||
### Issue: Network Shares Not Accessible
|
||||
|
||||
**Symptoms**: Can't connect to SMB/NFS shares
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# Check share services
|
||||
sudo synoservicectl --status smbd # SMB
|
||||
sudo synoservicectl --status nfsd # NFS
|
||||
|
||||
# Restart services
|
||||
sudo synoservicectl --restart smbd
|
||||
sudo synoservicectl --restart nfsd
|
||||
|
||||
# Check firewall
|
||||
# Control Panel → Security → Firewall
|
||||
# Ensure file sharing ports allowed
|
||||
```
|
||||
|
||||
### Issue: Performance Degradation After Upgrade
|
||||
|
||||
**Symptoms**: Slow response, high CPU/RAM usage
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# Check what's using resources
|
||||
top
|
||||
htop # If installed
|
||||
|
||||
# Via DSM UI → Resource Monitor
|
||||
# Identify resource-hungry processes
|
||||
|
||||
# Common causes:
|
||||
# 1. Indexing in progress (Photos, Drive, Universal Search)
|
||||
# - Wait for indexing to complete (can take hours)
|
||||
# 2. Optimization running
|
||||
# - Check: ps aux | grep optimize
|
||||
# - Let it complete
|
||||
# 3. Too many containers started at once
|
||||
# - Stagger container startup
|
||||
```
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
⚠️ **WARNING**: Rollback is complex and risky. Only attempt if absolutely necessary.
|
||||
|
||||
### Method 1: DSM Archive (If Available)
|
||||
|
||||
```bash
|
||||
# SSH to NAS
|
||||
ssh admin@atlantis
|
||||
|
||||
# Check if previous DSM version archived
|
||||
ls -la /volume1/@appstore/
|
||||
|
||||
# If archive exists, you can attempt rollback
|
||||
# CAUTION: This is not officially supported and may cause data loss
|
||||
```
|
||||
|
||||
### Method 2: Restore from Backup
|
||||
|
||||
If upgrade caused critical issues:
|
||||
|
||||
1. REDACTED_APP_PASSWORD
|
||||
2. Restore from HyperBackup
|
||||
3. Or restore from configuration backup:
|
||||
- **Control Panel** → **Update & Restore**
|
||||
- **Configuration Backup** → **Restore**
|
||||
|
||||
### Method 3: Fresh Install (Nuclear Option)
|
||||
|
||||
⚠️ **DANGER**: This will erase everything. Only for catastrophic failure.
|
||||
|
||||
1. Download previous DSM version
|
||||
2. Install via Synology Assistant in "Recovery Mode"
|
||||
3. Restore from complete backup
|
||||
4. Reconfigure everything
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Timing
|
||||
- Schedule upgrades during low-usage periods
|
||||
- Allow 3-4 hour maintenance window
|
||||
- Don't upgrade before important events
|
||||
- Wait 2-4 weeks after major DSM release (let others find bugs)
|
||||
|
||||
### Testing
|
||||
- If you have 2 NAS units, upgrade one first
|
||||
- Test on less critical NAS before primary
|
||||
- Read community forums for known issues
|
||||
- Review Synology release notes thoroughly
|
||||
|
||||
### Preparation
|
||||
- Always complete full backup
|
||||
- Test backup restore before upgrade
|
||||
- Document all configurations
|
||||
- Have physical access to NAS if possible
|
||||
- Keep Synology Assistant installed on PC
|
||||
|
||||
### Post-Upgrade
|
||||
- Monitor closely for 24-48 hours
|
||||
- Check logs daily for first week
|
||||
- Report any bugs to Synology
|
||||
- Update your documentation
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
- [ ] DSM upgraded to target version
|
||||
- [ ] All storage pools healthy
|
||||
- [ ] All packages running
|
||||
- [ ] All Docker containers running
|
||||
- [ ] Network shares accessible
|
||||
- [ ] Remote access working (Tailscale, QuickConnect)
|
||||
- [ ] Scheduled tasks running
|
||||
- [ ] Monitoring dashboards functional
|
||||
- [ ] Backups completing successfully
|
||||
- [ ] No errors in system logs
|
||||
- [ ] Performance normal
|
||||
- [ ] Documentation updated
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
|
||||
- [Emergency Access Guide](../troubleshooting/EMERGENCY_ACCESS_GUIDE.md)
|
||||
- [Disaster Recovery](../troubleshooting/disaster-recovery.md)
|
||||
- [Synology Disaster Recovery](../troubleshooting/synology-disaster-recovery.md)
|
||||
- [Backup Strategies](../admin/backup-strategies.md)
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- [Synology DSM Release Notes](https://www.synology.com/en-us/releaseNote/DSM)
|
||||
- [Synology Community Forums](https://community.synology.com/)
|
||||
- [Synology Knowledge Base](https://kb.synology.com/)
|
||||
|
||||
## Change Log
|
||||
|
||||
- 2026-02-14 - Initial creation
|
||||
- 2026-02-14 - Added comprehensive troubleshooting and rollback procedures
|
||||
Reference in New Issue
Block a user