Sanitized mirror from private repository - 2026-04-20 01:32:01 UTC
This commit is contained in:
490
docs/runbooks/disk-full-procedure.md
Normal file
490
docs/runbooks/disk-full-procedure.md
Normal file
@@ -0,0 +1,490 @@
|
||||
# Disk Full Procedure Runbook
|
||||
|
||||
## Overview
|
||||
This runbook provides procedures for handling disk space emergencies across all homelab hosts. It includes immediate actions, root cause analysis, and long-term solutions to prevent recurrence.
|
||||
|
||||
## Prerequisites
|
||||
- [ ] SSH access to affected host
|
||||
- [ ] Root/sudo privileges on the host
|
||||
- [ ] Monitoring dashboards access
|
||||
- [ ] Backup verification capability
|
||||
|
||||
## Metadata
|
||||
- **Estimated Time**: 30-90 minutes (depending on severity)
|
||||
- **Risk Level**: High (data loss possible if not handled carefully)
|
||||
- **Requires Downtime**: Minimal (may need to stop services temporarily)
|
||||
- **Reversible**: Partially (deleted data cannot be recovered)
|
||||
- **Tested On**: 2026-02-14
|
||||
|
||||
## Severity Levels
|
||||
|
||||
| Level | Disk Usage | Action Required | Urgency |
|
||||
|-------|------------|-----------------|---------|
|
||||
| 🟢 **Normal** | < 80% | Monitor | Low |
|
||||
| 🟡 **Warning** | 80-90% | Plan cleanup | Medium |
|
||||
| 🟠 **Critical** | 90-95% | Immediate cleanup | High |
|
||||
| 🔴 **Emergency** | > 95% | Emergency response | Critical |
|
||||
|
||||
## Quick Triage
|
||||
|
||||
First, determine which host and volume is affected:
|
||||
|
||||
```bash
|
||||
# Check all hosts disk usage
|
||||
ssh atlantis "df -h"
|
||||
ssh calypso "df -h"
|
||||
ssh concordnuc "df -h"
|
||||
ssh homelab-vm "df -h"
|
||||
ssh raspberry-pi-5 "df -h"
|
||||
```
|
||||
|
||||
## Emergency Procedure (>95% Full)
|
||||
|
||||
### Step 1: Immediate Space Recovery
|
||||
|
||||
**Goal**: Free up 5-10% space immediately to prevent system issues.
|
||||
|
||||
```bash
|
||||
# SSH to affected host
|
||||
ssh [hostname]
|
||||
|
||||
# Identify what's consuming space
|
||||
df -h
|
||||
du -sh /* 2>/dev/null | sort -rh | head -20
|
||||
|
||||
# Quick wins - Clear Docker cache
|
||||
docker system df # See what Docker is using
|
||||
docker system prune -a --volumes --force # Reclaim space (BE CAREFUL!)
|
||||
|
||||
# This typically frees 10-50GB depending on your setup
|
||||
```
|
||||
|
||||
**⚠️ WARNING**: `docker system prune` will remove:
|
||||
- Stopped containers
|
||||
- Unused networks
|
||||
- Dangling images
|
||||
- Build cache
|
||||
- Unused volumes (with --volumes flag)
|
||||
|
||||
**Safer alternative** if you're unsure:
|
||||
```bash
|
||||
# Less aggressive - removes only stopped containers and dangling images
|
||||
docker system prune --force
|
||||
```
|
||||
|
||||
### Step 2: Clear Log Files
|
||||
|
||||
```bash
|
||||
# Find large log files
|
||||
find /var/log -type f -size +100M -exec ls -lh {} \; | sort -k 5 -rh
|
||||
|
||||
# Clear systemd journal (keeps last 3 days)
|
||||
sudo journalctl --vacuum-time=3d
|
||||
|
||||
# Clear old Docker logs
|
||||
sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'
|
||||
|
||||
# For Synology NAS
|
||||
sudo find /volume1/@docker/containers -name "*-json.log" -size +100M -exec truncate -s 0 {} \;
|
||||
```
|
||||
|
||||
### Step 3: Remove Old Docker Images
|
||||
|
||||
```bash
|
||||
# List images by size
|
||||
docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k 2 -rh | head -20
|
||||
|
||||
# Remove specific old images
|
||||
docker image rm [image:tag]
|
||||
|
||||
# Remove all unused images
|
||||
docker image prune -a --force
|
||||
```
|
||||
|
||||
### Step 4: Verify Space Recovered
|
||||
|
||||
```bash
|
||||
# Check current usage
|
||||
df -h
|
||||
|
||||
# Verify critical services are running
|
||||
docker ps
|
||||
|
||||
# Check container logs for errors
|
||||
docker ps --format "{{.Names}}" | xargs -I {} sh -c 'echo "=== {} ===" && docker logs --tail 20 {}'
|
||||
```
|
||||
|
||||
## Detailed Analysis Procedure
|
||||
|
||||
Once immediate danger is passed, perform thorough analysis:
|
||||
|
||||
### Step 1: Identify Space Consumers
|
||||
|
||||
```bash
|
||||
# Comprehensive disk usage analysis
|
||||
sudo du -h --max-depth=2 / 2>/dev/null | sort -rh | head -30
|
||||
|
||||
# For Synology NAS specifically
|
||||
sudo du -h --max-depth=2 /volume1 2>/dev/null | sort -rh | head -30
|
||||
|
||||
# Check Docker volumes
|
||||
docker volume ls
|
||||
docker system df -v
|
||||
|
||||
# Check specific large directories
|
||||
du -sh /var/lib/docker/* | sort -rh
|
||||
du -sh /volume1/docker/* | sort -rh # Synology
|
||||
```
|
||||
|
||||
### Step 2: Analyze by Service
|
||||
|
||||
Create a space usage report:
|
||||
|
||||
```bash
|
||||
# Create analysis script
|
||||
cat > /tmp/analyze-space.sh << 'EOF'
|
||||
#!/bin/bash
|
||||
echo "=== Docker Container Volumes ==="
|
||||
docker ps --format "{{.Names}}" | while read container; do
|
||||
size=$(docker exec $container du -sh / 2>/dev/null | awk '{print $1}')
|
||||
echo "$container: $size"
|
||||
done | sort -rh
|
||||
|
||||
echo ""
|
||||
echo "=== Docker Volumes ==="
|
||||
docker volume ls --format "{{.Name}}" | while read vol; do
|
||||
size=$(docker volume inspect $vol --format '{{.Mountpoint}}' | xargs sudo du -sh 2>/dev/null | awk '{print $1}')
|
||||
echo "$vol: $size"
|
||||
done | sort -rh
|
||||
|
||||
echo ""
|
||||
echo "=== Log Files Over 100MB ==="
|
||||
find /var/log -type f -size +100M -exec ls -lh {} \; 2>/dev/null
|
||||
EOF
|
||||
|
||||
chmod +x /tmp/analyze-space.sh
|
||||
/tmp/analyze-space.sh
|
||||
```
|
||||
|
||||
### Step 3: Categorize Findings
|
||||
|
||||
Identify the primary space consumers:
|
||||
|
||||
| Category | Typical Culprits | Safe to Delete? |
|
||||
|----------|------------------|-----------------|
|
||||
| **Docker Images** | Old/unused image versions | ✅ Yes (if unused) |
|
||||
| **Docker Volumes** | Database growth, media cache | ⚠️ Maybe (check first) |
|
||||
| **Log Files** | Application logs, system logs | ✅ Yes (after review) |
|
||||
| **Media Files** | Plex, Jellyfin transcodes | ✅ Yes (transcodes) |
|
||||
| **Backups** | Old backup archives | ✅ Yes (keep recent) |
|
||||
| **Application Data** | Various service data | ❌ No (review first) |
|
||||
|
||||
## Cleanup Strategies by Service Type
|
||||
|
||||
### Media Services (Plex, Jellyfin)
|
||||
|
||||
```bash
|
||||
# Clear Plex transcode cache
|
||||
docker exec plex rm -rf /transcode/*
|
||||
|
||||
# Clear Jellyfin transcode cache
|
||||
docker exec jellyfin rm -rf /config/data/transcodes/*
|
||||
|
||||
# Find and remove old media previews
|
||||
find /volume1/plex/Library/Application\ Support/Plex\ Media\ Server/Cache -type f -mtime +30 -delete
|
||||
```
|
||||
|
||||
### *arr Suite (Sonarr, Radarr, etc.)
|
||||
|
||||
```bash
|
||||
# Clear download client history and backups
|
||||
docker exec sonarr find /config/Backups -mtime +30 -delete
|
||||
docker exec radarr find /config/Backups -mtime +30 -delete
|
||||
|
||||
# Clean up old logs
|
||||
docker exec sonarr find /config/logs -mtime +30 -delete
|
||||
docker exec radarr find /config/logs -mtime +30 -delete
|
||||
```
|
||||
|
||||
### Database Services (PostgreSQL, MariaDB)
|
||||
|
||||
```bash
|
||||
# Check database size
|
||||
docker exec postgres psql -U user -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database;"
|
||||
|
||||
# Vacuum databases (for PostgreSQL)
|
||||
docker exec postgres vacuumdb -U user --all --full --analyze
|
||||
|
||||
# Check MariaDB size
|
||||
docker exec mariadb mysql -u root -p -e "SELECT table_schema AS 'Database', ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)' FROM information_schema.TABLES GROUP BY table_schema;"
|
||||
```
|
||||
|
||||
### Monitoring Services (Prometheus, Grafana)
|
||||
|
||||
```bash
|
||||
# Check Prometheus storage size
|
||||
du -sh /volume1/docker/prometheus
|
||||
|
||||
# Prometheus retention is configured in prometheus.yml
|
||||
# Default: --storage.tsdb.retention.time=15d
|
||||
# Consider reducing retention if space is critical
|
||||
|
||||
# Clear old Grafana sessions
|
||||
docker exec grafana find /var/lib/grafana/sessions -mtime +7 -delete
|
||||
```
|
||||
|
||||
### Immich (Photo Management)
|
||||
|
||||
```bash
|
||||
# Check Immich storage usage
|
||||
docker exec immich-server df -h /usr/src/app/upload
|
||||
|
||||
# Immich uses a lot of space for:
|
||||
# - Original photos
|
||||
# - Thumbnails
|
||||
# - Encoded videos
|
||||
# - ML models
|
||||
|
||||
# Clean up old upload logs
|
||||
docker exec immich-server find /usr/src/app/upload/upload -mtime +90 -delete
|
||||
```
|
||||
|
||||
## Long-Term Solutions
|
||||
|
||||
### Solution 1: Configure Log Rotation
|
||||
|
||||
Create proper log rotation for Docker containers:
|
||||
|
||||
```bash
|
||||
# Edit Docker daemon config
|
||||
sudo nano /etc/docker/daemon.json
|
||||
|
||||
# Add log rotation settings
|
||||
{
|
||||
"log-driver": "json-file",
|
||||
"log-opts": {
|
||||
"max-size": "10m",
|
||||
"max-file": "3"
|
||||
}
|
||||
}
|
||||
|
||||
# Restart Docker
|
||||
sudo systemctl restart docker # Linux
|
||||
# OR for Synology
|
||||
sudo synoservicectl --restart pkgctl-Docker
|
||||
```
|
||||
|
||||
### Solution 2: Set Up Automated Cleanup
|
||||
|
||||
Create a cleanup cron job:
|
||||
|
||||
```bash
|
||||
# Create cleanup script
|
||||
sudo nano /usr/local/bin/homelab-cleanup.sh
|
||||
|
||||
#!/bin/bash
|
||||
# Homelab Automated Cleanup Script
|
||||
|
||||
# Remove stopped containers older than 7 days
|
||||
docker container prune --filter "until=168h" --force
|
||||
|
||||
# Remove unused images older than 30 days
|
||||
docker image prune --all --filter "until=720h" --force
|
||||
|
||||
# Remove unused volumes (BE CAREFUL - only if you're sure)
|
||||
# docker volume prune --force
|
||||
|
||||
# Clear journal logs older than 7 days
|
||||
journalctl --vacuum-time=7d
|
||||
|
||||
# Clear old backups (keep last 30 days)
|
||||
find /volume1/backups -type f -mtime +30 -delete
|
||||
|
||||
echo "Cleanup completed at $(date)" >> /var/log/homelab-cleanup.log
|
||||
|
||||
# Make executable
|
||||
sudo chmod +x /usr/local/bin/homelab-cleanup.sh
|
||||
|
||||
# Add to cron (runs weekly on Sunday at 3 AM)
|
||||
(crontab -l 2>/dev/null; echo "0 3 * * 0 /usr/local/bin/homelab-cleanup.sh") | crontab -
|
||||
```
|
||||
|
||||
### Solution 3: Configure Service-Specific Retention
|
||||
|
||||
Update each service with appropriate retention policies:
|
||||
|
||||
**Prometheus** (`prometheus.yml`):
|
||||
```yaml
|
||||
global:
|
||||
storage:
|
||||
tsdb:
|
||||
retention.time: 15d # Reduce from default 15d to 7d if needed
|
||||
retention.size: 50GB # Set size limit
|
||||
```
|
||||
|
||||
**Grafana** (docker-compose.yml):
|
||||
```yaml
|
||||
environment:
|
||||
- GF_DATABASE_WAL=true
|
||||
- GF_DATABASE_CLEANUP_INTERVAL=168h # Weekly cleanup
|
||||
```
|
||||
|
||||
**Plex** (Plex settings):
|
||||
- Settings → Transcoder → Transcoder temporary directory
|
||||
- Settings → Scheduled Tasks → Clean Bundles (daily)
|
||||
- Settings → Scheduled Tasks → Optimize Database (weekly)
|
||||
|
||||
### Solution 4: Monitor Disk Usage Proactively
|
||||
|
||||
Set up monitoring alerts in Grafana:
|
||||
|
||||
```yaml
|
||||
# Alert rule for disk space
|
||||
- alert: REDACTED_APP_PASSWORD
|
||||
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 20
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Disk space warning on {{ $labels.instance }}"
|
||||
description: "Disk {{ $labels.mountpoint }} has less than 20% free space"
|
||||
|
||||
- alert: DiskSpaceCritical
|
||||
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 10
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "CRITICAL: Disk space on {{ $labels.instance }}"
|
||||
description: "Disk {{ $labels.mountpoint }} has less than 10% free space"
|
||||
```
|
||||
|
||||
## Host-Specific Considerations
|
||||
|
||||
### Atlantis (Synology DS1823xs+)
|
||||
|
||||
```bash
|
||||
# Synology-specific cleanup
|
||||
# Clear Synology logs
|
||||
sudo find /var/log -name "*.log.*" -mtime +30 -delete
|
||||
|
||||
# Clear package logs
|
||||
sudo find /var/packages/*/target/logs -name "*.log.*" -mtime +30 -delete
|
||||
|
||||
# Check storage pool status
|
||||
sudo synostgpool --info
|
||||
|
||||
# DSM has built-in storage analyzer
|
||||
# Control Panel → Storage Manager → Storage Analyzer
|
||||
```
|
||||
|
||||
### Calypso (Synology DS723+)
|
||||
|
||||
Same as Atlantis - use Synology-specific commands.
|
||||
|
||||
### Concord NUC (Ubuntu)
|
||||
|
||||
```bash
|
||||
# Ubuntu-specific cleanup
|
||||
sudo apt-get clean
|
||||
sudo apt-get autoclean
|
||||
sudo apt-get autoremove --purge
|
||||
|
||||
# Clear old kernels (keep current + 1 previous)
|
||||
sudo apt-get autoremove --purge $(dpkg -l 'linux-*' | sed '/^ii/!d;/'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d;s/^[^ ]* [^ ]* \([^ ]*\).*/\1/;/[0-9]/!d')
|
||||
|
||||
# Clear thumbnail cache
|
||||
rm -rf ~/.cache/thumbnails/*
|
||||
```
|
||||
|
||||
### Homelab VM (Proxmox VM)
|
||||
|
||||
```bash
|
||||
# VM-specific cleanup
|
||||
# Clear apt cache
|
||||
sudo apt-get clean
|
||||
|
||||
# Clear old cloud-init logs
|
||||
sudo rm -rf /var/log/cloud-init*.log
|
||||
|
||||
# Compact QCOW2 disk (from Proxmox host)
|
||||
# qemu-img convert -O qcow2 -c original.qcow2 compressed.qcow2
|
||||
```
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
After cleanup, verify:
|
||||
|
||||
- [ ] Disk usage below 80%: `df -h`
|
||||
- [ ] All critical containers running: `docker ps`
|
||||
- [ ] No errors in recent logs: `docker logs [container] --tail 50`
|
||||
- [ ] Services accessible via web interface
|
||||
- [ ] Monitoring dashboards show normal metrics
|
||||
- [ ] Backup jobs can complete successfully
|
||||
- [ ] Automated cleanup configured for future
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
If cleanup causes issues:
|
||||
|
||||
1. **Check what was deleted**: Review command history and logs
|
||||
2. **Restore from backups**: If critical data was deleted
|
||||
```bash
|
||||
cd ~/Documents/repos/homelab
|
||||
./restore.sh [backup-date]
|
||||
```
|
||||
3. **Recreate Docker volumes**: If volumes were accidentally pruned
|
||||
4. **Restart affected services**: Redeploy from Portainer
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Still Running Out of Space After Cleanup
|
||||
|
||||
**Solution**: Consider adding more storage
|
||||
- Add external USB drives
|
||||
- Expand existing RAID arrays
|
||||
- Move services to hosts with more space
|
||||
- Archive old media to cold storage
|
||||
|
||||
### Issue: Docker Prune Removed Important Data
|
||||
|
||||
**Solution**:
|
||||
- Always use `--filter` to be selective
|
||||
- Never use `docker volume prune` without checking first
|
||||
- Keep recent backups before major cleanup operations
|
||||
|
||||
### Issue: Services Won't Start After Cleanup
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Check for missing volumes
|
||||
docker ps -a
|
||||
docker volume ls
|
||||
|
||||
# Check logs
|
||||
docker logs [container]
|
||||
|
||||
# Recreate volumes if needed (restore from backup)
|
||||
./restore.sh [backup-date]
|
||||
```
|
||||
|
||||
## Prevention Checklist
|
||||
|
||||
- [ ] Log rotation configured for all services
|
||||
- [ ] Automated cleanup script running weekly
|
||||
- [ ] Monitoring alerts set up for disk space
|
||||
- [ ] Retention policies configured appropriately
|
||||
- [ ] Regular backup verification scheduled
|
||||
- [ ] Capacity planning review quarterly
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
|
||||
- [Backup Strategies](../admin/backup-strategies.md)
|
||||
- [Monitoring Setup](../admin/monitoring-setup.md)
|
||||
- [Troubleshooting Guide](../troubleshooting/common-issues.md)
|
||||
|
||||
## Change Log
|
||||
|
||||
- 2026-02-14 - Initial creation with host-specific procedures
|
||||
- 2026-02-14 - Added service-specific cleanup strategies
|
||||
Reference in New Issue
Block a user