Files
homelab-optimized/docs/runbooks/disk-full-procedure.md
Gitea Mirror Bot 19b90cee4d
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m1s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-03-31 23:50:30 UTC
2026-03-31 23:50:30 +00:00

491 lines
13 KiB
Markdown

# Disk Full Procedure Runbook
## Overview
This runbook provides procedures for handling disk space emergencies across all homelab hosts. It includes immediate actions, root cause analysis, and long-term solutions to prevent recurrence.
## Prerequisites
- [ ] SSH access to affected host
- [ ] Root/sudo privileges on the host
- [ ] Monitoring dashboards access
- [ ] Backup verification capability
## Metadata
- **Estimated Time**: 30-90 minutes (depending on severity)
- **Risk Level**: High (data loss possible if not handled carefully)
- **Requires Downtime**: Minimal (may need to stop services temporarily)
- **Reversible**: Partially (deleted data cannot be recovered)
- **Tested On**: 2026-02-14
## Severity Levels
| Level | Disk Usage | Action Required | Urgency |
|-------|------------|-----------------|---------|
| 🟢 **Normal** | < 80% | Monitor | Low |
| 🟡 **Warning** | 80-90% | Plan cleanup | Medium |
| 🟠 **Critical** | 90-95% | Immediate cleanup | High |
| 🔴 **Emergency** | > 95% | Emergency response | Critical |
## Quick Triage
First, determine which host and volume is affected:
```bash
# Check all hosts disk usage
ssh atlantis "df -h"
ssh calypso "df -h"
ssh concordnuc "df -h"
ssh homelab-vm "df -h"
ssh raspberry-pi-5 "df -h"
```
## Emergency Procedure (>95% Full)
### Step 1: Immediate Space Recovery
**Goal**: Free up 5-10% space immediately to prevent system issues.
```bash
# SSH to affected host
ssh [hostname]
# Identify what's consuming space
df -h
du -sh /* 2>/dev/null | sort -rh | head -20
# Quick wins - Clear Docker cache
docker system df # See what Docker is using
docker system prune -a --volumes --force # Reclaim space (BE CAREFUL!)
# This typically frees 10-50GB depending on your setup
```
**⚠️ WARNING**: `docker system prune` will remove:
- Stopped containers
- Unused networks
- Dangling images
- Build cache
- Unused volumes (with --volumes flag)
**Safer alternative** if you're unsure:
```bash
# Less aggressive - removes only stopped containers and dangling images
docker system prune --force
```
### Step 2: Clear Log Files
```bash
# Find large log files
find /var/log -type f -size +100M -exec ls -lh {} \; | sort -k 5 -rh
# Clear systemd journal (keeps last 3 days)
sudo journalctl --vacuum-time=3d
# Clear old Docker logs
sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'
# For Synology NAS
sudo find /volume1/@docker/containers -name "*-json.log" -size +100M -exec truncate -s 0 {} \;
```
### Step 3: Remove Old Docker Images
```bash
# List images by size
docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k 2 -rh | head -20
# Remove specific old images
docker image rm [image:tag]
# Remove all unused images
docker image prune -a --force
```
### Step 4: Verify Space Recovered
```bash
# Check current usage
df -h
# Verify critical services are running
docker ps
# Check container logs for errors
docker ps --format "{{.Names}}" | xargs -I {} sh -c 'echo "=== {} ===" && docker logs --tail 20 {}'
```
## Detailed Analysis Procedure
Once immediate danger is passed, perform thorough analysis:
### Step 1: Identify Space Consumers
```bash
# Comprehensive disk usage analysis
sudo du -h --max-depth=2 / 2>/dev/null | sort -rh | head -30
# For Synology NAS specifically
sudo du -h --max-depth=2 /volume1 2>/dev/null | sort -rh | head -30
# Check Docker volumes
docker volume ls
docker system df -v
# Check specific large directories
du -sh /var/lib/docker/* | sort -rh
du -sh /volume1/docker/* | sort -rh # Synology
```
### Step 2: Analyze by Service
Create a space usage report:
```bash
# Create analysis script
cat > /tmp/analyze-space.sh << 'EOF'
#!/bin/bash
echo "=== Docker Container Volumes ==="
docker ps --format "{{.Names}}" | while read container; do
size=$(docker exec $container du -sh / 2>/dev/null | awk '{print $1}')
echo "$container: $size"
done | sort -rh
echo ""
echo "=== Docker Volumes ==="
docker volume ls --format "{{.Name}}" | while read vol; do
size=$(docker volume inspect $vol --format '{{.Mountpoint}}' | xargs sudo du -sh 2>/dev/null | awk '{print $1}')
echo "$vol: $size"
done | sort -rh
echo ""
echo "=== Log Files Over 100MB ==="
find /var/log -type f -size +100M -exec ls -lh {} \; 2>/dev/null
EOF
chmod +x /tmp/analyze-space.sh
/tmp/analyze-space.sh
```
### Step 3: Categorize Findings
Identify the primary space consumers:
| Category | Typical Culprits | Safe to Delete? |
|----------|------------------|-----------------|
| **Docker Images** | Old/unused image versions | ✅ Yes (if unused) |
| **Docker Volumes** | Database growth, media cache | ⚠️ Maybe (check first) |
| **Log Files** | Application logs, system logs | ✅ Yes (after review) |
| **Media Files** | Plex, Jellyfin transcodes | ✅ Yes (transcodes) |
| **Backups** | Old backup archives | ✅ Yes (keep recent) |
| **Application Data** | Various service data | ❌ No (review first) |
## Cleanup Strategies by Service Type
### Media Services (Plex, Jellyfin)
```bash
# Clear Plex transcode cache
docker exec plex rm -rf /transcode/*
# Clear Jellyfin transcode cache
docker exec jellyfin rm -rf /config/data/transcodes/*
# Find and remove old media previews
find /volume1/plex/Library/Application\ Support/Plex\ Media\ Server/Cache -type f -mtime +30 -delete
```
### *arr Suite (Sonarr, Radarr, etc.)
```bash
# Clear download client history and backups
docker exec sonarr find /config/Backups -mtime +30 -delete
docker exec radarr find /config/Backups -mtime +30 -delete
# Clean up old logs
docker exec sonarr find /config/logs -mtime +30 -delete
docker exec radarr find /config/logs -mtime +30 -delete
```
### Database Services (PostgreSQL, MariaDB)
```bash
# Check database size
docker exec postgres psql -U user -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database;"
# Vacuum databases (for PostgreSQL)
docker exec postgres vacuumdb -U user --all --full --analyze
# Check MariaDB size
docker exec mariadb mysql -u root -p -e "SELECT table_schema AS 'Database', ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)' FROM information_schema.TABLES GROUP BY table_schema;"
```
### Monitoring Services (Prometheus, Grafana)
```bash
# Check Prometheus storage size
du -sh /volume1/docker/prometheus
# Prometheus retention is configured in prometheus.yml
# Default: --storage.tsdb.retention.time=15d
# Consider reducing retention if space is critical
# Clear old Grafana sessions
docker exec grafana find /var/lib/grafana/sessions -mtime +7 -delete
```
### Immich (Photo Management)
```bash
# Check Immich storage usage
docker exec immich-server df -h /usr/src/app/upload
# Immich uses a lot of space for:
# - Original photos
# - Thumbnails
# - Encoded videos
# - ML models
# Clean up old upload logs
docker exec immich-server find /usr/src/app/upload/upload -mtime +90 -delete
```
## Long-Term Solutions
### Solution 1: Configure Log Rotation
Create proper log rotation for Docker containers:
```bash
# Edit Docker daemon config
sudo nano /etc/docker/daemon.json
# Add log rotation settings
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
# Restart Docker
sudo systemctl restart docker # Linux
# OR for Synology
sudo synoservicectl --restart pkgctl-Docker
```
### Solution 2: Set Up Automated Cleanup
Create a cleanup cron job:
```bash
# Create cleanup script
sudo nano /usr/local/bin/homelab-cleanup.sh
#!/bin/bash
# Homelab Automated Cleanup Script
# Remove stopped containers older than 7 days
docker container prune --filter "until=168h" --force
# Remove unused images older than 30 days
docker image prune --all --filter "until=720h" --force
# Remove unused volumes (BE CAREFUL - only if you're sure)
# docker volume prune --force
# Clear journal logs older than 7 days
journalctl --vacuum-time=7d
# Clear old backups (keep last 30 days)
find /volume1/backups -type f -mtime +30 -delete
echo "Cleanup completed at $(date)" >> /var/log/homelab-cleanup.log
# Make executable
sudo chmod +x /usr/local/bin/homelab-cleanup.sh
# Add to cron (runs weekly on Sunday at 3 AM)
(crontab -l 2>/dev/null; echo "0 3 * * 0 /usr/local/bin/homelab-cleanup.sh") | crontab -
```
### Solution 3: Configure Service-Specific Retention
Update each service with appropriate retention policies:
**Prometheus** (`prometheus.yml`):
```yaml
global:
storage:
tsdb:
retention.time: 15d # Reduce from default 15d to 7d if needed
retention.size: 50GB # Set size limit
```
**Grafana** (docker-compose.yml):
```yaml
environment:
- GF_DATABASE_WAL=true
- GF_DATABASE_CLEANUP_INTERVAL=168h # Weekly cleanup
```
**Plex** (Plex settings):
- Settings → Transcoder → Transcoder temporary directory
- Settings → Scheduled Tasks → Clean Bundles (daily)
- Settings → Scheduled Tasks → Optimize Database (weekly)
### Solution 4: Monitor Disk Usage Proactively
Set up monitoring alerts in Grafana:
```yaml
# Alert rule for disk space
- alert: REDACTED_APP_PASSWORD
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space warning on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has less than 20% free space"
- alert: DiskSpaceCritical
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 10
for: 5m
labels:
severity: critical
annotations:
summary: "CRITICAL: Disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has less than 10% free space"
```
## Host-Specific Considerations
### Atlantis (Synology DS1823xs+)
```bash
# Synology-specific cleanup
# Clear Synology logs
sudo find /var/log -name "*.log.*" -mtime +30 -delete
# Clear package logs
sudo find /var/packages/*/target/logs -name "*.log.*" -mtime +30 -delete
# Check storage pool status
sudo synostgpool --info
# DSM has built-in storage analyzer
# Control Panel → Storage Manager → Storage Analyzer
```
### Calypso (Synology DS723+)
Same as Atlantis - use Synology-specific commands.
### Concord NUC (Ubuntu)
```bash
# Ubuntu-specific cleanup
sudo apt-get clean
sudo apt-get autoclean
sudo apt-get autoremove --purge
# Clear old kernels (keep current + 1 previous)
sudo apt-get autoremove --purge $(dpkg -l 'linux-*' | sed '/^ii/!d;/'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d;s/^[^ ]* [^ ]* \([^ ]*\).*/\1/;/[0-9]/!d')
# Clear thumbnail cache
rm -rf ~/.cache/thumbnails/*
```
### Homelab VM (Proxmox VM)
```bash
# VM-specific cleanup
# Clear apt cache
sudo apt-get clean
# Clear old cloud-init logs
sudo rm -rf /var/log/cloud-init*.log
# Compact QCOW2 disk (from Proxmox host)
# qemu-img convert -O qcow2 -c original.qcow2 compressed.qcow2
```
## Verification Checklist
After cleanup, verify:
- [ ] Disk usage below 80%: `df -h`
- [ ] All critical containers running: `docker ps`
- [ ] No errors in recent logs: `docker logs [container] --tail 50`
- [ ] Services accessible via web interface
- [ ] Monitoring dashboards show normal metrics
- [ ] Backup jobs can complete successfully
- [ ] Automated cleanup configured for future
## Rollback Procedure
If cleanup causes issues:
1. **Check what was deleted**: Review command history and logs
2. **Restore from backups**: If critical data was deleted
```bash
cd ~/Documents/repos/homelab
./restore.sh [backup-date]
```
3. **Recreate Docker volumes**: If volumes were accidentally pruned
4. **Restart affected services**: Redeploy from Portainer
## Troubleshooting
### Issue: Still Running Out of Space After Cleanup
**Solution**: Consider adding more storage
- Add external USB drives
- Expand existing RAID arrays
- Move services to hosts with more space
- Archive old media to cold storage
### Issue: Docker Prune Removed Important Data
**Solution**:
- Always use `--filter` to be selective
- Never use `docker volume prune` without checking first
- Keep recent backups before major cleanup operations
### Issue: Services Won't Start After Cleanup
**Solution**:
```bash
# Check for missing volumes
docker ps -a
docker volume ls
# Check logs
docker logs [container]
# Recreate volumes if needed (restore from backup)
./restore.sh [backup-date]
```
## Prevention Checklist
- [ ] Log rotation configured for all services
- [ ] Automated cleanup script running weekly
- [ ] Monitoring alerts set up for disk space
- [ ] Retention policies configured appropriately
- [ ] Regular backup verification scheduled
- [ ] Capacity planning review quarterly
## Related Documentation
- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
- [Backup Strategies](../admin/backup-strategies.md)
- [Monitoring Setup](../admin/monitoring-setup.md)
- [Troubleshooting Guide](../troubleshooting/common-issues.md)
## Change Log
- 2026-02-14 - Initial creation with host-specific procedures
- 2026-02-14 - Added service-specific cleanup strategies