# Disk Full Procedure Runbook ## Overview This runbook provides procedures for handling disk space emergencies across all homelab hosts. It includes immediate actions, root cause analysis, and long-term solutions to prevent recurrence. ## Prerequisites - [ ] SSH access to affected host - [ ] Root/sudo privileges on the host - [ ] Monitoring dashboards access - [ ] Backup verification capability ## Metadata - **Estimated Time**: 30-90 minutes (depending on severity) - **Risk Level**: High (data loss possible if not handled carefully) - **Requires Downtime**: Minimal (may need to stop services temporarily) - **Reversible**: Partially (deleted data cannot be recovered) - **Tested On**: 2026-02-14 ## Severity Levels | Level | Disk Usage | Action Required | Urgency | |-------|------------|-----------------|---------| | 🟢 **Normal** | < 80% | Monitor | Low | | 🟡 **Warning** | 80-90% | Plan cleanup | Medium | | 🟠 **Critical** | 90-95% | Immediate cleanup | High | | 🔴 **Emergency** | > 95% | Emergency response | Critical | ## Quick Triage First, determine which host and volume is affected: ```bash # Check all hosts disk usage ssh atlantis "df -h" ssh calypso "df -h" ssh concordnuc "df -h" ssh homelab-vm "df -h" ssh raspberry-pi-5 "df -h" ``` ## Emergency Procedure (>95% Full) ### Step 1: Immediate Space Recovery **Goal**: Free up 5-10% space immediately to prevent system issues. ```bash # SSH to affected host ssh [hostname] # Identify what's consuming space df -h du -sh /* 2>/dev/null | sort -rh | head -20 # Quick wins - Clear Docker cache docker system df # See what Docker is using docker system prune -a --volumes --force # Reclaim space (BE CAREFUL!) # This typically frees 10-50GB depending on your setup ``` **⚠️ WARNING**: `docker system prune` will remove: - Stopped containers - Unused networks - Dangling images - Build cache - Unused volumes (with --volumes flag) **Safer alternative** if you're unsure: ```bash # Less aggressive - removes only stopped containers and dangling images docker system prune --force ``` ### Step 2: Clear Log Files ```bash # Find large log files find /var/log -type f -size +100M -exec ls -lh {} \; | sort -k 5 -rh # Clear systemd journal (keeps last 3 days) sudo journalctl --vacuum-time=3d # Clear old Docker logs sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log' # For Synology NAS sudo find /volume1/@docker/containers -name "*-json.log" -size +100M -exec truncate -s 0 {} \; ``` ### Step 3: Remove Old Docker Images ```bash # List images by size docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k 2 -rh | head -20 # Remove specific old images docker image rm [image:tag] # Remove all unused images docker image prune -a --force ``` ### Step 4: Verify Space Recovered ```bash # Check current usage df -h # Verify critical services are running docker ps # Check container logs for errors docker ps --format "{{.Names}}" | xargs -I {} sh -c 'echo "=== {} ===" && docker logs --tail 20 {}' ``` ## Detailed Analysis Procedure Once immediate danger is passed, perform thorough analysis: ### Step 1: Identify Space Consumers ```bash # Comprehensive disk usage analysis sudo du -h --max-depth=2 / 2>/dev/null | sort -rh | head -30 # For Synology NAS specifically sudo du -h --max-depth=2 /volume1 2>/dev/null | sort -rh | head -30 # Check Docker volumes docker volume ls docker system df -v # Check specific large directories du -sh /var/lib/docker/* | sort -rh du -sh /volume1/docker/* | sort -rh # Synology ``` ### Step 2: Analyze by Service Create a space usage report: ```bash # Create analysis script cat > /tmp/analyze-space.sh << 'EOF' #!/bin/bash echo "=== Docker Container Volumes ===" docker ps --format "{{.Names}}" | while read container; do size=$(docker exec $container du -sh / 2>/dev/null | awk '{print $1}') echo "$container: $size" done | sort -rh echo "" echo "=== Docker Volumes ===" docker volume ls --format "{{.Name}}" | while read vol; do size=$(docker volume inspect $vol --format '{{.Mountpoint}}' | xargs sudo du -sh 2>/dev/null | awk '{print $1}') echo "$vol: $size" done | sort -rh echo "" echo "=== Log Files Over 100MB ===" find /var/log -type f -size +100M -exec ls -lh {} \; 2>/dev/null EOF chmod +x /tmp/analyze-space.sh /tmp/analyze-space.sh ``` ### Step 3: Categorize Findings Identify the primary space consumers: | Category | Typical Culprits | Safe to Delete? | |----------|------------------|-----------------| | **Docker Images** | Old/unused image versions | ✅ Yes (if unused) | | **Docker Volumes** | Database growth, media cache | ⚠️ Maybe (check first) | | **Log Files** | Application logs, system logs | ✅ Yes (after review) | | **Media Files** | Plex, Jellyfin transcodes | ✅ Yes (transcodes) | | **Backups** | Old backup archives | ✅ Yes (keep recent) | | **Application Data** | Various service data | ❌ No (review first) | ## Cleanup Strategies by Service Type ### Media Services (Plex, Jellyfin) ```bash # Clear Plex transcode cache docker exec plex rm -rf /transcode/* # Clear Jellyfin transcode cache docker exec jellyfin rm -rf /config/data/transcodes/* # Find and remove old media previews find /volume1/plex/Library/Application\ Support/Plex\ Media\ Server/Cache -type f -mtime +30 -delete ``` ### *arr Suite (Sonarr, Radarr, etc.) ```bash # Clear download client history and backups docker exec sonarr find /config/Backups -mtime +30 -delete docker exec radarr find /config/Backups -mtime +30 -delete # Clean up old logs docker exec sonarr find /config/logs -mtime +30 -delete docker exec radarr find /config/logs -mtime +30 -delete ``` ### Database Services (PostgreSQL, MariaDB) ```bash # Check database size docker exec postgres psql -U user -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database;" # Vacuum databases (for PostgreSQL) docker exec postgres vacuumdb -U user --all --full --analyze # Check MariaDB size docker exec mariadb mysql -u root -p -e "SELECT table_schema AS 'Database', ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)' FROM information_schema.TABLES GROUP BY table_schema;" ``` ### Monitoring Services (Prometheus, Grafana) ```bash # Check Prometheus storage size du -sh /volume1/docker/prometheus # Prometheus retention is configured in prometheus.yml # Default: --storage.tsdb.retention.time=15d # Consider reducing retention if space is critical # Clear old Grafana sessions docker exec grafana find /var/lib/grafana/sessions -mtime +7 -delete ``` ### Immich (Photo Management) ```bash # Check Immich storage usage docker exec immich-server df -h /usr/src/app/upload # Immich uses a lot of space for: # - Original photos # - Thumbnails # - Encoded videos # - ML models # Clean up old upload logs docker exec immich-server find /usr/src/app/upload/upload -mtime +90 -delete ``` ## Long-Term Solutions ### Solution 1: Configure Log Rotation Create proper log rotation for Docker containers: ```bash # Edit Docker daemon config sudo nano /etc/docker/daemon.json # Add log rotation settings { "log-driver": "json-file", "log-opts": { "max-size": "10m", "max-file": "3" } } # Restart Docker sudo systemctl restart docker # Linux # OR for Synology sudo synoservicectl --restart pkgctl-Docker ``` ### Solution 2: Set Up Automated Cleanup Create a cleanup cron job: ```bash # Create cleanup script sudo nano /usr/local/bin/homelab-cleanup.sh #!/bin/bash # Homelab Automated Cleanup Script # Remove stopped containers older than 7 days docker container prune --filter "until=168h" --force # Remove unused images older than 30 days docker image prune --all --filter "until=720h" --force # Remove unused volumes (BE CAREFUL - only if you're sure) # docker volume prune --force # Clear journal logs older than 7 days journalctl --vacuum-time=7d # Clear old backups (keep last 30 days) find /volume1/backups -type f -mtime +30 -delete echo "Cleanup completed at $(date)" >> /var/log/homelab-cleanup.log # Make executable sudo chmod +x /usr/local/bin/homelab-cleanup.sh # Add to cron (runs weekly on Sunday at 3 AM) (crontab -l 2>/dev/null; echo "0 3 * * 0 /usr/local/bin/homelab-cleanup.sh") | crontab - ``` ### Solution 3: Configure Service-Specific Retention Update each service with appropriate retention policies: **Prometheus** (`prometheus.yml`): ```yaml global: storage: tsdb: retention.time: 15d # Reduce from default 15d to 7d if needed retention.size: 50GB # Set size limit ``` **Grafana** (docker-compose.yml): ```yaml environment: - GF_DATABASE_WAL=true - GF_DATABASE_CLEANUP_INTERVAL=168h # Weekly cleanup ``` **Plex** (Plex settings): - Settings → Transcoder → Transcoder temporary directory - Settings → Scheduled Tasks → Clean Bundles (daily) - Settings → Scheduled Tasks → Optimize Database (weekly) ### Solution 4: Monitor Disk Usage Proactively Set up monitoring alerts in Grafana: ```yaml # Alert rule for disk space - alert: REDACTED_APP_PASSWORD expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 20 for: 5m labels: severity: warning annotations: summary: "Disk space warning on {{ $labels.instance }}" description: "Disk {{ $labels.mountpoint }} has less than 20% free space" - alert: DiskSpaceCritical expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 10 for: 5m labels: severity: critical annotations: summary: "CRITICAL: Disk space on {{ $labels.instance }}" description: "Disk {{ $labels.mountpoint }} has less than 10% free space" ``` ## Host-Specific Considerations ### Atlantis (Synology DS1823xs+) ```bash # Synology-specific cleanup # Clear Synology logs sudo find /var/log -name "*.log.*" -mtime +30 -delete # Clear package logs sudo find /var/packages/*/target/logs -name "*.log.*" -mtime +30 -delete # Check storage pool status sudo synostgpool --info # DSM has built-in storage analyzer # Control Panel → Storage Manager → Storage Analyzer ``` ### Calypso (Synology DS723+) Same as Atlantis - use Synology-specific commands. ### Concord NUC (Ubuntu) ```bash # Ubuntu-specific cleanup sudo apt-get clean sudo apt-get autoclean sudo apt-get autoremove --purge # Clear old kernels (keep current + 1 previous) sudo apt-get autoremove --purge $(dpkg -l 'linux-*' | sed '/^ii/!d;/'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d;s/^[^ ]* [^ ]* \([^ ]*\).*/\1/;/[0-9]/!d') # Clear thumbnail cache rm -rf ~/.cache/thumbnails/* ``` ### Homelab VM (Proxmox VM) ```bash # VM-specific cleanup # Clear apt cache sudo apt-get clean # Clear old cloud-init logs sudo rm -rf /var/log/cloud-init*.log # Compact QCOW2 disk (from Proxmox host) # qemu-img convert -O qcow2 -c original.qcow2 compressed.qcow2 ``` ## Verification Checklist After cleanup, verify: - [ ] Disk usage below 80%: `df -h` - [ ] All critical containers running: `docker ps` - [ ] No errors in recent logs: `docker logs [container] --tail 50` - [ ] Services accessible via web interface - [ ] Monitoring dashboards show normal metrics - [ ] Backup jobs can complete successfully - [ ] Automated cleanup configured for future ## Rollback Procedure If cleanup causes issues: 1. **Check what was deleted**: Review command history and logs 2. **Restore from backups**: If critical data was deleted ```bash cd ~/Documents/repos/homelab ./restore.sh [backup-date] ``` 3. **Recreate Docker volumes**: If volumes were accidentally pruned 4. **Restart affected services**: Redeploy from Portainer ## Troubleshooting ### Issue: Still Running Out of Space After Cleanup **Solution**: Consider adding more storage - Add external USB drives - Expand existing RAID arrays - Move services to hosts with more space - Archive old media to cold storage ### Issue: Docker Prune Removed Important Data **Solution**: - Always use `--filter` to be selective - Never use `docker volume prune` without checking first - Keep recent backups before major cleanup operations ### Issue: Services Won't Start After Cleanup **Solution**: ```bash # Check for missing volumes docker ps -a docker volume ls # Check logs docker logs [container] # Recreate volumes if needed (restore from backup) ./restore.sh [backup-date] ``` ## Prevention Checklist - [ ] Log rotation configured for all services - [ ] Automated cleanup script running weekly - [ ] Monitoring alerts set up for disk space - [ ] Retention policies configured appropriately - [ ] Regular backup verification scheduled - [ ] Capacity planning review quarterly ## Related Documentation - [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md) - [Backup Strategies](../admin/backup-strategies.md) - [Monitoring Setup](../admin/monitoring-setup.md) - [Troubleshooting Guide](../troubleshooting/common-issues.md) ## Change Log - 2026-02-14 - Initial creation with host-specific procedures - 2026-02-14 - Added service-specific cleanup strategies