13 KiB
Disk Full Procedure Runbook
Overview
This runbook provides procedures for handling disk space emergencies across all homelab hosts. It includes immediate actions, root cause analysis, and long-term solutions to prevent recurrence.
Prerequisites
- SSH access to affected host
- Root/sudo privileges on the host
- Monitoring dashboards access
- Backup verification capability
Metadata
- Estimated Time: 30-90 minutes (depending on severity)
- Risk Level: High (data loss possible if not handled carefully)
- Requires Downtime: Minimal (may need to stop services temporarily)
- Reversible: Partially (deleted data cannot be recovered)
- Tested On: 2026-02-14
Severity Levels
| Level | Disk Usage | Action Required | Urgency |
|---|---|---|---|
| 🟢 Normal | < 80% | Monitor | Low |
| 🟡 Warning | 80-90% | Plan cleanup | Medium |
| 🟠 Critical | 90-95% | Immediate cleanup | High |
| 🔴 Emergency | > 95% | Emergency response | Critical |
Quick Triage
First, determine which host and volume is affected:
# Check all hosts disk usage
ssh atlantis "df -h"
ssh calypso "df -h"
ssh concordnuc "df -h"
ssh homelab-vm "df -h"
ssh raspberry-pi-5 "df -h"
Emergency Procedure (>95% Full)
Step 1: Immediate Space Recovery
Goal: Free up 5-10% space immediately to prevent system issues.
# SSH to affected host
ssh [hostname]
# Identify what's consuming space
df -h
du -sh /* 2>/dev/null | sort -rh | head -20
# Quick wins - Clear Docker cache
docker system df # See what Docker is using
docker system prune -a --volumes --force # Reclaim space (BE CAREFUL!)
# This typically frees 10-50GB depending on your setup
⚠️ WARNING: docker system prune will remove:
- Stopped containers
- Unused networks
- Dangling images
- Build cache
- Unused volumes (with --volumes flag)
Safer alternative if you're unsure:
# Less aggressive - removes only stopped containers and dangling images
docker system prune --force
Step 2: Clear Log Files
# Find large log files
find /var/log -type f -size +100M -exec ls -lh {} \; | sort -k 5 -rh
# Clear systemd journal (keeps last 3 days)
sudo journalctl --vacuum-time=3d
# Clear old Docker logs
sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'
# For Synology NAS
sudo find /volume1/@docker/containers -name "*-json.log" -size +100M -exec truncate -s 0 {} \;
Step 3: Remove Old Docker Images
# List images by size
docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k 2 -rh | head -20
# Remove specific old images
docker image rm [image:tag]
# Remove all unused images
docker image prune -a --force
Step 4: Verify Space Recovered
# Check current usage
df -h
# Verify critical services are running
docker ps
# Check container logs for errors
docker ps --format "{{.Names}}" | xargs -I {} sh -c 'echo "=== {} ===" && docker logs --tail 20 {}'
Detailed Analysis Procedure
Once immediate danger is passed, perform thorough analysis:
Step 1: Identify Space Consumers
# Comprehensive disk usage analysis
sudo du -h --max-depth=2 / 2>/dev/null | sort -rh | head -30
# For Synology NAS specifically
sudo du -h --max-depth=2 /volume1 2>/dev/null | sort -rh | head -30
# Check Docker volumes
docker volume ls
docker system df -v
# Check specific large directories
du -sh /var/lib/docker/* | sort -rh
du -sh /volume1/docker/* | sort -rh # Synology
Step 2: Analyze by Service
Create a space usage report:
# Create analysis script
cat > /tmp/analyze-space.sh << 'EOF'
#!/bin/bash
echo "=== Docker Container Volumes ==="
docker ps --format "{{.Names}}" | while read container; do
size=$(docker exec $container du -sh / 2>/dev/null | awk '{print $1}')
echo "$container: $size"
done | sort -rh
echo ""
echo "=== Docker Volumes ==="
docker volume ls --format "{{.Name}}" | while read vol; do
size=$(docker volume inspect $vol --format '{{.Mountpoint}}' | xargs sudo du -sh 2>/dev/null | awk '{print $1}')
echo "$vol: $size"
done | sort -rh
echo ""
echo "=== Log Files Over 100MB ==="
find /var/log -type f -size +100M -exec ls -lh {} \; 2>/dev/null
EOF
chmod +x /tmp/analyze-space.sh
/tmp/analyze-space.sh
Step 3: Categorize Findings
Identify the primary space consumers:
| Category | Typical Culprits | Safe to Delete? |
|---|---|---|
| Docker Images | Old/unused image versions | ✅ Yes (if unused) |
| Docker Volumes | Database growth, media cache | ⚠️ Maybe (check first) |
| Log Files | Application logs, system logs | ✅ Yes (after review) |
| Media Files | Plex, Jellyfin transcodes | ✅ Yes (transcodes) |
| Backups | Old backup archives | ✅ Yes (keep recent) |
| Application Data | Various service data | ❌ No (review first) |
Cleanup Strategies by Service Type
Media Services (Plex, Jellyfin)
# Clear Plex transcode cache
docker exec plex rm -rf /transcode/*
# Clear Jellyfin transcode cache
docker exec jellyfin rm -rf /config/data/transcodes/*
# Find and remove old media previews
find /volume1/plex/Library/Application\ Support/Plex\ Media\ Server/Cache -type f -mtime +30 -delete
*arr Suite (Sonarr, Radarr, etc.)
# Clear download client history and backups
docker exec sonarr find /config/Backups -mtime +30 -delete
docker exec radarr find /config/Backups -mtime +30 -delete
# Clean up old logs
docker exec sonarr find /config/logs -mtime +30 -delete
docker exec radarr find /config/logs -mtime +30 -delete
Database Services (PostgreSQL, MariaDB)
# Check database size
docker exec postgres psql -U user -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database;"
# Vacuum databases (for PostgreSQL)
docker exec postgres vacuumdb -U user --all --full --analyze
# Check MariaDB size
docker exec mariadb mysql -u root -p -e "SELECT table_schema AS 'Database', ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)' FROM information_schema.TABLES GROUP BY table_schema;"
Monitoring Services (Prometheus, Grafana)
# Check Prometheus storage size
du -sh /volume1/docker/prometheus
# Prometheus retention is configured in prometheus.yml
# Default: --storage.tsdb.retention.time=15d
# Consider reducing retention if space is critical
# Clear old Grafana sessions
docker exec grafana find /var/lib/grafana/sessions -mtime +7 -delete
Immich (Photo Management)
# Check Immich storage usage
docker exec immich-server df -h /usr/src/app/upload
# Immich uses a lot of space for:
# - Original photos
# - Thumbnails
# - Encoded videos
# - ML models
# Clean up old upload logs
docker exec immich-server find /usr/src/app/upload/upload -mtime +90 -delete
Long-Term Solutions
Solution 1: Configure Log Rotation
Create proper log rotation for Docker containers:
# Edit Docker daemon config
sudo nano /etc/docker/daemon.json
# Add log rotation settings
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
# Restart Docker
sudo systemctl restart docker # Linux
# OR for Synology
sudo synoservicectl --restart pkgctl-Docker
Solution 2: Set Up Automated Cleanup
Create a cleanup cron job:
# Create cleanup script
sudo nano /usr/local/bin/homelab-cleanup.sh
#!/bin/bash
# Homelab Automated Cleanup Script
# Remove stopped containers older than 7 days
docker container prune --filter "until=168h" --force
# Remove unused images older than 30 days
docker image prune --all --filter "until=720h" --force
# Remove unused volumes (BE CAREFUL - only if you're sure)
# docker volume prune --force
# Clear journal logs older than 7 days
journalctl --vacuum-time=7d
# Clear old backups (keep last 30 days)
find /volume1/backups -type f -mtime +30 -delete
echo "Cleanup completed at $(date)" >> /var/log/homelab-cleanup.log
# Make executable
sudo chmod +x /usr/local/bin/homelab-cleanup.sh
# Add to cron (runs weekly on Sunday at 3 AM)
(crontab -l 2>/dev/null; echo "0 3 * * 0 /usr/local/bin/homelab-cleanup.sh") | crontab -
Solution 3: Configure Service-Specific Retention
Update each service with appropriate retention policies:
Prometheus (prometheus.yml):
global:
storage:
tsdb:
retention.time: 15d # Reduce from default 15d to 7d if needed
retention.size: 50GB # Set size limit
Grafana (docker-compose.yml):
environment:
- GF_DATABASE_WAL=true
- GF_DATABASE_CLEANUP_INTERVAL=168h # Weekly cleanup
Plex (Plex settings):
- Settings → Transcoder → Transcoder temporary directory
- Settings → Scheduled Tasks → Clean Bundles (daily)
- Settings → Scheduled Tasks → Optimize Database (weekly)
Solution 4: Monitor Disk Usage Proactively
Set up monitoring alerts in Grafana:
# Alert rule for disk space
- alert: REDACTED_APP_PASSWORD
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space warning on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has less than 20% free space"
- alert: DiskSpaceCritical
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 10
for: 5m
labels:
severity: critical
annotations:
summary: "CRITICAL: Disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has less than 10% free space"
Host-Specific Considerations
Atlantis (Synology DS1823xs+)
# Synology-specific cleanup
# Clear Synology logs
sudo find /var/log -name "*.log.*" -mtime +30 -delete
# Clear package logs
sudo find /var/packages/*/target/logs -name "*.log.*" -mtime +30 -delete
# Check storage pool status
sudo synostgpool --info
# DSM has built-in storage analyzer
# Control Panel → Storage Manager → Storage Analyzer
Calypso (Synology DS723+)
Same as Atlantis - use Synology-specific commands.
Concord NUC (Ubuntu)
# Ubuntu-specific cleanup
sudo apt-get clean
sudo apt-get autoclean
sudo apt-get autoremove --purge
# Clear old kernels (keep current + 1 previous)
sudo apt-get autoremove --purge $(dpkg -l 'linux-*' | sed '/^ii/!d;/'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d;s/^[^ ]* [^ ]* \([^ ]*\).*/\1/;/[0-9]/!d')
# Clear thumbnail cache
rm -rf ~/.cache/thumbnails/*
Homelab VM (Proxmox VM)
# VM-specific cleanup
# Clear apt cache
sudo apt-get clean
# Clear old cloud-init logs
sudo rm -rf /var/log/cloud-init*.log
# Compact QCOW2 disk (from Proxmox host)
# qemu-img convert -O qcow2 -c original.qcow2 compressed.qcow2
Verification Checklist
After cleanup, verify:
- Disk usage below 80%:
df -h - All critical containers running:
docker ps - No errors in recent logs:
docker logs [container] --tail 50 - Services accessible via web interface
- Monitoring dashboards show normal metrics
- Backup jobs can complete successfully
- Automated cleanup configured for future
Rollback Procedure
If cleanup causes issues:
- Check what was deleted: Review command history and logs
- Restore from backups: If critical data was deleted
cd ~/Documents/repos/homelab ./restore.sh [backup-date] - Recreate Docker volumes: If volumes were accidentally pruned
- Restart affected services: Redeploy from Portainer
Troubleshooting
Issue: Still Running Out of Space After Cleanup
Solution: Consider adding more storage
- Add external USB drives
- Expand existing RAID arrays
- Move services to hosts with more space
- Archive old media to cold storage
Issue: Docker Prune Removed Important Data
Solution:
- Always use
--filterto be selective - Never use
docker volume prunewithout checking first - Keep recent backups before major cleanup operations
Issue: Services Won't Start After Cleanup
Solution:
# Check for missing volumes
docker ps -a
docker volume ls
# Check logs
docker logs [container]
# Recreate volumes if needed (restore from backup)
./restore.sh [backup-date]
Prevention Checklist
- Log rotation configured for all services
- Automated cleanup script running weekly
- Monitoring alerts set up for disk space
- Retention policies configured appropriately
- Regular backup verification scheduled
- Capacity planning review quarterly
Related Documentation
Change Log
- 2026-02-14 - Initial creation with host-specific procedures
- 2026-02-14 - Added service-specific cleanup strategies