Files
homelab-optimized/docs/runbooks/disk-full-procedure.md
Gitea Mirror Bot 717e06b7a8
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m0s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-03-17 11:52:42 UTC
2026-03-17 11:52:42 +00:00

13 KiB

Disk Full Procedure Runbook

Overview

This runbook provides procedures for handling disk space emergencies across all homelab hosts. It includes immediate actions, root cause analysis, and long-term solutions to prevent recurrence.

Prerequisites

  • SSH access to affected host
  • Root/sudo privileges on the host
  • Monitoring dashboards access
  • Backup verification capability

Metadata

  • Estimated Time: 30-90 minutes (depending on severity)
  • Risk Level: High (data loss possible if not handled carefully)
  • Requires Downtime: Minimal (may need to stop services temporarily)
  • Reversible: Partially (deleted data cannot be recovered)
  • Tested On: 2026-02-14

Severity Levels

Level Disk Usage Action Required Urgency
🟢 Normal < 80% Monitor Low
🟡 Warning 80-90% Plan cleanup Medium
🟠 Critical 90-95% Immediate cleanup High
🔴 Emergency > 95% Emergency response Critical

Quick Triage

First, determine which host and volume is affected:

# Check all hosts disk usage
ssh atlantis "df -h"
ssh calypso "df -h"
ssh concordnuc "df -h"
ssh homelab-vm "df -h"
ssh raspberry-pi-5 "df -h"

Emergency Procedure (>95% Full)

Step 1: Immediate Space Recovery

Goal: Free up 5-10% space immediately to prevent system issues.

# SSH to affected host
ssh [hostname]

# Identify what's consuming space
df -h
du -sh /* 2>/dev/null | sort -rh | head -20

# Quick wins - Clear Docker cache
docker system df  # See what Docker is using
docker system prune -a --volumes --force  # Reclaim space (BE CAREFUL!)

# This typically frees 10-50GB depending on your setup

⚠️ WARNING: docker system prune will remove:

  • Stopped containers
  • Unused networks
  • Dangling images
  • Build cache
  • Unused volumes (with --volumes flag)

Safer alternative if you're unsure:

# Less aggressive - removes only stopped containers and dangling images
docker system prune --force

Step 2: Clear Log Files

# Find large log files
find /var/log -type f -size +100M -exec ls -lh {} \; | sort -k 5 -rh

# Clear systemd journal (keeps last 3 days)
sudo journalctl --vacuum-time=3d

# Clear old Docker logs
sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'

# For Synology NAS
sudo find /volume1/@docker/containers -name "*-json.log" -size +100M -exec truncate -s 0 {} \;

Step 3: Remove Old Docker Images

# List images by size
docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k 2 -rh | head -20

# Remove specific old images
docker image rm [image:tag]

# Remove all unused images
docker image prune -a --force

Step 4: Verify Space Recovered

# Check current usage
df -h

# Verify critical services are running
docker ps

# Check container logs for errors
docker ps --format "{{.Names}}" | xargs -I {} sh -c 'echo "=== {} ===" && docker logs --tail 20 {}'

Detailed Analysis Procedure

Once immediate danger is passed, perform thorough analysis:

Step 1: Identify Space Consumers

# Comprehensive disk usage analysis
sudo du -h --max-depth=2 / 2>/dev/null | sort -rh | head -30

# For Synology NAS specifically
sudo du -h --max-depth=2 /volume1 2>/dev/null | sort -rh | head -30

# Check Docker volumes
docker volume ls
docker system df -v

# Check specific large directories
du -sh /var/lib/docker/* | sort -rh
du -sh /volume1/docker/* | sort -rh  # Synology

Step 2: Analyze by Service

Create a space usage report:

# Create analysis script
cat > /tmp/analyze-space.sh << 'EOF'
#!/bin/bash
echo "=== Docker Container Volumes ==="
docker ps --format "{{.Names}}" | while read container; do
    size=$(docker exec $container du -sh / 2>/dev/null | awk '{print $1}')
    echo "$container: $size"
done | sort -rh

echo ""
echo "=== Docker Volumes ==="
docker volume ls --format "{{.Name}}" | while read vol; do
    size=$(docker volume inspect $vol --format '{{.Mountpoint}}' | xargs sudo du -sh 2>/dev/null | awk '{print $1}')
    echo "$vol: $size"
done | sort -rh

echo ""
echo "=== Log Files Over 100MB ==="
find /var/log -type f -size +100M -exec ls -lh {} \; 2>/dev/null
EOF

chmod +x /tmp/analyze-space.sh
/tmp/analyze-space.sh

Step 3: Categorize Findings

Identify the primary space consumers:

Category Typical Culprits Safe to Delete?
Docker Images Old/unused image versions Yes (if unused)
Docker Volumes Database growth, media cache ⚠️ Maybe (check first)
Log Files Application logs, system logs Yes (after review)
Media Files Plex, Jellyfin transcodes Yes (transcodes)
Backups Old backup archives Yes (keep recent)
Application Data Various service data No (review first)

Cleanup Strategies by Service Type

Media Services (Plex, Jellyfin)

# Clear Plex transcode cache
docker exec plex rm -rf /transcode/*

# Clear Jellyfin transcode cache
docker exec jellyfin rm -rf /config/data/transcodes/*

# Find and remove old media previews
find /volume1/plex/Library/Application\ Support/Plex\ Media\ Server/Cache -type f -mtime +30 -delete

*arr Suite (Sonarr, Radarr, etc.)

# Clear download client history and backups
docker exec sonarr find /config/Backups -mtime +30 -delete
docker exec radarr find /config/Backups -mtime +30 -delete

# Clean up old logs
docker exec sonarr find /config/logs -mtime +30 -delete
docker exec radarr find /config/logs -mtime +30 -delete

Database Services (PostgreSQL, MariaDB)

# Check database size
docker exec postgres psql -U user -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database;"

# Vacuum databases (for PostgreSQL)
docker exec postgres vacuumdb -U user --all --full --analyze

# Check MariaDB size
docker exec mariadb mysql -u root -p -e "SELECT table_schema AS 'Database', ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)' FROM information_schema.TABLES GROUP BY table_schema;"

Monitoring Services (Prometheus, Grafana)

# Check Prometheus storage size
du -sh /volume1/docker/prometheus

# Prometheus retention is configured in prometheus.yml
# Default: --storage.tsdb.retention.time=15d
# Consider reducing retention if space is critical

# Clear old Grafana sessions
docker exec grafana find /var/lib/grafana/sessions -mtime +7 -delete

Immich (Photo Management)

# Check Immich storage usage
docker exec immich-server df -h /usr/src/app/upload

# Immich uses a lot of space for:
# - Original photos
# - Thumbnails
# - Encoded videos
# - ML models

# Clean up old upload logs
docker exec immich-server find /usr/src/app/upload/upload -mtime +90 -delete

Long-Term Solutions

Solution 1: Configure Log Rotation

Create proper log rotation for Docker containers:

# Edit Docker daemon config
sudo nano /etc/docker/daemon.json

# Add log rotation settings
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

# Restart Docker
sudo systemctl restart docker  # Linux
# OR for Synology
sudo synoservicectl --restart pkgctl-Docker

Solution 2: Set Up Automated Cleanup

Create a cleanup cron job:

# Create cleanup script
sudo nano /usr/local/bin/homelab-cleanup.sh

#!/bin/bash
# Homelab Automated Cleanup Script

# Remove stopped containers older than 7 days
docker container prune --filter "until=168h" --force

# Remove unused images older than 30 days
docker image prune --all --filter "until=720h" --force

# Remove unused volumes (BE CAREFUL - only if you're sure)
# docker volume prune --force

# Clear journal logs older than 7 days
journalctl --vacuum-time=7d

# Clear old backups (keep last 30 days)
find /volume1/backups -type f -mtime +30 -delete

echo "Cleanup completed at $(date)" >> /var/log/homelab-cleanup.log

# Make executable
sudo chmod +x /usr/local/bin/homelab-cleanup.sh

# Add to cron (runs weekly on Sunday at 3 AM)
(crontab -l 2>/dev/null; echo "0 3 * * 0 /usr/local/bin/homelab-cleanup.sh") | crontab -

Solution 3: Configure Service-Specific Retention

Update each service with appropriate retention policies:

Prometheus (prometheus.yml):

global:
  storage:
    tsdb:
      retention.time: 15d  # Reduce from default 15d to 7d if needed
      retention.size: 50GB  # Set size limit

Grafana (docker-compose.yml):

environment:
  - GF_DATABASE_WAL=true
  - GF_DATABASE_CLEANUP_INTERVAL=168h  # Weekly cleanup

Plex (Plex settings):

  • Settings → Transcoder → Transcoder temporary directory
  • Settings → Scheduled Tasks → Clean Bundles (daily)
  • Settings → Scheduled Tasks → Optimize Database (weekly)

Solution 4: Monitor Disk Usage Proactively

Set up monitoring alerts in Grafana:

# Alert rule for disk space
- alert: REDACTED_APP_PASSWORD
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 20
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Disk space warning on {{ $labels.instance }}"
    description: "Disk {{ $labels.mountpoint }} has less than 20% free space"

- alert: DiskSpaceCritical
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 10
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "CRITICAL: Disk space on {{ $labels.instance }}"
    description: "Disk {{ $labels.mountpoint }} has less than 10% free space"

Host-Specific Considerations

Atlantis (Synology DS1823xs+)

# Synology-specific cleanup
# Clear Synology logs
sudo find /var/log -name "*.log.*" -mtime +30 -delete

# Clear package logs
sudo find /var/packages/*/target/logs -name "*.log.*" -mtime +30 -delete

# Check storage pool status
sudo synostgpool --info

# DSM has built-in storage analyzer
# Control Panel → Storage Manager → Storage Analyzer

Calypso (Synology DS723+)

Same as Atlantis - use Synology-specific commands.

Concord NUC (Ubuntu)

# Ubuntu-specific cleanup
sudo apt-get clean
sudo apt-get autoclean
sudo apt-get autoremove --purge

# Clear old kernels (keep current + 1 previous)
sudo apt-get autoremove --purge $(dpkg -l 'linux-*' | sed '/^ii/!d;/'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d;s/^[^ ]* [^ ]* \([^ ]*\).*/\1/;/[0-9]/!d')

# Clear thumbnail cache
rm -rf ~/.cache/thumbnails/*

Homelab VM (Proxmox VM)

# VM-specific cleanup
# Clear apt cache
sudo apt-get clean

# Clear old cloud-init logs
sudo rm -rf /var/log/cloud-init*.log

# Compact QCOW2 disk (from Proxmox host)
# qemu-img convert -O qcow2 -c original.qcow2 compressed.qcow2

Verification Checklist

After cleanup, verify:

  • Disk usage below 80%: df -h
  • All critical containers running: docker ps
  • No errors in recent logs: docker logs [container] --tail 50
  • Services accessible via web interface
  • Monitoring dashboards show normal metrics
  • Backup jobs can complete successfully
  • Automated cleanup configured for future

Rollback Procedure

If cleanup causes issues:

  1. Check what was deleted: Review command history and logs
  2. Restore from backups: If critical data was deleted
    cd ~/Documents/repos/homelab
    ./restore.sh [backup-date]
    
  3. Recreate Docker volumes: If volumes were accidentally pruned
  4. Restart affected services: Redeploy from Portainer

Troubleshooting

Issue: Still Running Out of Space After Cleanup

Solution: Consider adding more storage

  • Add external USB drives
  • Expand existing RAID arrays
  • Move services to hosts with more space
  • Archive old media to cold storage

Issue: Docker Prune Removed Important Data

Solution:

  • Always use --filter to be selective
  • Never use docker volume prune without checking first
  • Keep recent backups before major cleanup operations

Issue: Services Won't Start After Cleanup

Solution:

# Check for missing volumes
docker ps -a
docker volume ls

# Check logs
docker logs [container]

# Recreate volumes if needed (restore from backup)
./restore.sh [backup-date]

Prevention Checklist

  • Log rotation configured for all services
  • Automated cleanup script running weekly
  • Monitoring alerts set up for disk space
  • Retention policies configured appropriately
  • Regular backup verification scheduled
  • Capacity planning review quarterly

Change Log

  • 2026-02-14 - Initial creation with host-specific procedures
  • 2026-02-14 - Added service-specific cleanup strategies