Vish/homelab-optimized

Fork 0

Files

Gitea Mirror Bot 717e06b7a8

Documentation / Build Docusaurus (push) Failing after 5m0s

Details

Documentation / Deploy to GitHub Pages (push) Has been skipped

Details

Sanitized mirror from private repository - 2026-03-17 11:52:42 UTC

2026-03-17 11:52:42 +00:00

13 KiB

Raw Blame History

Disk Full Procedure Runbook

Overview

This runbook provides procedures for handling disk space emergencies across all homelab hosts. It includes immediate actions, root cause analysis, and long-term solutions to prevent recurrence.

Prerequisites

SSH access to affected host
Root/sudo privileges on the host
Monitoring dashboards access
Backup verification capability

Metadata

Estimated Time: 30-90 minutes (depending on severity)
Risk Level: High (data loss possible if not handled carefully)
Requires Downtime: Minimal (may need to stop services temporarily)
Reversible: Partially (deleted data cannot be recovered)
Tested On: 2026-02-14

Severity Levels

Level	Disk Usage	Action Required	Urgency
🟢 Normal	< 80%	Monitor	Low
🟡 Warning	80-90%	Plan cleanup	Medium
🟠 Critical	90-95%	Immediate cleanup	High
🔴 Emergency	> 95%	Emergency response	Critical

Quick Triage

First, determine which host and volume is affected:

# Check all hosts disk usage
ssh atlantis "df -h"
ssh calypso "df -h"
ssh concordnuc "df -h"
ssh homelab-vm "df -h"
ssh raspberry-pi-5 "df -h"

Emergency Procedure (>95% Full)

Step 1: Immediate Space Recovery

Goal: Free up 5-10% space immediately to prevent system issues.

# SSH to affected host
ssh [hostname]

# Identify what's consuming space
df -h
du -sh /* 2>/dev/null | sort -rh | head -20

# Quick wins - Clear Docker cache
docker system df  # See what Docker is using
docker system prune -a --volumes --force  # Reclaim space (BE CAREFUL!)

# This typically frees 10-50GB depending on your setup

⚠️ WARNING: docker system prune will remove:

Stopped containers
Unused networks
Dangling images
Build cache
Unused volumes (with --volumes flag)

Safer alternative if you're unsure:

# Less aggressive - removes only stopped containers and dangling images
docker system prune --force

Step 2: Clear Log Files

# Find large log files
find /var/log -type f -size +100M -exec ls -lh {} \; | sort -k 5 -rh

# Clear systemd journal (keeps last 3 days)
sudo journalctl --vacuum-time=3d

# Clear old Docker logs
sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'

# For Synology NAS
sudo find /volume1/@docker/containers -name "*-json.log" -size +100M -exec truncate -s 0 {} \;

Step 3: Remove Old Docker Images

# List images by size
docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k 2 -rh | head -20

# Remove specific old images
docker image rm [image:tag]

# Remove all unused images
docker image prune -a --force

Step 4: Verify Space Recovered

# Check current usage
df -h

# Verify critical services are running
docker ps

# Check container logs for errors
docker ps --format "{{.Names}}" | xargs -I {} sh -c 'echo "=== {} ===" && docker logs --tail 20 {}'

Detailed Analysis Procedure

Once immediate danger is passed, perform thorough analysis:

Step 1: Identify Space Consumers

# Comprehensive disk usage analysis
sudo du -h --max-depth=2 / 2>/dev/null | sort -rh | head -30

# For Synology NAS specifically
sudo du -h --max-depth=2 /volume1 2>/dev/null | sort -rh | head -30

# Check Docker volumes
docker volume ls
docker system df -v

# Check specific large directories
du -sh /var/lib/docker/* | sort -rh
du -sh /volume1/docker/* | sort -rh  # Synology

Step 2: Analyze by Service

Create a space usage report:

# Create analysis script
cat > /tmp/analyze-space.sh << 'EOF'
#!/bin/bash
echo "=== Docker Container Volumes ==="
docker ps --format "{{.Names}}" | while read container; do
    size=$(docker exec $container du -sh / 2>/dev/null | awk '{print $1}')
    echo "$container: $size"
done | sort -rh

echo ""
echo "=== Docker Volumes ==="
docker volume ls --format "{{.Name}}" | while read vol; do
    size=$(docker volume inspect $vol --format '{{.Mountpoint}}' | xargs sudo du -sh 2>/dev/null | awk '{print $1}')
    echo "$vol: $size"
done | sort -rh

echo ""
echo "=== Log Files Over 100MB ==="
find /var/log -type f -size +100M -exec ls -lh {} \; 2>/dev/null
EOF

chmod +x /tmp/analyze-space.sh
/tmp/analyze-space.sh

Step 3: Categorize Findings

Identify the primary space consumers:

Category	Typical Culprits	Safe to Delete?
Docker Images	Old/unused image versions	✅ Yes (if unused)
Docker Volumes	Database growth, media cache	⚠️ Maybe (check first)
Log Files	Application logs, system logs	✅ Yes (after review)
Media Files	Plex, Jellyfin transcodes	✅ Yes (transcodes)
Backups	Old backup archives	✅ Yes (keep recent)
Application Data	Various service data	❌ No (review first)

Cleanup Strategies by Service Type

Media Services (Plex, Jellyfin)

# Clear Plex transcode cache
docker exec plex rm -rf /transcode/*

# Clear Jellyfin transcode cache
docker exec jellyfin rm -rf /config/data/transcodes/*

# Find and remove old media previews
find /volume1/plex/Library/Application\ Support/Plex\ Media\ Server/Cache -type f -mtime +30 -delete

*arr Suite (Sonarr, Radarr, etc.)

# Clear download client history and backups
docker exec sonarr find /config/Backups -mtime +30 -delete
docker exec radarr find /config/Backups -mtime +30 -delete

# Clean up old logs
docker exec sonarr find /config/logs -mtime +30 -delete
docker exec radarr find /config/logs -mtime +30 -delete

Database Services (PostgreSQL, MariaDB)

# Check database size
docker exec postgres psql -U user -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database;"

# Vacuum databases (for PostgreSQL)
docker exec postgres vacuumdb -U user --all --full --analyze

# Check MariaDB size
docker exec mariadb mysql -u root -p -e "SELECT table_schema AS 'Database', ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)' FROM information_schema.TABLES GROUP BY table_schema;"

Monitoring Services (Prometheus, Grafana)

# Check Prometheus storage size
du -sh /volume1/docker/prometheus

# Prometheus retention is configured in prometheus.yml
# Default: --storage.tsdb.retention.time=15d
# Consider reducing retention if space is critical

# Clear old Grafana sessions
docker exec grafana find /var/lib/grafana/sessions -mtime +7 -delete

Immich (Photo Management)

# Check Immich storage usage
docker exec immich-server df -h /usr/src/app/upload

# Immich uses a lot of space for:
# - Original photos
# - Thumbnails
# - Encoded videos
# - ML models

# Clean up old upload logs
docker exec immich-server find /usr/src/app/upload/upload -mtime +90 -delete

Long-Term Solutions

Solution 1: Configure Log Rotation

Create proper log rotation for Docker containers:

# Edit Docker daemon config
sudo nano /etc/docker/daemon.json

# Add log rotation settings
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

# Restart Docker
sudo systemctl restart docker  # Linux
# OR for Synology
sudo synoservicectl --restart pkgctl-Docker

Solution 2: Set Up Automated Cleanup

Create a cleanup cron job:

# Create cleanup script
sudo nano /usr/local/bin/homelab-cleanup.sh

#!/bin/bash
# Homelab Automated Cleanup Script

# Remove stopped containers older than 7 days
docker container prune --filter "until=168h" --force

# Remove unused images older than 30 days
docker image prune --all --filter "until=720h" --force

# Remove unused volumes (BE CAREFUL - only if you're sure)
# docker volume prune --force

# Clear journal logs older than 7 days
journalctl --vacuum-time=7d

# Clear old backups (keep last 30 days)
find /volume1/backups -type f -mtime +30 -delete

echo "Cleanup completed at $(date)" >> /var/log/homelab-cleanup.log

# Make executable
sudo chmod +x /usr/local/bin/homelab-cleanup.sh

# Add to cron (runs weekly on Sunday at 3 AM)
(crontab -l 2>/dev/null; echo "0 3 * * 0 /usr/local/bin/homelab-cleanup.sh") | crontab -

Solution 3: Configure Service-Specific Retention

Update each service with appropriate retention policies:

Prometheus (prometheus.yml):

global:
  storage:
    tsdb:
      retention.time: 15d  # Reduce from default 15d to 7d if needed
      retention.size: 50GB  # Set size limit

Grafana (docker-compose.yml):

environment:
  - GF_DATABASE_WAL=true
  - GF_DATABASE_CLEANUP_INTERVAL=168h  # Weekly cleanup

Plex (Plex settings):

Settings → Transcoder → Transcoder temporary directory
Settings → Scheduled Tasks → Clean Bundles (daily)
Settings → Scheduled Tasks → Optimize Database (weekly)

Solution 4: Monitor Disk Usage Proactively

Set up monitoring alerts in Grafana:

# Alert rule for disk space
- alert: REDACTED_APP_PASSWORD
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 20
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Disk space warning on {{ $labels.instance }}"
    description: "Disk {{ $labels.mountpoint }} has less than 20% free space"

- alert: DiskSpaceCritical
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 10
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "CRITICAL: Disk space on {{ $labels.instance }}"
    description: "Disk {{ $labels.mountpoint }} has less than 10% free space"

Host-Specific Considerations

Atlantis (Synology DS1823xs+)

# Synology-specific cleanup
# Clear Synology logs
sudo find /var/log -name "*.log.*" -mtime +30 -delete

# Clear package logs
sudo find /var/packages/*/target/logs -name "*.log.*" -mtime +30 -delete

# Check storage pool status
sudo synostgpool --info

# DSM has built-in storage analyzer
# Control Panel → Storage Manager → Storage Analyzer

Calypso (Synology DS723+)

Same as Atlantis - use Synology-specific commands.

Concord NUC (Ubuntu)

# Ubuntu-specific cleanup
sudo apt-get clean
sudo apt-get autoclean
sudo apt-get autoremove --purge

# Clear old kernels (keep current + 1 previous)
sudo apt-get autoremove --purge $(dpkg -l 'linux-*' | sed '/^ii/!d;/'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d;s/^[^ ]* [^ ]* \([^ ]*\).*/\1/;/[0-9]/!d')

# Clear thumbnail cache
rm -rf ~/.cache/thumbnails/*

Homelab VM (Proxmox VM)

# VM-specific cleanup
# Clear apt cache
sudo apt-get clean

# Clear old cloud-init logs
sudo rm -rf /var/log/cloud-init*.log

# Compact QCOW2 disk (from Proxmox host)
# qemu-img convert -O qcow2 -c original.qcow2 compressed.qcow2

Verification Checklist

After cleanup, verify:

Disk usage below 80%: df -h
All critical containers running: docker ps
No errors in recent logs: docker logs [container] --tail 50
Services accessible via web interface
Monitoring dashboards show normal metrics
Backup jobs can complete successfully
Automated cleanup configured for future

Rollback Procedure

If cleanup causes issues:

Check what was deleted: Review command history and logs

Restore from backups: If critical data was deleted

cd ~/Documents/repos/homelab
./restore.sh [backup-date]

Recreate Docker volumes: If volumes were accidentally pruned
Restart affected services: Redeploy from Portainer

Troubleshooting

Issue: Still Running Out of Space After Cleanup

Solution: Consider adding more storage

Add external USB drives
Expand existing RAID arrays
Move services to hosts with more space
Archive old media to cold storage

Issue: Docker Prune Removed Important Data

Solution:

Always use --filter to be selective
Never use docker volume prune without checking first
Keep recent backups before major cleanup operations

Issue: Services Won't Start After Cleanup