homelab-optimized/docs/runbooks/disk-full-procedure.md

# Disk Full Procedure Runbook

## Overview
This runbook provides procedures for handling disk space emergencies across all homelab hosts. It includes immediate actions, root cause analysis, and long-term solutions to prevent recurrence.

## Prerequisites
- [ ] SSH access to affected host
- [ ] Root/sudo privileges on the host
- [ ] Monitoring dashboards access
- [ ] Backup verification capability

## Metadata
- **Estimated Time**: 30-90 minutes (depending on severity)
- **Risk Level**: High (data loss possible if not handled carefully)
- **Requires Downtime**: Minimal (may need to stop services temporarily)
- **Reversible**: Partially (deleted data cannot be recovered)
- **Tested On**: 2026-02-14

## Severity Levels

| Level | Disk Usage | Action Required | Urgency |
|-------|------------|-----------------|---------|
| 🟢 **Normal** | < 80% | Monitor | Low |
| 🟡 **Warning** | 80-90% | Plan cleanup | Medium |
| 🟠 **Critical** | 90-95% | Immediate cleanup | High |
| 🔴 **Emergency** | > 95% | Emergency response | Critical |

## Quick Triage

First, determine which host and volume is affected:

```bash
# Check all hosts disk usage
ssh atlantis "df -h"
ssh calypso "df -h"
ssh concordnuc "df -h"
ssh homelab-vm "df -h"
ssh raspberry-pi-5 "df -h"
```

## Emergency Procedure (>95% Full)

### Step 1: Immediate Space Recovery

**Goal**: Free up 5-10% space immediately to prevent system issues.

```bash
# SSH to affected host
ssh [hostname]

# Identify what's consuming space
df -h
du -sh /* 2>/dev/null | sort -rh | head -20

# Quick wins - Clear Docker cache
docker system df  # See what Docker is using
docker system prune -a --volumes --force  # Reclaim space (BE CAREFUL!)

# This typically frees 10-50GB depending on your setup
```

**⚠️ WARNING**: `docker system prune` will remove:
- Stopped containers
- Unused networks
- Dangling images
- Build cache
- Unused volumes (with --volumes flag)

**Safer alternative** if you're unsure:
```bash
# Less aggressive - removes only stopped containers and dangling images
docker system prune --force
```

### Step 2: Clear Log Files

```bash
# Find large log files
find /var/log -type f -size +100M -exec ls -lh {} \; | sort -k 5 -rh

# Clear systemd journal (keeps last 3 days)
sudo journalctl --vacuum-time=3d

# Clear old Docker logs
sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'

# For Synology NAS
sudo find /volume1/@docker/containers -name "*-json.log" -size +100M -exec truncate -s 0 {} \;
```

### Step 3: Remove Old Docker Images

```bash
# List images by size
docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k 2 -rh | head -20

# Remove specific old images
docker image rm [image:tag]

# Remove all unused images
docker image prune -a --force
```

### Step 4: Verify Space Recovered

```bash
# Check current usage
df -h

# Verify critical services are running
docker ps

# Check container logs for errors
docker ps --format "{{.Names}}" | xargs -I {} sh -c 'echo "=== {} ===" && docker logs --tail 20 {}'
```

## Detailed Analysis Procedure

Once immediate danger is passed, perform thorough analysis:

### Step 1: Identify Space Consumers

```bash
# Comprehensive disk usage analysis
sudo du -h --max-depth=2 / 2>/dev/null | sort -rh | head -30

# For Synology NAS specifically
sudo du -h --max-depth=2 /volume1 2>/dev/null | sort -rh | head -30

# Check Docker volumes
docker volume ls
docker system df -v

# Check specific large directories
du -sh /var/lib/docker/* | sort -rh
du -sh /volume1/docker/* | sort -rh  # Synology
```

### Step 2: Analyze by Service

Create a space usage report:

```bash
# Create analysis script
cat > /tmp/analyze-space.sh << 'EOF'
#!/bin/bash
echo "=== Docker Container Volumes ==="
docker ps --format "{{.Names}}" | while read container; do
    size=$(docker exec $container du -sh / 2>/dev/null | awk '{print $1}')
    echo "$container: $size"
done | sort -rh

echo ""
echo "=== Docker Volumes ==="
docker volume ls --format "{{.Name}}" | while read vol; do
    size=$(docker volume inspect $vol --format '{{.Mountpoint}}' | xargs sudo du -sh 2>/dev/null | awk '{print $1}')
    echo "$vol: $size"
done | sort -rh

echo ""
echo "=== Log Files Over 100MB ==="
find /var/log -type f -size +100M -exec ls -lh {} \; 2>/dev/null
EOF

chmod +x /tmp/analyze-space.sh
/tmp/analyze-space.sh
```

### Step 3: Categorize Findings

Identify the primary space consumers:

| Category | Typical Culprits | Safe to Delete? |
|----------|------------------|-----------------|
| **Docker Images** | Old/unused image versions | ✅ Yes (if unused) |
| **Docker Volumes** | Database growth, media cache | ⚠️ Maybe (check first) |
| **Log Files** | Application logs, system logs | ✅ Yes (after review) |
| **Media Files** | Plex, Jellyfin transcodes | ✅ Yes (transcodes) |
| **Backups** | Old backup archives | ✅ Yes (keep recent) |
| **Application Data** | Various service data | ❌ No (review first) |

## Cleanup Strategies by Service Type

### Media Services (Plex, Jellyfin)

```bash
# Clear Plex transcode cache
docker exec plex rm -rf /transcode/*

# Clear Jellyfin transcode cache
docker exec jellyfin rm -rf /config/data/transcodes/*

# Find and remove old media previews
find /volume1/plex/Library/Application\ Support/Plex\ Media\ Server/Cache -type f -mtime +30 -delete
```

### *arr Suite (Sonarr, Radarr, etc.)

```bash
# Clear download client history and backups
docker exec sonarr find /config/Backups -mtime +30 -delete
docker exec radarr find /config/Backups -mtime +30 -delete

# Clean up old logs
docker exec sonarr find /config/logs -mtime +30 -delete
docker exec radarr find /config/logs -mtime +30 -delete
```

### Database Services (PostgreSQL, MariaDB)

```bash
# Check database size
docker exec postgres psql -U user -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database;"

# Vacuum databases (for PostgreSQL)
docker exec postgres vacuumdb -U user --all --full --analyze

# Check MariaDB size
docker exec mariadb mysql -u root -p -e "SELECT table_schema AS 'Database', ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)' FROM information_schema.TABLES GROUP BY table_schema;"
```

### Monitoring Services (Prometheus, Grafana)

```bash
# Check Prometheus storage size
du -sh /volume1/docker/prometheus

# Prometheus retention is configured in prometheus.yml
# Default: --storage.tsdb.retention.time=15d
# Consider reducing retention if space is critical

# Clear old Grafana sessions
docker exec grafana find /var/lib/grafana/sessions -mtime +7 -delete
```

### Immich (Photo Management)

```bash
# Check Immich storage usage
docker exec immich-server df -h /usr/src/app/upload

# Immich uses a lot of space for:
# - Original photos
# - Thumbnails
# - Encoded videos
# - ML models

# Clean up old upload logs
docker exec immich-server find /usr/src/app/upload/upload -mtime +90 -delete
```

## Long-Term Solutions

### Solution 1: Configure Log Rotation

Create proper log rotation for Docker containers:

```bash
# Edit Docker daemon config
sudo nano /etc/docker/daemon.json

# Add log rotation settings
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

# Restart Docker
sudo systemctl restart docker  # Linux
# OR for Synology
sudo synoservicectl --restart pkgctl-Docker
```

### Solution 2: Set Up Automated Cleanup

Create a cleanup cron job:

```bash
# Create cleanup script
sudo nano /usr/local/bin/homelab-cleanup.sh

#!/bin/bash
# Homelab Automated Cleanup Script

# Remove stopped containers older than 7 days
docker container prune --filter "until=168h" --force

# Remove unused images older than 30 days
docker image prune --all --filter "until=720h" --force

# Remove unused volumes (BE CAREFUL - only if you're sure)
# docker volume prune --force

# Clear journal logs older than 7 days
journalctl --vacuum-time=7d

# Clear old backups (keep last 30 days)
find /volume1/backups -type f -mtime +30 -delete

echo "Cleanup completed at $(date)" >> /var/log/homelab-cleanup.log

# Make executable
sudo chmod +x /usr/local/bin/homelab-cleanup.sh

# Add to cron (runs weekly on Sunday at 3 AM)
(crontab -l 2>/dev/null; echo "0 3 * * 0 /usr/local/bin/homelab-cleanup.sh") | crontab -
```

### Solution 3: Configure Service-Specific Retention

Update each service with appropriate retention policies:

**Prometheus** (`prometheus.yml`):
```yaml
global:
  storage:
    tsdb:
      retention.time: 15d  # Reduce from default 15d to 7d if needed
      retention.size: 50GB  # Set size limit
```

**Grafana** (docker-compose.yml):
```yaml
environment:
  - GF_DATABASE_WAL=true
  - GF_DATABASE_CLEANUP_INTERVAL=168h  # Weekly cleanup
```

**Plex** (Plex settings):
- Settings → Transcoder → Transcoder temporary directory
- Settings → Scheduled Tasks → Clean Bundles (daily)
- Settings → Scheduled Tasks → Optimize Database (weekly)

### Solution 4: Monitor Disk Usage Proactively

Set up monitoring alerts in Grafana:

```yaml
# Alert rule for disk space
- alert: REDACTED_APP_PASSWORD
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 20
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Disk space warning on {{ $labels.instance }}"
    description: "Disk {{ $labels.mountpoint }} has less than 20% free space"

- alert: DiskSpaceCritical
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 10
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "CRITICAL: Disk space on {{ $labels.instance }}"
    description: "Disk {{ $labels.mountpoint }} has less than 10% free space"
```

## Host-Specific Considerations

### Atlantis (Synology DS1823xs+)

```bash
# Synology-specific cleanup
# Clear Synology logs
sudo find /var/log -name "*.log.*" -mtime +30 -delete

# Clear package logs
sudo find /var/packages/*/target/logs -name "*.log.*" -mtime +30 -delete

# Check storage pool status
sudo synostgpool --info

# DSM has built-in storage analyzer
# Control Panel → Storage Manager → Storage Analyzer
```

### Calypso (Synology DS723+)

Same as Atlantis - use Synology-specific commands.

### Concord NUC (Ubuntu)

```bash
# Ubuntu-specific cleanup
sudo apt-get clean
sudo apt-get autoclean
sudo apt-get autoremove --purge

# Clear old kernels (keep current + 1 previous)
sudo apt-get autoremove --purge $(dpkg -l 'linux-*' | sed '/^ii/!d;/'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d;s/^[^ ]* [^ ]* \([^ ]*\).*/\1/;/[0-9]/!d')

# Clear thumbnail cache
rm -rf ~/.cache/thumbnails/*
```

### Homelab VM (Proxmox VM)

```bash
# VM-specific cleanup
# Clear apt cache
sudo apt-get clean

# Clear old cloud-init logs
sudo rm -rf /var/log/cloud-init*.log

# Compact QCOW2 disk (from Proxmox host)
# qemu-img convert -O qcow2 -c original.qcow2 compressed.qcow2
```

## Verification Checklist

After cleanup, verify:

- [ ] Disk usage below 80%: `df -h`
- [ ] All critical containers running: `docker ps`
- [ ] No errors in recent logs: `docker logs [container] --tail 50`
- [ ] Services accessible via web interface
- [ ] Monitoring dashboards show normal metrics
- [ ] Backup jobs can complete successfully
- [ ] Automated cleanup configured for future

## Rollback Procedure

If cleanup causes issues:

1. **Check what was deleted**: Review command history and logs
2. **Restore from backups**: If critical data was deleted
   ```bash
   cd ~/Documents/repos/homelab
   ./restore.sh [backup-date]
   ```
3. **Recreate Docker volumes**: If volumes were accidentally pruned
4. **Restart affected services**: Redeploy from Portainer

## Troubleshooting

### Issue: Still Running Out of Space After Cleanup

**Solution**: Consider adding more storage
- Add external USB drives
- Expand existing RAID arrays
- Move services to hosts with more space
- Archive old media to cold storage

### Issue: Docker Prune Removed Important Data

**Solution**:
- Always use `--filter` to be selective
- Never use `docker volume prune` without checking first
- Keep recent backups before major cleanup operations

### Issue: Services Won't Start After Cleanup

**Solution**:
```bash
# Check for missing volumes
docker ps -a
docker volume ls

# Check logs
docker logs [container]

# Recreate volumes if needed (restore from backup)
./restore.sh [backup-date]
```

## Prevention Checklist

- [ ] Log rotation configured for all services
- [ ] Automated cleanup script running weekly
- [ ] Monitoring alerts set up for disk space
- [ ] Retention policies configured appropriately
- [ ] Regular backup verification scheduled
- [ ] Capacity planning review quarterly

## Related Documentation

- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
- [Backup Strategies](../admin/backup-strategies.md)
- [Monitoring Setup](../admin/monitoring-setup.md)
- [Troubleshooting Guide](../troubleshooting/common-issues.md)

## Change Log

- 2026-02-14 - Initial creation with host-specific procedures
- 2026-02-14 - Added service-specific cleanup strategies