Sanitized mirror from private repository - 2026-04-20 01:32:01 UTC

2026-04-20 01:32:01 +00:00
commit e7652c8dab
1445 changed files with 364095 additions and 0 deletions
--- a/docs/runbooks/disk-full-procedure.md
+++ b/docs/runbooks/disk-full-procedure.md
@@ -0,0 +1,490 @@
+# Disk Full Procedure Runbook
+
+## Overview
+This runbook provides procedures for handling disk space emergencies across all homelab hosts. It includes immediate actions, root cause analysis, and long-term solutions to prevent recurrence.
+
+## Prerequisites
+- [ ] SSH access to affected host
+- [ ] Root/sudo privileges on the host
+- [ ] Monitoring dashboards access
+- [ ] Backup verification capability
+
+## Metadata
+- **Estimated Time**: 30-90 minutes (depending on severity)
+- **Risk Level**: High (data loss possible if not handled carefully)
+- **Requires Downtime**: Minimal (may need to stop services temporarily)
+- **Reversible**: Partially (deleted data cannot be recovered)
+- **Tested On**: 2026-02-14
+
+## Severity Levels
+
+| Level | Disk Usage | Action Required | Urgency |
+|-------|------------|-----------------|---------|
+| 🟢 **Normal** | < 80% | Monitor | Low |
+| 🟡 **Warning** | 80-90% | Plan cleanup | Medium |
+| 🟠 **Critical** | 90-95% | Immediate cleanup | High |
+| 🔴 **Emergency** | > 95% | Emergency response | Critical |
+
+## Quick Triage
+
+First, determine which host and volume is affected:
+
+```bash
+# Check all hosts disk usage
+ssh atlantis "df -h"
+ssh calypso "df -h"
+ssh concordnuc "df -h"
+ssh homelab-vm "df -h"
+ssh raspberry-pi-5 "df -h"
+```
+
+## Emergency Procedure (>95% Full)
+
+### Step 1: Immediate Space Recovery
+
+**Goal**: Free up 5-10% space immediately to prevent system issues.
+
+```bash
+# SSH to affected host
+ssh [hostname]
+
+# Identify what's consuming space
+df -h
+du -sh /* 2>/dev/null | sort -rh | head -20
+
+# Quick wins - Clear Docker cache
+docker system df  # See what Docker is using
+docker system prune -a --volumes --force  # Reclaim space (BE CAREFUL!)
+
+# This typically frees 10-50GB depending on your setup
+```
+
+**⚠️ WARNING**: `docker system prune` will remove:
+- Stopped containers
+- Unused networks
+- Dangling images
+- Build cache
+- Unused volumes (with --volumes flag)
+
+**Safer alternative** if you're unsure:
+```bash
+# Less aggressive - removes only stopped containers and dangling images
+docker system prune --force
+```
+
+### Step 2: Clear Log Files
+
+```bash
+# Find large log files
+find /var/log -type f -size +100M -exec ls -lh {} \; | sort -k 5 -rh
+
+# Clear systemd journal (keeps last 3 days)
+sudo journalctl --vacuum-time=3d
+
+# Clear old Docker logs
+sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'
+
+# For Synology NAS
+sudo find /volume1/@docker/containers -name "*-json.log" -size +100M -exec truncate -s 0 {} \;
+```
+
+### Step 3: Remove Old Docker Images
+
+```bash
+# List images by size
+docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k 2 -rh | head -20
+
+# Remove specific old images
+docker image rm [image:tag]
+
+# Remove all unused images
+docker image prune -a --force
+```
+
+### Step 4: Verify Space Recovered
+
+```bash
+# Check current usage
+df -h
+
+# Verify critical services are running
+docker ps
+
+# Check container logs for errors
+docker ps --format "{{.Names}}" | xargs -I {} sh -c 'echo "=== {} ===" && docker logs --tail 20 {}'
+```
+
+## Detailed Analysis Procedure
+
+Once immediate danger is passed, perform thorough analysis:
+
+### Step 1: Identify Space Consumers
+
+```bash
+# Comprehensive disk usage analysis
+sudo du -h --max-depth=2 / 2>/dev/null | sort -rh | head -30
+
+# For Synology NAS specifically
+sudo du -h --max-depth=2 /volume1 2>/dev/null | sort -rh | head -30
+
+# Check Docker volumes
+docker volume ls
+docker system df -v
+
+# Check specific large directories
+du -sh /var/lib/docker/* | sort -rh
+du -sh /volume1/docker/* | sort -rh  # Synology
+```
+
+### Step 2: Analyze by Service
+
+Create a space usage report:
+
+```bash
+# Create analysis script
+cat > /tmp/analyze-space.sh << 'EOF'
+#!/bin/bash
+echo "=== Docker Container Volumes ==="
+docker ps --format "{{.Names}}" | while read container; do
+    size=$(docker exec $container du -sh / 2>/dev/null | awk '{print $1}')
+    echo "$container: $size"
+done | sort -rh
+
+echo ""
+echo "=== Docker Volumes ==="
+docker volume ls --format "{{.Name}}" | while read vol; do
+    size=$(docker volume inspect $vol --format '{{.Mountpoint}}' | xargs sudo du -sh 2>/dev/null | awk '{print $1}')
+    echo "$vol: $size"
+done | sort -rh
+
+echo ""
+echo "=== Log Files Over 100MB ==="
+find /var/log -type f -size +100M -exec ls -lh {} \; 2>/dev/null
+EOF
+
+chmod +x /tmp/analyze-space.sh
+/tmp/analyze-space.sh
+```
+
+### Step 3: Categorize Findings
+
+Identify the primary space consumers:
+
+| Category | Typical Culprits | Safe to Delete? |
+|----------|------------------|-----------------|
+| **Docker Images** | Old/unused image versions | ✅ Yes (if unused) |
+| **Docker Volumes** | Database growth, media cache | ⚠️ Maybe (check first) |
+| **Log Files** | Application logs, system logs | ✅ Yes (after review) |
+| **Media Files** | Plex, Jellyfin transcodes | ✅ Yes (transcodes) |
+| **Backups** | Old backup archives | ✅ Yes (keep recent) |
+| **Application Data** | Various service data | ❌ No (review first) |
+
+## Cleanup Strategies by Service Type
+
+### Media Services (Plex, Jellyfin)
+
+```bash
+# Clear Plex transcode cache
+docker exec plex rm -rf /transcode/*
+
+# Clear Jellyfin transcode cache
+docker exec jellyfin rm -rf /config/data/transcodes/*
+
+# Find and remove old media previews
+find /volume1/plex/Library/Application\ Support/Plex\ Media\ Server/Cache -type f -mtime +30 -delete
+```
+
+### *arr Suite (Sonarr, Radarr, etc.)
+
+```bash
+# Clear download client history and backups
+docker exec sonarr find /config/Backups -mtime +30 -delete
+docker exec radarr find /config/Backups -mtime +30 -delete
+
+# Clean up old logs
+docker exec sonarr find /config/logs -mtime +30 -delete
+docker exec radarr find /config/logs -mtime +30 -delete
+```
+
+### Database Services (PostgreSQL, MariaDB)
+
+```bash
+# Check database size
+docker exec postgres psql -U user -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database;"
+
+# Vacuum databases (for PostgreSQL)
+docker exec postgres vacuumdb -U user --all --full --analyze
+
+# Check MariaDB size
+docker exec mariadb mysql -u root -p -e "SELECT table_schema AS 'Database', ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)' FROM information_schema.TABLES GROUP BY table_schema;"
+```
+
+### Monitoring Services (Prometheus, Grafana)
+
+```bash
+# Check Prometheus storage size
+du -sh /volume1/docker/prometheus
+
+# Prometheus retention is configured in prometheus.yml
+# Default: --storage.tsdb.retention.time=15d
+# Consider reducing retention if space is critical
+
+# Clear old Grafana sessions
+docker exec grafana find /var/lib/grafana/sessions -mtime +7 -delete
+```
+
+### Immich (Photo Management)
+
+```bash
+# Check Immich storage usage
+docker exec immich-server df -h /usr/src/app/upload
+
+# Immich uses a lot of space for:
+# - Original photos
+# - Thumbnails
+# - Encoded videos
+# - ML models
+
+# Clean up old upload logs
+docker exec immich-server find /usr/src/app/upload/upload -mtime +90 -delete
+```
+
+## Long-Term Solutions
+
+### Solution 1: Configure Log Rotation
+
+Create proper log rotation for Docker containers:
+
+```bash
+# Edit Docker daemon config
+sudo nano /etc/docker/daemon.json
+
+# Add log rotation settings
+{
+  "log-driver": "json-file",
+  "log-opts": {
+    "max-size": "10m",
+    "max-file": "3"
+  }
+}
+
+# Restart Docker
+sudo systemctl restart docker  # Linux
+# OR for Synology
+sudo synoservicectl --restart pkgctl-Docker
+```
+
+### Solution 2: Set Up Automated Cleanup
+
+Create a cleanup cron job:
+
+```bash
+# Create cleanup script
+sudo nano /usr/local/bin/homelab-cleanup.sh
+
+#!/bin/bash
+# Homelab Automated Cleanup Script
+
+# Remove stopped containers older than 7 days
+docker container prune --filter "until=168h" --force
+
+# Remove unused images older than 30 days
+docker image prune --all --filter "until=720h" --force
+
+# Remove unused volumes (BE CAREFUL - only if you're sure)
+# docker volume prune --force
+
+# Clear journal logs older than 7 days
+journalctl --vacuum-time=7d
+
+# Clear old backups (keep last 30 days)
+find /volume1/backups -type f -mtime +30 -delete
+
+echo "Cleanup completed at $(date)" >> /var/log/homelab-cleanup.log
+
+# Make executable
+sudo chmod +x /usr/local/bin/homelab-cleanup.sh
+
+# Add to cron (runs weekly on Sunday at 3 AM)
+(crontab -l 2>/dev/null; echo "0 3 * * 0 /usr/local/bin/homelab-cleanup.sh") | crontab -
+```
+
+### Solution 3: Configure Service-Specific Retention
+
+Update each service with appropriate retention policies:
+
+**Prometheus** (`prometheus.yml`):
+```yaml
+global:
+  storage:
+    tsdb:
+      retention.time: 15d  # Reduce from default 15d to 7d if needed
+      retention.size: 50GB  # Set size limit
+```
+
+**Grafana** (docker-compose.yml):
+```yaml
+environment:
+  - GF_DATABASE_WAL=true
+  - GF_DATABASE_CLEANUP_INTERVAL=168h  # Weekly cleanup
+```
+
+**Plex** (Plex settings):
+- Settings → Transcoder → Transcoder temporary directory
+- Settings → Scheduled Tasks → Clean Bundles (daily)
+- Settings → Scheduled Tasks → Optimize Database (weekly)
+
+### Solution 4: Monitor Disk Usage Proactively
+
+Set up monitoring alerts in Grafana:
+
+```yaml
+# Alert rule for disk space
+- alert: REDACTED_APP_PASSWORD
+  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 20
+  for: 5m
+  labels:
+    severity: warning
+  annotations:
+    summary: "Disk space warning on {{ $labels.instance }}"
+    description: "Disk {{ $labels.mountpoint }} has less than 20% free space"
+
+- alert: DiskSpaceCritical
+  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) < 10
+  for: 5m
+  labels:
+    severity: critical
+  annotations:
+    summary: "CRITICAL: Disk space on {{ $labels.instance }}"
+    description: "Disk {{ $labels.mountpoint }} has less than 10% free space"
+```
+
+## Host-Specific Considerations
+
+### Atlantis (Synology DS1823xs+)
+
+```bash
+# Synology-specific cleanup
+# Clear Synology logs
+sudo find /var/log -name "*.log.*" -mtime +30 -delete
+
+# Clear package logs
+sudo find /var/packages/*/target/logs -name "*.log.*" -mtime +30 -delete
+
+# Check storage pool status
+sudo synostgpool --info
+
+# DSM has built-in storage analyzer
+# Control Panel → Storage Manager → Storage Analyzer
+```
+
+### Calypso (Synology DS723+)
+
+Same as Atlantis - use Synology-specific commands.
+
+### Concord NUC (Ubuntu)
+
+```bash
+# Ubuntu-specific cleanup
+sudo apt-get clean
+sudo apt-get autoclean
+sudo apt-get autoremove --purge
+
+# Clear old kernels (keep current + 1 previous)
+sudo apt-get autoremove --purge $(dpkg -l 'linux-*' | sed '/^ii/!d;/'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d;s/^[^ ]* [^ ]* \([^ ]*\).*/\1/;/[0-9]/!d')
+
+# Clear thumbnail cache
+rm -rf ~/.cache/thumbnails/*
+```
+
+### Homelab VM (Proxmox VM)
+
+```bash
+# VM-specific cleanup
+# Clear apt cache
+sudo apt-get clean
+
+# Clear old cloud-init logs
+sudo rm -rf /var/log/cloud-init*.log
+
+# Compact QCOW2 disk (from Proxmox host)
+# qemu-img convert -O qcow2 -c original.qcow2 compressed.qcow2
+```
+
+## Verification Checklist
+
+After cleanup, verify:
+
+- [ ] Disk usage below 80%: `df -h`
+- [ ] All critical containers running: `docker ps`
+- [ ] No errors in recent logs: `docker logs [container] --tail 50`
+- [ ] Services accessible via web interface
+- [ ] Monitoring dashboards show normal metrics
+- [ ] Backup jobs can complete successfully
+- [ ] Automated cleanup configured for future
+
+## Rollback Procedure
+
+If cleanup causes issues:
+
+1. **Check what was deleted**: Review command history and logs
+2. **Restore from backups**: If critical data was deleted
+   ```bash
+   cd ~/Documents/repos/homelab
+   ./restore.sh [backup-date]
+   ```
+3. **Recreate Docker volumes**: If volumes were accidentally pruned
+4. **Restart affected services**: Redeploy from Portainer
+
+## Troubleshooting
+
+### Issue: Still Running Out of Space After Cleanup
+
+**Solution**: Consider adding more storage
+- Add external USB drives
+- Expand existing RAID arrays
+- Move services to hosts with more space
+- Archive old media to cold storage
+
+### Issue: Docker Prune Removed Important Data
+
+**Solution**:
+- Always use `--filter` to be selective
+- Never use `docker volume prune` without checking first
+- Keep recent backups before major cleanup operations
+
+### Issue: Services Won't Start After Cleanup
+
+**Solution**:
+```bash
+# Check for missing volumes
+docker ps -a
+docker volume ls
+
+# Check logs
+docker logs [container]
+
+# Recreate volumes if needed (restore from backup)
+./restore.sh [backup-date]
+```
+
+## Prevention Checklist
+
+- [ ] Log rotation configured for all services
+- [ ] Automated cleanup script running weekly
+- [ ] Monitoring alerts set up for disk space
+- [ ] Retention policies configured appropriately
+- [ ] Regular backup verification scheduled
+- [ ] Capacity planning review quarterly
+
+## Related Documentation
+
+- [Infrastructure Overview](../infrastructure/INFRASTRUCTURE_OVERVIEW.md)
+- [Backup Strategies](../admin/backup-strategies.md)
+- [Monitoring Setup](../admin/monitoring-setup.md)
+- [Troubleshooting Guide](../troubleshooting/common-issues.md)
+
+## Change Log
+
+- 2026-02-14 - Initial creation with host-specific procedures
+- 2026-02-14 - Added service-specific cleanup strategies