# Common Issues & Troubleshooting ## Overview This guide covers the most frequently encountered issues in the homelab environment and provides step-by-step solutions. Issues are organized by category with diagnostic steps and resolution procedures. ## Container & Docker Issues ### Container Won't Start #### Symptoms - Container exits immediately after starting - "Container exited with code 1" errors - Service unavailable after deployment #### Diagnostic Steps ```bash # Check container status docker ps -a # View container logs docker logs container-name # Inspect container configuration docker inspect container-name # Check resource usage docker stats ``` #### Common Causes & Solutions **Port Conflicts** ```bash # Check port usage netstat -tulpn | grep :8080 ss -tulpn | grep :8080 # Solution: Change port in docker-compose.yml ports: - "8081:8080" # Use different external port ``` **Permission Issues** ```bash # Check file ownership ls -la /mnt/storage/service-name # Fix ownership sudo chown -R 1000:1000 /mnt/storage/service-name # Set proper permissions sudo chmod -R 755 /mnt/storage/service-name ``` **Missing Environment Variables** ```bash # Check environment variables docker exec container-name env # Add missing variables to .env file echo "MISSING_VAR=value" >> .env # Recreate container docker-compose up -d --force-recreate ``` ### Container Memory Issues #### Symptoms - Container killed by OOM (Out of Memory) - Slow performance or timeouts - System becomes unresponsive #### Diagnostic Steps ```bash # Check memory usage free -h docker stats # Check system logs for OOM kills dmesg | grep -i "killed process" journalctl -u docker.service | grep -i oom ``` #### Solutions ```bash # Add memory limits to docker-compose.yml services: service-name: deploy: resources: limits: memory: 2G reservations: memory: 1G # Increase system swap sudo fallocate -l 4G /swapfile sudo chmod 600 /swapfile sudo mkswap /swapfile sudo swapon /swapfile ``` ### Docker Daemon Issues #### Symptoms - "Cannot connect to Docker daemon" errors - Docker commands hang or timeout - Services become unresponsive #### Diagnostic Steps ```bash # Check Docker daemon status systemctl status docker # Check Docker daemon logs journalctl -u docker.service -f # Test Docker connectivity docker version docker info ``` #### Solutions ```bash # Restart Docker daemon sudo systemctl restart docker # Clean up Docker system docker system prune -a # Reset Docker daemon (last resort) sudo systemctl stop docker sudo rm -rf /var/lib/docker sudo systemctl start docker ``` ## Network & Connectivity Issues ### Service Not Accessible #### Symptoms - Connection refused errors - Timeouts when accessing services - Services work internally but not externally #### Diagnostic Steps ```bash # Test local connectivity curl -I http://localhost:8080 # Test network connectivity curl -I http://server-ip:8080 # Check firewall rules sudo ufw status iptables -L # Check port binding netstat -tulpn | grep :8080 ``` #### Solutions ```bash # Open firewall ports sudo ufw allow 8080/tcp # Check Docker port binding # Ensure ports are properly exposed in docker-compose.yml ports: - "0.0.0.0:8080:8080" # Bind to all interfaces # Restart networking sudo systemctl restart networking ``` ### DNS Resolution Issues #### Symptoms - Cannot resolve service hostnames - "Name or service not known" errors - Services can't communicate with each other #### Diagnostic Steps ```bash # Test DNS resolution nslookup service.local dig service.local # Check DNS configuration cat /etc/resolv.conf # Test container DNS docker exec container-name nslookup google.com ``` #### Solutions ```bash # Update DNS servers echo "nameserver 8.8.8.8" | sudo tee -a /etc/resolv.conf # Restart systemd-resolved sudo systemctl restart systemd-resolved # Configure Docker DNS # Add to /etc/docker/daemon.json { "dns": ["8.8.8.8", "8.8.4.4"] } sudo systemctl restart docker ``` ### Reverse Proxy Issues #### Symptoms - 502 Bad Gateway errors - SSL certificate errors - Services accessible directly but not through proxy #### Diagnostic Steps ```bash # Check proxy container logs docker logs nginx-proxy-manager # Test backend connectivity curl -I http://backend-service:8080 # Check proxy configuration docker exec nginx-proxy-manager cat /etc/nginx/nginx.conf ``` #### Solutions ```bash # Verify backend service is running docker ps | grep backend-service # Check network connectivity between proxy and backend docker exec nginx-proxy-manager ping backend-service # Regenerate SSL certificates # Through Nginx Proxy Manager UI or: certbot renew --force-renewal ``` ## Storage & File System Issues ### Disk Space Issues #### Symptoms - "No space left on device" errors - Services failing to write data - System performance degradation #### Diagnostic Steps ```bash # Check disk usage df -h du -sh /* # Check Docker space usage docker system df # Find large files find / -type f -size +1G 2>/dev/null ``` #### Solutions ```bash # Clean Docker system docker system prune -a docker volume prune # Clean log files sudo journalctl --vacuum-time=7d sudo find /var/log -name "*.log" -type f -mtime +30 -delete # Move data to larger partition sudo mv /var/lib/docker /mnt/storage/docker sudo ln -s /mnt/storage/docker /var/lib/docker ``` ### Permission Issues #### Symptoms - "Permission denied" errors - Services can't read/write files - Configuration files not loading #### Diagnostic Steps ```bash # Check file permissions ls -la /mnt/storage/service-name # Check user/group IDs id username docker exec container-name id # Check mount points mount | grep storage ``` #### Solutions ```bash # Fix ownership recursively sudo chown -R 1000:1000 /mnt/storage/service-name # Set proper permissions sudo chmod -R 755 /mnt/storage/service-name # Add user to docker group sudo usermod -aG docker $USER # Set PUID/PGID in docker-compose.yml environment: - PUID=1000 - PGID=1000 ``` ### RAID Array Issues #### Symptoms - Degraded RAID arrays - Disk failure notifications - Slow storage performance #### Diagnostic Steps ```bash # Check RAID status cat /proc/mdstat sudo mdadm --detail /dev/md0 # Check disk health sudo smartctl -a /dev/sda # Check system logs dmesg | grep -i raid journalctl | grep -i mdadm ``` #### Solutions ```bash # Replace failed disk sudo mdadm --manage /dev/md0 --remove /dev/sdb # Physically replace disk sudo mdadm --manage /dev/md0 --add /dev/sdb # Force array rebuild sudo mdadm --manage /dev/md0 --re-add /dev/sdb # Monitor rebuild progress watch cat /proc/mdstat ``` ## Service-Specific Issues ### Database Connection Issues #### Symptoms - "Connection refused" to database - "Too many connections" errors - Database corruption warnings #### Diagnostic Steps ```bash # Check database container status docker logs postgres-container # Test database connectivity docker exec postgres-container psql -U user -d database -c "SELECT 1;" # Check connection limits docker exec postgres-container psql -U user -c "SHOW max_connections;" ``` #### Solutions ```bash # Restart database container docker-compose restart postgres # Increase connection limits # In postgresql.conf: max_connections = 200 # Clean up idle connections docker exec postgres-container psql -U user -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle';" ``` ### Web Service Issues #### Symptoms - 500 Internal Server Error - Slow response times - Service timeouts #### Diagnostic Steps ```bash # Check service logs docker logs web-service # Test service health curl -I http://localhost:8080/health # Check resource usage docker stats web-service ``` #### Solutions ```bash # Restart service docker-compose restart web-service # Increase resource limits deploy: resources: limits: memory: 2G cpus: '1.0' # Check application configuration docker exec web-service cat /config/app.conf ``` ### Authentication Issues #### Symptoms - Login failures - "Unauthorized" errors - SSO integration problems #### Diagnostic Steps ```bash # Check authentication service logs docker logs authentik-server # Test authentication endpoint curl -X POST http://auth.local/api/v3/auth/login # Check user database docker exec authentik-server ak list_users ``` #### Solutions ```bash # Reset user password docker exec authentik-server ak reset_password username # Restart authentication service docker-compose restart authentik # Check LDAP connectivity (if applicable) docker exec authentik-server ldapsearch -x -H ldap://server ``` ## Monitoring & Alerting Issues ### Metrics Collection Issues #### Symptoms - Missing metrics in Grafana - Prometheus targets down - Exporters not responding #### Diagnostic Steps ```bash # Check Prometheus targets curl http://prometheus:9090/api/v1/targets # Test exporter endpoints curl http://node-exporter:9100/metrics # Check Prometheus configuration docker exec prometheus cat /etc/prometheus/prometheus.yml ``` #### Solutions ```bash # Restart monitoring stack docker-compose -f monitoring.yml restart # Reload Prometheus configuration curl -X POST http://prometheus:9090/-/reload # Check network connectivity docker exec prometheus ping node-exporter ``` ### Alert Manager Issues #### Symptoms - Alerts not firing - Notifications not received - Alert routing problems #### Diagnostic Steps ```bash # Check AlertManager status curl http://alertmanager:9093/api/v1/status # View active alerts curl http://alertmanager:9093/api/v1/alerts # Check routing configuration docker exec alertmanager cat /etc/alertmanager/alertmanager.yml ``` #### Solutions ```bash # Test notification channels curl -X POST http://alertmanager:9093/api/v1/alerts \ -H "Content-Type: application/json" \ -d '[{"labels":{"alertname":"test"}}]' # Restart AlertManager docker-compose restart alertmanager # Validate configuration docker exec alertmanager amtool config check ``` ## Performance Issues ### High CPU Usage #### Symptoms - System sluggishness - High load averages - Services timing out #### Diagnostic Steps ```bash # Check system load uptime htop # Check container CPU usage docker stats # Identify CPU-intensive processes top -o %CPU ``` #### Solutions ```bash # Limit container CPU usage deploy: resources: limits: cpus: '0.5' # Optimize service configuration # Reduce worker processes, adjust cache settings # Scale services horizontally docker-compose up -d --scale web-service=3 ``` ### High Memory Usage #### Symptoms - System swapping - OOM kills - Slow performance #### Diagnostic Steps ```bash # Check memory usage free -h cat /proc/meminfo # Check container memory usage docker stats # Check for memory leaks ps aux --sort=-%mem | head ``` #### Solutions ```bash # Add memory limits deploy: resources: limits: memory: 1G # Increase system memory or swap sudo fallocate -l 2G /swapfile sudo mkswap /swapfile sudo swapon /swapfile # Optimize application memory usage # Adjust JVM heap size, database buffers, etc. ``` ### Network Performance Issues #### Symptoms - Slow file transfers - High network latency - Connection timeouts #### Diagnostic Steps ```bash # Test network speed iperf3 -c server-ip # Check network interface statistics ip -s link show # Monitor network traffic iftop nethogs ``` #### Solutions ```bash # Optimize network settings echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf sysctl -p # Check for network congestion # Upgrade network infrastructure if needed # Optimize Docker networking # Use host networking for performance-critical services network_mode: host ``` ## Security Issues ### SSL Certificate Issues #### Symptoms - Certificate expired warnings - SSL handshake failures - Browser security warnings #### Diagnostic Steps ```bash # Check certificate expiration openssl x509 -in cert.pem -text -noout | grep "Not After" # Test SSL connectivity openssl s_client -connect domain.com:443 # Check certificate chain curl -I https://domain.com ``` #### Solutions ```bash # Renew Let's Encrypt certificates certbot renew # Generate new self-signed certificate openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 # Update certificate in services # Copy new certificates to appropriate volumes ``` ### Authentication Failures #### Symptoms - Repeated login failures - Account lockouts - Suspicious access attempts #### Diagnostic Steps ```bash # Check authentication logs journalctl -u ssh.service | grep "Failed password" docker logs authentik-server | grep "login failed" # Check fail2ban status sudo fail2ban-client status sudo fail2ban-client status sshd ``` #### Solutions ```bash # Unban IP addresses sudo fail2ban-client set sshd unbanip IP_ADDRESS # Strengthen authentication # Enable 2FA, use SSH keys, implement rate limiting # Monitor for brute force attacks # Set up alerting for repeated failures ``` ## Emergency Procedures ### Complete System Recovery #### When to Use - Multiple service failures - System corruption - Hardware failures #### Recovery Steps ```bash # 1. Stop all services docker stop $(docker ps -q) # 2. Check system integrity fsck /dev/sda1 # 3. Restore from backup ./scripts/restore-system.sh # 4. Restart critical services ./scripts/deploy-critical.sh # 5. Verify system health ./scripts/health-check.sh ``` ### Data Recovery #### When to Use - Data corruption - Accidental deletion - Storage failures #### Recovery Steps ```bash # 1. Stop affected services docker-compose down # 2. Mount backup storage mount /dev/backup /mnt/restore # 3. Restore data rsync -av /mnt/restore/service-data/ /mnt/storage/service-data/ # 4. Fix permissions chown -R 1000:1000 /mnt/storage/service-data # 5. Restart services docker-compose up -d ``` ### Network Recovery #### When to Use - Network connectivity loss - DNS failures - Routing issues #### Recovery Steps ```bash # 1. Check physical connectivity ip link show # 2. Restart networking systemctl restart networking # 3. Reset network configuration netplan apply # 4. Flush DNS cache systemctl restart systemd-resolved # 5. Test connectivity ping 8.8.8.8 ``` ## Prevention Strategies ### Monitoring & Alerting - Set up comprehensive monitoring - Configure proactive alerts - Regular health checks - Performance baselines ### Backup & Recovery - Automated backup schedules - Regular restore testing - Offsite backup storage - Documentation of procedures ### Maintenance - Regular system updates - Capacity planning - Performance optimization - Security hardening ### Documentation - Incident response procedures - Configuration documentation - Change management processes - Knowledge sharing ## Related Documentation - **[Monitoring Setup](../admin/monitoring-setup.md)** - Monitoring configuration - **[Security Guidelines](../security/README.md)** - Security best practices - **[Backup Procedures](../admin/backup-procedures.md)** - Backup and recovery - **[Emergency Contacts](../admin/README.md)** - Emergency procedures --- *This troubleshooting guide provides comprehensive solutions for common issues encountered in the homelab environment. Keep this guide updated with new issues and solutions as they are discovered.*