806 lines
15 KiB
Markdown
806 lines
15 KiB
Markdown
# Common Issues & Troubleshooting
|
|
|
|
## Overview
|
|
|
|
This guide covers the most frequently encountered issues in the homelab environment and provides step-by-step solutions. Issues are organized by category with diagnostic steps and resolution procedures.
|
|
|
|
## Container & Docker Issues
|
|
|
|
### Container Won't Start
|
|
|
|
#### Symptoms
|
|
- Container exits immediately after starting
|
|
- "Container exited with code 1" errors
|
|
- Service unavailable after deployment
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check container status
|
|
docker ps -a
|
|
|
|
# View container logs
|
|
docker logs container-name
|
|
|
|
# Inspect container configuration
|
|
docker inspect container-name
|
|
|
|
# Check resource usage
|
|
docker stats
|
|
```
|
|
|
|
#### Common Causes & Solutions
|
|
|
|
**Port Conflicts**
|
|
```bash
|
|
# Check port usage
|
|
netstat -tulpn | grep :8080
|
|
ss -tulpn | grep :8080
|
|
|
|
# Solution: Change port in docker-compose.yml
|
|
ports:
|
|
- "8081:8080" # Use different external port
|
|
```
|
|
|
|
**Permission Issues**
|
|
```bash
|
|
# Check file ownership
|
|
ls -la /mnt/storage/service-name
|
|
|
|
# Fix ownership
|
|
sudo chown -R 1000:1000 /mnt/storage/service-name
|
|
|
|
# Set proper permissions
|
|
sudo chmod -R 755 /mnt/storage/service-name
|
|
```
|
|
|
|
**Missing Environment Variables**
|
|
```bash
|
|
# Check environment variables
|
|
docker exec container-name env
|
|
|
|
# Add missing variables to .env file
|
|
echo "MISSING_VAR=value" >> .env
|
|
|
|
# Recreate container
|
|
docker-compose up -d --force-recreate
|
|
```
|
|
|
|
### Container Memory Issues
|
|
|
|
#### Symptoms
|
|
- Container killed by OOM (Out of Memory)
|
|
- Slow performance or timeouts
|
|
- System becomes unresponsive
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check memory usage
|
|
free -h
|
|
docker stats
|
|
|
|
# Check system logs for OOM kills
|
|
dmesg | grep -i "killed process"
|
|
journalctl -u docker.service | grep -i oom
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Add memory limits to docker-compose.yml
|
|
services:
|
|
service-name:
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
memory: 2G
|
|
reservations:
|
|
memory: 1G
|
|
|
|
# Increase system swap
|
|
sudo fallocate -l 4G /swapfile
|
|
sudo chmod 600 /swapfile
|
|
sudo mkswap /swapfile
|
|
sudo swapon /swapfile
|
|
```
|
|
|
|
### Docker Daemon Issues
|
|
|
|
#### Symptoms
|
|
- "Cannot connect to Docker daemon" errors
|
|
- Docker commands hang or timeout
|
|
- Services become unresponsive
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check Docker daemon status
|
|
systemctl status docker
|
|
|
|
# Check Docker daemon logs
|
|
journalctl -u docker.service -f
|
|
|
|
# Test Docker connectivity
|
|
docker version
|
|
docker info
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Restart Docker daemon
|
|
sudo systemctl restart docker
|
|
|
|
# Clean up Docker system
|
|
docker system prune -a
|
|
|
|
# Reset Docker daemon (last resort)
|
|
sudo systemctl stop docker
|
|
sudo rm -rf /var/lib/docker
|
|
sudo systemctl start docker
|
|
```
|
|
|
|
## Network & Connectivity Issues
|
|
|
|
### Service Not Accessible
|
|
|
|
#### Symptoms
|
|
- Connection refused errors
|
|
- Timeouts when accessing services
|
|
- Services work internally but not externally
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Test local connectivity
|
|
curl -I http://localhost:8080
|
|
|
|
# Test network connectivity
|
|
curl -I http://server-ip:8080
|
|
|
|
# Check firewall rules
|
|
sudo ufw status
|
|
iptables -L
|
|
|
|
# Check port binding
|
|
netstat -tulpn | grep :8080
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Open firewall ports
|
|
sudo ufw allow 8080/tcp
|
|
|
|
# Check Docker port binding
|
|
# Ensure ports are properly exposed in docker-compose.yml
|
|
ports:
|
|
- "0.0.0.0:8080:8080" # Bind to all interfaces
|
|
|
|
# Restart networking
|
|
sudo systemctl restart networking
|
|
```
|
|
|
|
### DNS Resolution Issues
|
|
|
|
#### Symptoms
|
|
- Cannot resolve service hostnames
|
|
- "Name or service not known" errors
|
|
- Services can't communicate with each other
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Test DNS resolution
|
|
nslookup service.local
|
|
dig service.local
|
|
|
|
# Check DNS configuration
|
|
cat /etc/resolv.conf
|
|
|
|
# Test container DNS
|
|
docker exec container-name nslookup google.com
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Update DNS servers
|
|
echo "nameserver 8.8.8.8" | sudo tee -a /etc/resolv.conf
|
|
|
|
# Restart systemd-resolved
|
|
sudo systemctl restart systemd-resolved
|
|
|
|
# Configure Docker DNS
|
|
# Add to /etc/docker/daemon.json
|
|
{
|
|
"dns": ["8.8.8.8", "8.8.4.4"]
|
|
}
|
|
|
|
sudo systemctl restart docker
|
|
```
|
|
|
|
### Reverse Proxy Issues
|
|
|
|
#### Symptoms
|
|
- 502 Bad Gateway errors
|
|
- SSL certificate errors
|
|
- Services accessible directly but not through proxy
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check proxy container logs
|
|
docker logs nginx-proxy-manager
|
|
|
|
# Test backend connectivity
|
|
curl -I http://backend-service:8080
|
|
|
|
# Check proxy configuration
|
|
docker exec nginx-proxy-manager cat /etc/nginx/nginx.conf
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Verify backend service is running
|
|
docker ps | grep backend-service
|
|
|
|
# Check network connectivity between proxy and backend
|
|
docker exec nginx-proxy-manager ping backend-service
|
|
|
|
# Regenerate SSL certificates
|
|
# Through Nginx Proxy Manager UI or:
|
|
certbot renew --force-renewal
|
|
```
|
|
|
|
## Storage & File System Issues
|
|
|
|
### Disk Space Issues
|
|
|
|
#### Symptoms
|
|
- "No space left on device" errors
|
|
- Services failing to write data
|
|
- System performance degradation
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check disk usage
|
|
df -h
|
|
du -sh /*
|
|
|
|
# Check Docker space usage
|
|
docker system df
|
|
|
|
# Find large files
|
|
find / -type f -size +1G 2>/dev/null
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Clean Docker system
|
|
docker system prune -a
|
|
docker volume prune
|
|
|
|
# Clean log files
|
|
sudo journalctl --vacuum-time=7d
|
|
sudo find /var/log -name "*.log" -type f -mtime +30 -delete
|
|
|
|
# Move data to larger partition
|
|
sudo mv /var/lib/docker /mnt/storage/docker
|
|
sudo ln -s /mnt/storage/docker /var/lib/docker
|
|
```
|
|
|
|
### Permission Issues
|
|
|
|
#### Symptoms
|
|
- "Permission denied" errors
|
|
- Services can't read/write files
|
|
- Configuration files not loading
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check file permissions
|
|
ls -la /mnt/storage/service-name
|
|
|
|
# Check user/group IDs
|
|
id username
|
|
docker exec container-name id
|
|
|
|
# Check mount points
|
|
mount | grep storage
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Fix ownership recursively
|
|
sudo chown -R 1000:1000 /mnt/storage/service-name
|
|
|
|
# Set proper permissions
|
|
sudo chmod -R 755 /mnt/storage/service-name
|
|
|
|
# Add user to docker group
|
|
sudo usermod -aG docker $USER
|
|
|
|
# Set PUID/PGID in docker-compose.yml
|
|
environment:
|
|
- PUID=1000
|
|
- PGID=1000
|
|
```
|
|
|
|
### RAID Array Issues
|
|
|
|
#### Symptoms
|
|
- Degraded RAID arrays
|
|
- Disk failure notifications
|
|
- Slow storage performance
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check RAID status
|
|
cat /proc/mdstat
|
|
sudo mdadm --detail /dev/md0
|
|
|
|
# Check disk health
|
|
sudo smartctl -a /dev/sda
|
|
|
|
# Check system logs
|
|
dmesg | grep -i raid
|
|
journalctl | grep -i mdadm
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Replace failed disk
|
|
sudo mdadm --manage /dev/md0 --remove /dev/sdb
|
|
# Physically replace disk
|
|
sudo mdadm --manage /dev/md0 --add /dev/sdb
|
|
|
|
# Force array rebuild
|
|
sudo mdadm --manage /dev/md0 --re-add /dev/sdb
|
|
|
|
# Monitor rebuild progress
|
|
watch cat /proc/mdstat
|
|
```
|
|
|
|
## Service-Specific Issues
|
|
|
|
### Database Connection Issues
|
|
|
|
#### Symptoms
|
|
- "Connection refused" to database
|
|
- "Too many connections" errors
|
|
- Database corruption warnings
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check database container status
|
|
docker logs postgres-container
|
|
|
|
# Test database connectivity
|
|
docker exec postgres-container psql -U user -d database -c "SELECT 1;"
|
|
|
|
# Check connection limits
|
|
docker exec postgres-container psql -U user -c "SHOW max_connections;"
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Restart database container
|
|
docker-compose restart postgres
|
|
|
|
# Increase connection limits
|
|
# In postgresql.conf:
|
|
max_connections = 200
|
|
|
|
# Clean up idle connections
|
|
docker exec postgres-container psql -U user -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle';"
|
|
```
|
|
|
|
### Web Service Issues
|
|
|
|
#### Symptoms
|
|
- 500 Internal Server Error
|
|
- Slow response times
|
|
- Service timeouts
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check service logs
|
|
docker logs web-service
|
|
|
|
# Test service health
|
|
curl -I http://localhost:8080/health
|
|
|
|
# Check resource usage
|
|
docker stats web-service
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Restart service
|
|
docker-compose restart web-service
|
|
|
|
# Increase resource limits
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
memory: 2G
|
|
cpus: '1.0'
|
|
|
|
# Check application configuration
|
|
docker exec web-service cat /config/app.conf
|
|
```
|
|
|
|
### Authentication Issues
|
|
|
|
#### Symptoms
|
|
- Login failures
|
|
- "Unauthorized" errors
|
|
- SSO integration problems
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check authentication service logs
|
|
docker logs authentik-server
|
|
|
|
# Test authentication endpoint
|
|
curl -X POST http://auth.local/api/v3/auth/login
|
|
|
|
# Check user database
|
|
docker exec authentik-server ak list_users
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Reset user password
|
|
docker exec authentik-server ak reset_password username
|
|
|
|
# Restart authentication service
|
|
docker-compose restart authentik
|
|
|
|
# Check LDAP connectivity (if applicable)
|
|
docker exec authentik-server ldapsearch -x -H ldap://server
|
|
```
|
|
|
|
## Monitoring & Alerting Issues
|
|
|
|
### Metrics Collection Issues
|
|
|
|
#### Symptoms
|
|
- Missing metrics in Grafana
|
|
- Prometheus targets down
|
|
- Exporters not responding
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check Prometheus targets
|
|
curl http://prometheus:9090/api/v1/targets
|
|
|
|
# Test exporter endpoints
|
|
curl http://node-exporter:9100/metrics
|
|
|
|
# Check Prometheus configuration
|
|
docker exec prometheus cat /etc/prometheus/prometheus.yml
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Restart monitoring stack
|
|
docker-compose -f monitoring.yml restart
|
|
|
|
# Reload Prometheus configuration
|
|
curl -X POST http://prometheus:9090/-/reload
|
|
|
|
# Check network connectivity
|
|
docker exec prometheus ping node-exporter
|
|
```
|
|
|
|
### Alert Manager Issues
|
|
|
|
#### Symptoms
|
|
- Alerts not firing
|
|
- Notifications not received
|
|
- Alert routing problems
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check AlertManager status
|
|
curl http://alertmanager:9093/api/v1/status
|
|
|
|
# View active alerts
|
|
curl http://alertmanager:9093/api/v1/alerts
|
|
|
|
# Check routing configuration
|
|
docker exec alertmanager cat /etc/alertmanager/alertmanager.yml
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Test notification channels
|
|
curl -X POST http://alertmanager:9093/api/v1/alerts \
|
|
-H "Content-Type: application/json" \
|
|
-d '[{"labels":{"alertname":"test"}}]'
|
|
|
|
# Restart AlertManager
|
|
docker-compose restart alertmanager
|
|
|
|
# Validate configuration
|
|
docker exec alertmanager amtool config check
|
|
```
|
|
|
|
## Performance Issues
|
|
|
|
### High CPU Usage
|
|
|
|
#### Symptoms
|
|
- System sluggishness
|
|
- High load averages
|
|
- Services timing out
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check system load
|
|
uptime
|
|
htop
|
|
|
|
# Check container CPU usage
|
|
docker stats
|
|
|
|
# Identify CPU-intensive processes
|
|
top -o %CPU
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Limit container CPU usage
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
cpus: '0.5'
|
|
|
|
# Optimize service configuration
|
|
# Reduce worker processes, adjust cache settings
|
|
|
|
# Scale services horizontally
|
|
docker-compose up -d --scale web-service=3
|
|
```
|
|
|
|
### High Memory Usage
|
|
|
|
#### Symptoms
|
|
- System swapping
|
|
- OOM kills
|
|
- Slow performance
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check memory usage
|
|
free -h
|
|
cat /proc/meminfo
|
|
|
|
# Check container memory usage
|
|
docker stats
|
|
|
|
# Check for memory leaks
|
|
ps aux --sort=-%mem | head
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Add memory limits
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
memory: 1G
|
|
|
|
# Increase system memory or swap
|
|
sudo fallocate -l 2G /swapfile
|
|
sudo mkswap /swapfile
|
|
sudo swapon /swapfile
|
|
|
|
# Optimize application memory usage
|
|
# Adjust JVM heap size, database buffers, etc.
|
|
```
|
|
|
|
### Network Performance Issues
|
|
|
|
#### Symptoms
|
|
- Slow file transfers
|
|
- High network latency
|
|
- Connection timeouts
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Test network speed
|
|
iperf3 -c server-ip
|
|
|
|
# Check network interface statistics
|
|
ip -s link show
|
|
|
|
# Monitor network traffic
|
|
iftop
|
|
nethogs
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Optimize network settings
|
|
echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
|
|
echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf
|
|
sysctl -p
|
|
|
|
# Check for network congestion
|
|
# Upgrade network infrastructure if needed
|
|
|
|
# Optimize Docker networking
|
|
# Use host networking for performance-critical services
|
|
network_mode: host
|
|
```
|
|
|
|
## Security Issues
|
|
|
|
### SSL Certificate Issues
|
|
|
|
#### Symptoms
|
|
- Certificate expired warnings
|
|
- SSL handshake failures
|
|
- Browser security warnings
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check certificate expiration
|
|
openssl x509 -in cert.pem -text -noout | grep "Not After"
|
|
|
|
# Test SSL connectivity
|
|
openssl s_client -connect domain.com:443
|
|
|
|
# Check certificate chain
|
|
curl -I https://domain.com
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Renew Let's Encrypt certificates
|
|
certbot renew
|
|
|
|
# Generate new self-signed certificate
|
|
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365
|
|
|
|
# Update certificate in services
|
|
# Copy new certificates to appropriate volumes
|
|
```
|
|
|
|
### Authentication Failures
|
|
|
|
#### Symptoms
|
|
- Repeated login failures
|
|
- Account lockouts
|
|
- Suspicious access attempts
|
|
|
|
#### Diagnostic Steps
|
|
```bash
|
|
# Check authentication logs
|
|
journalctl -u ssh.service | grep "Failed password"
|
|
docker logs authentik-server | grep "login failed"
|
|
|
|
# Check fail2ban status
|
|
sudo fail2ban-client status
|
|
sudo fail2ban-client status sshd
|
|
```
|
|
|
|
#### Solutions
|
|
```bash
|
|
# Unban IP addresses
|
|
sudo fail2ban-client set sshd unbanip IP_ADDRESS
|
|
|
|
# Strengthen authentication
|
|
# Enable 2FA, use SSH keys, implement rate limiting
|
|
|
|
# Monitor for brute force attacks
|
|
# Set up alerting for repeated failures
|
|
```
|
|
|
|
## Emergency Procedures
|
|
|
|
### Complete System Recovery
|
|
|
|
#### When to Use
|
|
- Multiple service failures
|
|
- System corruption
|
|
- Hardware failures
|
|
|
|
#### Recovery Steps
|
|
```bash
|
|
# 1. Stop all services
|
|
docker stop $(docker ps -q)
|
|
|
|
# 2. Check system integrity
|
|
fsck /dev/sda1
|
|
|
|
# 3. Restore from backup
|
|
./scripts/restore-system.sh
|
|
|
|
# 4. Restart critical services
|
|
./scripts/deploy-critical.sh
|
|
|
|
# 5. Verify system health
|
|
./scripts/health-check.sh
|
|
```
|
|
|
|
### Data Recovery
|
|
|
|
#### When to Use
|
|
- Data corruption
|
|
- Accidental deletion
|
|
- Storage failures
|
|
|
|
#### Recovery Steps
|
|
```bash
|
|
# 1. Stop affected services
|
|
docker-compose down
|
|
|
|
# 2. Mount backup storage
|
|
mount /dev/backup /mnt/restore
|
|
|
|
# 3. Restore data
|
|
rsync -av /mnt/restore/service-data/ /mnt/storage/service-data/
|
|
|
|
# 4. Fix permissions
|
|
chown -R 1000:1000 /mnt/storage/service-data
|
|
|
|
# 5. Restart services
|
|
docker-compose up -d
|
|
```
|
|
|
|
### Network Recovery
|
|
|
|
#### When to Use
|
|
- Network connectivity loss
|
|
- DNS failures
|
|
- Routing issues
|
|
|
|
#### Recovery Steps
|
|
```bash
|
|
# 1. Check physical connectivity
|
|
ip link show
|
|
|
|
# 2. Restart networking
|
|
systemctl restart networking
|
|
|
|
# 3. Reset network configuration
|
|
netplan apply
|
|
|
|
# 4. Flush DNS cache
|
|
systemctl restart systemd-resolved
|
|
|
|
# 5. Test connectivity
|
|
ping 8.8.8.8
|
|
```
|
|
|
|
## Prevention Strategies
|
|
|
|
### Monitoring & Alerting
|
|
- Set up comprehensive monitoring
|
|
- Configure proactive alerts
|
|
- Regular health checks
|
|
- Performance baselines
|
|
|
|
### Backup & Recovery
|
|
- Automated backup schedules
|
|
- Regular restore testing
|
|
- Offsite backup storage
|
|
- Documentation of procedures
|
|
|
|
### Maintenance
|
|
- Regular system updates
|
|
- Capacity planning
|
|
- Performance optimization
|
|
- Security hardening
|
|
|
|
### Documentation
|
|
- Incident response procedures
|
|
- Configuration documentation
|
|
- Change management processes
|
|
- Knowledge sharing
|
|
|
|
## Related Documentation
|
|
|
|
- **[Monitoring Setup](../admin/monitoring-setup.md)** - Monitoring configuration
|
|
- **[Security Guidelines](../security/README.md)** - Security best practices
|
|
- **[Backup Procedures](../admin/backup-procedures.md)** - Backup and recovery
|
|
- **[Emergency Contacts](../admin/README.md)** - Emergency procedures
|
|
|
|
---
|
|
|
|
*This troubleshooting guide provides comprehensive solutions for common issues encountered in the homelab environment. Keep this guide updated with new issues and solutions as they are discovered.* |