Files
homelab-optimized/docs/getting-started/40-Common-Issues.md
Gitea Mirror Bot d90cf1f849
Some checks failed
Documentation / Deploy to GitHub Pages (push) Has been cancelled
Documentation / Build Docusaurus (push) Has been cancelled
Sanitized mirror from private repository - 2026-04-19 09:52:01 UTC
2026-04-19 09:52:01 +00:00

806 lines
15 KiB
Markdown

# Common Issues & Troubleshooting
## Overview
This guide covers the most frequently encountered issues in the homelab environment and provides step-by-step solutions. Issues are organized by category with diagnostic steps and resolution procedures.
## Container & Docker Issues
### Container Won't Start
#### Symptoms
- Container exits immediately after starting
- "Container exited with code 1" errors
- Service unavailable after deployment
#### Diagnostic Steps
```bash
# Check container status
docker ps -a
# View container logs
docker logs container-name
# Inspect container configuration
docker inspect container-name
# Check resource usage
docker stats
```
#### Common Causes & Solutions
**Port Conflicts**
```bash
# Check port usage
netstat -tulpn | grep :8080
ss -tulpn | grep :8080
# Solution: Change port in docker-compose.yml
ports:
- "8081:8080" # Use different external port
```
**Permission Issues**
```bash
# Check file ownership
ls -la /mnt/storage/service-name
# Fix ownership
sudo chown -R 1000:1000 /mnt/storage/service-name
# Set proper permissions
sudo chmod -R 755 /mnt/storage/service-name
```
**Missing Environment Variables**
```bash
# Check environment variables
docker exec container-name env
# Add missing variables to .env file
echo "MISSING_VAR=value" >> .env
# Recreate container
docker-compose up -d --force-recreate
```
### Container Memory Issues
#### Symptoms
- Container killed by OOM (Out of Memory)
- Slow performance or timeouts
- System becomes unresponsive
#### Diagnostic Steps
```bash
# Check memory usage
free -h
docker stats
# Check system logs for OOM kills
dmesg | grep -i "killed process"
journalctl -u docker.service | grep -i oom
```
#### Solutions
```bash
# Add memory limits to docker-compose.yml
services:
service-name:
deploy:
resources:
limits:
memory: 2G
reservations:
memory: 1G
# Increase system swap
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
```
### Docker Daemon Issues
#### Symptoms
- "Cannot connect to Docker daemon" errors
- Docker commands hang or timeout
- Services become unresponsive
#### Diagnostic Steps
```bash
# Check Docker daemon status
systemctl status docker
# Check Docker daemon logs
journalctl -u docker.service -f
# Test Docker connectivity
docker version
docker info
```
#### Solutions
```bash
# Restart Docker daemon
sudo systemctl restart docker
# Clean up Docker system
docker system prune -a
# Reset Docker daemon (last resort)
sudo systemctl stop docker
sudo rm -rf /var/lib/docker
sudo systemctl start docker
```
## Network & Connectivity Issues
### Service Not Accessible
#### Symptoms
- Connection refused errors
- Timeouts when accessing services
- Services work internally but not externally
#### Diagnostic Steps
```bash
# Test local connectivity
curl -I http://localhost:8080
# Test network connectivity
curl -I http://server-ip:8080
# Check firewall rules
sudo ufw status
iptables -L
# Check port binding
netstat -tulpn | grep :8080
```
#### Solutions
```bash
# Open firewall ports
sudo ufw allow 8080/tcp
# Check Docker port binding
# Ensure ports are properly exposed in docker-compose.yml
ports:
- "0.0.0.0:8080:8080" # Bind to all interfaces
# Restart networking
sudo systemctl restart networking
```
### DNS Resolution Issues
#### Symptoms
- Cannot resolve service hostnames
- "Name or service not known" errors
- Services can't communicate with each other
#### Diagnostic Steps
```bash
# Test DNS resolution
nslookup service.local
dig service.local
# Check DNS configuration
cat /etc/resolv.conf
# Test container DNS
docker exec container-name nslookup google.com
```
#### Solutions
```bash
# Update DNS servers
echo "nameserver 8.8.8.8" | sudo tee -a /etc/resolv.conf
# Restart systemd-resolved
sudo systemctl restart systemd-resolved
# Configure Docker DNS
# Add to /etc/docker/daemon.json
{
"dns": ["8.8.8.8", "8.8.4.4"]
}
sudo systemctl restart docker
```
### Reverse Proxy Issues
#### Symptoms
- 502 Bad Gateway errors
- SSL certificate errors
- Services accessible directly but not through proxy
#### Diagnostic Steps
```bash
# Check proxy container logs
docker logs nginx-proxy-manager
# Test backend connectivity
curl -I http://backend-service:8080
# Check proxy configuration
docker exec nginx-proxy-manager cat /etc/nginx/nginx.conf
```
#### Solutions
```bash
# Verify backend service is running
docker ps | grep backend-service
# Check network connectivity between proxy and backend
docker exec nginx-proxy-manager ping backend-service
# Regenerate SSL certificates
# Through Nginx Proxy Manager UI or:
certbot renew --force-renewal
```
## Storage & File System Issues
### Disk Space Issues
#### Symptoms
- "No space left on device" errors
- Services failing to write data
- System performance degradation
#### Diagnostic Steps
```bash
# Check disk usage
df -h
du -sh /*
# Check Docker space usage
docker system df
# Find large files
find / -type f -size +1G 2>/dev/null
```
#### Solutions
```bash
# Clean Docker system
docker system prune -a
docker volume prune
# Clean log files
sudo journalctl --vacuum-time=7d
sudo find /var/log -name "*.log" -type f -mtime +30 -delete
# Move data to larger partition
sudo mv /var/lib/docker /mnt/storage/docker
sudo ln -s /mnt/storage/docker /var/lib/docker
```
### Permission Issues
#### Symptoms
- "Permission denied" errors
- Services can't read/write files
- Configuration files not loading
#### Diagnostic Steps
```bash
# Check file permissions
ls -la /mnt/storage/service-name
# Check user/group IDs
id username
docker exec container-name id
# Check mount points
mount | grep storage
```
#### Solutions
```bash
# Fix ownership recursively
sudo chown -R 1000:1000 /mnt/storage/service-name
# Set proper permissions
sudo chmod -R 755 /mnt/storage/service-name
# Add user to docker group
sudo usermod -aG docker $USER
# Set PUID/PGID in docker-compose.yml
environment:
- PUID=1000
- PGID=1000
```
### RAID Array Issues
#### Symptoms
- Degraded RAID arrays
- Disk failure notifications
- Slow storage performance
#### Diagnostic Steps
```bash
# Check RAID status
cat /proc/mdstat
sudo mdadm --detail /dev/md0
# Check disk health
sudo smartctl -a /dev/sda
# Check system logs
dmesg | grep -i raid
journalctl | grep -i mdadm
```
#### Solutions
```bash
# Replace failed disk
sudo mdadm --manage /dev/md0 --remove /dev/sdb
# Physically replace disk
sudo mdadm --manage /dev/md0 --add /dev/sdb
# Force array rebuild
sudo mdadm --manage /dev/md0 --re-add /dev/sdb
# Monitor rebuild progress
watch cat /proc/mdstat
```
## Service-Specific Issues
### Database Connection Issues
#### Symptoms
- "Connection refused" to database
- "Too many connections" errors
- Database corruption warnings
#### Diagnostic Steps
```bash
# Check database container status
docker logs postgres-container
# Test database connectivity
docker exec postgres-container psql -U user -d database -c "SELECT 1;"
# Check connection limits
docker exec postgres-container psql -U user -c "SHOW max_connections;"
```
#### Solutions
```bash
# Restart database container
docker-compose restart postgres
# Increase connection limits
# In postgresql.conf:
max_connections = 200
# Clean up idle connections
docker exec postgres-container psql -U user -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle';"
```
### Web Service Issues
#### Symptoms
- 500 Internal Server Error
- Slow response times
- Service timeouts
#### Diagnostic Steps
```bash
# Check service logs
docker logs web-service
# Test service health
curl -I http://localhost:8080/health
# Check resource usage
docker stats web-service
```
#### Solutions
```bash
# Restart service
docker-compose restart web-service
# Increase resource limits
deploy:
resources:
limits:
memory: 2G
cpus: '1.0'
# Check application configuration
docker exec web-service cat /config/app.conf
```
### Authentication Issues
#### Symptoms
- Login failures
- "Unauthorized" errors
- SSO integration problems
#### Diagnostic Steps
```bash
# Check authentication service logs
docker logs authentik-server
# Test authentication endpoint
curl -X POST http://auth.local/api/v3/auth/login
# Check user database
docker exec authentik-server ak list_users
```
#### Solutions
```bash
# Reset user password
docker exec authentik-server ak reset_password username
# Restart authentication service
docker-compose restart authentik
# Check LDAP connectivity (if applicable)
docker exec authentik-server ldapsearch -x -H ldap://server
```
## Monitoring & Alerting Issues
### Metrics Collection Issues
#### Symptoms
- Missing metrics in Grafana
- Prometheus targets down
- Exporters not responding
#### Diagnostic Steps
```bash
# Check Prometheus targets
curl http://prometheus:9090/api/v1/targets
# Test exporter endpoints
curl http://node-exporter:9100/metrics
# Check Prometheus configuration
docker exec prometheus cat /etc/prometheus/prometheus.yml
```
#### Solutions
```bash
# Restart monitoring stack
docker-compose -f monitoring.yml restart
# Reload Prometheus configuration
curl -X POST http://prometheus:9090/-/reload
# Check network connectivity
docker exec prometheus ping node-exporter
```
### Alert Manager Issues
#### Symptoms
- Alerts not firing
- Notifications not received
- Alert routing problems
#### Diagnostic Steps
```bash
# Check AlertManager status
curl http://alertmanager:9093/api/v1/status
# View active alerts
curl http://alertmanager:9093/api/v1/alerts
# Check routing configuration
docker exec alertmanager cat /etc/alertmanager/alertmanager.yml
```
#### Solutions
```bash
# Test notification channels
curl -X POST http://alertmanager:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{"labels":{"alertname":"test"}}]'
# Restart AlertManager
docker-compose restart alertmanager
# Validate configuration
docker exec alertmanager amtool config check
```
## Performance Issues
### High CPU Usage
#### Symptoms
- System sluggishness
- High load averages
- Services timing out
#### Diagnostic Steps
```bash
# Check system load
uptime
htop
# Check container CPU usage
docker stats
# Identify CPU-intensive processes
top -o %CPU
```
#### Solutions
```bash
# Limit container CPU usage
deploy:
resources:
limits:
cpus: '0.5'
# Optimize service configuration
# Reduce worker processes, adjust cache settings
# Scale services horizontally
docker-compose up -d --scale web-service=3
```
### High Memory Usage
#### Symptoms
- System swapping
- OOM kills
- Slow performance
#### Diagnostic Steps
```bash
# Check memory usage
free -h
cat /proc/meminfo
# Check container memory usage
docker stats
# Check for memory leaks
ps aux --sort=-%mem | head
```
#### Solutions
```bash
# Add memory limits
deploy:
resources:
limits:
memory: 1G
# Increase system memory or swap
sudo fallocate -l 2G /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Optimize application memory usage
# Adjust JVM heap size, database buffers, etc.
```
### Network Performance Issues
#### Symptoms
- Slow file transfers
- High network latency
- Connection timeouts
#### Diagnostic Steps
```bash
# Test network speed
iperf3 -c server-ip
# Check network interface statistics
ip -s link show
# Monitor network traffic
iftop
nethogs
```
#### Solutions
```bash
# Optimize network settings
echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf
sysctl -p
# Check for network congestion
# Upgrade network infrastructure if needed
# Optimize Docker networking
# Use host networking for performance-critical services
network_mode: host
```
## Security Issues
### SSL Certificate Issues
#### Symptoms
- Certificate expired warnings
- SSL handshake failures
- Browser security warnings
#### Diagnostic Steps
```bash
# Check certificate expiration
openssl x509 -in cert.pem -text -noout | grep "Not After"
# Test SSL connectivity
openssl s_client -connect domain.com:443
# Check certificate chain
curl -I https://domain.com
```
#### Solutions
```bash
# Renew Let's Encrypt certificates
certbot renew
# Generate new self-signed certificate
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365
# Update certificate in services
# Copy new certificates to appropriate volumes
```
### Authentication Failures
#### Symptoms
- Repeated login failures
- Account lockouts
- Suspicious access attempts
#### Diagnostic Steps
```bash
# Check authentication logs
journalctl -u ssh.service | grep "Failed password"
docker logs authentik-server | grep "login failed"
# Check fail2ban status
sudo fail2ban-client status
sudo fail2ban-client status sshd
```
#### Solutions
```bash
# Unban IP addresses
sudo fail2ban-client set sshd unbanip IP_ADDRESS
# Strengthen authentication
# Enable 2FA, use SSH keys, implement rate limiting
# Monitor for brute force attacks
# Set up alerting for repeated failures
```
## Emergency Procedures
### Complete System Recovery
#### When to Use
- Multiple service failures
- System corruption
- Hardware failures
#### Recovery Steps
```bash
# 1. Stop all services
docker stop $(docker ps -q)
# 2. Check system integrity
fsck /dev/sda1
# 3. Restore from backup
./scripts/restore-system.sh
# 4. Restart critical services
./scripts/deploy-critical.sh
# 5. Verify system health
./scripts/health-check.sh
```
### Data Recovery
#### When to Use
- Data corruption
- Accidental deletion
- Storage failures
#### Recovery Steps
```bash
# 1. Stop affected services
docker-compose down
# 2. Mount backup storage
mount /dev/backup /mnt/restore
# 3. Restore data
rsync -av /mnt/restore/service-data/ /mnt/storage/service-data/
# 4. Fix permissions
chown -R 1000:1000 /mnt/storage/service-data
# 5. Restart services
docker-compose up -d
```
### Network Recovery
#### When to Use
- Network connectivity loss
- DNS failures
- Routing issues
#### Recovery Steps
```bash
# 1. Check physical connectivity
ip link show
# 2. Restart networking
systemctl restart networking
# 3. Reset network configuration
netplan apply
# 4. Flush DNS cache
systemctl restart systemd-resolved
# 5. Test connectivity
ping 8.8.8.8
```
## Prevention Strategies
### Monitoring & Alerting
- Set up comprehensive monitoring
- Configure proactive alerts
- Regular health checks
- Performance baselines
### Backup & Recovery
- Automated backup schedules
- Regular restore testing
- Offsite backup storage
- Documentation of procedures
### Maintenance
- Regular system updates
- Capacity planning
- Performance optimization
- Security hardening
### Documentation
- Incident response procedures
- Configuration documentation
- Change management processes
- Knowledge sharing
## Related Documentation
- **[Monitoring Setup](../admin/monitoring-setup.md)** - Monitoring configuration
- **[Security Guidelines](../security/README.md)** - Security best practices
- **[Backup Procedures](../admin/backup-procedures.md)** - Backup and recovery
- **[Emergency Contacts](../admin/README.md)** - Emergency procedures
---
*This troubleshooting guide provides comprehensive solutions for common issues encountered in the homelab environment. Keep this guide updated with new issues and solutions as they are discovered.*