166 lines
5.1 KiB
Markdown
166 lines
5.1 KiB
Markdown
# 🔧 Comprehensive Infrastructure Troubleshooting Guide
|
|
|
|
This guide provides systematic approaches to diagnose and resolve common infrastructure issues across all homelab components. When encountering problems, follow this troubleshooting flow.
|
|
|
|
## 🔍 Troubleshooting Methodology
|
|
|
|
### 1. **Gather Information**
|
|
- Check service status in Portainer
|
|
- Review recent changes (Git commits)
|
|
- Collect error messages and logs
|
|
- Identify affected hosts/services
|
|
|
|
### 2. **Check Service Status**
|
|
```bash
|
|
# On homelab VM
|
|
docker ps -a
|
|
docker stats
|
|
portainer stacks list
|
|
```
|
|
|
|
### 3. **Verify Network Connectivity**
|
|
```bash
|
|
# Test connectivity to services
|
|
ping [host]
|
|
telnet [host] [port]
|
|
curl -v [service-url]
|
|
```
|
|
|
|
### 4. **Review Logs and Metrics**
|
|
- Check Docker logs via Portainer or `docker logs`
|
|
- Review Grafana dashboards
|
|
- Monitor Uptime Kuma alerts
|
|
|
|
## 🚨 Common Issues and Solutions
|
|
|
|
### Authentication Problems
|
|
**Symptom**: Cannot access services like Portainer, Git, or Authentik
|
|
**Solution Steps**:
|
|
1. Verify correct credentials (check Vaultwarden)
|
|
2. Check Tailscale status (`tailscale status`)
|
|
3. Confirm DNS resolution works for service domains
|
|
4. Restart affected containers in Portainer
|
|
|
|
### Network Connectivity Issues
|
|
**Symptom**: Services unreachable from external networks or clients
|
|
**Common Causes**:
|
|
- Firewall rules blocking ports
|
|
- Incorrect Nginx Proxy Manager configuration
|
|
- Tailscale connectivity issues
|
|
- Cloudflare DNS propagation delays
|
|
|
|
**Troubleshooting Steps**:
|
|
1. Check Portainer for container running status
|
|
2. Verify host firewall settings (Synology DSM or UFW)
|
|
3. Test direct access to service ports via Tailscale network
|
|
4. Confirm NPM reverse proxy is correctly configured
|
|
|
|
### Container Failures
|
|
**Symptom**: Containers failing or crashing repeatedly
|
|
**Solution Steps**:
|
|
1. Check container logs (`docker logs [container-name]`)
|
|
2. Verify image versions (check for `:latest` tags)
|
|
3. Inspect volume mounts and data paths
|
|
4. Check resource limits/usage
|
|
5. Restart container in Portainer
|
|
|
|
### Backup Issues
|
|
**Symptom**: Backup failures or incomplete backups
|
|
**Troubleshooting Steps**:
|
|
1. Confirm backup task settings match documentation
|
|
2. Check HyperBackup logs for specific errors
|
|
3. Verify network connectivity to destination storage
|
|
4. Review Backblaze B2 dashboard for errors
|
|
5. Validate local backup copy exists before cloud upload
|
|
|
|
### Storage Problems
|
|
**Symptom**: Low disk space, read/write failures
|
|
**Solution Steps**:
|
|
1. Check disk usage via Portainer or host shell
|
|
```bash
|
|
df -h
|
|
du -sh /volume1/docker/*
|
|
```
|
|
2. Identify large files or directories
|
|
3. Verify proper mount points and permissions
|
|
4. Check Synology volume health status (via DSM UI)
|
|
|
|
## 🔄 Recovery Procedures
|
|
|
|
### Container-Level Recovery
|
|
1. Stop affected container
|
|
2. Back up configuration/data volumes if needed
|
|
3. Remove container from Portainer
|
|
4. Redeploy from Git source
|
|
|
|
### Service-Level Recovery
|
|
1. Verify compose file integrity in Git repository
|
|
2. Confirm correct image tags
|
|
3. Redeploy using GitOps (Portainer auto-deploys on push)
|
|
|
|
### Data Recovery Steps
|
|
1. Identify backup location based on service type:
|
|
- Critical data: Cloud backups (Backblaze B2)
|
|
- Local data: NAS storage backups (Hyper Backup)
|
|
- Docker configs: Setillo replication via Syncthing
|
|
|
|
## 📊 Monitoring-Based Troubleshooting
|
|
|
|
### Uptime Kuma Alerts
|
|
When Uptime Kuma signals downtime:
|
|
1. Check service status in Portainer
|
|
2. Verify container logs for error messages
|
|
3. Review recent system changes or updates
|
|
4. Confirm network is functional at multiple levels
|
|
|
|
### Grafana Dashboard Checks
|
|
Monitor these key metrics:
|
|
- CPU usage (target: <80%)
|
|
- Memory utilization (target: <70%)
|
|
- Disk space (must be >10% free)
|
|
- Network I/O bandwidth
|
|
- Container restart counts
|
|
|
|
## 🔧 Emergency Procedures
|
|
|
|
### 1. **Immediate Actions**
|
|
- Document the issue with timestamps
|
|
- Check Uptime Kuma and Grafana for context
|
|
- Contact team members if this affects shared access
|
|
|
|
### 2. **Service Restoration Process**
|
|
```
|
|
1. Identify affected service/s
|
|
2. Confirm availability of backups
|
|
3. Determine restoration priority (critical services first)
|
|
4. Execute backup restore from appropriate source
|
|
5. Monitor service status post-restoration
|
|
6. Validate functionality and notify users
|
|
```
|
|
|
|
### 3. **Communication Protocol**
|
|
- Send ntfy notification to team when:
|
|
- Critical system is down for >10 minutes
|
|
- Data loss is confirmed through backups
|
|
- Restoration requires extended downtime
|
|
|
|
## 📋 Diagnostic Checklist
|
|
|
|
Before starting troubleshooting, complete this checklist:
|
|
|
|
□ Have recent changes been identified?
|
|
□ Are all logs and error messages collected?
|
|
□ Is network connectivity working at multiple levels?
|
|
□ Can containers be restarted successfully?
|
|
□ Are backups available for restoring data?
|
|
□ What are the priority service impacts?
|
|
|
|
## 📚 Related Documentation
|
|
|
|
- [Disaster Recovery Guidelines](../infrastructure/monitoring/disaster-recovery.md)
|
|
- [Service Recovery Procedures](../infrastructure/backup-strategy.md)
|
|
- [Monitoring Stack Documentation](../infrastructure/monitoring/README.md)
|
|
- [Security Best Practices](../infrastructure/security.md)
|
|
|
|
---
|
|
*Last updated: 2026* |