Sanitized mirror from private repository - 2026-04-06 21:14:57 UTC
This commit is contained in:
166
docs/troubleshooting/comprehensive-troubleshooting.md
Normal file
166
docs/troubleshooting/comprehensive-troubleshooting.md
Normal file
@@ -0,0 +1,166 @@
|
||||
# 🔧 Comprehensive Infrastructure Troubleshooting Guide
|
||||
|
||||
This guide provides systematic approaches to diagnose and resolve common infrastructure issues across all homelab components. When encountering problems, follow this troubleshooting flow.
|
||||
|
||||
## 🔍 Troubleshooting Methodology
|
||||
|
||||
### 1. **Gather Information**
|
||||
- Check service status in Portainer
|
||||
- Review recent changes (Git commits)
|
||||
- Collect error messages and logs
|
||||
- Identify affected hosts/services
|
||||
|
||||
### 2. **Check Service Status**
|
||||
```bash
|
||||
# On homelab VM
|
||||
docker ps -a
|
||||
docker stats
|
||||
portainer stacks list
|
||||
```
|
||||
|
||||
### 3. **Verify Network Connectivity**
|
||||
```bash
|
||||
# Test connectivity to services
|
||||
ping [host]
|
||||
telnet [host] [port]
|
||||
curl -v [service-url]
|
||||
```
|
||||
|
||||
### 4. **Review Logs and Metrics**
|
||||
- Check Docker logs via Portainer or `docker logs`
|
||||
- Review Grafana dashboards
|
||||
- Monitor Uptime Kuma alerts
|
||||
|
||||
## 🚨 Common Issues and Solutions
|
||||
|
||||
### Authentication Problems
|
||||
**Symptom**: Cannot access services like Portainer, Git, or Authentik
|
||||
**Solution Steps**:
|
||||
1. Verify correct credentials (check Vaultwarden)
|
||||
2. Check Tailscale status (`tailscale status`)
|
||||
3. Confirm DNS resolution works for service domains
|
||||
4. Restart affected containers in Portainer
|
||||
|
||||
### Network Connectivity Issues
|
||||
**Symptom**: Services unreachable from external networks or clients
|
||||
**Common Causes**:
|
||||
- Firewall rules blocking ports
|
||||
- Incorrect Nginx Proxy Manager configuration
|
||||
- Tailscale connectivity issues
|
||||
- Cloudflare DNS propagation delays
|
||||
|
||||
**Troubleshooting Steps**:
|
||||
1. Check Portainer for container running status
|
||||
2. Verify host firewall settings (Synology DSM or UFW)
|
||||
3. Test direct access to service ports via Tailscale network
|
||||
4. Confirm NPM reverse proxy is correctly configured
|
||||
|
||||
### Container Failures
|
||||
**Symptom**: Containers failing or crashing repeatedly
|
||||
**Solution Steps**:
|
||||
1. Check container logs (`docker logs [container-name]`)
|
||||
2. Verify image versions (check for `:latest` tags)
|
||||
3. Inspect volume mounts and data paths
|
||||
4. Check resource limits/usage
|
||||
5. Restart container in Portainer
|
||||
|
||||
### Backup Issues
|
||||
**Symptom**: Backup failures or incomplete backups
|
||||
**Troubleshooting Steps**:
|
||||
1. Confirm backup task settings match documentation
|
||||
2. Check HyperBackup logs for specific errors
|
||||
3. Verify network connectivity to destination storage
|
||||
4. Review Backblaze B2 dashboard for errors
|
||||
5. Validate local backup copy exists before cloud upload
|
||||
|
||||
### Storage Problems
|
||||
**Symptom**: Low disk space, read/write failures
|
||||
**Solution Steps**:
|
||||
1. Check disk usage via Portainer or host shell
|
||||
```bash
|
||||
df -h
|
||||
du -sh /volume1/docker/*
|
||||
```
|
||||
2. Identify large files or directories
|
||||
3. Verify proper mount points and permissions
|
||||
4. Check Synology volume health status (via DSM UI)
|
||||
|
||||
## 🔄 Recovery Procedures
|
||||
|
||||
### Container-Level Recovery
|
||||
1. Stop affected container
|
||||
2. Back up configuration/data volumes if needed
|
||||
3. Remove container from Portainer
|
||||
4. Redeploy from Git source
|
||||
|
||||
### Service-Level Recovery
|
||||
1. Verify compose file integrity in Git repository
|
||||
2. Confirm correct image tags
|
||||
3. Redeploy using GitOps (Portainer auto-deploys on push)
|
||||
|
||||
### Data Recovery Steps
|
||||
1. Identify backup location based on service type:
|
||||
- Critical data: Cloud backups (Backblaze B2)
|
||||
- Local data: NAS storage backups (Hyper Backup)
|
||||
- Docker configs: Setillo replication via Syncthing
|
||||
|
||||
## 📊 Monitoring-Based Troubleshooting
|
||||
|
||||
### Uptime Kuma Alerts
|
||||
When Uptime Kuma signals downtime:
|
||||
1. Check service status in Portainer
|
||||
2. Verify container logs for error messages
|
||||
3. Review recent system changes or updates
|
||||
4. Confirm network is functional at multiple levels
|
||||
|
||||
### Grafana Dashboard Checks
|
||||
Monitor these key metrics:
|
||||
- CPU usage (target: <80%)
|
||||
- Memory utilization (target: <70%)
|
||||
- Disk space (must be >10% free)
|
||||
- Network I/O bandwidth
|
||||
- Container restart counts
|
||||
|
||||
## 🔧 Emergency Procedures
|
||||
|
||||
### 1. **Immediate Actions**
|
||||
- Document the issue with timestamps
|
||||
- Check Uptime Kuma and Grafana for context
|
||||
- Contact team members if this affects shared access
|
||||
|
||||
### 2. **Service Restoration Process**
|
||||
```
|
||||
1. Identify affected service/s
|
||||
2. Confirm availability of backups
|
||||
3. Determine restoration priority (critical services first)
|
||||
4. Execute backup restore from appropriate source
|
||||
5. Monitor service status post-restoration
|
||||
6. Validate functionality and notify users
|
||||
```
|
||||
|
||||
### 3. **Communication Protocol**
|
||||
- Send ntfy notification to team when:
|
||||
- Critical system is down for >10 minutes
|
||||
- Data loss is confirmed through backups
|
||||
- Restoration requires extended downtime
|
||||
|
||||
## 📋 Diagnostic Checklist
|
||||
|
||||
Before starting troubleshooting, complete this checklist:
|
||||
|
||||
□ Have recent changes been identified?
|
||||
□ Are all logs and error messages collected?
|
||||
□ Is network connectivity working at multiple levels?
|
||||
□ Can containers be restarted successfully?
|
||||
□ Are backups available for restoring data?
|
||||
□ What are the priority service impacts?
|
||||
|
||||
## 📚 Related Documentation
|
||||
|
||||
- [Disaster Recovery Guidelines](../infrastructure/monitoring/disaster-recovery.md)
|
||||
- [Service Recovery Procedures](../infrastructure/backup-strategy.md)
|
||||
- [Monitoring Stack Documentation](../infrastructure/monitoring/README.md)
|
||||
- [Security Best Practices](../infrastructure/security.md)
|
||||
|
||||
---
|
||||
*Last updated: 2026*
|
||||
Reference in New Issue
Block a user