Sanitized mirror from private repository - 2026-04-06 21:14:57 UTC

2026-04-06 21:14:57 +00:00
commit d3fa5d354a
1415 changed files with 359812 additions and 0 deletions
--- a/docs/troubleshooting/comprehensive-troubleshooting.md
+++ b/docs/troubleshooting/comprehensive-troubleshooting.md
@@ -0,0 +1,166 @@
+# 🔧 Comprehensive Infrastructure Troubleshooting Guide
+
+This guide provides systematic approaches to diagnose and resolve common infrastructure issues across all homelab components. When encountering problems, follow this troubleshooting flow.
+
+## 🔍 Troubleshooting Methodology
+
+### 1. **Gather Information**
+- Check service status in Portainer
+- Review recent changes (Git commits)
+- Collect error messages and logs
+- Identify affected hosts/services
+
+### 2. **Check Service Status**
+```bash
+# On homelab VM
+docker ps -a
+docker stats
+portainer stacks list
+```
+
+### 3. **Verify Network Connectivity**
+```bash
+# Test connectivity to services
+ping [host]
+telnet [host] [port]
+curl -v [service-url]
+```
+
+### 4. **Review Logs and Metrics**
+- Check Docker logs via Portainer or `docker logs`
+- Review Grafana dashboards
+- Monitor Uptime Kuma alerts
+
+## 🚨 Common Issues and Solutions
+
+### Authentication Problems
+**Symptom**: Cannot access services like Portainer, Git, or Authentik  
+**Solution Steps**:
+1. Verify correct credentials (check Vaultwarden)
+2. Check Tailscale status (`tailscale status`)
+3. Confirm DNS resolution works for service domains 
+4. Restart affected containers in Portainer
+
+### Network Connectivity Issues
+**Symptom**: Services unreachable from external networks or clients  
+**Common Causes**:
+- Firewall rules blocking ports
+- Incorrect Nginx Proxy Manager configuration
+- Tailscale connectivity issues
+- Cloudflare DNS propagation delays  
+
+**Troubleshooting Steps**:
+1. Check Portainer for container running status
+2. Verify host firewall settings (Synology DSM or UFW)
+3. Test direct access to service ports via Tailscale network
+4. Confirm NPM reverse proxy is correctly configured
+
+### Container Failures
+**Symptom**: Containers failing or crashing repeatedly  
+**Solution Steps**:
+1. Check container logs (`docker logs [container-name]`)
+2. Verify image versions (check for `:latest` tags)
+3. Inspect volume mounts and data paths
+4. Check resource limits/usage
+5. Restart container in Portainer
+
+### Backup Issues
+**Symptom**: Backup failures or incomplete backups  
+**Troubleshooting Steps**:
+1. Confirm backup task settings match documentation 
+2. Check HyperBackup logs for specific errors
+3. Verify network connectivity to destination storage
+4. Review Backblaze B2 dashboard for errors
+5. Validate local backup copy exists before cloud upload
+
+### Storage Problems
+**Symptom**: Low disk space, read/write failures  
+**Solution Steps**:
+1. Check disk usage via Portainer or host shell
+   ```bash
+   df -h
+   du -sh /volume1/docker/*
+   ```
+2. Identify large files or directories
+3. Verify proper mount points and permissions  
+4. Check Synology volume health status (via DSM UI)
+
+## 🔄 Recovery Procedures
+
+### Container-Level Recovery
+1. Stop affected container
+2. Back up configuration/data volumes if needed
+3. Remove container from Portainer
+4. Redeploy from Git source
+
+### Service-Level Recovery
+1. Verify compose file integrity in Git repository
+2. Confirm correct image tags
+3. Redeploy using GitOps (Portainer auto-deploys on push)
+
+### Data Recovery Steps
+1. Identify backup location based on service type:
+   - Critical data: Cloud backups (Backblaze B2)
+   - Local data: NAS storage backups (Hyper Backup)
+   - Docker configs: Setillo replication via Syncthing
+
+## 📊 Monitoring-Based Troubleshooting
+
+### Uptime Kuma Alerts
+When Uptime Kuma signals downtime:
+1. Check service status in Portainer
+2. Verify container logs for error messages
+3. Review recent system changes or updates
+4. Confirm network is functional at multiple levels
+
+### Grafana Dashboard Checks
+Monitor these key metrics:
+- CPU usage (target: <80%)
+- Memory utilization (target: <70%)  
+- Disk space (must be >10% free)
+- Network I/O bandwidth
+- Container restart counts
+
+## 🔧 Emergency Procedures
+
+### 1. **Immediate Actions**
+- Document the issue with timestamps
+- Check Uptime Kuma and Grafana for context
+- Contact team members if this affects shared access
+
+### 2. **Service Restoration Process**
+```
+1. Identify affected service/s
+2. Confirm availability of backups
+3. Determine restoration priority (critical services first)  
+4. Execute backup restore from appropriate source
+5. Monitor service status post-restoration
+6. Validate functionality and notify users
+```
+
+### 3. **Communication Protocol**
+- Send ntfy notification to team when:
+  - Critical system is down for >10 minutes
+  - Data loss is confirmed through backups
+  - Restoration requires extended downtime
+
+## 📋 Diagnostic Checklist
+
+Before starting troubleshooting, complete this checklist:
+
+□ Have recent changes been identified?  
+□ Are all logs and error messages collected?  
+□ Is network connectivity working at multiple levels?  
+□ Can containers be restarted successfully?  
+□ Are backups available for restoring data?  
+□ What are the priority service impacts?  
+
+## 📚 Related Documentation
+
+- [Disaster Recovery Guidelines](../infrastructure/monitoring/disaster-recovery.md)
+- [Service Recovery Procedures](../infrastructure/backup-strategy.md)  
+- [Monitoring Stack Documentation](../infrastructure/monitoring/README.md)
+- [Security Best Practices](../infrastructure/security.md)
+
+---
+*Last updated: 2026*