homelab-optimized/docs/troubleshooting/comprehensive-troubleshooting.md

# 🔧 Comprehensive Infrastructure Troubleshooting Guide

This guide provides systematic approaches to diagnose and resolve common infrastructure issues across all homelab components. When encountering problems, follow this troubleshooting flow.

## 🔍 Troubleshooting Methodology

### 1. **Gather Information**
- Check service status in Portainer
- Review recent changes (Git commits)
- Collect error messages and logs
- Identify affected hosts/services

### 2. **Check Service Status**
```bash
# On homelab VM
docker ps -a
docker stats
portainer stacks list
```

### 3. **Verify Network Connectivity**
```bash
# Test connectivity to services
ping [host]
telnet [host] [port]
curl -v [service-url]
```

### 4. **Review Logs and Metrics**
- Check Docker logs via Portainer or `docker logs`
- Review Grafana dashboards
- Monitor Uptime Kuma alerts

## 🚨 Common Issues and Solutions

### Authentication Problems
**Symptom**: Cannot access services like Portainer, Git, or Authentik
**Solution Steps**:
1. Verify correct credentials (check Vaultwarden)
2. Check Tailscale status (`tailscale status`)
3. Confirm DNS resolution works for service domains
4. Restart affected containers in Portainer

### Network Connectivity Issues
**Symptom**: Services unreachable from external networks or clients
**Common Causes**:
- Firewall rules blocking ports
- Incorrect Nginx Proxy Manager configuration
- Tailscale connectivity issues
- Cloudflare DNS propagation delays

**Troubleshooting Steps**:
1. Check Portainer for container running status
2. Verify host firewall settings (Synology DSM or UFW)
3. Test direct access to service ports via Tailscale network
4. Confirm NPM reverse proxy is correctly configured

### Container Failures
**Symptom**: Containers failing or crashing repeatedly
**Solution Steps**:
1. Check container logs (`docker logs [container-name]`)
2. Verify image versions (check for `:latest` tags)
3. Inspect volume mounts and data paths
4. Check resource limits/usage
5. Restart container in Portainer

### Backup Issues
**Symptom**: Backup failures or incomplete backups
**Troubleshooting Steps**:
1. Confirm backup task settings match documentation
2. Check HyperBackup logs for specific errors
3. Verify network connectivity to destination storage
4. Review Backblaze B2 dashboard for errors
5. Validate local backup copy exists before cloud upload

### Storage Problems
**Symptom**: Low disk space, read/write failures
**Solution Steps**:
1. Check disk usage via Portainer or host shell
   ```bash
   df -h
   du -sh /volume1/docker/*
   ```
2. Identify large files or directories
3. Verify proper mount points and permissions
4. Check Synology volume health status (via DSM UI)

## 🔄 Recovery Procedures

### Container-Level Recovery
1. Stop affected container
2. Back up configuration/data volumes if needed
3. Remove container from Portainer
4. Redeploy from Git source

### Service-Level Recovery
1. Verify compose file integrity in Git repository
2. Confirm correct image tags
3. Redeploy using GitOps (Portainer auto-deploys on push)

### Data Recovery Steps
1. Identify backup location based on service type:
   - Critical data: Cloud backups (Backblaze B2)
   - Local data: NAS storage backups (Hyper Backup)
   - Docker configs: Setillo replication via Syncthing

## 📊 Monitoring-Based Troubleshooting

### Uptime Kuma Alerts
When Uptime Kuma signals downtime:
1. Check service status in Portainer
2. Verify container logs for error messages
3. Review recent system changes or updates
4. Confirm network is functional at multiple levels

### Grafana Dashboard Checks
Monitor these key metrics:
- CPU usage (target: <80%)
- Memory utilization (target: <70%)
- Disk space (must be >10% free)
- Network I/O bandwidth
- Container restart counts

## 🔧 Emergency Procedures

### 1. **Immediate Actions**
- Document the issue with timestamps
- Check Uptime Kuma and Grafana for context
- Contact team members if this affects shared access

### 2. **Service Restoration Process**
```
1. Identify affected service/s
2. Confirm availability of backups
3. Determine restoration priority (critical services first)
4. Execute backup restore from appropriate source
5. Monitor service status post-restoration
6. Validate functionality and notify users
```

### 3. **Communication Protocol**
- Send ntfy notification to team when:
  - Critical system is down for >10 minutes
  - Data loss is confirmed through backups
  - Restoration requires extended downtime

## 📋 Diagnostic Checklist

Before starting troubleshooting, complete this checklist:

□ Have recent changes been identified?
□ Are all logs and error messages collected?
□ Is network connectivity working at multiple levels?
□ Can containers be restarted successfully?
□ Are backups available for restoring data?
□ What are the priority service impacts?

## 📚 Related Documentation

- [Disaster Recovery Guidelines](../infrastructure/monitoring/disaster-recovery.md)
- [Service Recovery Procedures](../infrastructure/backup-strategy.md)
- [Monitoring Stack Documentation](../infrastructure/monitoring/README.md)
- [Security Best Practices](../infrastructure/security.md)

---
*Last updated: 2026*