5.1 KiB
🔧 Comprehensive Infrastructure Troubleshooting Guide
This guide provides systematic approaches to diagnose and resolve common infrastructure issues across all homelab components. When encountering problems, follow this troubleshooting flow.
🔍 Troubleshooting Methodology
1. Gather Information
- Check service status in Portainer
- Review recent changes (Git commits)
- Collect error messages and logs
- Identify affected hosts/services
2. Check Service Status
# On homelab VM
docker ps -a
docker stats
portainer stacks list
3. Verify Network Connectivity
# Test connectivity to services
ping [host]
telnet [host] [port]
curl -v [service-url]
4. Review Logs and Metrics
- Check Docker logs via Portainer or
docker logs - Review Grafana dashboards
- Monitor Uptime Kuma alerts
🚨 Common Issues and Solutions
Authentication Problems
Symptom: Cannot access services like Portainer, Git, or Authentik
Solution Steps:
- Verify correct credentials (check Vaultwarden)
- Check Tailscale status (
tailscale status) - Confirm DNS resolution works for service domains
- Restart affected containers in Portainer
Network Connectivity Issues
Symptom: Services unreachable from external networks or clients
Common Causes:
- Firewall rules blocking ports
- Incorrect Nginx Proxy Manager configuration
- Tailscale connectivity issues
- Cloudflare DNS propagation delays
Troubleshooting Steps:
- Check Portainer for container running status
- Verify host firewall settings (Synology DSM or UFW)
- Test direct access to service ports via Tailscale network
- Confirm NPM reverse proxy is correctly configured
Container Failures
Symptom: Containers failing or crashing repeatedly
Solution Steps:
- Check container logs (
docker logs [container-name]) - Verify image versions (check for
:latesttags) - Inspect volume mounts and data paths
- Check resource limits/usage
- Restart container in Portainer
Backup Issues
Symptom: Backup failures or incomplete backups
Troubleshooting Steps:
- Confirm backup task settings match documentation
- Check HyperBackup logs for specific errors
- Verify network connectivity to destination storage
- Review Backblaze B2 dashboard for errors
- Validate local backup copy exists before cloud upload
Storage Problems
Symptom: Low disk space, read/write failures
Solution Steps:
- Check disk usage via Portainer or host shell
df -h du -sh /volume1/docker/* - Identify large files or directories
- Verify proper mount points and permissions
- Check Synology volume health status (via DSM UI)
🔄 Recovery Procedures
Container-Level Recovery
- Stop affected container
- Back up configuration/data volumes if needed
- Remove container from Portainer
- Redeploy from Git source
Service-Level Recovery
- Verify compose file integrity in Git repository
- Confirm correct image tags
- Redeploy using GitOps (Portainer auto-deploys on push)
Data Recovery Steps
- Identify backup location based on service type:
- Critical data: Cloud backups (Backblaze B2)
- Local data: NAS storage backups (Hyper Backup)
- Docker configs: Setillo replication via Syncthing
📊 Monitoring-Based Troubleshooting
Uptime Kuma Alerts
When Uptime Kuma signals downtime:
- Check service status in Portainer
- Verify container logs for error messages
- Review recent system changes or updates
- Confirm network is functional at multiple levels
Grafana Dashboard Checks
Monitor these key metrics:
- CPU usage (target: <80%)
- Memory utilization (target: <70%)
- Disk space (must be >10% free)
- Network I/O bandwidth
- Container restart counts
🔧 Emergency Procedures
1. Immediate Actions
- Document the issue with timestamps
- Check Uptime Kuma and Grafana for context
- Contact team members if this affects shared access
2. Service Restoration Process
1. Identify affected service/s
2. Confirm availability of backups
3. Determine restoration priority (critical services first)
4. Execute backup restore from appropriate source
5. Monitor service status post-restoration
6. Validate functionality and notify users
3. Communication Protocol
- Send ntfy notification to team when:
- Critical system is down for >10 minutes
- Data loss is confirmed through backups
- Restoration requires extended downtime
📋 Diagnostic Checklist
Before starting troubleshooting, complete this checklist:
□ Have recent changes been identified?
□ Are all logs and error messages collected?
□ Is network connectivity working at multiple levels?
□ Can containers be restarted successfully?
□ Are backups available for restoring data?
□ What are the priority service impacts?
📚 Related Documentation
- Disaster Recovery Guidelines
- Service Recovery Procedures
- Monitoring Stack Documentation
- Security Best Practices
Last updated: 2026