Vish/homelab-optimized

Fork 0

Files

Gitea Mirror Bot ac5a4ca940

Documentation / Build Docusaurus (push) Failing after 5m3s

Details

Documentation / Deploy to GitHub Pages (push) Has been skipped

Details

Sanitized mirror from private repository - 2026-04-01 04:44:34 UTC

2026-04-01 04:44:34 +00:00

5.1 KiB

Raw Blame History

🔧 Comprehensive Infrastructure Troubleshooting Guide

This guide provides systematic approaches to diagnose and resolve common infrastructure issues across all homelab components. When encountering problems, follow this troubleshooting flow.

🔍 Troubleshooting Methodology

1. Gather Information

Check service status in Portainer
Review recent changes (Git commits)
Collect error messages and logs
Identify affected hosts/services

2. Check Service Status

# On homelab VM
docker ps -a
docker stats
portainer stacks list

3. Verify Network Connectivity

# Test connectivity to services
ping [host]
telnet [host] [port]
curl -v [service-url]

4. Review Logs and Metrics

Check Docker logs via Portainer or docker logs
Review Grafana dashboards
Monitor Uptime Kuma alerts

🚨 Common Issues and Solutions

Authentication Problems

Symptom: Cannot access services like Portainer, Git, or Authentik
Solution Steps:

Verify correct credentials (check Vaultwarden)
Check Tailscale status (tailscale status)
Confirm DNS resolution works for service domains
Restart affected containers in Portainer

Network Connectivity Issues

Symptom: Services unreachable from external networks or clients
Common Causes:

Firewall rules blocking ports
Incorrect Nginx Proxy Manager configuration
Tailscale connectivity issues
Cloudflare DNS propagation delays

Troubleshooting Steps:

Check Portainer for container running status
Verify host firewall settings (Synology DSM or UFW)
Test direct access to service ports via Tailscale network
Confirm NPM reverse proxy is correctly configured

Container Failures

Symptom: Containers failing or crashing repeatedly
Solution Steps:

Check container logs (docker logs [container-name])
Verify image versions (check for :latest tags)
Inspect volume mounts and data paths
Check resource limits/usage
Restart container in Portainer

Backup Issues

Symptom: Backup failures or incomplete backups
Troubleshooting Steps:

Confirm backup task settings match documentation
Check HyperBackup logs for specific errors
Verify network connectivity to destination storage
Review Backblaze B2 dashboard for errors
Validate local backup copy exists before cloud upload

Storage Problems

Symptom: Low disk space, read/write failures
Solution Steps:

Check disk usage via Portainer or host shell
```
df -h
du -sh /volume1/docker/*
```
Identify large files or directories
Verify proper mount points and permissions
Check Synology volume health status (via DSM UI)

🔄 Recovery Procedures

Container-Level Recovery

Stop affected container
Back up configuration/data volumes if needed
Remove container from Portainer
Redeploy from Git source

Service-Level Recovery

Verify compose file integrity in Git repository
Confirm correct image tags
Redeploy using GitOps (Portainer auto-deploys on push)

Data Recovery Steps

Identify backup location based on service type:
- Critical data: Cloud backups (Backblaze B2)
- Local data: NAS storage backups (Hyper Backup)
- Docker configs: Setillo replication via Syncthing

📊 Monitoring-Based Troubleshooting

Uptime Kuma Alerts

When Uptime Kuma signals downtime:

Check service status in Portainer
Verify container logs for error messages
Review recent system changes or updates
Confirm network is functional at multiple levels

Grafana Dashboard Checks

Monitor these key metrics:

CPU usage (target: <80%)
Memory utilization (target: <70%)
Disk space (must be >10% free)
Network I/O bandwidth
Container restart counts

🔧 Emergency Procedures

1. Immediate Actions

Document the issue with timestamps
Check Uptime Kuma and Grafana for context
Contact team members if this affects shared access

2. Service Restoration Process

1. Identify affected service/s
2. Confirm availability of backups
3. Determine restoration priority (critical services first)  
4. Execute backup restore from appropriate source
5. Monitor service status post-restoration
6. Validate functionality and notify users

3. Communication Protocol

Send ntfy notification to team when:
- Critical system is down for >10 minutes
- Data loss is confirmed through backups
- Restoration requires extended downtime

📋 Diagnostic Checklist

Before starting troubleshooting, complete this checklist:

□ Have recent changes been identified?
□ Are all logs and error messages collected?
□ Is network connectivity working at multiple levels?
□ Can containers be restarted successfully?
□ Are backups available for restoring data?
□ What are the priority service impacts?

Last updated: 2026

5.1 KiB Raw Blame History