Files
homelab-optimized/docs/troubleshooting/comprehensive-troubleshooting.md
Gitea Mirror Bot ac5a4ca940
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m3s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-04-01 04:44:34 UTC
2026-04-01 04:44:34 +00:00

5.1 KiB

🔧 Comprehensive Infrastructure Troubleshooting Guide

This guide provides systematic approaches to diagnose and resolve common infrastructure issues across all homelab components. When encountering problems, follow this troubleshooting flow.

🔍 Troubleshooting Methodology

1. Gather Information

  • Check service status in Portainer
  • Review recent changes (Git commits)
  • Collect error messages and logs
  • Identify affected hosts/services

2. Check Service Status

# On homelab VM
docker ps -a
docker stats
portainer stacks list

3. Verify Network Connectivity

# Test connectivity to services
ping [host]
telnet [host] [port]
curl -v [service-url]

4. Review Logs and Metrics

  • Check Docker logs via Portainer or docker logs
  • Review Grafana dashboards
  • Monitor Uptime Kuma alerts

🚨 Common Issues and Solutions

Authentication Problems

Symptom: Cannot access services like Portainer, Git, or Authentik
Solution Steps:

  1. Verify correct credentials (check Vaultwarden)
  2. Check Tailscale status (tailscale status)
  3. Confirm DNS resolution works for service domains
  4. Restart affected containers in Portainer

Network Connectivity Issues

Symptom: Services unreachable from external networks or clients
Common Causes:

  • Firewall rules blocking ports
  • Incorrect Nginx Proxy Manager configuration
  • Tailscale connectivity issues
  • Cloudflare DNS propagation delays

Troubleshooting Steps:

  1. Check Portainer for container running status
  2. Verify host firewall settings (Synology DSM or UFW)
  3. Test direct access to service ports via Tailscale network
  4. Confirm NPM reverse proxy is correctly configured

Container Failures

Symptom: Containers failing or crashing repeatedly
Solution Steps:

  1. Check container logs (docker logs [container-name])
  2. Verify image versions (check for :latest tags)
  3. Inspect volume mounts and data paths
  4. Check resource limits/usage
  5. Restart container in Portainer

Backup Issues

Symptom: Backup failures or incomplete backups
Troubleshooting Steps:

  1. Confirm backup task settings match documentation
  2. Check HyperBackup logs for specific errors
  3. Verify network connectivity to destination storage
  4. Review Backblaze B2 dashboard for errors
  5. Validate local backup copy exists before cloud upload

Storage Problems

Symptom: Low disk space, read/write failures
Solution Steps:

  1. Check disk usage via Portainer or host shell
    df -h
    du -sh /volume1/docker/*
    
  2. Identify large files or directories
  3. Verify proper mount points and permissions
  4. Check Synology volume health status (via DSM UI)

🔄 Recovery Procedures

Container-Level Recovery

  1. Stop affected container
  2. Back up configuration/data volumes if needed
  3. Remove container from Portainer
  4. Redeploy from Git source

Service-Level Recovery

  1. Verify compose file integrity in Git repository
  2. Confirm correct image tags
  3. Redeploy using GitOps (Portainer auto-deploys on push)

Data Recovery Steps

  1. Identify backup location based on service type:
    • Critical data: Cloud backups (Backblaze B2)
    • Local data: NAS storage backups (Hyper Backup)
    • Docker configs: Setillo replication via Syncthing

📊 Monitoring-Based Troubleshooting

Uptime Kuma Alerts

When Uptime Kuma signals downtime:

  1. Check service status in Portainer
  2. Verify container logs for error messages
  3. Review recent system changes or updates
  4. Confirm network is functional at multiple levels

Grafana Dashboard Checks

Monitor these key metrics:

  • CPU usage (target: <80%)
  • Memory utilization (target: <70%)
  • Disk space (must be >10% free)
  • Network I/O bandwidth
  • Container restart counts

🔧 Emergency Procedures

1. Immediate Actions

  • Document the issue with timestamps
  • Check Uptime Kuma and Grafana for context
  • Contact team members if this affects shared access

2. Service Restoration Process

1. Identify affected service/s
2. Confirm availability of backups
3. Determine restoration priority (critical services first)  
4. Execute backup restore from appropriate source
5. Monitor service status post-restoration
6. Validate functionality and notify users

3. Communication Protocol

  • Send ntfy notification to team when:
    • Critical system is down for >10 minutes
    • Data loss is confirmed through backups
    • Restoration requires extended downtime

📋 Diagnostic Checklist

Before starting troubleshooting, complete this checklist:

□ Have recent changes been identified?
□ Are all logs and error messages collected?
□ Is network connectivity working at multiple levels?
□ Can containers be restarted successfully?
□ Are backups available for restoring data?
□ What are the priority service impacts?


Last updated: 2026