Files
homelab-optimized/docs/admin/disaster-recovery.md
Gitea Mirror Bot 1ab33b1e66
Some checks failed
Documentation / Deploy to GitHub Pages (push) Has been cancelled
Documentation / Build Docusaurus (push) Has been cancelled
Sanitized mirror from private repository - 2026-04-19 09:48:50 UTC
2026-04-19 09:48:50 +00:00

5.1 KiB

🔒 Disaster Recovery Procedures

This document outlines comprehensive disaster recovery procedures for the homelab infrastructure. These procedures should be followed when dealing with catastrophic failures or data loss events.

🎯 Recovery Objectives

Recovery Time Objective (RTO)

  • Critical Services: 30 minutes
  • Standard Services: 2 hours
  • Non-Critical: 1 day

Recovery Point Objective (RPO)

  • Critical Data: 1 hour
  • Standard Data: 24 hours
  • Non-Critical: 7 days

🧰 Recovery Resources

Backup Locations

  1. Local NAS Copies: Hyper Backup to Calypso
  2. Cloud Storage: Backblaze B2 (primary)
  3. Offsite Replication: Syncthing to Setillo
  4. Docker Configs: Git repository with Syncthing sync

Emergency Access

  • Tailscale VPN access (primary)
  • Physical console access to hosts
  • SSH keys stored in Vaultwarden
  • Emergency USB drives with recovery tools

🚨 Incident Response Workflow

1. Initial Assessment

1. Confirm nature of incident
2. Determine scope and impact
3. Notify team members
4. Document incident time and details
5. Activate appropriate recovery procedures

2. Service Restoration Priority

Critical (1-2 hours): 
├── Authentik SSO 
├── Gitea Git hosting
├── Vaultwarden password manager
└── Nginx Proxy Manager

Standard (6-24 hours):
├── Docker configurations
├── Database services
├── Media servers
└── Monitoring stack

Non-Critical (1 week):
├── Development instances
└── Test environments

3. Recovery Steps

Docker Stack Recovery

  1. Navigate to corresponding Git repository
  2. Verify stack compose file integrity
  3. Deploy using GitOps in Portainer
  4. Restore any required data from backups
  5. Validate container status and service access

Data Restoration

  1. Identify backup source (Backblaze B2, NAS)
  2. Confirm available restore points
  3. Select appropriate backup version
  4. Execute restoration process
  5. Verify data integrity

📦 Service-Specific Recovery

Authentik SSO Recovery

  • Source: Calypso B2 daily backups
  • Restoration time: <30 minutes
  • Key files: PostgreSQL database and config files
  • Required permissions for restore access

Gitea Git Hosting

  • Source: Calypso B2 daily backups
  • Restoration time: <30 minutes
  • Key files: MariaDB database, repository data
  • Ensure service accounts are recreated post-restore

Backup Systems

  • Local Hyper Backup: Calypso /volume1/backups/
  • Cloud B2: vk-atlantis, vk-concord-1, vk-setillo, vk-guava
  • Critical services: Atlantis NAS, Calypso NAS, Setillo NAS, Guava TrueNAS
  • Restore method: Manual process using existing tasks or restore from other sources

Media Services

  • Plex: Local storage + metadata backed up
  • Jellyfin: Local storage with metadata recovery
  • Immich: Photo DB plus media backup
  • Recovery time: <1 hour for basic access

🎯 Recovery Testing

Quarterly Tests

  1. Simulate hardware failures
  2. Conduct full data restores
  3. Verify service availability post-restore
  4. Document test results and improvements

Automation Testing

  • Scripted recovery workflows
  • Docker compose file validation
  • Backup integrity checks
  • Restoration time measurements

📋 Recovery Checklists

Complete Infrastructure Restore

□ Power cycle failed hardware
□ Reinstall operating system (DSM for Synology)
□ Configure basic network settings
□ Initialize storage volumes
□ Install Docker and Portainer
□ Clone Git repository to local directory
□ Deploy stacks from Git (Portainer GitOps)
□ Restore service-specific data from backups
□ Test all services through Tailscale
□ Verify external access through Cloudflare

Critical Service Restore

□ Confirm service is down □ Validate backup availability for service
□ Initiate restore process □ Monitor progress □ Resume service configuration □ Test functionality □ Update monitoring

🔄 Failover Procedures

Host-Level Failover

  1. Identify primary host failure
  2. Deploy stack to alternative host
  3. Validate access via Tailscale
  4. Update DNS if needed (Cloudflare)
  5. Confirm service availability from external access

Network-Level Failover

  1. Switch traffic routing via Cloudflare
  2. Update DNS records for affected services
  3. Test connectivity from multiple sources
  4. Monitor service health in Uptime Kuma
  5. Document routing changes

⚠️ Known Limitations

Unbacked Data

  • Jellyfish (RPi 5): Photos-only backup, no cloud sync
  • Homelab VM: Monitoring databases are stateless and rebuildable
  • Concord NUC: Small config files that can be regenerated

Recovery Dependencies

  • Some services require Tailscale access for proper operation
  • External DNS resolution depends on Cloudflare being operational
  • Backup restoration assumes sufficient disk space is available

Last updated: 2026