Vish/homelab-optimized

Fork 0

Files

Gitea Mirror Bot 1ab33b1e66

Documentation / Deploy to GitHub Pages (push) Has been cancelled

Details

Documentation / Build Docusaurus (push) Has been cancelled

Details

Sanitized mirror from private repository - 2026-04-19 09:48:50 UTC

2026-04-19 09:48:50 +00:00

5.1 KiB

Raw Blame History

🔒 Disaster Recovery Procedures

This document outlines comprehensive disaster recovery procedures for the homelab infrastructure. These procedures should be followed when dealing with catastrophic failures or data loss events.

🎯 Recovery Objectives

Recovery Time Objective (RTO)

Critical Services: 30 minutes
Standard Services: 2 hours
Non-Critical: 1 day

Recovery Point Objective (RPO)

Critical Data: 1 hour
Standard Data: 24 hours
Non-Critical: 7 days

🧰 Recovery Resources

Backup Locations

Local NAS Copies: Hyper Backup to Calypso
Cloud Storage: Backblaze B2 (primary)
Offsite Replication: Syncthing to Setillo
Docker Configs: Git repository with Syncthing sync

Emergency Access

Tailscale VPN access (primary)
Physical console access to hosts
SSH keys stored in Vaultwarden
Emergency USB drives with recovery tools

🚨 Incident Response Workflow

1. Initial Assessment

1. Confirm nature of incident
2. Determine scope and impact
3. Notify team members
4. Document incident time and details
5. Activate appropriate recovery procedures

2. Service Restoration Priority

Critical (1-2 hours): 
├── Authentik SSO 
├── Gitea Git hosting
├── Vaultwarden password manager
└── Nginx Proxy Manager

Standard (6-24 hours):
├── Docker configurations
├── Database services
├── Media servers
└── Monitoring stack

Non-Critical (1 week):
├── Development instances
└── Test environments

3. Recovery Steps

Docker Stack Recovery

Navigate to corresponding Git repository
Verify stack compose file integrity
Deploy using GitOps in Portainer
Restore any required data from backups
Validate container status and service access

Data Restoration

Identify backup source (Backblaze B2, NAS)
Confirm available restore points
Select appropriate backup version
Execute restoration process
Verify data integrity

📦 Service-Specific Recovery

Authentik SSO Recovery

Source: Calypso B2 daily backups
Restoration time: <30 minutes
Key files: PostgreSQL database and config files
Required permissions for restore access

Gitea Git Hosting

Source: Calypso B2 daily backups
Restoration time: <30 minutes
Key files: MariaDB database, repository data
Ensure service accounts are recreated post-restore

Backup Systems

Local Hyper Backup: Calypso /volume1/backups/
Cloud B2: vk-atlantis, vk-concord-1, vk-setillo, vk-guava
Critical services: Atlantis NAS, Calypso NAS, Setillo NAS, Guava TrueNAS
Restore method: Manual process using existing tasks or restore from other sources

Media Services

Plex: Local storage + metadata backed up
Jellyfin: Local storage with metadata recovery
Immich: Photo DB plus media backup
Recovery time: <1 hour for basic access

🎯 Recovery Testing

Quarterly Tests

Simulate hardware failures
Conduct full data restores
Verify service availability post-restore
Document test results and improvements

Automation Testing

Scripted recovery workflows
Docker compose file validation
Backup integrity checks
Restoration time measurements

📋 Recovery Checklists

Complete Infrastructure Restore

□ Power cycle failed hardware
□ Reinstall operating system (DSM for Synology)
□ Configure basic network settings
□ Initialize storage volumes
□ Install Docker and Portainer
□ Clone Git repository to local directory
□ Deploy stacks from Git (Portainer GitOps)
□ Restore service-specific data from backups
□ Test all services through Tailscale
□ Verify external access through Cloudflare

Critical Service Restore

□ Confirm service is down □ Validate backup availability for service
□ Initiate restore process □ Monitor progress □ Resume service configuration □ Test functionality □ Update monitoring

🔄 Failover Procedures

Host-Level Failover

Identify primary host failure
Deploy stack to alternative host
Validate access via Tailscale
Update DNS if needed (Cloudflare)
Confirm service availability from external access

Network-Level Failover

Switch traffic routing via Cloudflare
Update DNS records for affected services
Test connectivity from multiple sources
Monitor service health in Uptime Kuma
Document routing changes

⚠️ Known Limitations

Unbacked Data

Jellyfish (RPi 5): Photos-only backup, no cloud sync
Homelab VM: Monitoring databases are stateless and rebuildable
Concord NUC: Small config files that can be regenerated

Recovery Dependencies

Some services require Tailscale access for proper operation
External DNS resolution depends on Cloudflare being operational
Backup restoration assumes sufficient disk space is available

Last updated: 2026

5.1 KiB Raw Blame History