homelab-optimized/docs/troubleshooting/synology-disaster-recovery.md

# 🚨 Synology NAS Disaster Recovery Guide

**🔴 Critical Emergency Procedures**

This guide covers critical disaster recovery scenarios specific to Synology NAS systems, with detailed procedures for the DS1823xs+ and related hardware failures. These procedures can save your data and minimize downtime.

## 🎯 Critical Scenarios Covered

1. **💾 SSD Cache Failure** - Current critical issue with Atlantis
2. **🔥 Complete NAS Failure** - Hardware replacement procedures
3. **⚡ Power Surge Damage** - Recovery from electrical damage
4. **🌊 Water/Physical Damage** - Emergency data extraction
5. **🔒 Encryption Key Loss** - Encrypted volume recovery
6. **📦 DSM Corruption** - Operating system recovery

---

## 💾 SSD Cache Failure Recovery (CURRENT CRITICAL ISSUE)

### **🚨 Current Situation: Atlantis DS1823xs+**
```bash
# CRITICAL STATUS:
# - SSD cache corrupted after DSM update
# - Volume1 is OFFLINE due to cache failure
# - 2x WD Black SN750 SE 500GB drives affected
# - All Docker services down
# - Immediate action required

# Symptoms:
# - Volume1 shows as "Crashed" in Storage Manager
# - SSD cache shows errors or corruption
# - Services fail to start
# - Data appears inaccessible
```

### **⚡ Emergency Recovery Procedure**

#### **Step 1: Immediate Assessment (5 minutes)**
```bash
# SSH into Atlantis
ssh admin@atlantis.vish.local
# or via Tailscale IP
ssh admin@100.83.230.112

# Check system status
sudo -i
cat /proc/mdstat
df -h
dmesg | tail -50

# Check volume status
synodisk --enum
synovolume --enum
```

#### **Step 2: Disable SSD Cache (10 minutes)**
```bash
# CRITICAL: This will restore Volume1 access
# Navigate via web interface:
# 1. DSM > Storage Manager
# 2. Storage > SSD Cache
# 3. Select corrupted cache
# 4. Click "Remove" or "Disable"
# 5. Confirm removal (data will be preserved)

# Alternative via SSH (if web interface fails):
echo 'Disabling SSD cache via command line...'
# Note: Exact commands vary by DSM version
# Consult Synology documentation for CLI cache management
```

#### **Step 3: Verify Volume1 Recovery (5 minutes)**
```bash
# Check if Volume1 is back online
df -h | grep volume1
ls -la /volume1/

# If Volume1 is accessible:
echo "✅ Volume1 recovered successfully"

# If still offline:
echo "❌ Volume1 still offline - proceed to advanced recovery"
```

#### **Step 4: Emergency Data Backup (30-60 minutes)**
```bash
# IMMEDIATELY backup critical data once Volume1 is accessible
# Priority order:

# 1. Docker configurations (highest priority)
rsync -av /volume1/docker/ /volume2/emergency-backup/docker-$(date +%Y%m%d)/
tar -czf /volume2/emergency-backup/docker-configs-$(date +%Y%m%d).tar.gz /volume1/docker/

# 2. Critical documents
rsync -av /volume1/documents/ /volume2/emergency-backup/documents-$(date +%Y%m%d)/

# 3. Database backups
find /volume1/docker -name "*backup*" -type f -exec cp {} /volume2/emergency-backup/db-backups/ \;

# 4. Configuration files
cp -r /volume1/homelab/ /volume2/emergency-backup/homelab-$(date +%Y%m%d)/

# Verify backup integrity
echo "Verifying backup integrity..."
find /volume2/emergency-backup/ -type f -exec md5sum {} \; > /volume2/emergency-backup/checksums-$(date +%Y%m%d).md5
```

#### **Step 5: Remove Failed SSD Drives (15 minutes)**
```bash
# Physical removal of corrupted SSD drives
# 1. Shutdown Atlantis safely
sudo shutdown -h now

# 2. Wait for complete shutdown (LED off)
# 3. Remove power cable
# 4. Open NAS case
# 5. Remove both WD Black SN750 SE drives from M.2 slots
# 6. Close case and reconnect power
# 7. Power on and verify system boots normally

# After boot, verify no SSD cache references remain
# DSM > Storage Manager > Storage > SSD Cache
# Should show "No SSD cache configured"
```

### **🔧 Permanent Solution: New NVMe Installation**

#### **Hardware Installation (When New Drives Arrive)**
```bash
# New hardware to install:
# - 2x Crucial P310 1TB (CT1000P310SSD801)
# - 1x Synology SNV5420-400G

# Installation procedure:
# 1. Power down Atlantis
# 2. Install Crucial P310 drives in M.2 slots 1 & 2
# 3. Install Synology SNV5420 in E10M20-T1 card M.2 slot
# 4. Power on and wait for drive recognition
```

#### **007revad Script Configuration**
```bash
# After hardware installation, run 007revad scripts
cd /volume1/homelab/synology_scripts/

# 1. Enable M.2 volume support
cd 007revad_enable_m2/
sudo ./syno_enable_m2_volume.sh
echo "✅ M.2 volume support enabled"

# 2. Create M.2 volumes
cd ../007revad_m2_volume/
sudo ./syno_m2_volume.sh
echo "✅ M.2 volumes created"

# 3. Update HDD database (for IronWolf Pro drives)
cd ../007revad_hdd_db/
sudo ./syno_hdd_db.sh
echo "✅ HDD database updated"
```

#### **New Cache Configuration**
```bash
# Configure new SSD cache with Crucial P310 drives
# DSM > Storage Manager > Storage > SSD Cache

# Recommended configuration:
# - Cache Type: Read-Write cache
# - RAID Type: RAID 1 (for redundancy)
# - Drives: Both Crucial P310 1TB drives
# - Skip data consistency check: NO (ensure integrity)

# Synology SNV5420 usage:
# - Use as separate high-performance volume
# - Ideal for Docker containers requiring high IOPS
# - Configure as Volume3 for critical services
```

---

## 🔥 Complete NAS Hardware Failure

### **Emergency Data Extraction**
```bash
# If NAS won't boot but drives are intact
# Use Linux PC for data recovery

# 1. Remove drives from failed NAS
# 2. Connect drives to Linux system via USB adapters
# 3. Install mdadm for RAID recovery

sudo apt update && sudo apt install mdadm

# 4. Scan for RAID arrays
sudo mdadm --assemble --scan
sudo mdadm --detail --scan

# 5. Mount recovered volumes
mkdir -p /mnt/synology-recovery
sudo mount /dev/md0 /mnt/synology-recovery

# 6. Copy critical data
rsync -av /mnt/synology-recovery/docker/ ~/synology-recovery/docker/
rsync -av /mnt/synology-recovery/documents/ ~/synology-recovery/documents/
```

### **NAS Replacement Procedure**
```bash
# Complete DS1823xs+ replacement

# Step 1: Order identical replacement
# - Same model: DS1823xs+
# - Same RAM configuration: 32GB DDR4 ECC
# - Same expansion cards: E10M20-T1

# Step 2: Drive migration
# - Remove all drives from old unit
# - Note drive bay positions (critical!)
# - Install drives in new unit in EXACT same order
# - Install M.2 drives in same slots

# Step 3: First boot
# - Power on new NAS
# - DSM will detect existing configuration
# - Follow migration wizard
# - Do NOT initialize drives (will erase data)

# Step 4: Configuration restoration
# - Restore DSM configuration from backup
# - Reinstall packages and applications
# - Run 007revad scripts
# - Verify all services operational
```

---

## ⚡ Power Surge Recovery

### **Assessment Procedure**
```bash
# After power surge or electrical event

# Step 1: Visual inspection
# - Check for burn marks on power adapter
# - Inspect NAS case for damage
# - Look for LED indicators

# Step 2: Controlled power-on test
# - Use different power outlet
# - Connect only essential cables
# - Power on and observe boot sequence

# Step 3: Component testing
# If NAS powers on:
# - Check all drive recognition
# - Verify network connectivity
# - Test all expansion cards

# If NAS doesn't power on:
# - Try different power adapter (if available)
# - Check fuses in power adapter
# - Consider professional repair
```

### **Data Protection After Surge**
```bash
# If NAS boots but shows errors:

# 1. Immediate backup
# Priority: Get data off potentially damaged system
rsync -av /volume1/critical/ /external-backup/

# 2. Drive health check
# Check all drives for damage
sudo smartctl -a /dev/sda
sudo smartctl -a /dev/sdb
# Repeat for all drives

# 3. Memory test
# Run memory diagnostic if available
# Check for ECC errors in logs

# 4. Replace damaged components
# Order replacements for any failed components
# Consider UPS installation to prevent future damage
```

---

## 🌊 Water/Physical Damage Recovery

### **Emergency Response (First 30 minutes)**
```bash
# If NAS exposed to water or physical damage:

# IMMEDIATE ACTIONS:
# 1. POWER OFF IMMEDIATELY - do not attempt to boot
# 2. Disconnect all cables
# 3. Remove drives if possible
# 4. Do not attempt to power on

# Drive preservation:
# - Place drives in anti-static bags
# - Store in dry, cool location
# - Do not attempt to clean or dry
# - Contact professional recovery service if needed
```

### **Professional Recovery Decision**
```bash
# When to contact professional data recovery:
# - Water damage to drives
# - Physical damage to drive enclosures
# - Clicking or grinding noises from drives
# - Drives not recognized by any system
# - Critical data with no backup

# Professional services:
# - DriveSavers: 1-800-440-1904
# - Ontrack: 1-800-872-2599
# - Secure Data Recovery: 1-800-388-1266

# Cost considerations:
# - $500-$5000+ depending on damage
# - Success not guaranteed
# - Weigh cost vs. data value
```

---

## 🔒 Encryption Key Recovery

### **Encrypted Volume Access**
```bash
# If encryption key is lost or corrupted:

# Step 1: Locate backup keys
# Check these locations:
# - Password manager (Vaultwarden)
# - Physical key backup (if created)
# - Email notifications from Synology
# - Configuration backup files

# Step 2: Key recovery attempt
# DSM > Control Panel > Shared Folder
# Select encrypted folder > Edit > Security
# Try "Recover" option with backup key

# Step 3: If no backup key exists:
# Data is likely unrecoverable without professional help
# Synology uses strong encryption - no backdoors
# Consider professional cryptographic recovery services
```

### **Prevention for Future**
```bash
# Create encryption key backup NOW:
# 1. DSM > Control Panel > Shared Folder
# 2. Select encrypted folder > Edit > Security
# 3. Export encryption key
# 4. Store in multiple secure locations:
#    - Password manager
#    - Physical printout in safe
#    - Encrypted cloud storage
#    - Secondary NAS location
```

---

## 📦 DSM Operating System Recovery

### **DSM Corruption Recovery**
```bash
# If DSM won't boot or is corrupted:

# Step 1: Download DSM installer
# From Synology website:
# - Find your exact model (DS1823xs+)
# - Download latest DSM .pat file
# - Save to computer

# Step 2: Synology Assistant recovery
# 1. Install Synology Assistant on computer
# 2. Connect NAS and computer to same network
# 3. Power on NAS while holding reset button
# 4. Release reset when power LED blinks orange
# 5. Use Synology Assistant to reinstall DSM

# Step 3: Configuration restoration
# After DSM reinstall:
# - Restore from configuration backup
# - Reinstall packages
# - Reconfigure services
# - Run 007revad scripts
```

### **Manual DSM Installation**
```bash
# If Synology Assistant fails:

# 1. Access recovery mode
# - Power off NAS
# - Hold reset button while powering on
# - Keep holding until power LED blinks orange
# - Release reset button

# 2. Web interface recovery
# - Open browser to NAS IP address
# - Should show recovery interface
# - Upload DSM .pat file
# - Follow installation wizard

# 3. Data preservation
# - Choose "Keep existing data" if option appears
# - Do not format drives unless absolutely necessary
# - Existing volumes should be preserved
```

---

## 🛠️ 007revad Scripts for Disaster Recovery

### **Post-Recovery Script Execution**
```bash
# After any hardware replacement or DSM reinstall:

# 1. Download/update scripts
cd /volume1/homelab/synology_scripts/
git pull origin main  # Update to latest versions

# 2. HDD Database Update (for IronWolf Pro drives)
cd 007revad_hdd_db/
sudo ./syno_hdd_db.sh
# Ensures Seagate IronWolf Pro drives are properly recognized
# Prevents compatibility warnings
# Enables full SMART monitoring

# 3. Enable M.2 Volume Support
cd ../007revad_enable_m2/
sudo ./syno_enable_m2_volume.sh
# Re-enables M.2 volume creation after DSM updates
# Required after any DSM reinstall
# Fixes DSM limitations on M.2 usage

# 4. Create M.2 Volumes
cd ../007revad_m2_volume/
sudo ./syno_m2_volume.sh
# Creates storage volumes on M.2 drives
# Allows M.2 drives to be used for more than just cache
# Essential for high-performance storage setup
```

### **Script Automation for Recovery**
```bash
# Create automated recovery script
cat > /volume1/homelab/scripts/post-recovery-setup.sh << 'EOF'
#!/bin/bash
# Post-disaster recovery automation script

echo "🚀 Starting post-recovery setup..."

# Update 007revad scripts
cd /volume1/homelab/synology_scripts/
git pull origin main

# Run HDD database update
echo "📀 Updating HDD database..."
cd 007revad_hdd_db/
sudo ./syno_hdd_db.sh

# Enable M.2 volumes
echo "💾 Enabling M.2 volume support..."
cd ../007revad_enable_m2/
sudo ./syno_enable_m2_volume.sh

# Create M.2 volumes
echo "🔧 Creating M.2 volumes..."
cd ../007revad_m2_volume/
sudo ./syno_m2_volume.sh

# Restart Docker services
echo "🐳 Restarting Docker services..."
sudo systemctl restart docker

# Verify services
echo "✅ Verifying critical services..."
docker ps | grep -E "(plex|grafana|vaultwarden)"

echo "🎉 Post-recovery setup complete!"
EOF

chmod +x /volume1/homelab/scripts/post-recovery-setup.sh
```

---

## 📋 Recovery Checklists

### **🚨 SSD Cache Failure Checklist**
```bash
☐ SSH access to NAS confirmed
☐ Volume status assessed
☐ SSD cache disabled/removed
☐ Volume1 accessibility verified
☐ Emergency backup completed
☐ Failed SSD drives physically removed
☐ System stability confirmed
☐ New drives ordered (if needed)
☐ 007revad scripts prepared
☐ Recovery procedure documented
```

### **🔥 Complete NAS Failure Checklist**
```bash
☐ Damage assessment completed
☐ Drives safely removed
☐ Drive order documented
☐ Replacement NAS ordered
☐ Data recovery attempted (if needed)
☐ New NAS configured
☐ Drives installed in correct order
☐ Configuration restored
☐ 007revad scripts executed
☐ All services verified operational
```

### **⚡ Power Surge Recovery Checklist**
```bash
☐ Visual damage inspection completed
☐ Power adapter tested/replaced
☐ Controlled power-on test performed
☐ Drive health checks completed
☐ Memory diagnostics run
☐ Network connectivity verified
☐ UPS installation planned
☐ Surge protection upgraded
☐ Insurance claim filed (if applicable)
```

---

## 🚨 Emergency Contacts & Resources

### **Professional Data Recovery Services**
```bash
# DriveSavers (24/7 emergency service)
Phone: 1-800-440-1904
Web: https://www.drivesavers.com
Specialties: RAID, NAS, enterprise storage

# Ontrack Data Recovery
Phone: 1-800-872-2599
Web: https://www.ontrack.com
Specialties: Synology NAS, RAID arrays

# Secure Data Recovery Services
Phone: 1-800-388-1266
Web: https://www.securedatarecovery.com
Specialties: Water damage, physical damage
```

### **Synology Support**
```bash
# Synology Technical Support
Phone: 1-425-952-7900 (US)
Email: support@synology.com
Web: https://www.synology.com/support
Hours: 24/7 for critical issues

# Synology Community
Forum: https://community.synology.com
Reddit: r/synology
Discord: Synology Community Server
```

### **Hardware Vendors**
```bash
# Seagate Support (IronWolf Pro drives)
Phone: 1-800-732-4283
Web: https://www.seagate.com/support/
Warranty: https://www.seagate.com/support/warranty-and-replacements/

# Crucial Support (P310 SSDs)
Phone: 1-800-336-8896
Web: https://www.crucial.com/support
Warranty: https://www.crucial.com/support/warranty
```

---

## 🔄 Prevention & Monitoring

### **Proactive Monitoring Setup**
```bash
# Set up monitoring to prevent disasters:

# 1. SMART monitoring for all drives
# DSM > Storage Manager > Storage > HDD/SSD
# Enable SMART test scheduling

# 2. Temperature monitoring
# Install temperature sensors
# Set up alerts for overheating

# 3. UPS monitoring
# Install Network UPS Tools (NUT)
# Configure automatic shutdown

# 4. Backup verification
# Automated backup integrity checks
# Regular restore testing
```

### **Regular Maintenance Schedule**
```bash
# Monthly tasks:
☐ Check drive health (SMART status)
☐ Verify backup integrity
☐ Test UPS functionality
☐ Update DSM and packages
☐ Run 007revad scripts if needed

# Quarterly tasks:
☐ Full system backup
☐ Configuration export
☐ Hardware inspection
☐ Update disaster recovery documentation
☐ Test recovery procedures

# Annually:
☐ Replace UPS batteries
☐ Review warranty status
☐ Update emergency contacts
☐ Disaster recovery drill
☐ Insurance policy review
```

---

**💡 Critical Reminder**: The current SSD cache failure on Atlantis requires immediate attention. Follow the emergency recovery procedure above to restore Volume1 access and prevent data loss.

**🔄 Update Status**: This document should be updated after resolving the current cache failure and installing the new Crucial P310 and Synology SNV5420 drives.

**📞 Emergency Protocol**: If you cannot resolve issues using this guide, contact professional data recovery services immediately. Time is critical for data preservation.