Files
homelab-optimized/docs/troubleshooting/synology-disaster-recovery.md
Gitea Mirror Bot bd82e850ee
Some checks failed
Documentation / Deploy to GitHub Pages (push) Has been cancelled
Documentation / Build Docusaurus (push) Has started running
Sanitized mirror from private repository - 2026-03-21 11:14:37 UTC
2026-03-21 11:14:37 +00:00

644 lines
17 KiB
Markdown

# 🚨 Synology NAS Disaster Recovery Guide
**🔴 Critical Emergency Procedures**
This guide covers critical disaster recovery scenarios specific to Synology NAS systems, with detailed procedures for the DS1823xs+ and related hardware failures. These procedures can save your data and minimize downtime.
## 🎯 Critical Scenarios Covered
1. **💾 SSD Cache Failure** - Current critical issue with Atlantis
2. **🔥 Complete NAS Failure** - Hardware replacement procedures
3. **⚡ Power Surge Damage** - Recovery from electrical damage
4. **🌊 Water/Physical Damage** - Emergency data extraction
5. **🔒 Encryption Key Loss** - Encrypted volume recovery
6. **📦 DSM Corruption** - Operating system recovery
---
## 💾 SSD Cache Failure Recovery (CURRENT CRITICAL ISSUE)
### **🚨 Current Situation: Atlantis DS1823xs+**
```bash
# CRITICAL STATUS:
# - SSD cache corrupted after DSM update
# - Volume1 is OFFLINE due to cache failure
# - 2x WD Black SN750 SE 500GB drives affected
# - All Docker services down
# - Immediate action required
# Symptoms:
# - Volume1 shows as "Crashed" in Storage Manager
# - SSD cache shows errors or corruption
# - Services fail to start
# - Data appears inaccessible
```
### **⚡ Emergency Recovery Procedure**
#### **Step 1: Immediate Assessment (5 minutes)**
```bash
# SSH into Atlantis
ssh admin@atlantis.vish.local
# or via Tailscale IP
ssh admin@100.83.230.112
# Check system status
sudo -i
cat /proc/mdstat
df -h
dmesg | tail -50
# Check volume status
synodisk --enum
synovolume --enum
```
#### **Step 2: Disable SSD Cache (10 minutes)**
```bash
# CRITICAL: This will restore Volume1 access
# Navigate via web interface:
# 1. DSM > Storage Manager
# 2. Storage > SSD Cache
# 3. Select corrupted cache
# 4. Click "Remove" or "Disable"
# 5. Confirm removal (data will be preserved)
# Alternative via SSH (if web interface fails):
echo 'Disabling SSD cache via command line...'
# Note: Exact commands vary by DSM version
# Consult Synology documentation for CLI cache management
```
#### **Step 3: Verify Volume1 Recovery (5 minutes)**
```bash
# Check if Volume1 is back online
df -h | grep volume1
ls -la /volume1/
# If Volume1 is accessible:
echo "✅ Volume1 recovered successfully"
# If still offline:
echo "❌ Volume1 still offline - proceed to advanced recovery"
```
#### **Step 4: Emergency Data Backup (30-60 minutes)**
```bash
# IMMEDIATELY backup critical data once Volume1 is accessible
# Priority order:
# 1. Docker configurations (highest priority)
rsync -av /volume1/docker/ /volume2/emergency-backup/docker-$(date +%Y%m%d)/
tar -czf /volume2/emergency-backup/docker-configs-$(date +%Y%m%d).tar.gz /volume1/docker/
# 2. Critical documents
rsync -av /volume1/documents/ /volume2/emergency-backup/documents-$(date +%Y%m%d)/
# 3. Database backups
find /volume1/docker -name "*backup*" -type f -exec cp {} /volume2/emergency-backup/db-backups/ \;
# 4. Configuration files
cp -r /volume1/homelab/ /volume2/emergency-backup/homelab-$(date +%Y%m%d)/
# Verify backup integrity
echo "Verifying backup integrity..."
find /volume2/emergency-backup/ -type f -exec md5sum {} \; > /volume2/emergency-backup/checksums-$(date +%Y%m%d).md5
```
#### **Step 5: Remove Failed SSD Drives (15 minutes)**
```bash
# Physical removal of corrupted SSD drives
# 1. Shutdown Atlantis safely
sudo shutdown -h now
# 2. Wait for complete shutdown (LED off)
# 3. Remove power cable
# 4. Open NAS case
# 5. Remove both WD Black SN750 SE drives from M.2 slots
# 6. Close case and reconnect power
# 7. Power on and verify system boots normally
# After boot, verify no SSD cache references remain
# DSM > Storage Manager > Storage > SSD Cache
# Should show "No SSD cache configured"
```
### **🔧 Permanent Solution: New NVMe Installation**
#### **Hardware Installation (When New Drives Arrive)**
```bash
# New hardware to install:
# - 2x Crucial P310 1TB (CT1000P310SSD801)
# - 1x Synology SNV5420-400G
# Installation procedure:
# 1. Power down Atlantis
# 2. Install Crucial P310 drives in M.2 slots 1 & 2
# 3. Install Synology SNV5420 in E10M20-T1 card M.2 slot
# 4. Power on and wait for drive recognition
```
#### **007revad Script Configuration**
```bash
# After hardware installation, run 007revad scripts
cd /volume1/homelab/synology_scripts/
# 1. Enable M.2 volume support
cd 007revad_enable_m2/
sudo ./syno_enable_m2_volume.sh
echo "✅ M.2 volume support enabled"
# 2. Create M.2 volumes
cd ../007revad_m2_volume/
sudo ./syno_m2_volume.sh
echo "✅ M.2 volumes created"
# 3. Update HDD database (for IronWolf Pro drives)
cd ../007revad_hdd_db/
sudo ./syno_hdd_db.sh
echo "✅ HDD database updated"
```
#### **New Cache Configuration**
```bash
# Configure new SSD cache with Crucial P310 drives
# DSM > Storage Manager > Storage > SSD Cache
# Recommended configuration:
# - Cache Type: Read-Write cache
# - RAID Type: RAID 1 (for redundancy)
# - Drives: Both Crucial P310 1TB drives
# - Skip data consistency check: NO (ensure integrity)
# Synology SNV5420 usage:
# - Use as separate high-performance volume
# - Ideal for Docker containers requiring high IOPS
# - Configure as Volume3 for critical services
```
---
## 🔥 Complete NAS Hardware Failure
### **Emergency Data Extraction**
```bash
# If NAS won't boot but drives are intact
# Use Linux PC for data recovery
# 1. Remove drives from failed NAS
# 2. Connect drives to Linux system via USB adapters
# 3. Install mdadm for RAID recovery
sudo apt update && sudo apt install mdadm
# 4. Scan for RAID arrays
sudo mdadm --assemble --scan
sudo mdadm --detail --scan
# 5. Mount recovered volumes
mkdir -p /mnt/synology-recovery
sudo mount /dev/md0 /mnt/synology-recovery
# 6. Copy critical data
rsync -av /mnt/synology-recovery/docker/ ~/synology-recovery/docker/
rsync -av /mnt/synology-recovery/documents/ ~/synology-recovery/documents/
```
### **NAS Replacement Procedure**
```bash
# Complete DS1823xs+ replacement
# Step 1: Order identical replacement
# - Same model: DS1823xs+
# - Same RAM configuration: 32GB DDR4 ECC
# - Same expansion cards: E10M20-T1
# Step 2: Drive migration
# - Remove all drives from old unit
# - Note drive bay positions (critical!)
# - Install drives in new unit in EXACT same order
# - Install M.2 drives in same slots
# Step 3: First boot
# - Power on new NAS
# - DSM will detect existing configuration
# - Follow migration wizard
# - Do NOT initialize drives (will erase data)
# Step 4: Configuration restoration
# - Restore DSM configuration from backup
# - Reinstall packages and applications
# - Run 007revad scripts
# - Verify all services operational
```
---
## ⚡ Power Surge Recovery
### **Assessment Procedure**
```bash
# After power surge or electrical event
# Step 1: Visual inspection
# - Check for burn marks on power adapter
# - Inspect NAS case for damage
# - Look for LED indicators
# Step 2: Controlled power-on test
# - Use different power outlet
# - Connect only essential cables
# - Power on and observe boot sequence
# Step 3: Component testing
# If NAS powers on:
# - Check all drive recognition
# - Verify network connectivity
# - Test all expansion cards
# If NAS doesn't power on:
# - Try different power adapter (if available)
# - Check fuses in power adapter
# - Consider professional repair
```
### **Data Protection After Surge**
```bash
# If NAS boots but shows errors:
# 1. Immediate backup
# Priority: Get data off potentially damaged system
rsync -av /volume1/critical/ /external-backup/
# 2. Drive health check
# Check all drives for damage
sudo smartctl -a /dev/sda
sudo smartctl -a /dev/sdb
# Repeat for all drives
# 3. Memory test
# Run memory diagnostic if available
# Check for ECC errors in logs
# 4. Replace damaged components
# Order replacements for any failed components
# Consider UPS installation to prevent future damage
```
---
## 🌊 Water/Physical Damage Recovery
### **Emergency Response (First 30 minutes)**
```bash
# If NAS exposed to water or physical damage:
# IMMEDIATE ACTIONS:
# 1. POWER OFF IMMEDIATELY - do not attempt to boot
# 2. Disconnect all cables
# 3. Remove drives if possible
# 4. Do not attempt to power on
# Drive preservation:
# - Place drives in anti-static bags
# - Store in dry, cool location
# - Do not attempt to clean or dry
# - Contact professional recovery service if needed
```
### **Professional Recovery Decision**
```bash
# When to contact professional data recovery:
# - Water damage to drives
# - Physical damage to drive enclosures
# - Clicking or grinding noises from drives
# - Drives not recognized by any system
# - Critical data with no backup
# Professional services:
# - DriveSavers: 1-800-440-1904
# - Ontrack: 1-800-872-2599
# - Secure Data Recovery: 1-800-388-1266
# Cost considerations:
# - $500-$5000+ depending on damage
# - Success not guaranteed
# - Weigh cost vs. data value
```
---
## 🔒 Encryption Key Recovery
### **Encrypted Volume Access**
```bash
# If encryption key is lost or corrupted:
# Step 1: Locate backup keys
# Check these locations:
# - Password manager (Vaultwarden)
# - Physical key backup (if created)
# - Email notifications from Synology
# - Configuration backup files
# Step 2: Key recovery attempt
# DSM > Control Panel > Shared Folder
# Select encrypted folder > Edit > Security
# Try "Recover" option with backup key
# Step 3: If no backup key exists:
# Data is likely unrecoverable without professional help
# Synology uses strong encryption - no backdoors
# Consider professional cryptographic recovery services
```
### **Prevention for Future**
```bash
# Create encryption key backup NOW:
# 1. DSM > Control Panel > Shared Folder
# 2. Select encrypted folder > Edit > Security
# 3. Export encryption key
# 4. Store in multiple secure locations:
# - Password manager
# - Physical printout in safe
# - Encrypted cloud storage
# - Secondary NAS location
```
---
## 📦 DSM Operating System Recovery
### **DSM Corruption Recovery**
```bash
# If DSM won't boot or is corrupted:
# Step 1: Download DSM installer
# From Synology website:
# - Find your exact model (DS1823xs+)
# - Download latest DSM .pat file
# - Save to computer
# Step 2: Synology Assistant recovery
# 1. Install Synology Assistant on computer
# 2. Connect NAS and computer to same network
# 3. Power on NAS while holding reset button
# 4. Release reset when power LED blinks orange
# 5. Use Synology Assistant to reinstall DSM
# Step 3: Configuration restoration
# After DSM reinstall:
# - Restore from configuration backup
# - Reinstall packages
# - Reconfigure services
# - Run 007revad scripts
```
### **Manual DSM Installation**
```bash
# If Synology Assistant fails:
# 1. Access recovery mode
# - Power off NAS
# - Hold reset button while powering on
# - Keep holding until power LED blinks orange
# - Release reset button
# 2. Web interface recovery
# - Open browser to NAS IP address
# - Should show recovery interface
# - Upload DSM .pat file
# - Follow installation wizard
# 3. Data preservation
# - Choose "Keep existing data" if option appears
# - Do not format drives unless absolutely necessary
# - Existing volumes should be preserved
```
---
## 🛠️ 007revad Scripts for Disaster Recovery
### **Post-Recovery Script Execution**
```bash
# After any hardware replacement or DSM reinstall:
# 1. Download/update scripts
cd /volume1/homelab/synology_scripts/
git pull origin main # Update to latest versions
# 2. HDD Database Update (for IronWolf Pro drives)
cd 007revad_hdd_db/
sudo ./syno_hdd_db.sh
# Ensures Seagate IronWolf Pro drives are properly recognized
# Prevents compatibility warnings
# Enables full SMART monitoring
# 3. Enable M.2 Volume Support
cd ../007revad_enable_m2/
sudo ./syno_enable_m2_volume.sh
# Re-enables M.2 volume creation after DSM updates
# Required after any DSM reinstall
# Fixes DSM limitations on M.2 usage
# 4. Create M.2 Volumes
cd ../007revad_m2_volume/
sudo ./syno_m2_volume.sh
# Creates storage volumes on M.2 drives
# Allows M.2 drives to be used for more than just cache
# Essential for high-performance storage setup
```
### **Script Automation for Recovery**
```bash
# Create automated recovery script
cat > /volume1/homelab/scripts/post-recovery-setup.sh << 'EOF'
#!/bin/bash
# Post-disaster recovery automation script
echo "🚀 Starting post-recovery setup..."
# Update 007revad scripts
cd /volume1/homelab/synology_scripts/
git pull origin main
# Run HDD database update
echo "📀 Updating HDD database..."
cd 007revad_hdd_db/
sudo ./syno_hdd_db.sh
# Enable M.2 volumes
echo "💾 Enabling M.2 volume support..."
cd ../007revad_enable_m2/
sudo ./syno_enable_m2_volume.sh
# Create M.2 volumes
echo "🔧 Creating M.2 volumes..."
cd ../007revad_m2_volume/
sudo ./syno_m2_volume.sh
# Restart Docker services
echo "🐳 Restarting Docker services..."
sudo systemctl restart docker
# Verify services
echo "✅ Verifying critical services..."
docker ps | grep -E "(plex|grafana|vaultwarden)"
echo "🎉 Post-recovery setup complete!"
EOF
chmod +x /volume1/homelab/scripts/post-recovery-setup.sh
```
---
## 📋 Recovery Checklists
### **🚨 SSD Cache Failure Checklist**
```bash
☐ SSH access to NAS confirmed
☐ Volume status assessed
☐ SSD cache disabled/removed
☐ Volume1 accessibility verified
☐ Emergency backup completed
☐ Failed SSD drives physically removed
☐ System stability confirmed
☐ New drives ordered (if needed)
☐ 007revad scripts prepared
☐ Recovery procedure documented
```
### **🔥 Complete NAS Failure Checklist**
```bash
☐ Damage assessment completed
☐ Drives safely removed
☐ Drive order documented
☐ Replacement NAS ordered
☐ Data recovery attempted (if needed)
☐ New NAS configured
☐ Drives installed in correct order
☐ Configuration restored
☐ 007revad scripts executed
☐ All services verified operational
```
### **⚡ Power Surge Recovery Checklist**
```bash
☐ Visual damage inspection completed
☐ Power adapter tested/replaced
☐ Controlled power-on test performed
☐ Drive health checks completed
☐ Memory diagnostics run
☐ Network connectivity verified
☐ UPS installation planned
☐ Surge protection upgraded
☐ Insurance claim filed (if applicable)
```
---
## 🚨 Emergency Contacts & Resources
### **Professional Data Recovery Services**
```bash
# DriveSavers (24/7 emergency service)
Phone: 1-800-440-1904
Web: https://www.drivesavers.com
Specialties: RAID, NAS, enterprise storage
# Ontrack Data Recovery
Phone: 1-800-872-2599
Web: https://www.ontrack.com
Specialties: Synology NAS, RAID arrays
# Secure Data Recovery Services
Phone: 1-800-388-1266
Web: https://www.securedatarecovery.com
Specialties: Water damage, physical damage
```
### **Synology Support**
```bash
# Synology Technical Support
Phone: 1-425-952-7900 (US)
Email: support@synology.com
Web: https://www.synology.com/support
Hours: 24/7 for critical issues
# Synology Community
Forum: https://community.synology.com
Reddit: r/synology
Discord: Synology Community Server
```
### **Hardware Vendors**
```bash
# Seagate Support (IronWolf Pro drives)
Phone: 1-800-732-4283
Web: https://www.seagate.com/support/
Warranty: https://www.seagate.com/support/warranty-and-replacements/
# Crucial Support (P310 SSDs)
Phone: 1-800-336-8896
Web: https://www.crucial.com/support
Warranty: https://www.crucial.com/support/warranty
```
---
## 🔄 Prevention & Monitoring
### **Proactive Monitoring Setup**
```bash
# Set up monitoring to prevent disasters:
# 1. SMART monitoring for all drives
# DSM > Storage Manager > Storage > HDD/SSD
# Enable SMART test scheduling
# 2. Temperature monitoring
# Install temperature sensors
# Set up alerts for overheating
# 3. UPS monitoring
# Install Network UPS Tools (NUT)
# Configure automatic shutdown
# 4. Backup verification
# Automated backup integrity checks
# Regular restore testing
```
### **Regular Maintenance Schedule**
```bash
# Monthly tasks:
☐ Check drive health (SMART status)
☐ Verify backup integrity
☐ Test UPS functionality
☐ Update DSM and packages
☐ Run 007revad scripts if needed
# Quarterly tasks:
☐ Full system backup
☐ Configuration export
☐ Hardware inspection
☐ Update disaster recovery documentation
☐ Test recovery procedures
# Annually:
☐ Replace UPS batteries
☐ Review warranty status
☐ Update emergency contacts
☐ Disaster recovery drill
☐ Insurance policy review
```
---
**💡 Critical Reminder**: The current SSD cache failure on Atlantis requires immediate attention. Follow the emergency recovery procedure above to restore Volume1 access and prevent data loss.
**🔄 Update Status**: This document should be updated after resolving the current cache failure and installing the new Crucial P310 and Synology SNV5420 drives.
**📞 Emergency Protocol**: If you cannot resolve issues using this guide, contact professional data recovery services immediately. Time is critical for data preservation.