homelab-optimized/docs/troubleshooting/performance.md

# ⚡ Performance Troubleshooting Guide

## Overview

This guide helps diagnose and resolve performance issues in your homelab, from slow containers to network bottlenecks and storage problems.

---

## 🔍 Quick Diagnostics Checklist

Before diving deep, run through this checklist:

```bash
# 1. Check system resources
htop                           # CPU, memory usage
docker stats                   # Container resource usage
df -h                          # Disk space
iostat -x 1 5                  # Disk I/O

# 2. Check network
iperf3 -c <target-ip>          # Network throughput
ping -c 10 <target>            # Latency
netstat -tulpn                 # Open ports/connections

# 3. Check containers
docker ps -a                   # Container status
docker logs <container> --tail 100  # Recent logs
```

---

## 🐌 Slow Container Performance

### Symptoms
- Container takes long to respond
- High CPU usage by specific container
- Container restarts frequently

### Diagnosis

```bash
# Check container resource usage
docker stats <container_name>

# Check container logs for errors
docker logs <container_name> --tail 200 | grep -i "error\|warn\|slow"

# Inspect container health
docker inspect <container_name> | jq '.[0].State'

# Check container processes
docker top <container_name>
```

### Common Causes & Solutions

#### 1. Memory Limits Too Low
```yaml
# docker-compose.yml - Increase memory limits
services:
  myservice:
    mem_limit: 2g        # Increase from default
    memswap_limit: 4g    # Allow swap if needed
```

#### 2. CPU Throttling
```yaml
# docker-compose.yml - Adjust CPU limits
services:
  myservice:
    cpus: '2.0'          # Allow 2 CPU cores
    cpu_shares: 1024     # Higher priority
```

#### 3. Storage I/O Bottleneck
```bash
# Check if container is doing heavy I/O
docker stats --format "table {{.Name}}\t{{.BlockIO}}"

# Solution: Move data to faster storage (NVMe cache, SSD)
```

#### 4. Database Performance
```bash
# PostgreSQL slow queries
docker exec -it postgres psql -U user -c "
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;"

# Add indexes for slow queries
# Increase shared_buffers in postgresql.conf
```

---

## 🌐 Network Performance Issues

### Symptoms
- Slow file transfers between hosts
- High latency to services
- Buffering when streaming media

### Diagnosis

```bash
# Test throughput between hosts
iperf3 -s                      # On server
iperf3 -c <server-ip> -t 30    # On client

# Expected speeds:
# - 1GbE: ~940 Mbps
# - 2.5GbE: ~2.35 Gbps
# - 10GbE: ~9.4 Gbps

# Check for packet loss
ping -c 100 <target> | tail -3

# Check network interface errors
ip -s link show eth0
```

### Common Causes & Solutions

#### 1. MTU Mismatch
```bash
# Check current MTU
ip link show | grep mtu

# Test for MTU issues (should not fragment)
ping -M do -s 1472 <target>

# Fix: Set consistent MTU across network
ip link set eth0 mtu 1500
```

#### 2. Duplex/Speed Mismatch
```bash
# Check link speed
ethtool eth0 | grep -i speed

# Force correct speed (if auto-negotiation fails)
ethtool -s eth0 speed 1000 duplex full autoneg off
```

#### 3. DNS Resolution Slow
```bash
# Test DNS resolution time
time dig google.com

# If slow, check /etc/resolv.conf
# Use local Pi-hole/AdGuard or fast upstream DNS

# Fix in Docker
# docker-compose.yml
services:
  myservice:
    dns:
      - 192.168.1.x    # Local DNS (Pi-hole)
      - 1.1.1.1        # Fallback
```

#### 4. Tailscale Performance
```bash
# Check Tailscale connection type
tailscale status

# If using DERP relay (slow), check firewall
# Port 41641/UDP should be open for direct connections

# Check Tailscale latency
tailscale ping <device>
```

#### 5. Reverse Proxy Bottleneck
```bash
# Check Nginx Proxy Manager logs
docker logs nginx-proxy-manager --tail 100

# Increase worker connections
# In nginx.conf:
worker_processes auto;
events {
    worker_connections 4096;
}
```

---

## 💾 Storage Performance Issues

### Symptoms
- Slow read/write speeds
- High disk I/O wait
- Database queries timing out

### Diagnosis

```bash
# Check disk I/O statistics
iostat -xz 1 10

# Key metrics:
# - %util > 90% = disk saturated
# - await > 20ms = slow disk
# - r/s, w/s = operations per second

# Check for processes doing heavy I/O
iotop -o

# Test disk speed
# Sequential write
dd if=/dev/zero of=/volume1/test bs=1G count=1 oflag=direct

# Sequential read
dd if=/volume1/test of=/dev/null bs=1G count=1 iflag=direct
```

### Common Causes & Solutions

#### 1. HDD vs SSD/NVMe
```
Expected speeds:
- HDD (7200 RPM): 100-200 MB/s sequential
- SATA SSD: 500-550 MB/s
- NVMe SSD: 2000-7000 MB/s

# Move frequently accessed data to faster storage
# Use NVMe cache on Synology NAS
```

#### 2. RAID Rebuild in Progress
```bash
# Check Synology RAID status
cat /proc/mdstat

# During rebuild, expect 30-50% performance loss
# Wait for rebuild to complete
```

#### 3. NVMe Cache Not Working
```bash
# On Synology, check cache status in DSM
# Storage Manager > SSD Cache

# Common issues:
# - Cache full (increase size or add more SSDs)
# - Wrong cache mode (read-only vs read-write)
# - Cache disabled after DSM update
```

#### 4. SMB/NFS Performance
```bash
# Test SMB performance
smbclient //nas/share -U user -c "put largefile.bin"

# Optimize SMB settings in smb.conf:
socket options = TCP_NODELAY IPTOS_LOWDELAY
read raw = yes
write raw = yes
max xmit = 65535

# For NFS, use NFSv4.1 with larger rsize/wsize
mount -t nfs4 nas:/share /mnt -o rsize=1048576,wsize=1048576
```

#### 5. Docker Volume Performance
```bash
# Check volume driver
docker volume inspect <volume>

# For better performance, use:
# - Bind mounts instead of named volumes for large datasets
# - Local SSD for database volumes

# docker-compose.yml
volumes:
  - /fast-ssd/postgres:/var/lib/postgresql/data
```

---

## 📺 Media Streaming Performance

### Symptoms
- Buffering during playback
- Transcoding takes too long
- Multiple streams cause stuttering

### Plex/Jellyfin Optimization

```bash
# Check transcoding status
# Plex: Settings > Dashboard > Now Playing
# Jellyfin: Dashboard > Active Streams

# Enable hardware transcoding
# Plex: Settings > Transcoder > Hardware Acceleration
# Jellyfin: Dashboard > Playback > Transcoding

# For Intel QuickSync (Synology):
docker run -d \
  --device /dev/dri:/dev/dri \  # Pass GPU
  -e PLEX_CLAIM="claim-xxx" \
  plexinc/pms-docker
```

### Direct Play vs Transcoding
```
Performance comparison:
- Direct Play: ~5-20 Mbps per stream (no CPU usage)
- Transcoding: ~2000-4000 CPU score per 1080p stream

# Optimize for Direct Play:
# 1. Use compatible codecs (H.264, AAC)
# 2. Match client capabilities
# 3. Disable transcoding for local clients
```

### Multiple Concurrent Streams
```
10GbE can handle: ~80 concurrent 4K streams (theoretical)
1GbE can handle: ~8 concurrent 4K streams

# If hitting limits:
# 1. Reduce stream quality for remote users
# 2. Enable bandwidth limits per user
# 3. Upgrade network infrastructure
```

---

## 🖥️ Synology NAS Performance

### Check System Health
```bash
# SSH into Synology
ssh admin@nas

# Check CPU/memory
top

# Check storage health
cat /proc/mdstat
syno_hdd_util --all

# Check Docker performance
docker stats
```

### Common Synology Issues

#### 1. Indexing Slowing System
```bash
# Check if Synology is indexing
ps aux | grep -i index

# Temporarily stop indexing
synoservicectl --stop synoindexd

# Or schedule indexing for off-hours
# Control Panel > Indexing Service > Schedule
```

#### 2. Snapshot Replication Running
```bash
# Check running tasks
synoschedtask --list

# Schedule snapshots during low-usage hours
```

#### 3. Antivirus Scanning
```bash
# Disable real-time scanning or schedule scans
# Security Advisor > Advanced > Scheduled Scan
```

#### 4. Memory Pressure
```bash
# Check memory usage
free -h

# If low on RAM, consider:
# - Adding more RAM (DS1823xs+ supports up to 32GB)
# - Reducing number of running containers
# - Disabling unused packages
```

---

## 📊 Monitoring for Performance

### Set Up Prometheus Alerts

```yaml
# prometheus/rules/performance.yml
groups:
  - name: performance
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: warning

      - alert: DiskIOHigh
        expr: rate(node_disk_io_time_seconds_total[5m]) > 0.9
        for: 10m
        labels:
          severity: warning

      - alert: NetworkErrors
        expr: rate(node_network_receive_errs_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
```

### Grafana Dashboard Panels

Key metrics to monitor:
- CPU usage by core
- Memory usage and swap
- Disk I/O latency (await)
- Network throughput and errors
- Container resource usage
- Docker volume I/O

---

## 🛠️ Performance Tuning Checklist

### System Level
- [ ] Kernel parameters optimized (`/etc/sysctl.conf`)
- [ ] Disk scheduler appropriate for workload (mq-deadline for SSD)
- [ ] Swap configured appropriately
- [ ] File descriptor limits increased

### Docker Level
- [ ] Container resource limits set
- [ ] Logging driver configured (json-file with max-size)
- [ ] Unused containers/images removed
- [ ] Volumes on appropriate storage

### Network Level
- [ ] Jumbo frames enabled (if supported)
- [ ] DNS resolution fast
- [ ] Firewall rules optimized
- [ ] Quality of Service (QoS) configured

### Application Level
- [ ] Database indexes optimized
- [ ] Caching enabled (Redis/Memcached)
- [ ] Connection pooling configured
- [ ] Static assets served efficiently

---

## 🔗 Related Documentation

- [Network Performance Tuning](../infrastructure/network-performance-tuning.md)
- [Monitoring Setup](../admin/monitoring.md)
- [Common Issues](common-issues.md)
- [10GbE Backbone](../diagrams/10gbe-backbone.md)
- [Storage Topology](../diagrams/storage-topology.md)