Sanitized mirror from private repository - 2026-04-18 11:13:17 UTC
This commit is contained in:
475
docs/troubleshooting/performance.md
Normal file
475
docs/troubleshooting/performance.md
Normal file
@@ -0,0 +1,475 @@
|
||||
# ⚡ Performance Troubleshooting Guide
|
||||
|
||||
## Overview
|
||||
|
||||
This guide helps diagnose and resolve performance issues in your homelab, from slow containers to network bottlenecks and storage problems.
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Quick Diagnostics Checklist
|
||||
|
||||
Before diving deep, run through this checklist:
|
||||
|
||||
```bash
|
||||
# 1. Check system resources
|
||||
htop # CPU, memory usage
|
||||
docker stats # Container resource usage
|
||||
df -h # Disk space
|
||||
iostat -x 1 5 # Disk I/O
|
||||
|
||||
# 2. Check network
|
||||
iperf3 -c <target-ip> # Network throughput
|
||||
ping -c 10 <target> # Latency
|
||||
netstat -tulpn # Open ports/connections
|
||||
|
||||
# 3. Check containers
|
||||
docker ps -a # Container status
|
||||
docker logs <container> --tail 100 # Recent logs
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🐌 Slow Container Performance
|
||||
|
||||
### Symptoms
|
||||
- Container takes long to respond
|
||||
- High CPU usage by specific container
|
||||
- Container restarts frequently
|
||||
|
||||
### Diagnosis
|
||||
|
||||
```bash
|
||||
# Check container resource usage
|
||||
docker stats <container_name>
|
||||
|
||||
# Check container logs for errors
|
||||
docker logs <container_name> --tail 200 | grep -i "error\|warn\|slow"
|
||||
|
||||
# Inspect container health
|
||||
docker inspect <container_name> | jq '.[0].State'
|
||||
|
||||
# Check container processes
|
||||
docker top <container_name>
|
||||
```
|
||||
|
||||
### Common Causes & Solutions
|
||||
|
||||
#### 1. Memory Limits Too Low
|
||||
```yaml
|
||||
# docker-compose.yml - Increase memory limits
|
||||
services:
|
||||
myservice:
|
||||
mem_limit: 2g # Increase from default
|
||||
memswap_limit: 4g # Allow swap if needed
|
||||
```
|
||||
|
||||
#### 2. CPU Throttling
|
||||
```yaml
|
||||
# docker-compose.yml - Adjust CPU limits
|
||||
services:
|
||||
myservice:
|
||||
cpus: '2.0' # Allow 2 CPU cores
|
||||
cpu_shares: 1024 # Higher priority
|
||||
```
|
||||
|
||||
#### 3. Storage I/O Bottleneck
|
||||
```bash
|
||||
# Check if container is doing heavy I/O
|
||||
docker stats --format "table {{.Name}}\t{{.BlockIO}}"
|
||||
|
||||
# Solution: Move data to faster storage (NVMe cache, SSD)
|
||||
```
|
||||
|
||||
#### 4. Database Performance
|
||||
```bash
|
||||
# PostgreSQL slow queries
|
||||
docker exec -it postgres psql -U user -c "
|
||||
SELECT query, calls, mean_time, total_time
|
||||
FROM pg_stat_statements
|
||||
ORDER BY total_time DESC
|
||||
LIMIT 10;"
|
||||
|
||||
# Add indexes for slow queries
|
||||
# Increase shared_buffers in postgresql.conf
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🌐 Network Performance Issues
|
||||
|
||||
### Symptoms
|
||||
- Slow file transfers between hosts
|
||||
- High latency to services
|
||||
- Buffering when streaming media
|
||||
|
||||
### Diagnosis
|
||||
|
||||
```bash
|
||||
# Test throughput between hosts
|
||||
iperf3 -s # On server
|
||||
iperf3 -c <server-ip> -t 30 # On client
|
||||
|
||||
# Expected speeds:
|
||||
# - 1GbE: ~940 Mbps
|
||||
# - 2.5GbE: ~2.35 Gbps
|
||||
# - 10GbE: ~9.4 Gbps
|
||||
|
||||
# Check for packet loss
|
||||
ping -c 100 <target> | tail -3
|
||||
|
||||
# Check network interface errors
|
||||
ip -s link show eth0
|
||||
```
|
||||
|
||||
### Common Causes & Solutions
|
||||
|
||||
#### 1. MTU Mismatch
|
||||
```bash
|
||||
# Check current MTU
|
||||
ip link show | grep mtu
|
||||
|
||||
# Test for MTU issues (should not fragment)
|
||||
ping -M do -s 1472 <target>
|
||||
|
||||
# Fix: Set consistent MTU across network
|
||||
ip link set eth0 mtu 1500
|
||||
```
|
||||
|
||||
#### 2. Duplex/Speed Mismatch
|
||||
```bash
|
||||
# Check link speed
|
||||
ethtool eth0 | grep -i speed
|
||||
|
||||
# Force correct speed (if auto-negotiation fails)
|
||||
ethtool -s eth0 speed 1000 duplex full autoneg off
|
||||
```
|
||||
|
||||
#### 3. DNS Resolution Slow
|
||||
```bash
|
||||
# Test DNS resolution time
|
||||
time dig google.com
|
||||
|
||||
# If slow, check /etc/resolv.conf
|
||||
# Use local Pi-hole/AdGuard or fast upstream DNS
|
||||
|
||||
# Fix in Docker
|
||||
# docker-compose.yml
|
||||
services:
|
||||
myservice:
|
||||
dns:
|
||||
- 192.168.1.x # Local DNS (Pi-hole)
|
||||
- 1.1.1.1 # Fallback
|
||||
```
|
||||
|
||||
#### 4. Tailscale Performance
|
||||
```bash
|
||||
# Check Tailscale connection type
|
||||
tailscale status
|
||||
|
||||
# If using DERP relay (slow), check firewall
|
||||
# Port 41641/UDP should be open for direct connections
|
||||
|
||||
# Check Tailscale latency
|
||||
tailscale ping <device>
|
||||
```
|
||||
|
||||
#### 5. Reverse Proxy Bottleneck
|
||||
```bash
|
||||
# Check Nginx Proxy Manager logs
|
||||
docker logs nginx-proxy-manager --tail 100
|
||||
|
||||
# Increase worker connections
|
||||
# In nginx.conf:
|
||||
worker_processes auto;
|
||||
events {
|
||||
worker_connections 4096;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 💾 Storage Performance Issues
|
||||
|
||||
### Symptoms
|
||||
- Slow read/write speeds
|
||||
- High disk I/O wait
|
||||
- Database queries timing out
|
||||
|
||||
### Diagnosis
|
||||
|
||||
```bash
|
||||
# Check disk I/O statistics
|
||||
iostat -xz 1 10
|
||||
|
||||
# Key metrics:
|
||||
# - %util > 90% = disk saturated
|
||||
# - await > 20ms = slow disk
|
||||
# - r/s, w/s = operations per second
|
||||
|
||||
# Check for processes doing heavy I/O
|
||||
iotop -o
|
||||
|
||||
# Test disk speed
|
||||
# Sequential write
|
||||
dd if=/dev/zero of=/volume1/test bs=1G count=1 oflag=direct
|
||||
|
||||
# Sequential read
|
||||
dd if=/volume1/test of=/dev/null bs=1G count=1 iflag=direct
|
||||
```
|
||||
|
||||
### Common Causes & Solutions
|
||||
|
||||
#### 1. HDD vs SSD/NVMe
|
||||
```
|
||||
Expected speeds:
|
||||
- HDD (7200 RPM): 100-200 MB/s sequential
|
||||
- SATA SSD: 500-550 MB/s
|
||||
- NVMe SSD: 2000-7000 MB/s
|
||||
|
||||
# Move frequently accessed data to faster storage
|
||||
# Use NVMe cache on Synology NAS
|
||||
```
|
||||
|
||||
#### 2. RAID Rebuild in Progress
|
||||
```bash
|
||||
# Check Synology RAID status
|
||||
cat /proc/mdstat
|
||||
|
||||
# During rebuild, expect 30-50% performance loss
|
||||
# Wait for rebuild to complete
|
||||
```
|
||||
|
||||
#### 3. NVMe Cache Not Working
|
||||
```bash
|
||||
# On Synology, check cache status in DSM
|
||||
# Storage Manager > SSD Cache
|
||||
|
||||
# Common issues:
|
||||
# - Cache full (increase size or add more SSDs)
|
||||
# - Wrong cache mode (read-only vs read-write)
|
||||
# - Cache disabled after DSM update
|
||||
```
|
||||
|
||||
#### 4. SMB/NFS Performance
|
||||
```bash
|
||||
# Test SMB performance
|
||||
smbclient //nas/share -U user -c "put largefile.bin"
|
||||
|
||||
# Optimize SMB settings in smb.conf:
|
||||
socket options = TCP_NODELAY IPTOS_LOWDELAY
|
||||
read raw = yes
|
||||
write raw = yes
|
||||
max xmit = 65535
|
||||
|
||||
# For NFS, use NFSv4.1 with larger rsize/wsize
|
||||
mount -t nfs4 nas:/share /mnt -o rsize=1048576,wsize=1048576
|
||||
```
|
||||
|
||||
#### 5. Docker Volume Performance
|
||||
```bash
|
||||
# Check volume driver
|
||||
docker volume inspect <volume>
|
||||
|
||||
# For better performance, use:
|
||||
# - Bind mounts instead of named volumes for large datasets
|
||||
# - Local SSD for database volumes
|
||||
|
||||
# docker-compose.yml
|
||||
volumes:
|
||||
- /fast-ssd/postgres:/var/lib/postgresql/data
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📺 Media Streaming Performance
|
||||
|
||||
### Symptoms
|
||||
- Buffering during playback
|
||||
- Transcoding takes too long
|
||||
- Multiple streams cause stuttering
|
||||
|
||||
### Plex/Jellyfin Optimization
|
||||
|
||||
```bash
|
||||
# Check transcoding status
|
||||
# Plex: Settings > Dashboard > Now Playing
|
||||
# Jellyfin: Dashboard > Active Streams
|
||||
|
||||
# Enable hardware transcoding
|
||||
# Plex: Settings > Transcoder > Hardware Acceleration
|
||||
# Jellyfin: Dashboard > Playback > Transcoding
|
||||
|
||||
# For Intel QuickSync (Synology):
|
||||
docker run -d \
|
||||
--device /dev/dri:/dev/dri \ # Pass GPU
|
||||
-e PLEX_CLAIM="claim-xxx" \
|
||||
plexinc/pms-docker
|
||||
```
|
||||
|
||||
### Direct Play vs Transcoding
|
||||
```
|
||||
Performance comparison:
|
||||
- Direct Play: ~5-20 Mbps per stream (no CPU usage)
|
||||
- Transcoding: ~2000-4000 CPU score per 1080p stream
|
||||
|
||||
# Optimize for Direct Play:
|
||||
# 1. Use compatible codecs (H.264, AAC)
|
||||
# 2. Match client capabilities
|
||||
# 3. Disable transcoding for local clients
|
||||
```
|
||||
|
||||
### Multiple Concurrent Streams
|
||||
```
|
||||
10GbE can handle: ~80 concurrent 4K streams (theoretical)
|
||||
1GbE can handle: ~8 concurrent 4K streams
|
||||
|
||||
# If hitting limits:
|
||||
# 1. Reduce stream quality for remote users
|
||||
# 2. Enable bandwidth limits per user
|
||||
# 3. Upgrade network infrastructure
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🖥️ Synology NAS Performance
|
||||
|
||||
### Check System Health
|
||||
```bash
|
||||
# SSH into Synology
|
||||
ssh admin@nas
|
||||
|
||||
# Check CPU/memory
|
||||
top
|
||||
|
||||
# Check storage health
|
||||
cat /proc/mdstat
|
||||
syno_hdd_util --all
|
||||
|
||||
# Check Docker performance
|
||||
docker stats
|
||||
```
|
||||
|
||||
### Common Synology Issues
|
||||
|
||||
#### 1. Indexing Slowing System
|
||||
```bash
|
||||
# Check if Synology is indexing
|
||||
ps aux | grep -i index
|
||||
|
||||
# Temporarily stop indexing
|
||||
synoservicectl --stop synoindexd
|
||||
|
||||
# Or schedule indexing for off-hours
|
||||
# Control Panel > Indexing Service > Schedule
|
||||
```
|
||||
|
||||
#### 2. Snapshot Replication Running
|
||||
```bash
|
||||
# Check running tasks
|
||||
synoschedtask --list
|
||||
|
||||
# Schedule snapshots during low-usage hours
|
||||
```
|
||||
|
||||
#### 3. Antivirus Scanning
|
||||
```bash
|
||||
# Disable real-time scanning or schedule scans
|
||||
# Security Advisor > Advanced > Scheduled Scan
|
||||
```
|
||||
|
||||
#### 4. Memory Pressure
|
||||
```bash
|
||||
# Check memory usage
|
||||
free -h
|
||||
|
||||
# If low on RAM, consider:
|
||||
# - Adding more RAM (DS1823xs+ supports up to 32GB)
|
||||
# - Reducing number of running containers
|
||||
# - Disabling unused packages
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Monitoring for Performance
|
||||
|
||||
### Set Up Prometheus Alerts
|
||||
|
||||
```yaml
|
||||
# prometheus/rules/performance.yml
|
||||
groups:
|
||||
- name: performance
|
||||
rules:
|
||||
- alert: HighCPUUsage
|
||||
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High CPU usage on {{ $labels.instance }}"
|
||||
|
||||
- alert: HighMemoryUsage
|
||||
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
|
||||
- alert: DiskIOHigh
|
||||
expr: rate(node_disk_io_time_seconds_total[5m]) > 0.9
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
|
||||
- alert: NetworkErrors
|
||||
expr: rate(node_network_receive_errs_total[5m]) > 10
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
```
|
||||
|
||||
### Grafana Dashboard Panels
|
||||
|
||||
Key metrics to monitor:
|
||||
- CPU usage by core
|
||||
- Memory usage and swap
|
||||
- Disk I/O latency (await)
|
||||
- Network throughput and errors
|
||||
- Container resource usage
|
||||
- Docker volume I/O
|
||||
|
||||
---
|
||||
|
||||
## 🛠️ Performance Tuning Checklist
|
||||
|
||||
### System Level
|
||||
- [ ] Kernel parameters optimized (`/etc/sysctl.conf`)
|
||||
- [ ] Disk scheduler appropriate for workload (mq-deadline for SSD)
|
||||
- [ ] Swap configured appropriately
|
||||
- [ ] File descriptor limits increased
|
||||
|
||||
### Docker Level
|
||||
- [ ] Container resource limits set
|
||||
- [ ] Logging driver configured (json-file with max-size)
|
||||
- [ ] Unused containers/images removed
|
||||
- [ ] Volumes on appropriate storage
|
||||
|
||||
### Network Level
|
||||
- [ ] Jumbo frames enabled (if supported)
|
||||
- [ ] DNS resolution fast
|
||||
- [ ] Firewall rules optimized
|
||||
- [ ] Quality of Service (QoS) configured
|
||||
|
||||
### Application Level
|
||||
- [ ] Database indexes optimized
|
||||
- [ ] Caching enabled (Redis/Memcached)
|
||||
- [ ] Connection pooling configured
|
||||
- [ ] Static assets served efficiently
|
||||
|
||||
---
|
||||
|
||||
## 🔗 Related Documentation
|
||||
|
||||
- [Network Performance Tuning](../infrastructure/network-performance-tuning.md)
|
||||
- [Monitoring Setup](../admin/monitoring.md)
|
||||
- [Common Issues](common-issues.md)
|
||||
- [10GbE Backbone](../diagrams/10gbe-backbone.md)
|
||||
- [Storage Topology](../diagrams/storage-topology.md)
|
||||
Reference in New Issue
Block a user