Sanitized mirror from private repository - 2026-04-04 03:48:45 UTC

2026-04-04 03:48:45 +00:00
commit 6b5bdf7b8d
1319 changed files with 336964 additions and 0 deletions
--- a/hosts/vms/seattle/README-ollama.md
+++ b/hosts/vms/seattle/README-ollama.md
@@ -0,0 +1,400 @@
+# Ollama on Seattle - Local LLM Inference Server
+
+## Overview
+
+| Setting | Value |
+|---------|-------|
+| **Host** | Seattle VM (Contabo VPS) |
+| **Port** | 11434 (Ollama API) |
+| **Image** | `ollama/ollama:latest` |
+| **API** | http://100.82.197.124:11434 (Tailscale) |
+| **Stack File** | `hosts/vms/seattle/ollama.yaml` |
+| **Data Volume** | `ollama-seattle-data` |
+
+## Why Ollama on Seattle?
+
+Ollama was deployed on seattle to provide:
+1. **CPU-Only Inference**: Ollama is optimized for CPU inference, unlike vLLM which requires GPU
+2. **Additional Capacity**: Supplements the main Ollama instance on Atlantis (192.168.0.200)
+3. **Geographic Distribution**: Runs on a Contabo VPS, providing inference capability outside the local network
+4. **Integration with Perplexica**: Can be added as an additional LLM provider for redundancy
+
+## Specifications
+
+### Hardware
+- **CPU**: 16 vCPU AMD EPYC Processor
+- **RAM**: 64GB
+- **Storage**: 300GB SSD
+- **Location**: Contabo Data Center
+- **Network**: Tailscale VPN (100.82.197.124)
+
+### Resource Allocation
+```yaml
+limits:
+  cpus: '12'
+  memory: 32G
+reservations:
+  cpus: '4'
+  memory: 8G
+```
+
+## Installed Models
+
+### Qwen 2.5 1.5B Instruct
+- **Model ID**: `qwen2.5:1.5b`
+- **Size**: ~986 MB
+- **Context Window**: 32K tokens
+- **Use Case**: Fast, lightweight inference for search queries
+- **Performance**: Excellent on CPU, ~5-10 tokens/second
+
+## Installation History
+
+### February 16, 2026 - Initial Setup
+
+**Problem**: Attempted to use vLLM for CPU inference
+- vLLM container crashed with device detection errors
+- vLLM is primarily designed for GPU inference
+- CPU mode is not well-supported in recent vLLM versions
+
+**Solution**: Switched to Ollama
+- Ollama is specifically optimized for CPU inference
+- Provides better performance and reliability on CPU-only systems
+- Simpler configuration and management
+- Native support for multiple model formats
+
+**Deployment Steps**:
+1. Removed failing vLLM container
+2. Created `ollama.yaml` docker-compose configuration
+3. Deployed Ollama container
+4. Pulled `qwen2.5:1.5b` model
+5. Tested API connectivity via Tailscale
+
+## Configuration
+
+### Docker Compose
+
+See `hosts/vms/seattle/ollama.yaml`:
+
+```yaml
+services:
+  ollama:
+    image: ollama/ollama:latest
+    container_name: ollama-seattle
+    ports:
+      - "11434:11434"
+    environment:
+      - OLLAMA_HOST=0.0.0.0:11434
+      - OLLAMA_KEEP_ALIVE=24h
+      - OLLAMA_NUM_PARALLEL=2
+      - OLLAMA_MAX_LOADED_MODELS=2
+    volumes:
+      - ollama-data:/root/.ollama
+    restart: unless-stopped
+```
+
+### Environment Variables
+
+- `OLLAMA_HOST`: Bind to all interfaces
+- `OLLAMA_KEEP_ALIVE`: Keep models loaded for 24 hours
+- `OLLAMA_NUM_PARALLEL`: Allow 2 parallel requests
+- `OLLAMA_MAX_LOADED_MODELS`: Cache up to 2 models in memory
+
+## Usage
+
+### API Endpoints
+
+#### List Models
+```bash
+curl http://100.82.197.124:11434/api/tags
+```
+
+#### Generate Completion
+```bash
+curl http://100.82.197.124:11434/api/generate -d '{
+  "model": "qwen2.5:1.5b",
+  "prompt": "Explain quantum computing in simple terms"
+}'
+```
+
+#### Chat Completion
+```bash
+curl http://100.82.197.124:11434/api/chat -d '{
+  "model": "qwen2.5:1.5b",
+  "messages": [
+    {"role": "user", "content": "Hello!"}
+  ]
+}'
+```
+
+### Model Management
+
+#### Pull a New Model
+```bash
+ssh seattle-tailscale "docker exec ollama-seattle ollama pull <model-name>"
+
+# Examples:
+# docker exec ollama-seattle ollama pull qwen2.5:3b
+# docker exec ollama-seattle ollama pull llama3.2:3b
+# docker exec ollama-seattle ollama pull mistral:7b
+```
+
+#### List Downloaded Models
+```bash
+ssh seattle-tailscale "docker exec ollama-seattle ollama list"
+```
+
+#### Remove a Model
+```bash
+ssh seattle-tailscale "docker exec ollama-seattle ollama rm <model-name>"
+```
+
+## Integration with Perplexica
+
+To add this Ollama instance as an LLM provider in Perplexica:
+
+1. Navigate to **http://192.168.0.210:4785/settings**
+2. Click **"Model Providers"**
+3. Click **"Add Provider"**
+4. Configure as follows:
+
+```json
+{
+  "name": "Ollama Seattle",
+  "type": "ollama",
+  "baseURL": "http://100.82.197.124:11434",
+  "apiKey": ""
+}
+```
+
+5. Click **"Save"**
+6. Select `qwen2.5:1.5b` from the model dropdown when searching
+
+### Benefits of Multiple Ollama Instances
+
+- **Load Distribution**: Distribute inference load across multiple servers
+- **Redundancy**: If one instance is down, use the other
+- **Model Variety**: Different instances can host different models
+- **Network Optimization**: Use closest/fastest instance
+
+## Performance
+
+### Expected Performance (CPU-Only)
+
+| Model | Size | Tokens/Second | Memory Usage |
+|-------|------|---------------|--------------|
+| qwen2.5:1.5b | 986 MB | 8-12 | ~2-3 GB |
+| qwen2.5:3b | ~2 GB | 5-8 | ~4-5 GB |
+| llama3.2:3b | ~2 GB | 4-7 | ~4-5 GB |
+| mistral:7b | ~4 GB | 2-4 | ~8-10 GB |
+
+### Optimization Tips
+
+1. **Use Smaller Models**: 1.5B and 3B models work best on CPU
+2. **Limit Parallel Requests**: Set `OLLAMA_NUM_PARALLEL=2` to avoid overload
+3. **Keep Models Loaded**: Long `OLLAMA_KEEP_ALIVE` prevents reload delays
+4. **Monitor Memory**: Watch RAM usage with `docker stats ollama-seattle`
+
+## Monitoring
+
+### Container Status
+```bash
+# Check if running
+ssh seattle-tailscale "docker ps | grep ollama"
+
+# View logs
+ssh seattle-tailscale "docker logs -f ollama-seattle"
+
+# Check resource usage
+ssh seattle-tailscale "docker stats ollama-seattle"
+```
+
+### API Health Check
+```bash
+# Test connectivity
+curl -m 5 http://100.82.197.124:11434/api/tags
+
+# Test inference
+curl http://100.82.197.124:11434/api/generate -d '{
+  "model": "qwen2.5:1.5b",
+  "prompt": "test",
+  "stream": false
+}'
+```
+
+### Performance Metrics
+```bash
+# Check response time
+time curl -s http://100.82.197.124:11434/api/tags > /dev/null
+
+# Monitor CPU usage
+ssh seattle-tailscale "top -b -n 1 | grep ollama"
+```
+
+## Troubleshooting
+
+### Container Won't Start
+
+```bash
+# Check logs
+ssh seattle-tailscale "docker logs ollama-seattle"
+
+# Common issues:
+# - Port 11434 already in use
+# - Insufficient memory
+# - Volume mount permissions
+```
+
+### Slow Inference
+
+**Causes**:
+- Model too large for available CPU
+- Too many parallel requests
+- Insufficient RAM
+
+**Solutions**:
+```bash
+# Use a smaller model
+docker exec ollama-seattle ollama pull qwen2.5:1.5b
+
+# Reduce parallel requests
+# Edit ollama.yaml: OLLAMA_NUM_PARALLEL=1
+
+# Increase CPU allocation
+# Edit ollama.yaml: cpus: '16'
+```
+
+### Connection Timeout
+
+**Problem**: Unable to reach Ollama from other machines
+
+**Solutions**:
+1. Verify Tailscale connection:
+   ```bash
+   ping 100.82.197.124
+   tailscale status | grep seattle
+   ```
+
+2. Check firewall:
+   ```bash
+   ssh seattle-tailscale "ss -tlnp | grep 11434"
+   ```
+
+3. Verify container is listening:
+   ```bash
+   ssh seattle-tailscale "docker exec ollama-seattle netstat -tlnp"
+   ```
+
+### Model Download Fails
+
+```bash
+# Check available disk space
+ssh seattle-tailscale "df -h"
+
+# Check internet connectivity
+ssh seattle-tailscale "curl -I https://ollama.com"
+
+# Try manual download
+ssh seattle-tailscale "docker exec -it ollama-seattle ollama pull <model>"
+```
+
+## Maintenance
+
+### Updates
+
+```bash
+# Pull latest Ollama image
+ssh seattle-tailscale "docker pull ollama/ollama:latest"
+
+# Recreate container
+ssh seattle-tailscale "cd /opt/ollama && docker compose up -d --force-recreate"
+```
+
+### Backup
+
+```bash
+# Backup models and configuration
+ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar czf /backup/ollama-backup.tar.gz /data"
+
+# Restore
+ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar xzf /backup/ollama-backup.tar.gz -C /"
+```
+
+### Cleanup
+
+```bash
+# Remove unused models
+ssh seattle-tailscale "docker exec ollama-seattle ollama list"
+ssh seattle-tailscale "docker exec ollama-seattle ollama rm <unused-model>"
+
+# Clean up Docker
+ssh seattle-tailscale "docker system prune -f"
+```
+
+## Security Considerations
+
+### Network Access
+
+- Ollama is exposed on port 11434
+- **Only accessible via Tailscale** (100.82.197.124)
+- Not exposed to public internet
+- Consider adding authentication if exposing publicly
+
+### API Security
+
+Ollama doesn't have built-in authentication. For production use:
+
+1. **Use a reverse proxy** with authentication (Nginx, Caddy)
+2. **Restrict access** via firewall rules
+3. **Use Tailscale ACLs** to limit access
+4. **Monitor usage** for abuse
+
+## Cost Analysis
+
+### Contabo VPS Costs
+- **Monthly Cost**: ~$25-35 USD
+- **Inference Cost**: $0 (self-hosted)
+- **vs Cloud APIs**: OpenAI costs ~$0.15-0.60 per 1M tokens
+
+### Break-even Analysis
+- **Light usage** (<1M tokens/month): Cloud APIs cheaper
+- **Medium usage** (1-10M tokens/month): Self-hosted breaks even
+- **Heavy usage** (>10M tokens/month): Self-hosted much cheaper
+
+## Future Enhancements
+
+### Potential Improvements
+
+1. **GPU Support**: Migrate to GPU-enabled VPS for faster inference
+2. **Load Balancer**: Set up Nginx to load balance between Ollama instances
+3. **Auto-scaling**: Deploy additional instances based on load
+4. **Model Caching**: Pre-warm multiple models for faster switching
+5. **Monitoring Dashboard**: Grafana + Prometheus for metrics
+6. **API Gateway**: Add rate limiting and authentication
+
+### Model Recommendations
+
+For different use cases on CPU:
+
+- **Fast responses**: qwen2.5:1.5b, phi3:3.8b
+- **Better quality**: qwen2.5:3b, llama3.2:3b
+- **Code tasks**: qwen2.5-coder:1.5b, codegemma:2b
+- **Instruction following**: mistral:7b (slower but better)
+
+## Related Services
+
+- **Atlantis Ollama** (`192.168.0.200:11434`) - Main Ollama instance
+- **Perplexica** (`192.168.0.210:4785`) - AI search engine client
+- **LM Studio** (`100.98.93.15:1234`) - Alternative LLM server
+
+## References
+
+- [Ollama Documentation](https://github.com/ollama/ollama)
+- [Available Models](https://ollama.com/library)
+- [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
+- [Qwen 2.5 Model Card](https://ollama.com/library/qwen2.5)
+
+---
+
+**Status:** ✅ Fully operational
+**Last Updated:** February 16, 2026
+**Maintained By:** Docker Compose (manual)