# Ollama on Seattle - Local LLM Inference Server ## Overview | Setting | Value | |---------|-------| | **Host** | Seattle VM (Contabo VPS) | | **Port** | 11434 (Ollama API) | | **Image** | `ollama/ollama:latest` | | **API** | http://100.82.197.124:11434 (Tailscale) | | **Stack File** | `hosts/vms/seattle/ollama.yaml` | | **Data Volume** | `ollama-seattle-data` | ## Why Ollama on Seattle? Ollama was deployed on seattle to provide: 1. **CPU-Only Inference**: Ollama is optimized for CPU inference, unlike vLLM which requires GPU 2. **Additional Capacity**: Supplements the main Ollama instance on Atlantis (192.168.0.200) 3. **Geographic Distribution**: Runs on a Contabo VPS, providing inference capability outside the local network 4. **Integration with Perplexica**: Can be added as an additional LLM provider for redundancy ## Specifications ### Hardware - **CPU**: 16 vCPU AMD EPYC Processor - **RAM**: 64GB - **Storage**: 300GB SSD - **Location**: Contabo Data Center - **Network**: Tailscale VPN (100.82.197.124) ### Resource Allocation ```yaml limits: cpus: '12' memory: 32G reservations: cpus: '4' memory: 8G ``` ## Installed Models ### Qwen 2.5 1.5B Instruct - **Model ID**: `qwen2.5:1.5b` - **Size**: ~986 MB - **Context Window**: 32K tokens - **Use Case**: Fast, lightweight inference for search queries - **Performance**: Excellent on CPU, ~5-10 tokens/second ## Installation History ### February 16, 2026 - Initial Setup **Problem**: Attempted to use vLLM for CPU inference - vLLM container crashed with device detection errors - vLLM is primarily designed for GPU inference - CPU mode is not well-supported in recent vLLM versions **Solution**: Switched to Ollama - Ollama is specifically optimized for CPU inference - Provides better performance and reliability on CPU-only systems - Simpler configuration and management - Native support for multiple model formats **Deployment Steps**: 1. Removed failing vLLM container 2. Created `ollama.yaml` docker-compose configuration 3. Deployed Ollama container 4. Pulled `qwen2.5:1.5b` model 5. Tested API connectivity via Tailscale ## Configuration ### Docker Compose See `hosts/vms/seattle/ollama.yaml`: ```yaml services: ollama: image: ollama/ollama:latest container_name: ollama-seattle ports: - "11434:11434" environment: - OLLAMA_HOST=0.0.0.0:11434 - OLLAMA_KEEP_ALIVE=24h - OLLAMA_NUM_PARALLEL=2 - OLLAMA_MAX_LOADED_MODELS=2 volumes: - ollama-data:/root/.ollama restart: unless-stopped ``` ### Environment Variables - `OLLAMA_HOST`: Bind to all interfaces - `OLLAMA_KEEP_ALIVE`: Keep models loaded for 24 hours - `OLLAMA_NUM_PARALLEL`: Allow 2 parallel requests - `OLLAMA_MAX_LOADED_MODELS`: Cache up to 2 models in memory ## Usage ### API Endpoints #### List Models ```bash curl http://100.82.197.124:11434/api/tags ``` #### Generate Completion ```bash curl http://100.82.197.124:11434/api/generate -d '{ "model": "qwen2.5:1.5b", "prompt": "Explain quantum computing in simple terms" }' ``` #### Chat Completion ```bash curl http://100.82.197.124:11434/api/chat -d '{ "model": "qwen2.5:1.5b", "messages": [ {"role": "user", "content": "Hello!"} ] }' ``` ### Model Management #### Pull a New Model ```bash ssh seattle-tailscale "docker exec ollama-seattle ollama pull " # Examples: # docker exec ollama-seattle ollama pull qwen2.5:3b # docker exec ollama-seattle ollama pull llama3.2:3b # docker exec ollama-seattle ollama pull mistral:7b ``` #### List Downloaded Models ```bash ssh seattle-tailscale "docker exec ollama-seattle ollama list" ``` #### Remove a Model ```bash ssh seattle-tailscale "docker exec ollama-seattle ollama rm " ``` ## Integration with Perplexica To add this Ollama instance as an LLM provider in Perplexica: 1. Navigate to **http://192.168.0.210:4785/settings** 2. Click **"Model Providers"** 3. Click **"Add Provider"** 4. Configure as follows: ```json { "name": "Ollama Seattle", "type": "ollama", "baseURL": "http://100.82.197.124:11434", "apiKey": "" } ``` 5. Click **"Save"** 6. Select `qwen2.5:1.5b` from the model dropdown when searching ### Benefits of Multiple Ollama Instances - **Load Distribution**: Distribute inference load across multiple servers - **Redundancy**: If one instance is down, use the other - **Model Variety**: Different instances can host different models - **Network Optimization**: Use closest/fastest instance ## Performance ### Expected Performance (CPU-Only) | Model | Size | Tokens/Second | Memory Usage | |-------|------|---------------|--------------| | qwen2.5:1.5b | 986 MB | 8-12 | ~2-3 GB | | qwen2.5:3b | ~2 GB | 5-8 | ~4-5 GB | | llama3.2:3b | ~2 GB | 4-7 | ~4-5 GB | | mistral:7b | ~4 GB | 2-4 | ~8-10 GB | ### Optimization Tips 1. **Use Smaller Models**: 1.5B and 3B models work best on CPU 2. **Limit Parallel Requests**: Set `OLLAMA_NUM_PARALLEL=2` to avoid overload 3. **Keep Models Loaded**: Long `OLLAMA_KEEP_ALIVE` prevents reload delays 4. **Monitor Memory**: Watch RAM usage with `docker stats ollama-seattle` ## Monitoring ### Container Status ```bash # Check if running ssh seattle-tailscale "docker ps | grep ollama" # View logs ssh seattle-tailscale "docker logs -f ollama-seattle" # Check resource usage ssh seattle-tailscale "docker stats ollama-seattle" ``` ### API Health Check ```bash # Test connectivity curl -m 5 http://100.82.197.124:11434/api/tags # Test inference curl http://100.82.197.124:11434/api/generate -d '{ "model": "qwen2.5:1.5b", "prompt": "test", "stream": false }' ``` ### Performance Metrics ```bash # Check response time time curl -s http://100.82.197.124:11434/api/tags > /dev/null # Monitor CPU usage ssh seattle-tailscale "top -b -n 1 | grep ollama" ``` ## Troubleshooting ### Container Won't Start ```bash # Check logs ssh seattle-tailscale "docker logs ollama-seattle" # Common issues: # - Port 11434 already in use # - Insufficient memory # - Volume mount permissions ``` ### Slow Inference **Causes**: - Model too large for available CPU - Too many parallel requests - Insufficient RAM **Solutions**: ```bash # Use a smaller model docker exec ollama-seattle ollama pull qwen2.5:1.5b # Reduce parallel requests # Edit ollama.yaml: OLLAMA_NUM_PARALLEL=1 # Increase CPU allocation # Edit ollama.yaml: cpus: '16' ``` ### Connection Timeout **Problem**: Unable to reach Ollama from other machines **Solutions**: 1. Verify Tailscale connection: ```bash ping 100.82.197.124 tailscale status | grep seattle ``` 2. Check firewall: ```bash ssh seattle-tailscale "ss -tlnp | grep 11434" ``` 3. Verify container is listening: ```bash ssh seattle-tailscale "docker exec ollama-seattle netstat -tlnp" ``` ### Model Download Fails ```bash # Check available disk space ssh seattle-tailscale "df -h" # Check internet connectivity ssh seattle-tailscale "curl -I https://ollama.com" # Try manual download ssh seattle-tailscale "docker exec -it ollama-seattle ollama pull " ``` ## Maintenance ### Updates ```bash # Pull latest Ollama image ssh seattle-tailscale "docker pull ollama/ollama:latest" # Recreate container ssh seattle-tailscale "cd /opt/ollama && docker compose up -d --force-recreate" ``` ### Backup ```bash # Backup models and configuration ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar czf /backup/ollama-backup.tar.gz /data" # Restore ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar xzf /backup/ollama-backup.tar.gz -C /" ``` ### Cleanup ```bash # Remove unused models ssh seattle-tailscale "docker exec ollama-seattle ollama list" ssh seattle-tailscale "docker exec ollama-seattle ollama rm " # Clean up Docker ssh seattle-tailscale "docker system prune -f" ``` ## Security Considerations ### Network Access - Ollama is exposed on port 11434 - **Only accessible via Tailscale** (100.82.197.124) - Not exposed to public internet - Consider adding authentication if exposing publicly ### API Security Ollama doesn't have built-in authentication. For production use: 1. **Use a reverse proxy** with authentication (Nginx, Caddy) 2. **Restrict access** via firewall rules 3. **Use Tailscale ACLs** to limit access 4. **Monitor usage** for abuse ## Cost Analysis ### Contabo VPS Costs - **Monthly Cost**: ~$25-35 USD - **Inference Cost**: $0 (self-hosted) - **vs Cloud APIs**: OpenAI costs ~$0.15-0.60 per 1M tokens ### Break-even Analysis - **Light usage** (<1M tokens/month): Cloud APIs cheaper - **Medium usage** (1-10M tokens/month): Self-hosted breaks even - **Heavy usage** (>10M tokens/month): Self-hosted much cheaper ## Future Enhancements ### Potential Improvements 1. **GPU Support**: Migrate to GPU-enabled VPS for faster inference 2. **Load Balancer**: Set up Nginx to load balance between Ollama instances 3. **Auto-scaling**: Deploy additional instances based on load 4. **Model Caching**: Pre-warm multiple models for faster switching 5. **Monitoring Dashboard**: Grafana + Prometheus for metrics 6. **API Gateway**: Add rate limiting and authentication ### Model Recommendations For different use cases on CPU: - **Fast responses**: qwen2.5:1.5b, phi3:3.8b - **Better quality**: qwen2.5:3b, llama3.2:3b - **Code tasks**: qwen2.5-coder:1.5b, codegemma:2b - **Instruction following**: mistral:7b (slower but better) ## Related Services - **Atlantis Ollama** (`192.168.0.200:11434`) - Main Ollama instance - **Perplexica** (`192.168.0.210:4785`) - AI search engine client - **LM Studio** (`100.98.93.15:1234`) - Alternative LLM server ## References - [Ollama Documentation](https://github.com/ollama/ollama) - [Available Models](https://ollama.com/library) - [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md) - [Qwen 2.5 Model Card](https://ollama.com/library/qwen2.5) --- **Status:** ✅ Fully operational **Last Updated:** February 16, 2026 **Maintained By:** Docker Compose (manual)