9.8 KiB
9.8 KiB
Ollama on Seattle - Local LLM Inference Server
Overview
| Setting | Value |
|---|---|
| Host | Seattle VM (Contabo VPS) |
| Port | 11434 (Ollama API) |
| Image | ollama/ollama:latest |
| API | http://100.82.197.124:11434 (Tailscale) |
| Stack File | hosts/vms/seattle/ollama.yaml |
| Data Volume | ollama-seattle-data |
Why Ollama on Seattle?
Ollama was deployed on seattle to provide:
- CPU-Only Inference: Ollama is optimized for CPU inference, unlike vLLM which requires GPU
- Additional Capacity: Supplements the main Ollama instance on Atlantis (192.168.0.200)
- Geographic Distribution: Runs on a Contabo VPS, providing inference capability outside the local network
- Integration with Perplexica: Can be added as an additional LLM provider for redundancy
Specifications
Hardware
- CPU: 16 vCPU AMD EPYC Processor
- RAM: 64GB
- Storage: 300GB SSD
- Location: Contabo Data Center
- Network: Tailscale VPN (100.82.197.124)
Resource Allocation
limits:
cpus: '12'
memory: 32G
reservations:
cpus: '4'
memory: 8G
Installed Models
Qwen 2.5 1.5B Instruct
- Model ID:
qwen2.5:1.5b - Size: ~986 MB
- Context Window: 32K tokens
- Use Case: Fast, lightweight inference for search queries
- Performance: Excellent on CPU, ~5-10 tokens/second
Installation History
February 16, 2026 - Initial Setup
Problem: Attempted to use vLLM for CPU inference
- vLLM container crashed with device detection errors
- vLLM is primarily designed for GPU inference
- CPU mode is not well-supported in recent vLLM versions
Solution: Switched to Ollama
- Ollama is specifically optimized for CPU inference
- Provides better performance and reliability on CPU-only systems
- Simpler configuration and management
- Native support for multiple model formats
Deployment Steps:
- Removed failing vLLM container
- Created
ollama.yamldocker-compose configuration - Deployed Ollama container
- Pulled
qwen2.5:1.5bmodel - Tested API connectivity via Tailscale
Configuration
Docker Compose
See hosts/vms/seattle/ollama.yaml:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama-seattle
ports:
- "11434:11434"
environment:
- OLLAMA_HOST=0.0.0.0:11434
- OLLAMA_KEEP_ALIVE=24h
- OLLAMA_NUM_PARALLEL=2
- OLLAMA_MAX_LOADED_MODELS=2
volumes:
- ollama-data:/root/.ollama
restart: unless-stopped
Environment Variables
OLLAMA_HOST: Bind to all interfacesOLLAMA_KEEP_ALIVE: Keep models loaded for 24 hoursOLLAMA_NUM_PARALLEL: Allow 2 parallel requestsOLLAMA_MAX_LOADED_MODELS: Cache up to 2 models in memory
Usage
API Endpoints
List Models
curl http://100.82.197.124:11434/api/tags
Generate Completion
curl http://100.82.197.124:11434/api/generate -d '{
"model": "qwen2.5:1.5b",
"prompt": "Explain quantum computing in simple terms"
}'
Chat Completion
curl http://100.82.197.124:11434/api/chat -d '{
"model": "qwen2.5:1.5b",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Model Management
Pull a New Model
ssh seattle-tailscale "docker exec ollama-seattle ollama pull <model-name>"
# Examples:
# docker exec ollama-seattle ollama pull qwen2.5:3b
# docker exec ollama-seattle ollama pull llama3.2:3b
# docker exec ollama-seattle ollama pull mistral:7b
List Downloaded Models
ssh seattle-tailscale "docker exec ollama-seattle ollama list"
Remove a Model
ssh seattle-tailscale "docker exec ollama-seattle ollama rm <model-name>"
Integration with Perplexica
To add this Ollama instance as an LLM provider in Perplexica:
- Navigate to http://192.168.0.210:4785/settings
- Click "Model Providers"
- Click "Add Provider"
- Configure as follows:
{
"name": "Ollama Seattle",
"type": "ollama",
"baseURL": "http://100.82.197.124:11434",
"apiKey": ""
}
- Click "Save"
- Select
qwen2.5:1.5bfrom the model dropdown when searching
Benefits of Multiple Ollama Instances
- Load Distribution: Distribute inference load across multiple servers
- Redundancy: If one instance is down, use the other
- Model Variety: Different instances can host different models
- Network Optimization: Use closest/fastest instance
Performance
Expected Performance (CPU-Only)
| Model | Size | Tokens/Second | Memory Usage |
|---|---|---|---|
| qwen2.5:1.5b | 986 MB | 8-12 | ~2-3 GB |
| qwen2.5:3b | ~2 GB | 5-8 | ~4-5 GB |
| llama3.2:3b | ~2 GB | 4-7 | ~4-5 GB |
| mistral:7b | ~4 GB | 2-4 | ~8-10 GB |
Optimization Tips
- Use Smaller Models: 1.5B and 3B models work best on CPU
- Limit Parallel Requests: Set
OLLAMA_NUM_PARALLEL=2to avoid overload - Keep Models Loaded: Long
OLLAMA_KEEP_ALIVEprevents reload delays - Monitor Memory: Watch RAM usage with
docker stats ollama-seattle
Monitoring
Container Status
# Check if running
ssh seattle-tailscale "docker ps | grep ollama"
# View logs
ssh seattle-tailscale "docker logs -f ollama-seattle"
# Check resource usage
ssh seattle-tailscale "docker stats ollama-seattle"
API Health Check
# Test connectivity
curl -m 5 http://100.82.197.124:11434/api/tags
# Test inference
curl http://100.82.197.124:11434/api/generate -d '{
"model": "qwen2.5:1.5b",
"prompt": "test",
"stream": false
}'
Performance Metrics
# Check response time
time curl -s http://100.82.197.124:11434/api/tags > /dev/null
# Monitor CPU usage
ssh seattle-tailscale "top -b -n 1 | grep ollama"
Troubleshooting
Container Won't Start
# Check logs
ssh seattle-tailscale "docker logs ollama-seattle"
# Common issues:
# - Port 11434 already in use
# - Insufficient memory
# - Volume mount permissions
Slow Inference
Causes:
- Model too large for available CPU
- Too many parallel requests
- Insufficient RAM
Solutions:
# Use a smaller model
docker exec ollama-seattle ollama pull qwen2.5:1.5b
# Reduce parallel requests
# Edit ollama.yaml: OLLAMA_NUM_PARALLEL=1
# Increase CPU allocation
# Edit ollama.yaml: cpus: '16'
Connection Timeout
Problem: Unable to reach Ollama from other machines
Solutions:
-
Verify Tailscale connection:
ping 100.82.197.124 tailscale status | grep seattle -
Check firewall:
ssh seattle-tailscale "ss -tlnp | grep 11434" -
Verify container is listening:
ssh seattle-tailscale "docker exec ollama-seattle netstat -tlnp"
Model Download Fails
# Check available disk space
ssh seattle-tailscale "df -h"
# Check internet connectivity
ssh seattle-tailscale "curl -I https://ollama.com"
# Try manual download
ssh seattle-tailscale "docker exec -it ollama-seattle ollama pull <model>"
Maintenance
Updates
# Pull latest Ollama image
ssh seattle-tailscale "docker pull ollama/ollama:latest"
# Recreate container
ssh seattle-tailscale "cd /opt/ollama && docker compose up -d --force-recreate"
Backup
# Backup models and configuration
ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar czf /backup/ollama-backup.tar.gz /data"
# Restore
ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar xzf /backup/ollama-backup.tar.gz -C /"
Cleanup
# Remove unused models
ssh seattle-tailscale "docker exec ollama-seattle ollama list"
ssh seattle-tailscale "docker exec ollama-seattle ollama rm <unused-model>"
# Clean up Docker
ssh seattle-tailscale "docker system prune -f"
Security Considerations
Network Access
- Ollama is exposed on port 11434
- Only accessible via Tailscale (100.82.197.124)
- Not exposed to public internet
- Consider adding authentication if exposing publicly
API Security
Ollama doesn't have built-in authentication. For production use:
- Use a reverse proxy with authentication (Nginx, Caddy)
- Restrict access via firewall rules
- Use Tailscale ACLs to limit access
- Monitor usage for abuse
Cost Analysis
Contabo VPS Costs
- Monthly Cost: ~$25-35 USD
- Inference Cost: $0 (self-hosted)
- vs Cloud APIs: OpenAI costs ~$0.15-0.60 per 1M tokens
Break-even Analysis
- Light usage (<1M tokens/month): Cloud APIs cheaper
- Medium usage (1-10M tokens/month): Self-hosted breaks even
- Heavy usage (>10M tokens/month): Self-hosted much cheaper
Future Enhancements
Potential Improvements
- GPU Support: Migrate to GPU-enabled VPS for faster inference
- Load Balancer: Set up Nginx to load balance between Ollama instances
- Auto-scaling: Deploy additional instances based on load
- Model Caching: Pre-warm multiple models for faster switching
- Monitoring Dashboard: Grafana + Prometheus for metrics
- API Gateway: Add rate limiting and authentication
Model Recommendations
For different use cases on CPU:
- Fast responses: qwen2.5:1.5b, phi3:3.8b
- Better quality: qwen2.5:3b, llama3.2:3b
- Code tasks: qwen2.5-coder:1.5b, codegemma:2b
- Instruction following: mistral:7b (slower but better)
Related Services
- Atlantis Ollama (
192.168.0.200:11434) - Main Ollama instance - Perplexica (
192.168.0.210:4785) - AI search engine client - LM Studio (
100.98.93.15:1234) - Alternative LLM server
References
Status: ✅ Fully operational Last Updated: February 16, 2026 Maintained By: Docker Compose (manual)