401 lines
9.8 KiB
Markdown
401 lines
9.8 KiB
Markdown
# Ollama on Seattle - Local LLM Inference Server
|
|
|
|
## Overview
|
|
|
|
| Setting | Value |
|
|
|---------|-------|
|
|
| **Host** | Seattle VM (Contabo VPS) |
|
|
| **Port** | 11434 (Ollama API) |
|
|
| **Image** | `ollama/ollama:latest` |
|
|
| **API** | http://100.82.197.124:11434 (Tailscale) |
|
|
| **Stack File** | `hosts/vms/seattle/ollama.yaml` |
|
|
| **Data Volume** | `ollama-seattle-data` |
|
|
|
|
## Why Ollama on Seattle?
|
|
|
|
Ollama was deployed on seattle to provide:
|
|
1. **CPU-Only Inference**: Ollama is optimized for CPU inference, unlike vLLM which requires GPU
|
|
2. **Additional Capacity**: Supplements the main Ollama instance on Atlantis (192.168.0.200)
|
|
3. **Geographic Distribution**: Runs on a Contabo VPS, providing inference capability outside the local network
|
|
4. **Integration with Perplexica**: Can be added as an additional LLM provider for redundancy
|
|
|
|
## Specifications
|
|
|
|
### Hardware
|
|
- **CPU**: 16 vCPU AMD EPYC Processor
|
|
- **RAM**: 64GB
|
|
- **Storage**: 300GB SSD
|
|
- **Location**: Contabo Data Center
|
|
- **Network**: Tailscale VPN (100.82.197.124)
|
|
|
|
### Resource Allocation
|
|
```yaml
|
|
limits:
|
|
cpus: '12'
|
|
memory: 32G
|
|
reservations:
|
|
cpus: '4'
|
|
memory: 8G
|
|
```
|
|
|
|
## Installed Models
|
|
|
|
### Qwen 2.5 1.5B Instruct
|
|
- **Model ID**: `qwen2.5:1.5b`
|
|
- **Size**: ~986 MB
|
|
- **Context Window**: 32K tokens
|
|
- **Use Case**: Fast, lightweight inference for search queries
|
|
- **Performance**: Excellent on CPU, ~5-10 tokens/second
|
|
|
|
## Installation History
|
|
|
|
### February 16, 2026 - Initial Setup
|
|
|
|
**Problem**: Attempted to use vLLM for CPU inference
|
|
- vLLM container crashed with device detection errors
|
|
- vLLM is primarily designed for GPU inference
|
|
- CPU mode is not well-supported in recent vLLM versions
|
|
|
|
**Solution**: Switched to Ollama
|
|
- Ollama is specifically optimized for CPU inference
|
|
- Provides better performance and reliability on CPU-only systems
|
|
- Simpler configuration and management
|
|
- Native support for multiple model formats
|
|
|
|
**Deployment Steps**:
|
|
1. Removed failing vLLM container
|
|
2. Created `ollama.yaml` docker-compose configuration
|
|
3. Deployed Ollama container
|
|
4. Pulled `qwen2.5:1.5b` model
|
|
5. Tested API connectivity via Tailscale
|
|
|
|
## Configuration
|
|
|
|
### Docker Compose
|
|
|
|
See `hosts/vms/seattle/ollama.yaml`:
|
|
|
|
```yaml
|
|
services:
|
|
ollama:
|
|
image: ollama/ollama:latest
|
|
container_name: ollama-seattle
|
|
ports:
|
|
- "11434:11434"
|
|
environment:
|
|
- OLLAMA_HOST=0.0.0.0:11434
|
|
- OLLAMA_KEEP_ALIVE=24h
|
|
- OLLAMA_NUM_PARALLEL=2
|
|
- OLLAMA_MAX_LOADED_MODELS=2
|
|
volumes:
|
|
- ollama-data:/root/.ollama
|
|
restart: unless-stopped
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
- `OLLAMA_HOST`: Bind to all interfaces
|
|
- `OLLAMA_KEEP_ALIVE`: Keep models loaded for 24 hours
|
|
- `OLLAMA_NUM_PARALLEL`: Allow 2 parallel requests
|
|
- `OLLAMA_MAX_LOADED_MODELS`: Cache up to 2 models in memory
|
|
|
|
## Usage
|
|
|
|
### API Endpoints
|
|
|
|
#### List Models
|
|
```bash
|
|
curl http://100.82.197.124:11434/api/tags
|
|
```
|
|
|
|
#### Generate Completion
|
|
```bash
|
|
curl http://100.82.197.124:11434/api/generate -d '{
|
|
"model": "qwen2.5:1.5b",
|
|
"prompt": "Explain quantum computing in simple terms"
|
|
}'
|
|
```
|
|
|
|
#### Chat Completion
|
|
```bash
|
|
curl http://100.82.197.124:11434/api/chat -d '{
|
|
"model": "qwen2.5:1.5b",
|
|
"messages": [
|
|
{"role": "user", "content": "Hello!"}
|
|
]
|
|
}'
|
|
```
|
|
|
|
### Model Management
|
|
|
|
#### Pull a New Model
|
|
```bash
|
|
ssh seattle-tailscale "docker exec ollama-seattle ollama pull <model-name>"
|
|
|
|
# Examples:
|
|
# docker exec ollama-seattle ollama pull qwen2.5:3b
|
|
# docker exec ollama-seattle ollama pull llama3.2:3b
|
|
# docker exec ollama-seattle ollama pull mistral:7b
|
|
```
|
|
|
|
#### List Downloaded Models
|
|
```bash
|
|
ssh seattle-tailscale "docker exec ollama-seattle ollama list"
|
|
```
|
|
|
|
#### Remove a Model
|
|
```bash
|
|
ssh seattle-tailscale "docker exec ollama-seattle ollama rm <model-name>"
|
|
```
|
|
|
|
## Integration with Perplexica
|
|
|
|
To add this Ollama instance as an LLM provider in Perplexica:
|
|
|
|
1. Navigate to **http://192.168.0.210:4785/settings**
|
|
2. Click **"Model Providers"**
|
|
3. Click **"Add Provider"**
|
|
4. Configure as follows:
|
|
|
|
```json
|
|
{
|
|
"name": "Ollama Seattle",
|
|
"type": "ollama",
|
|
"baseURL": "http://100.82.197.124:11434",
|
|
"apiKey": ""
|
|
}
|
|
```
|
|
|
|
5. Click **"Save"**
|
|
6. Select `qwen2.5:1.5b` from the model dropdown when searching
|
|
|
|
### Benefits of Multiple Ollama Instances
|
|
|
|
- **Load Distribution**: Distribute inference load across multiple servers
|
|
- **Redundancy**: If one instance is down, use the other
|
|
- **Model Variety**: Different instances can host different models
|
|
- **Network Optimization**: Use closest/fastest instance
|
|
|
|
## Performance
|
|
|
|
### Expected Performance (CPU-Only)
|
|
|
|
| Model | Size | Tokens/Second | Memory Usage |
|
|
|-------|------|---------------|--------------|
|
|
| qwen2.5:1.5b | 986 MB | 8-12 | ~2-3 GB |
|
|
| qwen2.5:3b | ~2 GB | 5-8 | ~4-5 GB |
|
|
| llama3.2:3b | ~2 GB | 4-7 | ~4-5 GB |
|
|
| mistral:7b | ~4 GB | 2-4 | ~8-10 GB |
|
|
|
|
### Optimization Tips
|
|
|
|
1. **Use Smaller Models**: 1.5B and 3B models work best on CPU
|
|
2. **Limit Parallel Requests**: Set `OLLAMA_NUM_PARALLEL=2` to avoid overload
|
|
3. **Keep Models Loaded**: Long `OLLAMA_KEEP_ALIVE` prevents reload delays
|
|
4. **Monitor Memory**: Watch RAM usage with `docker stats ollama-seattle`
|
|
|
|
## Monitoring
|
|
|
|
### Container Status
|
|
```bash
|
|
# Check if running
|
|
ssh seattle-tailscale "docker ps | grep ollama"
|
|
|
|
# View logs
|
|
ssh seattle-tailscale "docker logs -f ollama-seattle"
|
|
|
|
# Check resource usage
|
|
ssh seattle-tailscale "docker stats ollama-seattle"
|
|
```
|
|
|
|
### API Health Check
|
|
```bash
|
|
# Test connectivity
|
|
curl -m 5 http://100.82.197.124:11434/api/tags
|
|
|
|
# Test inference
|
|
curl http://100.82.197.124:11434/api/generate -d '{
|
|
"model": "qwen2.5:1.5b",
|
|
"prompt": "test",
|
|
"stream": false
|
|
}'
|
|
```
|
|
|
|
### Performance Metrics
|
|
```bash
|
|
# Check response time
|
|
time curl -s http://100.82.197.124:11434/api/tags > /dev/null
|
|
|
|
# Monitor CPU usage
|
|
ssh seattle-tailscale "top -b -n 1 | grep ollama"
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Container Won't Start
|
|
|
|
```bash
|
|
# Check logs
|
|
ssh seattle-tailscale "docker logs ollama-seattle"
|
|
|
|
# Common issues:
|
|
# - Port 11434 already in use
|
|
# - Insufficient memory
|
|
# - Volume mount permissions
|
|
```
|
|
|
|
### Slow Inference
|
|
|
|
**Causes**:
|
|
- Model too large for available CPU
|
|
- Too many parallel requests
|
|
- Insufficient RAM
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Use a smaller model
|
|
docker exec ollama-seattle ollama pull qwen2.5:1.5b
|
|
|
|
# Reduce parallel requests
|
|
# Edit ollama.yaml: OLLAMA_NUM_PARALLEL=1
|
|
|
|
# Increase CPU allocation
|
|
# Edit ollama.yaml: cpus: '16'
|
|
```
|
|
|
|
### Connection Timeout
|
|
|
|
**Problem**: Unable to reach Ollama from other machines
|
|
|
|
**Solutions**:
|
|
1. Verify Tailscale connection:
|
|
```bash
|
|
ping 100.82.197.124
|
|
tailscale status | grep seattle
|
|
```
|
|
|
|
2. Check firewall:
|
|
```bash
|
|
ssh seattle-tailscale "ss -tlnp | grep 11434"
|
|
```
|
|
|
|
3. Verify container is listening:
|
|
```bash
|
|
ssh seattle-tailscale "docker exec ollama-seattle netstat -tlnp"
|
|
```
|
|
|
|
### Model Download Fails
|
|
|
|
```bash
|
|
# Check available disk space
|
|
ssh seattle-tailscale "df -h"
|
|
|
|
# Check internet connectivity
|
|
ssh seattle-tailscale "curl -I https://ollama.com"
|
|
|
|
# Try manual download
|
|
ssh seattle-tailscale "docker exec -it ollama-seattle ollama pull <model>"
|
|
```
|
|
|
|
## Maintenance
|
|
|
|
### Updates
|
|
|
|
```bash
|
|
# Pull latest Ollama image
|
|
ssh seattle-tailscale "docker pull ollama/ollama:latest"
|
|
|
|
# Recreate container
|
|
ssh seattle-tailscale "cd /opt/ollama && docker compose up -d --force-recreate"
|
|
```
|
|
|
|
### Backup
|
|
|
|
```bash
|
|
# Backup models and configuration
|
|
ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar czf /backup/ollama-backup.tar.gz /data"
|
|
|
|
# Restore
|
|
ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar xzf /backup/ollama-backup.tar.gz -C /"
|
|
```
|
|
|
|
### Cleanup
|
|
|
|
```bash
|
|
# Remove unused models
|
|
ssh seattle-tailscale "docker exec ollama-seattle ollama list"
|
|
ssh seattle-tailscale "docker exec ollama-seattle ollama rm <unused-model>"
|
|
|
|
# Clean up Docker
|
|
ssh seattle-tailscale "docker system prune -f"
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
### Network Access
|
|
|
|
- Ollama is exposed on port 11434
|
|
- **Only accessible via Tailscale** (100.82.197.124)
|
|
- Not exposed to public internet
|
|
- Consider adding authentication if exposing publicly
|
|
|
|
### API Security
|
|
|
|
Ollama doesn't have built-in authentication. For production use:
|
|
|
|
1. **Use a reverse proxy** with authentication (Nginx, Caddy)
|
|
2. **Restrict access** via firewall rules
|
|
3. **Use Tailscale ACLs** to limit access
|
|
4. **Monitor usage** for abuse
|
|
|
|
## Cost Analysis
|
|
|
|
### Contabo VPS Costs
|
|
- **Monthly Cost**: ~$25-35 USD
|
|
- **Inference Cost**: $0 (self-hosted)
|
|
- **vs Cloud APIs**: OpenAI costs ~$0.15-0.60 per 1M tokens
|
|
|
|
### Break-even Analysis
|
|
- **Light usage** (<1M tokens/month): Cloud APIs cheaper
|
|
- **Medium usage** (1-10M tokens/month): Self-hosted breaks even
|
|
- **Heavy usage** (>10M tokens/month): Self-hosted much cheaper
|
|
|
|
## Future Enhancements
|
|
|
|
### Potential Improvements
|
|
|
|
1. **GPU Support**: Migrate to GPU-enabled VPS for faster inference
|
|
2. **Load Balancer**: Set up Nginx to load balance between Ollama instances
|
|
3. **Auto-scaling**: Deploy additional instances based on load
|
|
4. **Model Caching**: Pre-warm multiple models for faster switching
|
|
5. **Monitoring Dashboard**: Grafana + Prometheus for metrics
|
|
6. **API Gateway**: Add rate limiting and authentication
|
|
|
|
### Model Recommendations
|
|
|
|
For different use cases on CPU:
|
|
|
|
- **Fast responses**: qwen2.5:1.5b, phi3:3.8b
|
|
- **Better quality**: qwen2.5:3b, llama3.2:3b
|
|
- **Code tasks**: qwen2.5-coder:1.5b, codegemma:2b
|
|
- **Instruction following**: mistral:7b (slower but better)
|
|
|
|
## Related Services
|
|
|
|
- **Atlantis Ollama** (`192.168.0.200:11434`) - Main Ollama instance
|
|
- **Perplexica** (`192.168.0.210:4785`) - AI search engine client
|
|
- **LM Studio** (`100.98.93.15:1234`) - Alternative LLM server
|
|
|
|
## References
|
|
|
|
- [Ollama Documentation](https://github.com/ollama/ollama)
|
|
- [Available Models](https://ollama.com/library)
|
|
- [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
|
|
- [Qwen 2.5 Model Card](https://ollama.com/library/qwen2.5)
|
|
|
|
---
|
|
|
|
**Status:** ✅ Fully operational
|
|
**Last Updated:** February 16, 2026
|
|
**Maintained By:** Docker Compose (manual)
|