Sanitized mirror from private repository - 2026-04-04 03:48:45 UTC
This commit is contained in:
400
hosts/vms/seattle/README-ollama.md
Normal file
400
hosts/vms/seattle/README-ollama.md
Normal file
@@ -0,0 +1,400 @@
|
||||
# Ollama on Seattle - Local LLM Inference Server
|
||||
|
||||
## Overview
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| **Host** | Seattle VM (Contabo VPS) |
|
||||
| **Port** | 11434 (Ollama API) |
|
||||
| **Image** | `ollama/ollama:latest` |
|
||||
| **API** | http://100.82.197.124:11434 (Tailscale) |
|
||||
| **Stack File** | `hosts/vms/seattle/ollama.yaml` |
|
||||
| **Data Volume** | `ollama-seattle-data` |
|
||||
|
||||
## Why Ollama on Seattle?
|
||||
|
||||
Ollama was deployed on seattle to provide:
|
||||
1. **CPU-Only Inference**: Ollama is optimized for CPU inference, unlike vLLM which requires GPU
|
||||
2. **Additional Capacity**: Supplements the main Ollama instance on Atlantis (192.168.0.200)
|
||||
3. **Geographic Distribution**: Runs on a Contabo VPS, providing inference capability outside the local network
|
||||
4. **Integration with Perplexica**: Can be added as an additional LLM provider for redundancy
|
||||
|
||||
## Specifications
|
||||
|
||||
### Hardware
|
||||
- **CPU**: 16 vCPU AMD EPYC Processor
|
||||
- **RAM**: 64GB
|
||||
- **Storage**: 300GB SSD
|
||||
- **Location**: Contabo Data Center
|
||||
- **Network**: Tailscale VPN (100.82.197.124)
|
||||
|
||||
### Resource Allocation
|
||||
```yaml
|
||||
limits:
|
||||
cpus: '12'
|
||||
memory: 32G
|
||||
reservations:
|
||||
cpus: '4'
|
||||
memory: 8G
|
||||
```
|
||||
|
||||
## Installed Models
|
||||
|
||||
### Qwen 2.5 1.5B Instruct
|
||||
- **Model ID**: `qwen2.5:1.5b`
|
||||
- **Size**: ~986 MB
|
||||
- **Context Window**: 32K tokens
|
||||
- **Use Case**: Fast, lightweight inference for search queries
|
||||
- **Performance**: Excellent on CPU, ~5-10 tokens/second
|
||||
|
||||
## Installation History
|
||||
|
||||
### February 16, 2026 - Initial Setup
|
||||
|
||||
**Problem**: Attempted to use vLLM for CPU inference
|
||||
- vLLM container crashed with device detection errors
|
||||
- vLLM is primarily designed for GPU inference
|
||||
- CPU mode is not well-supported in recent vLLM versions
|
||||
|
||||
**Solution**: Switched to Ollama
|
||||
- Ollama is specifically optimized for CPU inference
|
||||
- Provides better performance and reliability on CPU-only systems
|
||||
- Simpler configuration and management
|
||||
- Native support for multiple model formats
|
||||
|
||||
**Deployment Steps**:
|
||||
1. Removed failing vLLM container
|
||||
2. Created `ollama.yaml` docker-compose configuration
|
||||
3. Deployed Ollama container
|
||||
4. Pulled `qwen2.5:1.5b` model
|
||||
5. Tested API connectivity via Tailscale
|
||||
|
||||
## Configuration
|
||||
|
||||
### Docker Compose
|
||||
|
||||
See `hosts/vms/seattle/ollama.yaml`:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
ollama:
|
||||
image: ollama/ollama:latest
|
||||
container_name: ollama-seattle
|
||||
ports:
|
||||
- "11434:11434"
|
||||
environment:
|
||||
- OLLAMA_HOST=0.0.0.0:11434
|
||||
- OLLAMA_KEEP_ALIVE=24h
|
||||
- OLLAMA_NUM_PARALLEL=2
|
||||
- OLLAMA_MAX_LOADED_MODELS=2
|
||||
volumes:
|
||||
- ollama-data:/root/.ollama
|
||||
restart: unless-stopped
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
- `OLLAMA_HOST`: Bind to all interfaces
|
||||
- `OLLAMA_KEEP_ALIVE`: Keep models loaded for 24 hours
|
||||
- `OLLAMA_NUM_PARALLEL`: Allow 2 parallel requests
|
||||
- `OLLAMA_MAX_LOADED_MODELS`: Cache up to 2 models in memory
|
||||
|
||||
## Usage
|
||||
|
||||
### API Endpoints
|
||||
|
||||
#### List Models
|
||||
```bash
|
||||
curl http://100.82.197.124:11434/api/tags
|
||||
```
|
||||
|
||||
#### Generate Completion
|
||||
```bash
|
||||
curl http://100.82.197.124:11434/api/generate -d '{
|
||||
"model": "qwen2.5:1.5b",
|
||||
"prompt": "Explain quantum computing in simple terms"
|
||||
}'
|
||||
```
|
||||
|
||||
#### Chat Completion
|
||||
```bash
|
||||
curl http://100.82.197.124:11434/api/chat -d '{
|
||||
"model": "qwen2.5:1.5b",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello!"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### Model Management
|
||||
|
||||
#### Pull a New Model
|
||||
```bash
|
||||
ssh seattle-tailscale "docker exec ollama-seattle ollama pull <model-name>"
|
||||
|
||||
# Examples:
|
||||
# docker exec ollama-seattle ollama pull qwen2.5:3b
|
||||
# docker exec ollama-seattle ollama pull llama3.2:3b
|
||||
# docker exec ollama-seattle ollama pull mistral:7b
|
||||
```
|
||||
|
||||
#### List Downloaded Models
|
||||
```bash
|
||||
ssh seattle-tailscale "docker exec ollama-seattle ollama list"
|
||||
```
|
||||
|
||||
#### Remove a Model
|
||||
```bash
|
||||
ssh seattle-tailscale "docker exec ollama-seattle ollama rm <model-name>"
|
||||
```
|
||||
|
||||
## Integration with Perplexica
|
||||
|
||||
To add this Ollama instance as an LLM provider in Perplexica:
|
||||
|
||||
1. Navigate to **http://192.168.0.210:4785/settings**
|
||||
2. Click **"Model Providers"**
|
||||
3. Click **"Add Provider"**
|
||||
4. Configure as follows:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "Ollama Seattle",
|
||||
"type": "ollama",
|
||||
"baseURL": "http://100.82.197.124:11434",
|
||||
"apiKey": ""
|
||||
}
|
||||
```
|
||||
|
||||
5. Click **"Save"**
|
||||
6. Select `qwen2.5:1.5b` from the model dropdown when searching
|
||||
|
||||
### Benefits of Multiple Ollama Instances
|
||||
|
||||
- **Load Distribution**: Distribute inference load across multiple servers
|
||||
- **Redundancy**: If one instance is down, use the other
|
||||
- **Model Variety**: Different instances can host different models
|
||||
- **Network Optimization**: Use closest/fastest instance
|
||||
|
||||
## Performance
|
||||
|
||||
### Expected Performance (CPU-Only)
|
||||
|
||||
| Model | Size | Tokens/Second | Memory Usage |
|
||||
|-------|------|---------------|--------------|
|
||||
| qwen2.5:1.5b | 986 MB | 8-12 | ~2-3 GB |
|
||||
| qwen2.5:3b | ~2 GB | 5-8 | ~4-5 GB |
|
||||
| llama3.2:3b | ~2 GB | 4-7 | ~4-5 GB |
|
||||
| mistral:7b | ~4 GB | 2-4 | ~8-10 GB |
|
||||
|
||||
### Optimization Tips
|
||||
|
||||
1. **Use Smaller Models**: 1.5B and 3B models work best on CPU
|
||||
2. **Limit Parallel Requests**: Set `OLLAMA_NUM_PARALLEL=2` to avoid overload
|
||||
3. **Keep Models Loaded**: Long `OLLAMA_KEEP_ALIVE` prevents reload delays
|
||||
4. **Monitor Memory**: Watch RAM usage with `docker stats ollama-seattle`
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Container Status
|
||||
```bash
|
||||
# Check if running
|
||||
ssh seattle-tailscale "docker ps | grep ollama"
|
||||
|
||||
# View logs
|
||||
ssh seattle-tailscale "docker logs -f ollama-seattle"
|
||||
|
||||
# Check resource usage
|
||||
ssh seattle-tailscale "docker stats ollama-seattle"
|
||||
```
|
||||
|
||||
### API Health Check
|
||||
```bash
|
||||
# Test connectivity
|
||||
curl -m 5 http://100.82.197.124:11434/api/tags
|
||||
|
||||
# Test inference
|
||||
curl http://100.82.197.124:11434/api/generate -d '{
|
||||
"model": "qwen2.5:1.5b",
|
||||
"prompt": "test",
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
### Performance Metrics
|
||||
```bash
|
||||
# Check response time
|
||||
time curl -s http://100.82.197.124:11434/api/tags > /dev/null
|
||||
|
||||
# Monitor CPU usage
|
||||
ssh seattle-tailscale "top -b -n 1 | grep ollama"
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Container Won't Start
|
||||
|
||||
```bash
|
||||
# Check logs
|
||||
ssh seattle-tailscale "docker logs ollama-seattle"
|
||||
|
||||
# Common issues:
|
||||
# - Port 11434 already in use
|
||||
# - Insufficient memory
|
||||
# - Volume mount permissions
|
||||
```
|
||||
|
||||
### Slow Inference
|
||||
|
||||
**Causes**:
|
||||
- Model too large for available CPU
|
||||
- Too many parallel requests
|
||||
- Insufficient RAM
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Use a smaller model
|
||||
docker exec ollama-seattle ollama pull qwen2.5:1.5b
|
||||
|
||||
# Reduce parallel requests
|
||||
# Edit ollama.yaml: OLLAMA_NUM_PARALLEL=1
|
||||
|
||||
# Increase CPU allocation
|
||||
# Edit ollama.yaml: cpus: '16'
|
||||
```
|
||||
|
||||
### Connection Timeout
|
||||
|
||||
**Problem**: Unable to reach Ollama from other machines
|
||||
|
||||
**Solutions**:
|
||||
1. Verify Tailscale connection:
|
||||
```bash
|
||||
ping 100.82.197.124
|
||||
tailscale status | grep seattle
|
||||
```
|
||||
|
||||
2. Check firewall:
|
||||
```bash
|
||||
ssh seattle-tailscale "ss -tlnp | grep 11434"
|
||||
```
|
||||
|
||||
3. Verify container is listening:
|
||||
```bash
|
||||
ssh seattle-tailscale "docker exec ollama-seattle netstat -tlnp"
|
||||
```
|
||||
|
||||
### Model Download Fails
|
||||
|
||||
```bash
|
||||
# Check available disk space
|
||||
ssh seattle-tailscale "df -h"
|
||||
|
||||
# Check internet connectivity
|
||||
ssh seattle-tailscale "curl -I https://ollama.com"
|
||||
|
||||
# Try manual download
|
||||
ssh seattle-tailscale "docker exec -it ollama-seattle ollama pull <model>"
|
||||
```
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Updates
|
||||
|
||||
```bash
|
||||
# Pull latest Ollama image
|
||||
ssh seattle-tailscale "docker pull ollama/ollama:latest"
|
||||
|
||||
# Recreate container
|
||||
ssh seattle-tailscale "cd /opt/ollama && docker compose up -d --force-recreate"
|
||||
```
|
||||
|
||||
### Backup
|
||||
|
||||
```bash
|
||||
# Backup models and configuration
|
||||
ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar czf /backup/ollama-backup.tar.gz /data"
|
||||
|
||||
# Restore
|
||||
ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar xzf /backup/ollama-backup.tar.gz -C /"
|
||||
```
|
||||
|
||||
### Cleanup
|
||||
|
||||
```bash
|
||||
# Remove unused models
|
||||
ssh seattle-tailscale "docker exec ollama-seattle ollama list"
|
||||
ssh seattle-tailscale "docker exec ollama-seattle ollama rm <unused-model>"
|
||||
|
||||
# Clean up Docker
|
||||
ssh seattle-tailscale "docker system prune -f"
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Network Access
|
||||
|
||||
- Ollama is exposed on port 11434
|
||||
- **Only accessible via Tailscale** (100.82.197.124)
|
||||
- Not exposed to public internet
|
||||
- Consider adding authentication if exposing publicly
|
||||
|
||||
### API Security
|
||||
|
||||
Ollama doesn't have built-in authentication. For production use:
|
||||
|
||||
1. **Use a reverse proxy** with authentication (Nginx, Caddy)
|
||||
2. **Restrict access** via firewall rules
|
||||
3. **Use Tailscale ACLs** to limit access
|
||||
4. **Monitor usage** for abuse
|
||||
|
||||
## Cost Analysis
|
||||
|
||||
### Contabo VPS Costs
|
||||
- **Monthly Cost**: ~$25-35 USD
|
||||
- **Inference Cost**: $0 (self-hosted)
|
||||
- **vs Cloud APIs**: OpenAI costs ~$0.15-0.60 per 1M tokens
|
||||
|
||||
### Break-even Analysis
|
||||
- **Light usage** (<1M tokens/month): Cloud APIs cheaper
|
||||
- **Medium usage** (1-10M tokens/month): Self-hosted breaks even
|
||||
- **Heavy usage** (>10M tokens/month): Self-hosted much cheaper
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Potential Improvements
|
||||
|
||||
1. **GPU Support**: Migrate to GPU-enabled VPS for faster inference
|
||||
2. **Load Balancer**: Set up Nginx to load balance between Ollama instances
|
||||
3. **Auto-scaling**: Deploy additional instances based on load
|
||||
4. **Model Caching**: Pre-warm multiple models for faster switching
|
||||
5. **Monitoring Dashboard**: Grafana + Prometheus for metrics
|
||||
6. **API Gateway**: Add rate limiting and authentication
|
||||
|
||||
### Model Recommendations
|
||||
|
||||
For different use cases on CPU:
|
||||
|
||||
- **Fast responses**: qwen2.5:1.5b, phi3:3.8b
|
||||
- **Better quality**: qwen2.5:3b, llama3.2:3b
|
||||
- **Code tasks**: qwen2.5-coder:1.5b, codegemma:2b
|
||||
- **Instruction following**: mistral:7b (slower but better)
|
||||
|
||||
## Related Services
|
||||
|
||||
- **Atlantis Ollama** (`192.168.0.200:11434`) - Main Ollama instance
|
||||
- **Perplexica** (`192.168.0.210:4785`) - AI search engine client
|
||||
- **LM Studio** (`100.98.93.15:1234`) - Alternative LLM server
|
||||
|
||||
## References
|
||||
|
||||
- [Ollama Documentation](https://github.com/ollama/ollama)
|
||||
- [Available Models](https://ollama.com/library)
|
||||
- [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
|
||||
- [Qwen 2.5 Model Card](https://ollama.com/library/qwen2.5)
|
||||
|
||||
---
|
||||
|
||||
**Status:** ✅ Fully operational
|
||||
**Last Updated:** February 16, 2026
|
||||
**Maintained By:** Docker Compose (manual)
|
||||
Reference in New Issue
Block a user