Files
homelab-optimized/hosts/vms/seattle/README-ollama.md
Gitea Mirror Bot fb00a325d1
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m14s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-04-18 11:19:59 UTC
2026-04-18 11:19:59 +00:00

401 lines
9.8 KiB
Markdown

# Ollama on Seattle - Local LLM Inference Server
## Overview
| Setting | Value |
|---------|-------|
| **Host** | Seattle VM (Contabo VPS) |
| **Port** | 11434 (Ollama API) |
| **Image** | `ollama/ollama:latest` |
| **API** | http://100.82.197.124:11434 (Tailscale) |
| **Stack File** | `hosts/vms/seattle/ollama.yaml` |
| **Data Volume** | `ollama-seattle-data` |
## Why Ollama on Seattle?
Ollama was deployed on seattle to provide:
1. **CPU-Only Inference**: Ollama is optimized for CPU inference, unlike vLLM which requires GPU
2. **Additional Capacity**: Supplements the main Ollama instance on Atlantis (192.168.0.200)
3. **Geographic Distribution**: Runs on a Contabo VPS, providing inference capability outside the local network
4. **Integration with Perplexica**: Can be added as an additional LLM provider for redundancy
## Specifications
### Hardware
- **CPU**: 16 vCPU AMD EPYC Processor
- **RAM**: 64GB
- **Storage**: 300GB SSD
- **Location**: Contabo Data Center
- **Network**: Tailscale VPN (100.82.197.124)
### Resource Allocation
```yaml
limits:
cpus: '12'
memory: 32G
reservations:
cpus: '4'
memory: 8G
```
## Installed Models
### Qwen 2.5 1.5B Instruct
- **Model ID**: `qwen2.5:1.5b`
- **Size**: ~986 MB
- **Context Window**: 32K tokens
- **Use Case**: Fast, lightweight inference for search queries
- **Performance**: Excellent on CPU, ~5-10 tokens/second
## Installation History
### February 16, 2026 - Initial Setup
**Problem**: Attempted to use vLLM for CPU inference
- vLLM container crashed with device detection errors
- vLLM is primarily designed for GPU inference
- CPU mode is not well-supported in recent vLLM versions
**Solution**: Switched to Ollama
- Ollama is specifically optimized for CPU inference
- Provides better performance and reliability on CPU-only systems
- Simpler configuration and management
- Native support for multiple model formats
**Deployment Steps**:
1. Removed failing vLLM container
2. Created `ollama.yaml` docker-compose configuration
3. Deployed Ollama container
4. Pulled `qwen2.5:1.5b` model
5. Tested API connectivity via Tailscale
## Configuration
### Docker Compose
See `hosts/vms/seattle/ollama.yaml`:
```yaml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama-seattle
ports:
- "11434:11434"
environment:
- OLLAMA_HOST=0.0.0.0:11434
- OLLAMA_KEEP_ALIVE=24h
- OLLAMA_NUM_PARALLEL=2
- OLLAMA_MAX_LOADED_MODELS=2
volumes:
- ollama-data:/root/.ollama
restart: unless-stopped
```
### Environment Variables
- `OLLAMA_HOST`: Bind to all interfaces
- `OLLAMA_KEEP_ALIVE`: Keep models loaded for 24 hours
- `OLLAMA_NUM_PARALLEL`: Allow 2 parallel requests
- `OLLAMA_MAX_LOADED_MODELS`: Cache up to 2 models in memory
## Usage
### API Endpoints
#### List Models
```bash
curl http://100.82.197.124:11434/api/tags
```
#### Generate Completion
```bash
curl http://100.82.197.124:11434/api/generate -d '{
"model": "qwen2.5:1.5b",
"prompt": "Explain quantum computing in simple terms"
}'
```
#### Chat Completion
```bash
curl http://100.82.197.124:11434/api/chat -d '{
"model": "qwen2.5:1.5b",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
```
### Model Management
#### Pull a New Model
```bash
ssh seattle-tailscale "docker exec ollama-seattle ollama pull <model-name>"
# Examples:
# docker exec ollama-seattle ollama pull qwen2.5:3b
# docker exec ollama-seattle ollama pull llama3.2:3b
# docker exec ollama-seattle ollama pull mistral:7b
```
#### List Downloaded Models
```bash
ssh seattle-tailscale "docker exec ollama-seattle ollama list"
```
#### Remove a Model
```bash
ssh seattle-tailscale "docker exec ollama-seattle ollama rm <model-name>"
```
## Integration with Perplexica
To add this Ollama instance as an LLM provider in Perplexica:
1. Navigate to **http://192.168.0.210:4785/settings**
2. Click **"Model Providers"**
3. Click **"Add Provider"**
4. Configure as follows:
```json
{
"name": "Ollama Seattle",
"type": "ollama",
"baseURL": "http://100.82.197.124:11434",
"apiKey": ""
}
```
5. Click **"Save"**
6. Select `qwen2.5:1.5b` from the model dropdown when searching
### Benefits of Multiple Ollama Instances
- **Load Distribution**: Distribute inference load across multiple servers
- **Redundancy**: If one instance is down, use the other
- **Model Variety**: Different instances can host different models
- **Network Optimization**: Use closest/fastest instance
## Performance
### Expected Performance (CPU-Only)
| Model | Size | Tokens/Second | Memory Usage |
|-------|------|---------------|--------------|
| qwen2.5:1.5b | 986 MB | 8-12 | ~2-3 GB |
| qwen2.5:3b | ~2 GB | 5-8 | ~4-5 GB |
| llama3.2:3b | ~2 GB | 4-7 | ~4-5 GB |
| mistral:7b | ~4 GB | 2-4 | ~8-10 GB |
### Optimization Tips
1. **Use Smaller Models**: 1.5B and 3B models work best on CPU
2. **Limit Parallel Requests**: Set `OLLAMA_NUM_PARALLEL=2` to avoid overload
3. **Keep Models Loaded**: Long `OLLAMA_KEEP_ALIVE` prevents reload delays
4. **Monitor Memory**: Watch RAM usage with `docker stats ollama-seattle`
## Monitoring
### Container Status
```bash
# Check if running
ssh seattle-tailscale "docker ps | grep ollama"
# View logs
ssh seattle-tailscale "docker logs -f ollama-seattle"
# Check resource usage
ssh seattle-tailscale "docker stats ollama-seattle"
```
### API Health Check
```bash
# Test connectivity
curl -m 5 http://100.82.197.124:11434/api/tags
# Test inference
curl http://100.82.197.124:11434/api/generate -d '{
"model": "qwen2.5:1.5b",
"prompt": "test",
"stream": false
}'
```
### Performance Metrics
```bash
# Check response time
time curl -s http://100.82.197.124:11434/api/tags > /dev/null
# Monitor CPU usage
ssh seattle-tailscale "top -b -n 1 | grep ollama"
```
## Troubleshooting
### Container Won't Start
```bash
# Check logs
ssh seattle-tailscale "docker logs ollama-seattle"
# Common issues:
# - Port 11434 already in use
# - Insufficient memory
# - Volume mount permissions
```
### Slow Inference
**Causes**:
- Model too large for available CPU
- Too many parallel requests
- Insufficient RAM
**Solutions**:
```bash
# Use a smaller model
docker exec ollama-seattle ollama pull qwen2.5:1.5b
# Reduce parallel requests
# Edit ollama.yaml: OLLAMA_NUM_PARALLEL=1
# Increase CPU allocation
# Edit ollama.yaml: cpus: '16'
```
### Connection Timeout
**Problem**: Unable to reach Ollama from other machines
**Solutions**:
1. Verify Tailscale connection:
```bash
ping 100.82.197.124
tailscale status | grep seattle
```
2. Check firewall:
```bash
ssh seattle-tailscale "ss -tlnp | grep 11434"
```
3. Verify container is listening:
```bash
ssh seattle-tailscale "docker exec ollama-seattle netstat -tlnp"
```
### Model Download Fails
```bash
# Check available disk space
ssh seattle-tailscale "df -h"
# Check internet connectivity
ssh seattle-tailscale "curl -I https://ollama.com"
# Try manual download
ssh seattle-tailscale "docker exec -it ollama-seattle ollama pull <model>"
```
## Maintenance
### Updates
```bash
# Pull latest Ollama image
ssh seattle-tailscale "docker pull ollama/ollama:latest"
# Recreate container
ssh seattle-tailscale "cd /opt/ollama && docker compose up -d --force-recreate"
```
### Backup
```bash
# Backup models and configuration
ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar czf /backup/ollama-backup.tar.gz /data"
# Restore
ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar xzf /backup/ollama-backup.tar.gz -C /"
```
### Cleanup
```bash
# Remove unused models
ssh seattle-tailscale "docker exec ollama-seattle ollama list"
ssh seattle-tailscale "docker exec ollama-seattle ollama rm <unused-model>"
# Clean up Docker
ssh seattle-tailscale "docker system prune -f"
```
## Security Considerations
### Network Access
- Ollama is exposed on port 11434
- **Only accessible via Tailscale** (100.82.197.124)
- Not exposed to public internet
- Consider adding authentication if exposing publicly
### API Security
Ollama doesn't have built-in authentication. For production use:
1. **Use a reverse proxy** with authentication (Nginx, Caddy)
2. **Restrict access** via firewall rules
3. **Use Tailscale ACLs** to limit access
4. **Monitor usage** for abuse
## Cost Analysis
### Contabo VPS Costs
- **Monthly Cost**: ~$25-35 USD
- **Inference Cost**: $0 (self-hosted)
- **vs Cloud APIs**: OpenAI costs ~$0.15-0.60 per 1M tokens
### Break-even Analysis
- **Light usage** (<1M tokens/month): Cloud APIs cheaper
- **Medium usage** (1-10M tokens/month): Self-hosted breaks even
- **Heavy usage** (>10M tokens/month): Self-hosted much cheaper
## Future Enhancements
### Potential Improvements
1. **GPU Support**: Migrate to GPU-enabled VPS for faster inference
2. **Load Balancer**: Set up Nginx to load balance between Ollama instances
3. **Auto-scaling**: Deploy additional instances based on load
4. **Model Caching**: Pre-warm multiple models for faster switching
5. **Monitoring Dashboard**: Grafana + Prometheus for metrics
6. **API Gateway**: Add rate limiting and authentication
### Model Recommendations
For different use cases on CPU:
- **Fast responses**: qwen2.5:1.5b, phi3:3.8b
- **Better quality**: qwen2.5:3b, llama3.2:3b
- **Code tasks**: qwen2.5-coder:1.5b, codegemma:2b
- **Instruction following**: mistral:7b (slower but better)
## Related Services
- **Atlantis Ollama** (`192.168.0.200:11434`) - Main Ollama instance
- **Perplexica** (`192.168.0.210:4785`) - AI search engine client
- **LM Studio** (`100.98.93.15:1234`) - Alternative LLM server
## References
- [Ollama Documentation](https://github.com/ollama/ollama)
- [Available Models](https://ollama.com/library)
- [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
- [Qwen 2.5 Model Card](https://ollama.com/library/qwen2.5)
---
**Status:** ✅ Fully operational
**Last Updated:** February 16, 2026
**Maintained By:** Docker Compose (manual)