homelab-optimized/hosts/vms/seattle/README-ollama.md

# Ollama on Seattle - Local LLM Inference Server

## Overview

| Setting | Value |
|---------|-------|
| **Host** | Seattle VM (Contabo VPS) |
| **Port** | 11434 (Ollama API) |
| **Image** | `ollama/ollama:latest` |
| **API** | http://100.82.197.124:11434 (Tailscale) |
| **Stack File** | `hosts/vms/seattle/ollama.yaml` |
| **Data Volume** | `ollama-seattle-data` |

## Why Ollama on Seattle?

Ollama was deployed on seattle to provide:
1. **CPU-Only Inference**: Ollama is optimized for CPU inference, unlike vLLM which requires GPU
2. **Additional Capacity**: Supplements the main Ollama instance on Atlantis (192.168.0.200)
3. **Geographic Distribution**: Runs on a Contabo VPS, providing inference capability outside the local network
4. **Integration with Perplexica**: Can be added as an additional LLM provider for redundancy

## Specifications

### Hardware
- **CPU**: 16 vCPU AMD EPYC Processor
- **RAM**: 64GB
- **Storage**: 300GB SSD
- **Location**: Contabo Data Center
- **Network**: Tailscale VPN (100.82.197.124)

### Resource Allocation
```yaml
limits:
  cpus: '12'
  memory: 32G
reservations:
  cpus: '4'
  memory: 8G
```

## Installed Models

### Qwen 2.5 1.5B Instruct
- **Model ID**: `qwen2.5:1.5b`
- **Size**: ~986 MB
- **Context Window**: 32K tokens
- **Use Case**: Fast, lightweight inference for search queries
- **Performance**: Excellent on CPU, ~5-10 tokens/second

## Installation History

### February 16, 2026 - Initial Setup

**Problem**: Attempted to use vLLM for CPU inference
- vLLM container crashed with device detection errors
- vLLM is primarily designed for GPU inference
- CPU mode is not well-supported in recent vLLM versions

**Solution**: Switched to Ollama
- Ollama is specifically optimized for CPU inference
- Provides better performance and reliability on CPU-only systems
- Simpler configuration and management
- Native support for multiple model formats

**Deployment Steps**:
1. Removed failing vLLM container
2. Created `ollama.yaml` docker-compose configuration
3. Deployed Ollama container
4. Pulled `qwen2.5:1.5b` model
5. Tested API connectivity via Tailscale

## Configuration

### Docker Compose

See `hosts/vms/seattle/ollama.yaml`:

```yaml
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama-seattle
    ports:
      - "11434:11434"
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_NUM_PARALLEL=2
      - OLLAMA_MAX_LOADED_MODELS=2
    volumes:
      - ollama-data:/root/.ollama
    restart: unless-stopped
```

### Environment Variables

- `OLLAMA_HOST`: Bind to all interfaces
- `OLLAMA_KEEP_ALIVE`: Keep models loaded for 24 hours
- `OLLAMA_NUM_PARALLEL`: Allow 2 parallel requests
- `OLLAMA_MAX_LOADED_MODELS`: Cache up to 2 models in memory

## Usage

### API Endpoints

#### List Models
```bash
curl http://100.82.197.124:11434/api/tags
```

#### Generate Completion
```bash
curl http://100.82.197.124:11434/api/generate -d '{
  "model": "qwen2.5:1.5b",
  "prompt": "Explain quantum computing in simple terms"
}'
```

#### Chat Completion
```bash
curl http://100.82.197.124:11434/api/chat -d '{
  "model": "qwen2.5:1.5b",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'
```

### Model Management

#### Pull a New Model
```bash
ssh seattle-tailscale "docker exec ollama-seattle ollama pull <model-name>"

# Examples:
# docker exec ollama-seattle ollama pull qwen2.5:3b
# docker exec ollama-seattle ollama pull llama3.2:3b
# docker exec ollama-seattle ollama pull mistral:7b
```

#### List Downloaded Models
```bash
ssh seattle-tailscale "docker exec ollama-seattle ollama list"
```

#### Remove a Model
```bash
ssh seattle-tailscale "docker exec ollama-seattle ollama rm <model-name>"
```

## Integration with Perplexica

To add this Ollama instance as an LLM provider in Perplexica:

1. Navigate to **http://192.168.0.210:4785/settings**
2. Click **"Model Providers"**
3. Click **"Add Provider"**
4. Configure as follows:

```json
{
  "name": "Ollama Seattle",
  "type": "ollama",
  "baseURL": "http://100.82.197.124:11434",
  "apiKey": ""
}
```

5. Click **"Save"**
6. Select `qwen2.5:1.5b` from the model dropdown when searching

### Benefits of Multiple Ollama Instances

- **Load Distribution**: Distribute inference load across multiple servers
- **Redundancy**: If one instance is down, use the other
- **Model Variety**: Different instances can host different models
- **Network Optimization**: Use closest/fastest instance

## Performance

### Expected Performance (CPU-Only)

| Model | Size | Tokens/Second | Memory Usage |
|-------|------|---------------|--------------|
| qwen2.5:1.5b | 986 MB | 8-12 | ~2-3 GB |
| qwen2.5:3b | ~2 GB | 5-8 | ~4-5 GB |
| llama3.2:3b | ~2 GB | 4-7 | ~4-5 GB |
| mistral:7b | ~4 GB | 2-4 | ~8-10 GB |

### Optimization Tips

1. **Use Smaller Models**: 1.5B and 3B models work best on CPU
2. **Limit Parallel Requests**: Set `OLLAMA_NUM_PARALLEL=2` to avoid overload
3. **Keep Models Loaded**: Long `OLLAMA_KEEP_ALIVE` prevents reload delays
4. **Monitor Memory**: Watch RAM usage with `docker stats ollama-seattle`

## Monitoring

### Container Status
```bash
# Check if running
ssh seattle-tailscale "docker ps | grep ollama"

# View logs
ssh seattle-tailscale "docker logs -f ollama-seattle"

# Check resource usage
ssh seattle-tailscale "docker stats ollama-seattle"
```

### API Health Check
```bash
# Test connectivity
curl -m 5 http://100.82.197.124:11434/api/tags

# Test inference
curl http://100.82.197.124:11434/api/generate -d '{
  "model": "qwen2.5:1.5b",
  "prompt": "test",
  "stream": false
}'
```

### Performance Metrics
```bash
# Check response time
time curl -s http://100.82.197.124:11434/api/tags > /dev/null

# Monitor CPU usage
ssh seattle-tailscale "top -b -n 1 | grep ollama"
```

## Troubleshooting

### Container Won't Start

```bash
# Check logs
ssh seattle-tailscale "docker logs ollama-seattle"

# Common issues:
# - Port 11434 already in use
# - Insufficient memory
# - Volume mount permissions
```

### Slow Inference

**Causes**:
- Model too large for available CPU
- Too many parallel requests
- Insufficient RAM

**Solutions**:
```bash
# Use a smaller model
docker exec ollama-seattle ollama pull qwen2.5:1.5b

# Reduce parallel requests
# Edit ollama.yaml: OLLAMA_NUM_PARALLEL=1

# Increase CPU allocation
# Edit ollama.yaml: cpus: '16'
```

### Connection Timeout

**Problem**: Unable to reach Ollama from other machines

**Solutions**:
1. Verify Tailscale connection:
   ```bash
   ping 100.82.197.124
   tailscale status | grep seattle
   ```

2. Check firewall:
   ```bash
   ssh seattle-tailscale "ss -tlnp | grep 11434"
   ```

3. Verify container is listening:
   ```bash
   ssh seattle-tailscale "docker exec ollama-seattle netstat -tlnp"
   ```

### Model Download Fails

```bash
# Check available disk space
ssh seattle-tailscale "df -h"

# Check internet connectivity
ssh seattle-tailscale "curl -I https://ollama.com"

# Try manual download
ssh seattle-tailscale "docker exec -it ollama-seattle ollama pull <model>"
```

## Maintenance

### Updates

```bash
# Pull latest Ollama image
ssh seattle-tailscale "docker pull ollama/ollama:latest"

# Recreate container
ssh seattle-tailscale "cd /opt/ollama && docker compose up -d --force-recreate"
```

### Backup

```bash
# Backup models and configuration
ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar czf /backup/ollama-backup.tar.gz /data"

# Restore
ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar xzf /backup/ollama-backup.tar.gz -C /"
```

### Cleanup

```bash
# Remove unused models
ssh seattle-tailscale "docker exec ollama-seattle ollama list"
ssh seattle-tailscale "docker exec ollama-seattle ollama rm <unused-model>"

# Clean up Docker
ssh seattle-tailscale "docker system prune -f"
```

## Security Considerations

### Network Access

- Ollama is exposed on port 11434
- **Only accessible via Tailscale** (100.82.197.124)
- Not exposed to public internet
- Consider adding authentication if exposing publicly

### API Security

Ollama doesn't have built-in authentication. For production use:

1. **Use a reverse proxy** with authentication (Nginx, Caddy)
2. **Restrict access** via firewall rules
3. **Use Tailscale ACLs** to limit access
4. **Monitor usage** for abuse

## Cost Analysis

### Contabo VPS Costs
- **Monthly Cost**: ~$25-35 USD
- **Inference Cost**: $0 (self-hosted)
- **vs Cloud APIs**: OpenAI costs ~$0.15-0.60 per 1M tokens

### Break-even Analysis
- **Light usage** (<1M tokens/month): Cloud APIs cheaper
- **Medium usage** (1-10M tokens/month): Self-hosted breaks even
- **Heavy usage** (>10M tokens/month): Self-hosted much cheaper

## Future Enhancements

### Potential Improvements

1. **GPU Support**: Migrate to GPU-enabled VPS for faster inference
2. **Load Balancer**: Set up Nginx to load balance between Ollama instances
3. **Auto-scaling**: Deploy additional instances based on load
4. **Model Caching**: Pre-warm multiple models for faster switching
5. **Monitoring Dashboard**: Grafana + Prometheus for metrics
6. **API Gateway**: Add rate limiting and authentication

### Model Recommendations

For different use cases on CPU:

- **Fast responses**: qwen2.5:1.5b, phi3:3.8b
- **Better quality**: qwen2.5:3b, llama3.2:3b
- **Code tasks**: qwen2.5-coder:1.5b, codegemma:2b
- **Instruction following**: mistral:7b (slower but better)

## Related Services

- **Atlantis Ollama** (`192.168.0.200:11434`) - Main Ollama instance
- **Perplexica** (`192.168.0.210:4785`) - AI search engine client
- **LM Studio** (`100.98.93.15:1234`) - Alternative LLM server

## References

- [Ollama Documentation](https://github.com/ollama/ollama)
- [Available Models](https://ollama.com/library)
- [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
- [Qwen 2.5 Model Card](https://ollama.com/library/qwen2.5)

---

**Status:** ✅ Fully operational
**Last Updated:** February 16, 2026
**Maintained By:** Docker Compose (manual)