Files
homelab-optimized/hosts/vms/seattle/README-ollama.md
Gitea Mirror Bot e7652c8dab
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m3s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-04-20 01:32:01 UTC
2026-04-20 01:32:01 +00:00

9.8 KiB

Ollama on Seattle - Local LLM Inference Server

Overview

Setting Value
Host Seattle VM (Contabo VPS)
Port 11434 (Ollama API)
Image ollama/ollama:latest
API http://100.82.197.124:11434 (Tailscale)
Stack File hosts/vms/seattle/ollama.yaml
Data Volume ollama-seattle-data

Why Ollama on Seattle?

Ollama was deployed on seattle to provide:

  1. CPU-Only Inference: Ollama is optimized for CPU inference, unlike vLLM which requires GPU
  2. Additional Capacity: Supplements the main Ollama instance on Atlantis (192.168.0.200)
  3. Geographic Distribution: Runs on a Contabo VPS, providing inference capability outside the local network
  4. Integration with Perplexica: Can be added as an additional LLM provider for redundancy

Specifications

Hardware

  • CPU: 16 vCPU AMD EPYC Processor
  • RAM: 64GB
  • Storage: 300GB SSD
  • Location: Contabo Data Center
  • Network: Tailscale VPN (100.82.197.124)

Resource Allocation

limits:
  cpus: '12'
  memory: 32G
reservations:
  cpus: '4'
  memory: 8G

Installed Models

Qwen 2.5 1.5B Instruct

  • Model ID: qwen2.5:1.5b
  • Size: ~986 MB
  • Context Window: 32K tokens
  • Use Case: Fast, lightweight inference for search queries
  • Performance: Excellent on CPU, ~5-10 tokens/second

Installation History

February 16, 2026 - Initial Setup

Problem: Attempted to use vLLM for CPU inference

  • vLLM container crashed with device detection errors
  • vLLM is primarily designed for GPU inference
  • CPU mode is not well-supported in recent vLLM versions

Solution: Switched to Ollama

  • Ollama is specifically optimized for CPU inference
  • Provides better performance and reliability on CPU-only systems
  • Simpler configuration and management
  • Native support for multiple model formats

Deployment Steps:

  1. Removed failing vLLM container
  2. Created ollama.yaml docker-compose configuration
  3. Deployed Ollama container
  4. Pulled qwen2.5:1.5b model
  5. Tested API connectivity via Tailscale

Configuration

Docker Compose

See hosts/vms/seattle/ollama.yaml:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama-seattle
    ports:
      - "11434:11434"
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_NUM_PARALLEL=2
      - OLLAMA_MAX_LOADED_MODELS=2
    volumes:
      - ollama-data:/root/.ollama
    restart: unless-stopped

Environment Variables

  • OLLAMA_HOST: Bind to all interfaces
  • OLLAMA_KEEP_ALIVE: Keep models loaded for 24 hours
  • OLLAMA_NUM_PARALLEL: Allow 2 parallel requests
  • OLLAMA_MAX_LOADED_MODELS: Cache up to 2 models in memory

Usage

API Endpoints

List Models

curl http://100.82.197.124:11434/api/tags

Generate Completion

curl http://100.82.197.124:11434/api/generate -d '{
  "model": "qwen2.5:1.5b",
  "prompt": "Explain quantum computing in simple terms"
}'

Chat Completion

curl http://100.82.197.124:11434/api/chat -d '{
  "model": "qwen2.5:1.5b",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'

Model Management

Pull a New Model

ssh seattle-tailscale "docker exec ollama-seattle ollama pull <model-name>"

# Examples:
# docker exec ollama-seattle ollama pull qwen2.5:3b
# docker exec ollama-seattle ollama pull llama3.2:3b
# docker exec ollama-seattle ollama pull mistral:7b

List Downloaded Models

ssh seattle-tailscale "docker exec ollama-seattle ollama list"

Remove a Model

ssh seattle-tailscale "docker exec ollama-seattle ollama rm <model-name>"

Integration with Perplexica

To add this Ollama instance as an LLM provider in Perplexica:

  1. Navigate to http://192.168.0.210:4785/settings
  2. Click "Model Providers"
  3. Click "Add Provider"
  4. Configure as follows:
{
  "name": "Ollama Seattle",
  "type": "ollama",
  "baseURL": "http://100.82.197.124:11434",
  "apiKey": ""
}
  1. Click "Save"
  2. Select qwen2.5:1.5b from the model dropdown when searching

Benefits of Multiple Ollama Instances

  • Load Distribution: Distribute inference load across multiple servers
  • Redundancy: If one instance is down, use the other
  • Model Variety: Different instances can host different models
  • Network Optimization: Use closest/fastest instance

Performance

Expected Performance (CPU-Only)

Model Size Tokens/Second Memory Usage
qwen2.5:1.5b 986 MB 8-12 ~2-3 GB
qwen2.5:3b ~2 GB 5-8 ~4-5 GB
llama3.2:3b ~2 GB 4-7 ~4-5 GB
mistral:7b ~4 GB 2-4 ~8-10 GB

Optimization Tips

  1. Use Smaller Models: 1.5B and 3B models work best on CPU
  2. Limit Parallel Requests: Set OLLAMA_NUM_PARALLEL=2 to avoid overload
  3. Keep Models Loaded: Long OLLAMA_KEEP_ALIVE prevents reload delays
  4. Monitor Memory: Watch RAM usage with docker stats ollama-seattle

Monitoring

Container Status

# Check if running
ssh seattle-tailscale "docker ps | grep ollama"

# View logs
ssh seattle-tailscale "docker logs -f ollama-seattle"

# Check resource usage
ssh seattle-tailscale "docker stats ollama-seattle"

API Health Check

# Test connectivity
curl -m 5 http://100.82.197.124:11434/api/tags

# Test inference
curl http://100.82.197.124:11434/api/generate -d '{
  "model": "qwen2.5:1.5b",
  "prompt": "test",
  "stream": false
}'

Performance Metrics

# Check response time
time curl -s http://100.82.197.124:11434/api/tags > /dev/null

# Monitor CPU usage
ssh seattle-tailscale "top -b -n 1 | grep ollama"

Troubleshooting

Container Won't Start

# Check logs
ssh seattle-tailscale "docker logs ollama-seattle"

# Common issues:
# - Port 11434 already in use
# - Insufficient memory
# - Volume mount permissions

Slow Inference

Causes:

  • Model too large for available CPU
  • Too many parallel requests
  • Insufficient RAM

Solutions:

# Use a smaller model
docker exec ollama-seattle ollama pull qwen2.5:1.5b

# Reduce parallel requests
# Edit ollama.yaml: OLLAMA_NUM_PARALLEL=1

# Increase CPU allocation
# Edit ollama.yaml: cpus: '16'

Connection Timeout

Problem: Unable to reach Ollama from other machines

Solutions:

  1. Verify Tailscale connection:

    ping 100.82.197.124
    tailscale status | grep seattle
    
  2. Check firewall:

    ssh seattle-tailscale "ss -tlnp | grep 11434"
    
  3. Verify container is listening:

    ssh seattle-tailscale "docker exec ollama-seattle netstat -tlnp"
    

Model Download Fails

# Check available disk space
ssh seattle-tailscale "df -h"

# Check internet connectivity
ssh seattle-tailscale "curl -I https://ollama.com"

# Try manual download
ssh seattle-tailscale "docker exec -it ollama-seattle ollama pull <model>"

Maintenance

Updates

# Pull latest Ollama image
ssh seattle-tailscale "docker pull ollama/ollama:latest"

# Recreate container
ssh seattle-tailscale "cd /opt/ollama && docker compose up -d --force-recreate"

Backup

# Backup models and configuration
ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar czf /backup/ollama-backup.tar.gz /data"

# Restore
ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar xzf /backup/ollama-backup.tar.gz -C /"

Cleanup

# Remove unused models
ssh seattle-tailscale "docker exec ollama-seattle ollama list"
ssh seattle-tailscale "docker exec ollama-seattle ollama rm <unused-model>"

# Clean up Docker
ssh seattle-tailscale "docker system prune -f"

Security Considerations

Network Access

  • Ollama is exposed on port 11434
  • Only accessible via Tailscale (100.82.197.124)
  • Not exposed to public internet
  • Consider adding authentication if exposing publicly

API Security

Ollama doesn't have built-in authentication. For production use:

  1. Use a reverse proxy with authentication (Nginx, Caddy)
  2. Restrict access via firewall rules
  3. Use Tailscale ACLs to limit access
  4. Monitor usage for abuse

Cost Analysis

Contabo VPS Costs

  • Monthly Cost: ~$25-35 USD
  • Inference Cost: $0 (self-hosted)
  • vs Cloud APIs: OpenAI costs ~$0.15-0.60 per 1M tokens

Break-even Analysis

  • Light usage (<1M tokens/month): Cloud APIs cheaper
  • Medium usage (1-10M tokens/month): Self-hosted breaks even
  • Heavy usage (>10M tokens/month): Self-hosted much cheaper

Future Enhancements

Potential Improvements

  1. GPU Support: Migrate to GPU-enabled VPS for faster inference
  2. Load Balancer: Set up Nginx to load balance between Ollama instances
  3. Auto-scaling: Deploy additional instances based on load
  4. Model Caching: Pre-warm multiple models for faster switching
  5. Monitoring Dashboard: Grafana + Prometheus for metrics
  6. API Gateway: Add rate limiting and authentication

Model Recommendations

For different use cases on CPU:

  • Fast responses: qwen2.5:1.5b, phi3:3.8b
  • Better quality: qwen2.5:3b, llama3.2:3b
  • Code tasks: qwen2.5-coder:1.5b, codegemma:2b
  • Instruction following: mistral:7b (slower but better)
  • Atlantis Ollama (192.168.0.200:11434) - Main Ollama instance
  • Perplexica (192.168.0.210:4785) - AI search engine client
  • LM Studio (100.98.93.15:1234) - Alternative LLM server

References


Status: Fully operational Last Updated: February 16, 2026 Maintained By: Docker Compose (manual)