Vish/homelab-optimized

Fork 0

Files

Gitea Mirror Bot fb00a325d1

Documentation / Build Docusaurus (push) Failing after 5m14s

Details

Documentation / Deploy to GitHub Pages (push) Has been skipped

Details

Sanitized mirror from private repository - 2026-04-18 11:19:59 UTC

2026-04-18 11:19:59 +00:00

9.8 KiB

Raw Blame History

Ollama on Seattle - Local LLM Inference Server

Overview

Setting	Value
Host	Seattle VM (Contabo VPS)
Port	11434 (Ollama API)
Image	`ollama/ollama:latest`
API	http://100.82.197.124:11434 (Tailscale)
Stack File	`hosts/vms/seattle/ollama.yaml`
Data Volume	`ollama-seattle-data`

Why Ollama on Seattle?

Ollama was deployed on seattle to provide:

CPU-Only Inference: Ollama is optimized for CPU inference, unlike vLLM which requires GPU
Additional Capacity: Supplements the main Ollama instance on Atlantis (192.168.0.200)
Geographic Distribution: Runs on a Contabo VPS, providing inference capability outside the local network
Integration with Perplexica: Can be added as an additional LLM provider for redundancy

Specifications

Hardware

CPU: 16 vCPU AMD EPYC Processor
RAM: 64GB
Storage: 300GB SSD
Location: Contabo Data Center
Network: Tailscale VPN (100.82.197.124)

Resource Allocation

limits:
  cpus: '12'
  memory: 32G
reservations:
  cpus: '4'
  memory: 8G

Installed Models

Qwen 2.5 1.5B Instruct

Model ID: qwen2.5:1.5b
Size: ~986 MB
Context Window: 32K tokens
Use Case: Fast, lightweight inference for search queries
Performance: Excellent on CPU, ~5-10 tokens/second

Installation History

February 16, 2026 - Initial Setup

Problem: Attempted to use vLLM for CPU inference

vLLM container crashed with device detection errors
vLLM is primarily designed for GPU inference
CPU mode is not well-supported in recent vLLM versions

Solution: Switched to Ollama

Ollama is specifically optimized for CPU inference
Provides better performance and reliability on CPU-only systems
Simpler configuration and management
Native support for multiple model formats

Deployment Steps:

Removed failing vLLM container
Created ollama.yaml docker-compose configuration
Deployed Ollama container
Pulled qwen2.5:1.5b model
Tested API connectivity via Tailscale

Configuration

Docker Compose

See hosts/vms/seattle/ollama.yaml:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama-seattle
    ports:
      - "11434:11434"
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_NUM_PARALLEL=2
      - OLLAMA_MAX_LOADED_MODELS=2
    volumes:
      - ollama-data:/root/.ollama
    restart: unless-stopped

Environment Variables

OLLAMA_HOST: Bind to all interfaces
OLLAMA_KEEP_ALIVE: Keep models loaded for 24 hours
OLLAMA_NUM_PARALLEL: Allow 2 parallel requests
OLLAMA_MAX_LOADED_MODELS: Cache up to 2 models in memory

Usage

API Endpoints

List Models

curl http://100.82.197.124:11434/api/tags

Generate Completion

curl http://100.82.197.124:11434/api/generate -d '{
  "model": "qwen2.5:1.5b",
  "prompt": "Explain quantum computing in simple terms"
}'

Chat Completion

curl http://100.82.197.124:11434/api/chat -d '{
  "model": "qwen2.5:1.5b",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'

Model Management

Pull a New Model

ssh seattle-tailscale "docker exec ollama-seattle ollama pull <model-name>"

# Examples:
# docker exec ollama-seattle ollama pull qwen2.5:3b
# docker exec ollama-seattle ollama pull llama3.2:3b
# docker exec ollama-seattle ollama pull mistral:7b

List Downloaded Models

ssh seattle-tailscale "docker exec ollama-seattle ollama list"

Remove a Model

ssh seattle-tailscale "docker exec ollama-seattle ollama rm <model-name>"

Integration with Perplexica

To add this Ollama instance as an LLM provider in Perplexica:

Navigate to http://192.168.0.210:4785/settings
Click "Model Providers"
Click "Add Provider"
Configure as follows:

{
  "name": "Ollama Seattle",
  "type": "ollama",
  "baseURL": "http://100.82.197.124:11434",
  "apiKey": ""
}

Click "Save"
Select qwen2.5:1.5b from the model dropdown when searching

Benefits of Multiple Ollama Instances

Load Distribution: Distribute inference load across multiple servers
Redundancy: If one instance is down, use the other
Model Variety: Different instances can host different models
Network Optimization: Use closest/fastest instance

Performance

Expected Performance (CPU-Only)

Model	Size	Tokens/Second	Memory Usage
qwen2.5:1.5b	986 MB	8-12	~2-3 GB
qwen2.5:3b	~2 GB	5-8	~4-5 GB
llama3.2:3b	~2 GB	4-7	~4-5 GB
mistral:7b	~4 GB	2-4	~8-10 GB

Optimization Tips

Use Smaller Models: 1.5B and 3B models work best on CPU
Limit Parallel Requests: Set OLLAMA_NUM_PARALLEL=2 to avoid overload
Keep Models Loaded: Long OLLAMA_KEEP_ALIVE prevents reload delays
Monitor Memory: Watch RAM usage with docker stats ollama-seattle

Monitoring

Container Status

# Check if running
ssh seattle-tailscale "docker ps | grep ollama"

# View logs
ssh seattle-tailscale "docker logs -f ollama-seattle"

# Check resource usage
ssh seattle-tailscale "docker stats ollama-seattle"

API Health Check

# Test connectivity
curl -m 5 http://100.82.197.124:11434/api/tags

# Test inference
curl http://100.82.197.124:11434/api/generate -d '{
  "model": "qwen2.5:1.5b",
  "prompt": "test",
  "stream": false
}'

Performance Metrics

# Check response time
time curl -s http://100.82.197.124:11434/api/tags > /dev/null

# Monitor CPU usage
ssh seattle-tailscale "top -b -n 1 | grep ollama"

Troubleshooting

Container Won't Start

# Check logs
ssh seattle-tailscale "docker logs ollama-seattle"

# Common issues:
# - Port 11434 already in use
# - Insufficient memory
# - Volume mount permissions

Slow Inference

Causes:

Model too large for available CPU
Too many parallel requests
Insufficient RAM

Solutions:

# Use a smaller model
docker exec ollama-seattle ollama pull qwen2.5:1.5b

# Reduce parallel requests
# Edit ollama.yaml: OLLAMA_NUM_PARALLEL=1

# Increase CPU allocation
# Edit ollama.yaml: cpus: '16'

Connection Timeout

Problem: Unable to reach Ollama from other machines

Solutions:

Verify Tailscale connection:

ping 100.82.197.124
tailscale status | grep seattle

Check firewall:

ssh seattle-tailscale "ss -tlnp | grep 11434"

Verify container is listening:

ssh seattle-tailscale "docker exec ollama-seattle netstat -tlnp"

Model Download Fails

# Check available disk space
ssh seattle-tailscale "df -h"

# Check internet connectivity
ssh seattle-tailscale "curl -I https://ollama.com"

# Try manual download
ssh seattle-tailscale "docker exec -it ollama-seattle ollama pull <model>"

Maintenance

Updates

# Pull latest Ollama image
ssh seattle-tailscale "docker pull ollama/ollama:latest"

# Recreate container
ssh seattle-tailscale "cd /opt/ollama && docker compose up -d --force-recreate"

Backup

# Backup models and configuration
ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar czf /backup/ollama-backup.tar.gz /data"

# Restore
ssh seattle-tailscale "docker run --rm -v ollama-seattle-data:/data -v $(pwd):/backup alpine tar xzf /backup/ollama-backup.tar.gz -C /"

Cleanup

# Remove unused models
ssh seattle-tailscale "docker exec ollama-seattle ollama list"
ssh seattle-tailscale "docker exec ollama-seattle ollama rm <unused-model>"

# Clean up Docker
ssh seattle-tailscale "docker system prune -f"

Security Considerations

Network Access

Ollama is exposed on port 11434
Only accessible via Tailscale (100.82.197.124)
Not exposed to public internet
Consider adding authentication if exposing publicly

API Security

Ollama doesn't have built-in authentication. For production use:

Use a reverse proxy with authentication (Nginx, Caddy)
Restrict access via firewall rules
Use Tailscale ACLs to limit access
Monitor usage for abuse

Cost Analysis

Contabo VPS Costs

Monthly Cost: ~$25-35 USD
Inference Cost: $0 (self-hosted)
vs Cloud APIs: OpenAI costs ~$0.15-0.60 per 1M tokens

Break-even Analysis

Light usage (<1M tokens/month): Cloud APIs cheaper
Medium usage (1-10M tokens/month): Self-hosted breaks even
Heavy usage (>10M tokens/month): Self-hosted much cheaper

Future Enhancements

Potential Improvements

GPU Support: Migrate to GPU-enabled VPS for faster inference
Load Balancer: Set up Nginx to load balance between Ollama instances
Auto-scaling: Deploy additional instances based on load
Model Caching: Pre-warm multiple models for faster switching
Monitoring Dashboard: Grafana + Prometheus for metrics
API Gateway: Add rate limiting and authentication

Model Recommendations

For different use cases on CPU:

Fast responses: qwen2.5:1.5b, phi3:3.8b
Better quality: qwen2.5:3b, llama3.2:3b
Code tasks: qwen2.5-coder:1.5b, codegemma:2b
Instruction following: mistral:7b (slower but better)

Atlantis Ollama (192.168.0.200:11434) - Main Ollama instance
Perplexica (192.168.0.210:4785) - AI search engine client
LM Studio (100.98.93.15:1234) - Alternative LLM server

References

Status: ✅ Fully operational Last Updated: February 16, 2026 Maintained By: Docker Compose (manual)

9.8 KiB Raw Blame History