Sanitized mirror from private repository - 2026-04-20 01:32:01 UTC
This commit is contained in:
210
docs/guides/PERPLEXICA_SEATTLE_SUMMARY.md
Normal file
210
docs/guides/PERPLEXICA_SEATTLE_SUMMARY.md
Normal file
@@ -0,0 +1,210 @@
|
||||
# Perplexica + Seattle Ollama Integration - Summary
|
||||
|
||||
**Date:** February 16, 2026
|
||||
**Goal:** Enable Perplexica to use LLM inference on Seattle VM
|
||||
**Result:** ✅ Successfully deployed Ollama on Seattle and integrated with Perplexica
|
||||
|
||||
## What Was Done
|
||||
|
||||
### 1. Problem Discovery
|
||||
- Found vLLM container failing on Seattle with device detection errors
|
||||
- vLLM requires GPU and has poor CPU-only support
|
||||
- Decided to use Ollama instead (optimized for CPU inference)
|
||||
|
||||
### 2. Ollama Deployment on Seattle
|
||||
- ✅ Removed failing vLLM container
|
||||
- ✅ Created `hosts/vms/seattle/ollama.yaml` docker-compose configuration
|
||||
- ✅ Deployed Ollama container on Seattle VM
|
||||
- ✅ Pulled `qwen2.5:1.5b` model (986 MB)
|
||||
- ✅ Verified API is accessible via Tailscale at `100.82.197.124:11434`
|
||||
|
||||
### 3. Integration with Perplexica
|
||||
- ✅ Verified connectivity from homelab to Seattle Ollama
|
||||
- ✅ Documented how to add Seattle Ollama as a provider in Perplexica settings
|
||||
- ✅ Updated Perplexica documentation with new provider info
|
||||
|
||||
### 4. Documentation Created
|
||||
- ✅ `hosts/vms/seattle/ollama.yaml` - Docker compose config
|
||||
- ✅ `hosts/vms/seattle/README-ollama.md` - Complete Ollama documentation (420+ lines)
|
||||
- Installation history
|
||||
- Configuration details
|
||||
- Usage examples
|
||||
- API endpoints
|
||||
- Performance metrics
|
||||
- Troubleshooting guide
|
||||
- Integration instructions
|
||||
- ✅ `hosts/vms/seattle/litellm-config.yaml` - Config file (not used, kept for reference)
|
||||
- ✅ `docs/guides/PERPLEXICA_SEATTLE_INTEGRATION.md` - Step-by-step integration guide
|
||||
- Prerequisites
|
||||
- Configuration steps
|
||||
- Troubleshooting
|
||||
- Performance comparison
|
||||
- Cost analysis
|
||||
- ✅ Updated `docs/services/individual/perplexica.md` - Added Seattle Ollama info
|
||||
- ✅ Updated `hosts/vms/seattle/README.md` - Added Ollama to services list
|
||||
|
||||
## How to Use
|
||||
|
||||
### Add Seattle Ollama to Perplexica
|
||||
|
||||
1. Open http://192.168.0.210:4785/settings
|
||||
2. Click "Model Providers"
|
||||
3. Click "Add Provider"
|
||||
4. Configure:
|
||||
- **Name**: Ollama Seattle
|
||||
- **Type**: Ollama
|
||||
- **Base URL**: `http://100.82.197.124:11434`
|
||||
- **API Key**: *(leave empty)*
|
||||
5. Save
|
||||
6. Select `qwen2.5:1.5b` from model dropdown when searching
|
||||
|
||||
### Test the Setup
|
||||
|
||||
```bash
|
||||
# Test Ollama API
|
||||
curl http://100.82.197.124:11434/api/tags
|
||||
|
||||
# Test generation
|
||||
curl http://100.82.197.124:11434/api/generate -d '{
|
||||
"model": "qwen2.5:1.5b",
|
||||
"prompt": "Hello, world!",
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
## Technical Specs
|
||||
|
||||
### Seattle VM
|
||||
- **Provider**: Contabo VPS
|
||||
- **CPU**: 16 vCPU AMD EPYC
|
||||
- **RAM**: 64 GB
|
||||
- **Network**: Tailscale VPN (100.82.197.124)
|
||||
|
||||
### Ollama Configuration
|
||||
- **Image**: `ollama/ollama:latest`
|
||||
- **Port**: 11434
|
||||
- **Resource Limits**:
|
||||
- CPU: 12 cores (limit), 4 cores (reservation)
|
||||
- Memory: 32 GB (limit), 8 GB (reservation)
|
||||
- **Keep Alive**: 24 hours
|
||||
- **Parallel Requests**: 2
|
||||
|
||||
### Model Details
|
||||
- **Name**: Qwen 2.5 1.5B Instruct
|
||||
- **Size**: 986 MB
|
||||
- **Performance**: ~8-12 tokens/second on CPU
|
||||
- **Context Window**: 32K tokens
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Load Distribution**: Spread LLM inference across multiple servers
|
||||
2. **Redundancy**: Backup if primary Ollama (Atlantis) fails
|
||||
3. **Cost Efficiency**: $0 inference cost (vs cloud APIs at $0.15-0.60 per 1M tokens)
|
||||
4. **Privacy**: All inference stays within your infrastructure
|
||||
5. **Flexibility**: Can host different models on different instances
|
||||
|
||||
## Files Modified
|
||||
|
||||
```
|
||||
/home/homelab/organized/repos/homelab/
|
||||
├── hosts/vms/seattle/
|
||||
│ ├── ollama.yaml (new)
|
||||
│ ├── litellm-config.yaml (new, reference only)
|
||||
│ ├── README-ollama.md (new)
|
||||
│ └── README.md (updated)
|
||||
├── docs/
|
||||
│ ├── services/individual/perplexica.md (updated)
|
||||
│ └── guides/PERPLEXICA_SEATTLE_INTEGRATION.md (new)
|
||||
└── PERPLEXICA_SEATTLE_SUMMARY.md (this file)
|
||||
```
|
||||
|
||||
## Key Learnings
|
||||
|
||||
### vLLM vs Ollama for CPU
|
||||
- **vLLM**: Designed for GPU, poor CPU support, fails with device detection errors
|
||||
- **Ollama**: Excellent CPU support, reliable, well-optimized, easy to use
|
||||
- **Recommendation**: Always use Ollama for CPU-only inference
|
||||
|
||||
### Performance Expectations
|
||||
- CPU inference is ~10x slower than GPU
|
||||
- Small models (1.5B-3B) work well on CPU
|
||||
- Large models (7B+) are too slow for real-time use on CPU
|
||||
- Expect 8-12 tokens/second with qwen2.5:1.5b on CPU
|
||||
|
||||
### Network Configuration
|
||||
- Tailscale provides secure cross-host communication
|
||||
- Direct IP access (no Cloudflare proxy) prevents timeouts
|
||||
- Ollama doesn't require authentication on trusted networks
|
||||
|
||||
## Next Steps (Optional Future Enhancements)
|
||||
|
||||
1. **Pull More Models** on Seattle:
|
||||
```bash
|
||||
ssh seattle-tailscale "docker exec ollama-seattle ollama pull qwen2.5:3b"
|
||||
ssh seattle-tailscale "docker exec ollama-seattle ollama pull phi3:3.8b"
|
||||
```
|
||||
|
||||
2. **Add Load Balancing**:
|
||||
- Set up Nginx to distribute requests across Ollama instances
|
||||
- Implement health checks and automatic failover
|
||||
|
||||
3. **Monitoring**:
|
||||
- Add Prometheus metrics
|
||||
- Create Grafana dashboard for inference metrics
|
||||
- Alert on high latency or failures
|
||||
|
||||
4. **GPU Instance**:
|
||||
- Consider adding GPU-enabled VPS for faster inference
|
||||
- Would provide 5-10x performance improvement
|
||||
|
||||
5. **Additional Models**:
|
||||
- Deploy specialized models for different tasks
|
||||
- Code: `qwen2.5-coder:1.5b`
|
||||
- Math: `deepseek-math:7b`
|
||||
|
||||
## Troubleshooting Quick Reference
|
||||
|
||||
| Problem | Solution |
|
||||
|---------|----------|
|
||||
| Container won't start | Check logs: `ssh seattle-tailscale "docker logs ollama-seattle"` |
|
||||
| Connection timeout | Verify Tailscale: `ping 100.82.197.124` |
|
||||
| Slow inference | Use smaller model or reduce parallel requests |
|
||||
| No models available | Pull model: `docker exec ollama-seattle ollama pull qwen2.5:1.5b` |
|
||||
| High memory usage | Reduce `OLLAMA_MAX_LOADED_MODELS` or use smaller models |
|
||||
|
||||
## Cost Analysis
|
||||
|
||||
### Current Setup
|
||||
- **Seattle VPS**: ~$25-35/month (already paid for)
|
||||
- **Ollama**: $0/month (self-hosted)
|
||||
- **Total Additional Cost**: $0
|
||||
|
||||
### vs Cloud APIs
|
||||
- **OpenAI GPT-3.5**: $0.50 per 1M tokens
|
||||
- **Claude 3 Haiku**: $0.25 per 1M tokens
|
||||
- **Self-Hosted**: $0 per 1M tokens
|
||||
|
||||
**Break-even**: Any usage over 0 tokens makes self-hosted cheaper
|
||||
|
||||
## Success Metrics
|
||||
|
||||
- ✅ Ollama running stably on Seattle
|
||||
- ✅ API accessible from homelab via Tailscale
|
||||
- ✅ Model pulled and ready for inference
|
||||
- ✅ Integration path documented for Perplexica
|
||||
- ✅ Comprehensive troubleshooting guides created
|
||||
- ✅ Performance benchmarks documented
|
||||
|
||||
## Support & Documentation
|
||||
|
||||
- **Main Documentation**: `hosts/vms/seattle/README-ollama.md`
|
||||
- **Integration Guide**: `docs/guides/PERPLEXICA_SEATTLE_INTEGRATION.md`
|
||||
- **Perplexica Docs**: `docs/services/individual/perplexica.md`
|
||||
- **Ollama API Docs**: https://github.com/ollama/ollama/blob/main/docs/api.md
|
||||
|
||||
---
|
||||
|
||||
**Status**: ✅ Complete and Operational
|
||||
**Deployed**: February 16, 2026
|
||||
**Tested**: ✅ API verified working
|
||||
**Documented**: ✅ Comprehensive documentation created
|
||||
Reference in New Issue
Block a user