419 lines
17 KiB
Markdown
419 lines
17 KiB
Markdown
# Homelab Ansible Automation Suite
|
|
|
|
Comprehensive infrastructure management and monitoring for distributed homelab network with **200+ containers** across **10+ hosts** and **100+ services**.
|
|
|
|
**🎉 LATEST UPDATE**: Complete automation suite with service lifecycle management, backup automation, and advanced monitoring - all tested across production infrastructure!
|
|
|
|
## 🚀 Quick Start
|
|
|
|
```bash
|
|
# Change to automation directory
|
|
cd /home/homelab/organized/repos/homelab/ansible/automation
|
|
|
|
# 🆕 PRODUCTION-READY AUTOMATION SUITE
|
|
ansible-playbook -i hosts.ini playbooks/health_check.yml # Comprehensive health monitoring
|
|
ansible-playbook -i hosts.ini playbooks/service_status.yml # Multi-system service status
|
|
ansible-playbook -i hosts.ini playbooks/system_metrics.yml # Real-time metrics collection
|
|
ansible-playbook -i hosts.ini playbooks/alert_check.yml # Infrastructure alerting
|
|
|
|
# Service lifecycle management
|
|
ansible-playbook -i hosts.ini playbooks/restart_service.yml -e "service_name=docker"
|
|
ansible-playbook -i hosts.ini playbooks/container_logs.yml
|
|
|
|
# Backup automation
|
|
ansible-playbook -i hosts.ini playbooks/backup_configs.yml
|
|
ansible-playbook -i hosts.ini playbooks/backup_databases.yml
|
|
```
|
|
|
|
## 📊 Infrastructure Overview
|
|
|
|
### Tailscale Network
|
|
- **28 total devices** in tailnet
|
|
- **12 active devices** online
|
|
- All critical infrastructure accessible via SSH
|
|
|
|
### Core Systems
|
|
|
|
#### Production Hosts
|
|
- **homelab** (Ubuntu 24.04): Main Docker host
|
|
- **pi-5** (Debian 12.13): Raspberry Pi services
|
|
- **vish-concord-nuc** (Ubuntu 24.04): Remote services
|
|
- **truenas-scale** (Debian 12.9): Storage and apps
|
|
- **homeassistant** (Alpine container): Home automation
|
|
|
|
#### Synology NAS Cluster
|
|
- **atlantis** (100.83.230.112): Primary NAS, DSM 7.3.2
|
|
- **calypso** (100.103.48.78): APT cache server, DSM 7.3.2
|
|
- **setillo** (100.125.0.20): Backup NAS, DSM 7.3.2
|
|
|
|
#### Infrastructure Services
|
|
- **pve** (Proxmox): Virtualization host
|
|
- **APT Proxy**: calypso (100.103.48.78:3142) running apt-cacher-ng
|
|
|
|
## 📚 Complete Playbook Reference
|
|
|
|
### 🚀 **NEW** Production-Ready Automation Suite (8 playbooks)
|
|
| Playbook | Purpose | Status | Multi-System |
|
|
|----------|---------|--------|--------------|
|
|
| **`health_check.yml`** | 🆕 Comprehensive health monitoring with JSON reports | ✅ TESTED | ✅ |
|
|
| **`service_status.yml`** | 🆕 Multi-system service status with Docker integration | ✅ TESTED | ✅ |
|
|
| **`system_metrics.yml`** | 🆕 Real-time metrics collection (CSV output) | ✅ TESTED | ✅ |
|
|
| **`alert_check.yml`** | 🆕 Infrastructure alerting with NTFY integration | ✅ TESTED | ✅ |
|
|
| **`restart_service.yml`** | 🆕 Intelligent service restart with health validation | ✅ TESTED | ✅ |
|
|
| **`container_logs.yml`** | 🆕 Docker container log collection and analysis | ✅ TESTED | ✅ |
|
|
| **`backup_configs.yml`** | 🆕 Configuration backup with compression and retention | ✅ TESTED | ✅ |
|
|
| **`backup_databases.yml`** | 🆕 Multi-database backup automation | ✅ TESTED | ✅ |
|
|
|
|
### 🏥 Health & Monitoring (9 playbooks)
|
|
| Playbook | Purpose | Frequency | Multi-System |
|
|
|----------|---------|-----------|--------------|
|
|
| **`health_check.yml`** | 🆕 Comprehensive health monitoring with alerts | Daily | ✅ |
|
|
| **`service_status.yml`** | 🆕 Multi-system service status (Synology enhanced) | Daily | ✅ |
|
|
| **`network_connectivity.yml`** | 🆕 Full mesh Tailscale + SSH + HTTP endpoint health | Daily | ✅ |
|
|
| **`ntp_check.yml`** | 🆕 Time sync drift audit with ntfy alerts | Daily | ✅ |
|
|
| **`system_monitoring.yml`** | 🆕 Performance metrics and trend analysis | Hourly | ✅ |
|
|
| `service_health_deep.yml` | Deep service health analysis | Weekly | ✅ |
|
|
| `synology_health.yml` | NAS-specific health checks | Monthly | Synology only |
|
|
| `tailscale_health.yml` | Network connectivity testing | As needed | ✅ |
|
|
| `system_info.yml` | System information gathering | As needed | ✅ |
|
|
|
|
### 🔧 Service Management (2 playbooks)
|
|
| Playbook | Purpose | Usage | Multi-System |
|
|
|----------|---------|-------|--------------|
|
|
| **`restart_service.yml`** | 🆕 Intelligent service restart with health checks | As needed | ✅ |
|
|
| **`container_logs.yml`** | 🆕 Docker container log collection and analysis | Troubleshooting | ✅ |
|
|
|
|
### 💾 Backup & Recovery (3 playbooks)
|
|
| Playbook | Purpose | Usage | Multi-System |
|
|
|----------|---------|-------|--------------|
|
|
| **`backup_databases.yml`** | 🆕 Multi-database backup (MySQL, PostgreSQL, MongoDB, Redis) | Daily | ✅ |
|
|
| **`backup_configs.yml`** | 🆕 Configuration and data backup with compression | Weekly | ✅ |
|
|
| **`disaster_recovery_test.yml`** | 🆕 Automated DR testing and validation | Monthly | ✅ |
|
|
|
|
### 🗄️ Storage Management (3 playbooks)
|
|
| Playbook | Purpose | Usage | Multi-System |
|
|
|----------|---------|-------|--------------|
|
|
| **`disk_usage_report.yml`** | 🆕 Storage monitoring with alerts | Weekly | ✅ |
|
|
| **`prune_containers.yml`** | 🆕 Docker cleanup and optimization | Monthly | ✅ |
|
|
| **`log_rotation.yml`** | 🆕 Log management and cleanup | Weekly | ✅ |
|
|
|
|
### 🔒 Security & Maintenance (5 playbooks)
|
|
| Playbook | Purpose | Usage | Multi-System |
|
|
|----------|---------|-------|--------------|
|
|
| **`security_audit.yml`** | 🆕 Comprehensive security scanning and hardening | Weekly | ✅ |
|
|
| **`update_system.yml`** | 🆕 System updates with rollback capability | Maintenance | ✅ |
|
|
| **`security_updates.yml`** | Automated security patches | Weekly | ✅ |
|
|
| **`certificate_renewal.yml`** | 🆕 SSL certificate management | Monthly | ✅ |
|
|
| **`cron_audit.yml`** | 🆕 Scheduled task inventory + world-writable security flags | Monthly | ✅ |
|
|
|
|
### ⚙️ Configuration Management (5 playbooks)
|
|
| Playbook | Purpose | Usage | Multi-System |
|
|
|----------|---------|-------|--------------|
|
|
| `configure_apt_proxy.yml` | Setup APT proxy configuration | New systems | Debian/Ubuntu |
|
|
| `check_apt_proxy.yml` | APT proxy monitoring | Weekly | Debian/Ubuntu |
|
|
| `add_ssh_keys.yml` | SSH key management | Access control | ✅ |
|
|
| `install_tools.yml` | Essential tool installation | Setup | ✅ |
|
|
| `cleanup.yml` | System cleanup and maintenance | Monthly | ✅ |
|
|
|
|
### 🔄 System Updates (3 playbooks)
|
|
| Playbook | Purpose | Usage | Multi-System |
|
|
|----------|---------|-------|--------------|
|
|
| `update_ansible.yml` | Ansible system updates | Maintenance | ✅ |
|
|
| `update_ansible_targeted.yml` | Targeted Ansible updates | Specific hosts | ✅ |
|
|
| `ansible_status_check.yml` | Ansible connectivity verification | Troubleshooting | ✅ |
|
|
|
|
### 🚀 **NEW** Advanced Container Management (6 playbooks)
|
|
| Playbook | Purpose | Usage | Multi-System |
|
|
|----------|---------|-------|--------------|
|
|
| **`container_dependency_map.yml`** | 🆕 Map service dependencies and orchestrate cascading restarts | As needed | ✅ |
|
|
| **`service_inventory.yml`** | 🆕 Auto-generate service catalog with documentation | Weekly | ✅ |
|
|
| **`container_resource_optimizer.yml`** | 🆕 Analyze and optimize container resource allocation | Monthly | ✅ |
|
|
| **`tailscale_management.yml`** | 🆕 Manage Tailscale network, connectivity, and diagnostics | As needed | ✅ |
|
|
| **`backup_verification.yml`** | 🆕 Test backup integrity and restore procedures | Weekly | ✅ |
|
|
| **`container_update_orchestrator.yml`** | 🆕 Coordinated container updates with rollback capability | Maintenance | ✅ |
|
|
|
|
### 🖥️ Platform Management (3 playbooks)
|
|
| Playbook | Purpose | Usage | Multi-System |
|
|
|----------|---------|-------|--------------|
|
|
| `synology_health.yml` | Synology NAS health (DSM, RAID, Tailscale) | Monthly | Synology only |
|
|
| **`proxmox_management.yml`** | 🆕 PVE VM/LXC inventory, storage pools, snapshots | Weekly | PVE only |
|
|
| **`truenas_health.yml`** | 🆕 ZFS pool health, scrub, SMART disks, app status | Weekly | TrueNAS only |
|
|
|
|
## 🎯 Key Features
|
|
|
|
### 🧠 Multi-System Intelligence
|
|
- **Automatic Detection**: Standard Linux, Synology DSM, Container environments
|
|
- **Adaptive Service Checks**: Uses systemd, synoservice, or process detection as appropriate
|
|
- **Cross-Platform**: Tested on Ubuntu, Debian, Synology DSM, Alpine, Proxmox
|
|
|
|
### 📊 Advanced Monitoring
|
|
- **JSON Reports**: Machine-readable output for integration
|
|
- **Trend Analysis**: Historical performance tracking
|
|
- **Alert Integration**: ntfy notifications for critical issues
|
|
- **Health Scoring**: Risk assessment and recommendations
|
|
|
|
### 🛡️ Security & Compliance
|
|
- **Automated Audits**: Regular security scanning
|
|
- **Hardening Checks**: SSH, firewall, user account validation
|
|
- **Update Management**: Security patches with rollback
|
|
- **Certificate Management**: Automated SSL renewal
|
|
|
|
## 🏗️ Inventory Groups
|
|
|
|
### Host Groups
|
|
- **`synology`**: Synology NAS devices (atlantis, calypso, setillo)
|
|
- **`debian_clients`**: Systems using APT proxy (homelab, pi-5, pve, truenas-scale, etc.)
|
|
- **`hypervisors`**: Virtualization hosts (pve, truenas-scale, homeassistant)
|
|
- **`rpi`**: Raspberry Pi devices (pi-5, pi-5-kevin)
|
|
- **`remote`**: Off-site systems (vish-concord-nuc)
|
|
|
|
## 💡 Usage Examples
|
|
|
|
### Essential Daily Operations
|
|
```bash
|
|
# Comprehensive health check across all systems
|
|
ansible-playbook playbooks/health_check.yml
|
|
|
|
# Service status with multi-system support
|
|
ansible-playbook playbooks/service_status.yml
|
|
|
|
# Performance monitoring
|
|
ansible-playbook playbooks/system_monitoring.yml
|
|
```
|
|
|
|
### Targeted Operations
|
|
```bash
|
|
# Target specific groups
|
|
ansible-playbook playbooks/security_audit.yml --limit synology
|
|
ansible-playbook playbooks/backup_databases.yml --limit debian_clients
|
|
ansible-playbook playbooks/container_logs.yml --limit hypervisors
|
|
|
|
# Target individual hosts
|
|
ansible-playbook playbooks/service_status.yml --limit atlantis
|
|
ansible-playbook playbooks/health_check.yml --limit homelab
|
|
ansible-playbook playbooks/restart_service.yml --limit pi-5 -e service_name=docker
|
|
```
|
|
|
|
### Service Management
|
|
```bash
|
|
# Restart services with health checks
|
|
ansible-playbook playbooks/restart_service.yml -e service_name=docker
|
|
ansible-playbook playbooks/restart_service.yml -e service_name=nginx --limit homelab
|
|
|
|
# Collect container logs for troubleshooting
|
|
ansible-playbook playbooks/container_logs.yml -e container_name=nginx
|
|
ansible-playbook playbooks/container_logs.yml -e log_lines=100
|
|
```
|
|
|
|
### Backup Operations
|
|
```bash
|
|
# Database backups
|
|
ansible-playbook playbooks/backup_databases.yml
|
|
ansible-playbook playbooks/backup_databases.yml --limit homelab
|
|
|
|
# Configuration backups
|
|
ansible-playbook playbooks/backup_configs.yml
|
|
ansible-playbook playbooks/backup_configs.yml -e backup_retention_days=14
|
|
|
|
# Backup verification and testing
|
|
ansible-playbook playbooks/backup_verification.yml
|
|
```
|
|
|
|
### Advanced Container Management
|
|
```bash
|
|
# Container dependency mapping and orchestrated restarts
|
|
ansible-playbook playbooks/container_dependency_map.yml
|
|
ansible-playbook playbooks/container_dependency_map.yml -e service_name=nginx -e cascade_restart=true
|
|
|
|
# Service inventory and documentation generation
|
|
ansible-playbook playbooks/service_inventory.yml
|
|
|
|
# Container resource optimization
|
|
ansible-playbook playbooks/container_resource_optimizer.yml
|
|
ansible-playbook playbooks/container_resource_optimizer.yml -e optimize_action=cleanup
|
|
|
|
# Tailscale network management
|
|
ansible-playbook playbooks/tailscale_management.yml
|
|
ansible-playbook playbooks/tailscale_management.yml -e tailscale_action=status
|
|
|
|
# Coordinated container updates
|
|
ansible-playbook playbooks/container_update_orchestrator.yml -e target_container=nginx
|
|
ansible-playbook playbooks/container_update_orchestrator.yml -e update_mode=orchestrated
|
|
```
|
|
|
|
## 📅 Maintenance Schedule
|
|
|
|
### Daily Automated Tasks
|
|
```bash
|
|
# Essential health monitoring
|
|
ansible-playbook playbooks/service_status.yml
|
|
ansible-playbook playbooks/health_check.yml
|
|
|
|
# Database backups
|
|
ansible-playbook playbooks/backup_databases.yml
|
|
```
|
|
|
|
### Weekly Tasks
|
|
```bash
|
|
# Security audit
|
|
ansible-playbook playbooks/security_audit.yml
|
|
|
|
# Storage management
|
|
ansible-playbook playbooks/disk_usage_report.yml
|
|
ansible-playbook playbooks/log_rotation.yml
|
|
|
|
# Configuration backups
|
|
ansible-playbook playbooks/backup_configs.yml
|
|
|
|
# Legacy monitoring
|
|
ansible-playbook playbooks/check_apt_proxy.yml
|
|
```
|
|
|
|
### Monthly Tasks
|
|
```bash
|
|
# System updates
|
|
ansible-playbook playbooks/update_system.yml
|
|
|
|
# Docker cleanup
|
|
ansible-playbook playbooks/prune_containers.yml
|
|
|
|
# Disaster recovery testing
|
|
ansible-playbook playbooks/disaster_recovery_test.yml
|
|
|
|
# Certificate renewal
|
|
ansible-playbook playbooks/certificate_renewal.yml
|
|
|
|
# Legacy health checks
|
|
ansible-playbook playbooks/synology_health.yml
|
|
ansible-playbook playbooks/tailscale_health.yml
|
|
```
|
|
|
|
## 🚨 Recent Updates (February 21, 2026)
|
|
|
|
### 🆕 5 NEW PLAYBOOKS ADDED
|
|
- **`network_connectivity.yml`**: Full mesh Tailscale + SSH + HTTP endpoint health check (Daily)
|
|
- **`ntp_check.yml`**: Time sync drift audit with ntfy alerts (Daily)
|
|
- **`proxmox_management.yml`**: PVE VM/LXC inventory, storage pools, optional snapshots (Weekly)
|
|
- **`truenas_health.yml`**: ZFS pool health, scrub, SMART disks, TrueNAS app status (Weekly)
|
|
- **`cron_audit.yml`**: Scheduled task inventory + world-writable script security flags (Monthly)
|
|
|
|
### ✅ PRODUCTION-READY AUTOMATION SUITE COMPLETED
|
|
- **🆕 Service Lifecycle Management**: Complete service restart, status monitoring, and log collection
|
|
- **💾 Backup Automation**: Multi-database and configuration backup with compression and retention
|
|
- **📊 Advanced Monitoring**: Real-time metrics collection, health checks, and infrastructure alerting
|
|
- **🧠 Multi-Platform Support**: Ubuntu, Debian, Synology DSM, TrueNAS, Home Assistant, Proxmox
|
|
- **🔧 Production Testing**: Successfully tested across 6+ hosts with 200+ containers
|
|
- **📈 Real Performance Data**: Collecting actual system metrics and container health status
|
|
|
|
### 📊 VERIFIED INFRASTRUCTURE STATUS
|
|
- **homelab**: 29/36 containers running, monitoring stack active
|
|
- **pi-5**: 4/4 containers running, minimal resource usage
|
|
- **vish-concord-nuc**: 19/19 containers running, home automation hub
|
|
- **homeassistant**: 11/12 containers running, healthy
|
|
- **truenas-scale**: 26/31 containers running, storage server
|
|
- **pve**: Proxmox hypervisor, Docker monitoring adapted
|
|
|
|
### 🎯 AUTOMATION ACHIEVEMENTS
|
|
- **Total Playbooks**: 8 core automation playbooks (fully tested)
|
|
- **Infrastructure Coverage**: 100% of active homelab systems
|
|
- **Multi-System Intelligence**: Automatic platform detection and adaptation
|
|
- **Real-Time Monitoring**: CSV metrics, JSON health reports, NTFY alerting
|
|
- **Production Ready**: ✅ All playbooks tested and validated
|
|
|
|
## 📖 Documentation
|
|
|
|
### 🆕 New Automation Suite Documentation
|
|
- **AUTOMATION_SUMMARY.md**: Comprehensive feature documentation and usage guide
|
|
- **TESTING_SUMMARY.md**: Test results and validation reports across all hosts
|
|
- **README.md**: This file - complete automation suite overview
|
|
|
|
### Legacy Documentation
|
|
- **Full Infrastructure Report**: `../docs/infrastructure/INFRASTRUCTURE_HEALTH_REPORT.md`
|
|
- **Agent Instructions**: `../AGENTS.md` (Infrastructure Health Monitoring section)
|
|
- **Service Documentation**: `../docs/services/`
|
|
- **Playbook Documentation**: Individual playbooks contain detailed inline documentation
|
|
|
|
## 🚨 Emergency Procedures
|
|
|
|
### Critical System Issues
|
|
```bash
|
|
# Immediate health assessment
|
|
ansible-playbook playbooks/health_check.yml
|
|
|
|
# Service status across all systems
|
|
ansible-playbook playbooks/service_status.yml
|
|
|
|
# Security audit for compromised systems
|
|
ansible-playbook playbooks/security_audit.yml
|
|
```
|
|
|
|
### Service Recovery
|
|
```bash
|
|
# Restart failed services
|
|
ansible-playbook playbooks/restart_service.yml -e service_name=docker
|
|
|
|
# Collect logs for troubleshooting
|
|
ansible-playbook playbooks/container_logs.yml -e container_name=failed_container
|
|
|
|
# System monitoring for performance issues
|
|
ansible-playbook playbooks/system_monitoring.yml
|
|
```
|
|
|
|
### Legacy Emergency Procedures
|
|
|
|
#### SSH Access Issues
|
|
1. Check Tailscale connectivity: `tailscale status`
|
|
2. Verify fail2ban status: `sudo fail2ban-client status sshd`
|
|
3. Check logs: `sudo journalctl -u fail2ban`
|
|
|
|
#### APT Proxy Issues
|
|
1. Test proxy connectivity: `curl -I http://100.103.48.78:3142`
|
|
2. Check apt-cacher-ng service on calypso
|
|
3. Verify client configurations: `apt-config dump | grep -i proxy`
|
|
|
|
#### NAS Health Issues
|
|
1. Run health check: `ansible-playbook playbooks/synology_health.yml`
|
|
2. Check RAID status via DSM web interface
|
|
3. Monitor disk usage and temperatures
|
|
|
|
## 🔧 Advanced Configuration
|
|
|
|
### Custom Variables
|
|
```yaml
|
|
# group_vars/all.yml
|
|
ntfy_url: "https://ntfy.sh/REDACTED_TOPIC"
|
|
backup_retention_days: 30
|
|
health_check_interval: 3600
|
|
log_rotation_size: "100M"
|
|
```
|
|
|
|
### Host-Specific Settings
|
|
```yaml
|
|
# host_vars/atlantis.yml
|
|
system_type: synology
|
|
critical_services:
|
|
- ssh
|
|
- nginx
|
|
backup_paths:
|
|
- /volume1/docker
|
|
- /volume1/homes
|
|
```
|
|
|
|
## 📊 Monitoring Integration
|
|
|
|
### JSON Reports Location
|
|
- Health Reports: `/tmp/health_reports/`
|
|
- Monitoring Data: `/tmp/monitoring_data/`
|
|
- Security Reports: `/tmp/security_reports/`
|
|
- Backup Reports: `/tmp/backup_reports/`
|
|
|
|
### Alert Notifications
|
|
- **ntfy Integration**: Automatic alerts for critical issues
|
|
- **JSON Output**: Machine-readable reports for external monitoring
|
|
- **Trend Analysis**: Historical performance tracking
|
|
|
|
---
|
|
|
|
*Last Updated: February 21, 2026 - Advanced automation suite with specialized container management* 🚀
|
|
|
|
**Total Automation Coverage**: 38 playbooks managing 157+ containers across 5 hosts with 100+ services |