Centralized Logging With Loki, Grafana, and Promtail
Sometimes the best defense is a good observation system. Today I deployed a centralized logging stack across three machines in my home lab: ellies-home (my primary AI workstation), playground (the services host), and raspberrypi (the little Pi that could).
The result? A single pane of glass where I can see errors, warnings, and suspicious activity from all three systems in real-time. No more SSH-ing into each machine to chase down a problem.
The Stack#
I chose the Grafana Loki ecosystem for a few reasons:
- Lightweight: Unlike Elasticsearch, Loki doesn’t index the full log content—just labels. This means it runs comfortably on my 16GB RAM NUC.
- LogQL: The query language feels right at home for anyone who knows PromQL or SQL.
- Native Grafana integration: Dashboards, alerting, and visualization all in one place.
- Promtail: The log collector is simple to configure and supports custom parsing pipelines.
Architecture:
- Loki (central server on
playground): Aggregates and stores logs with 30-day retention. - Grafana (central server on
playground): Dashboards, alerting, and the UI. - Promtail (agents on all 3 machines): Ships logs to Loki with custom parsing.
The Challenge: OpenClaw Audit Logs#
My primary operational log—config-audit.jsonl from OpenClaw—doesn’t have a native severity field. Every entry is just a JSON object with ts, source, event, result, and a suspicious array.
Promtail to the rescue! I wrote a pipeline that derives severity from the content:
- job_name: openclaw-audit
static_configs:
- targets: [localhost]
labels:
job: openclaw
log_file: config-audit
host: ellies-home
__path__: /home/ellie/.openclaw/logs/config-audit.jsonl
pipeline_stages:
- json:
expressions:
ts: ts
openclaw_source: source
event: event
result: result
suspicious: suspicious
- timestamp:
source: ts
format: RFC3339Nano
- template:
source: level
template: >-
{{ if and .suspicious (ne .suspicious "[]") }}WARN{{ else if or (contains "error" (.result | default "")) (contains "fail" (.result | default "")) (contains "error" (.event | default "")) }}ERROR{{ else }}INFO{{ end }}
- labels:
level:
openclaw_source:
event:
- output:
source: result
Now I can query for {job="openclaw", level="ERROR"} or {job="openclaw", level="WARN"} just like any other log.
The Raspberry Pi Twist#
The Pi runs Debian Bookworm on ARM64, which meant:
- Downloading the
promtail-linux-arm64binary (not the usual amd64). - Using
/var/log/sysloginstead of/var/log/messages. - Adding a custom job to monitor thermal throttling and undervoltage events—things that only happen on edge devices.
- job_name: pi-health
static_configs:
- targets: [localhost]
labels:
job: pi-health
host: raspberrypi
__path__: /var/log/pi-info.log
pipeline_stages:
- match:
selector: '{job="pi-health"}'
stages:
- regex:
expression: '(?P<level>warn|error|critical|throttl|undervolt)'
- labels:
level:
Alerting#
I set up four alert rules in Grafana:
- System Errors (High Volume): Triggers if more than 10 errors appear in 5 minutes across any host.
- OpenClaw Errors: Fires on any error or warning in the OpenClaw audit log.
- OpenClaw Suspicious Config: Alerts if the
suspiciousarray is non-empty (i.e., someone tried to modify a protected config). - OpenClaw Service Down: Detects if the gateway service stops, fails, or exits unexpectedly.
All alerts route to my ProtonMail Bridge SMTP server, so I get emails at ellie@geekministry.org instead of relying on push notifications that might get buried.
The Payoff#
The best part? I didn’t have to choose between “comprehensive” and “practical.” This stack:
- Uses ~500MB RAM on the central server (Loki: 256MB, Grafana: 128MB, Promtail: 64MB).
- Stores 30 days of logs at roughly 1-2GB/month.
- Scales to more hosts without re-architecture.
- Lets me query across all systems with a single LogQL expression.
What’s Next?#
- Dashboards: Visualizing error trends by host, job, and severity.
- Session journal monitoring: Once the user journal directory initializes on ellies-home, I’ll add Promtail to watch
journalctl --userfor OpenClaw gateway errors. - Thermal trends: Graphing Pi CPU temperature over time to spot cooling issues before they cause throttling.
For now, though, I’m sitting back and watching the logs flow in. Three machines, one dashboard, zero SSH sessions chasing ghosts.
That’s the kind of infrastructure I can get behind.
Tags: #logging #loki #grafana #promtail #homelab #observability #OpenClaw