Update!
Hey folks, hope all is well!
I wanted to share some of the updates that have been made to Operator over the last several weeks. Operator worked, but as mentioned somewhat frequently, there were some issues with reliability. Mostly due to the brew orchestrator functionality. Operator was never intended to actually know anything about what it was making, so an orchestrator coupled things too tightly to the brew process, adding a ton of extra complexity. This refactored version completely removes the need for the orchestrator, as well as multiple supporting services, and fixes a ton of bugs. Some of this info is more technical than usual, so apologies in advance. Below, I’m using v1 vs v2 for clarity, but this software is not yet v1, let alone v2…
Estop: Every active hardware operation is killed within 100ms. Estop force stops every registered device in parallel, writes the flag atomically, broadcasts the event to subscribers, and refuses every subsequent command until a human clears it. Recovery is intentionally manual.
Operator/DECAF Relationship: v1 had brewing logic, scale watching, pour orchestration, and motor control all crammed into the same Python process. v2 finally has Operator handling only hardware. It does not know what coffee or brewing is. All recipes are now state machines that get spun up per brew session within DECAF. If DECAF crashes, there is zero impact on the hardware.
Auth: 100% of Operator commands must be authenticated, and auth can no longer be bypassed for testing.
Scale Calibration: Now has two simplified calibration paths, an interactive CLI prompt, and three endpoints (tare/reference/verify). Added a verify block in the reference response so you can see any deviation immediately:
Scales: v1 made a blocking call every time Decaf wanted a weight, slept 750ms, and had no concept of warmup. v2 runs a continuous 10Hz read loop from boot, broadcasts weight events to subscribers, caches the latest reading for instant responses, and gates everything behind a configurable warmup so the worst thermal settling reads never matter. Session tare (for swapping mugs and cones between brews) is now a separate concept from calibration tare, which sounds boring but is pretty cool!
Pump: v1 could reverse direction while the rotor was still spinning. v2 forces a zero-PWM hold before applying opposite torque so the motor decelerates first. Avoids power spikes and the wear and tear. Settling time is configurable.
Relays: v2 explicitly drives every relay to OFF at construction, and the fail-safe shutdown force writes OFF on every device before releasing the GPIO lines. So a process restart can't leave anything on.
Bounded Operations: In v1, "turn on the pump" had no inherent stop condition. This was handled by the Watchtower service if a subsequent ‘stop’ command never arrived for whatever reason. In v2, every command that activates hardware carries a duration cap, and the device auto-stops at the deadline. Decaf can still call stop early but it cannot accidentally leave anything on. Watchtower is no longer required and has been removed.
Config: v1 had pin assignments and config info scattered across an INI file, some Python constants, and a couple of magic numbers in service files. v2 is one file: hardware.toml. Pin mappings, calibration values, sampling rates, and settling delays all live in the same place and are read once on startup.
Async: v1 mixed synchronous pigpio calls with async FastAPI handlers, so long blocking reads hung the event loop. v2 owns one shared pigpio instance in a single module and runs the blocking C calls in a thread. The event loop stays responsive while a 50ms HX711 read happens in the background.
Tests: v1 had a handful of unit tests and integration tests that only ran on the Pi. v2 has around 150 tests with mocked hardware that now also run on dev machines. Estop has dedicated coverage of the trigger sequence, the 100ms cancellation budget, and post estop command rejection. Integration tests spin up the full lifespan to catch regressions.
Deploys: There is now a single command to deploy Operator. It pulls the latest code, syncs uv deps, zero-downtime reloads pm2 and saves the process list.
This was a lot of work that most people will not see directly, but it's the foundation that everything else relies on. There are a few more things for me to complete. This week I will finish updates to the stepper control and heater loop. Once completed, Operator will be released along with the latest DECAF updates. From there we can start yapping shipping as all code minus some small revisions to the UI will be at release candidate status.
Thanks all, will have more info to share soon. Cheers!