I wrote recently about the difference between easy and simple.1 That post was philosophical. This one is not. This is what it looks like in practice.
I have been building an AWS operations toolkit using Swamp. The toolkit investigates outages. It runs daily health checks. It composes nine separate extensions into two workflows that gather data from CloudWatch, X-Ray, EC2, Lambda, and load balancers, then produce an actionable report.
This is the machine that builds the machine. Not the code. The system that produces the code and makes sense of the output.
The problem
Incident investigation is repetitive. Check the alarms. Pull the metrics. Grep the logs. Trace the requests. Inventory the resources. Every time. The steps are known. The judgment about what matters is not.
The easy version is to ask an AI agent to investigate an outage. It will do something. Without structure it misses steps. It hallucinates connections. It drowns you in irrelevant data. It is easy to start. It is not simple to get right.
The structure
The toolkit is called @webframp/aws-ops. It depends on nine extensions, each
responsible for one domain:
aws/logs: CloudWatch log queriesaws/metrics: metric retrieval and anomaly detectionaws/alarms: alarm state and historyaws/traces: X-Ray service graphs and error tracesaws/inventory: EC2 and Lambda resource inventoryaws/networking: load balancers and NAT gatewaysaws/alarm-investigation: deep-dive alarm analysisaws/cost-explorer: cost datagithub: repository context
Each extension defines a model. Each model does one thing. You create instances of them pointed at your region and your account. The ops extension composes them into a workflow that runs the investigation in parallel where possible.
swamp model create @webframp/aws/logs aws-logs --global-arg region=us-east-1
swamp model create @webframp/aws/alarms aws-alarms --global-arg region=us-east-1
swamp workflow run @webframp/investigate-outage
Three commands. The workflow handles the rest.
What the workflow does
The investigate-outage workflow executes seven steps:
- Gather alarm summary and active alarms.
- Analyze Lambda duration and errors. Analyze ELB 5XX rates and latency.
- Retrieve the X-Ray service dependency graph and error traces.
- List CloudWatch log groups and search for error patterns.
- Inventory EC2 instances and Lambda functions.
- List load balancers and NAT gateways with health status.
- Generate an incident report that summarizes all findings.
Steps one through six run in parallel. Step seven waits for all of them to finish. The report has six sections: alarms, metrics, traces, inventory, networking, recommendations.
This is not a chatbot answering a question. It is a system executing a plan.
Why nine extensions and not one
The first version was one extension. It did everything. It was easy to build. It was not simple to understand, test, or extend.
When the alarm analysis needed to change, the entire extension had to change with it. When I wanted to add cost data, I had to touch code that also handled traces. The boundaries were wrong.
Nine extensions is more work to set up. Each one has its own manifest, its own
model definition, its own tests. But each one does exactly one thing. When the
alarm investigation logic improved, only aws/alarm-investigation changed.
When I added networking support, nothing else moved.
This is the Hickey distinction in practice. Nine extensions is harder to start. It is simpler to maintain. Fewer parts per unit. Clearer relationships between them.
What I learned
The agent is good at execution. It is not good at deciding what to execute. The first time I ran it without the workflow, it checked three alarms and declared the system healthy. It missed a cold-start spike that was adding latency. The data was there. The agent did not know to look.
The workflow is where that judgment lives. I decided the steps. I decided the order. I decided what data to gather and how to present it. The agent runs the steps. The report structure ensures nothing gets lost.
This is living in the system every day. I did not write most of the code in these extensions. The agent did. But I wrote every workflow. I defined every model boundary. I reviewed every report template. The architectural decisions are mine. The implementation is not.
That division is the point. The agent made the work easy. The architecture is what makes it simple.
The machine that builds the machine
The toolkit is not the product. It is the machine that investigates my infrastructure. It helps me decide faster. I built the machine. The machine builds the understanding.
If I had let the agent design the system, it would have built one large extension that did everything. It would have worked. It would not have been simple. And the first time I needed to change it, I would have been lost.
The work is not the code. The work is the system that produces it.