I wrote three months ago about the context files and memory system behind my AI coding agent. This post is the follow-up I wish existed back then: the concrete tricks and patterns that make the agent reliably useful, not just occasionally impressive. Most of these are small. They compound.
1. A memory system with types, not a scratchpad
The naive approach is “tell the agent to remember things.” That produces a growing blob of ambiguous notes that the agent dutifully reads at the start of every session and then ignores, because it has no way to decide which lines are load-bearing.
What works is a directory of small typed files, one concept per file, with a one-line index in MEMORY.md. I use four types:
- user — facts about me, my role, and my preferences, so the agent can tailor responses without asking
- feedback — rules I have enforced through correction or confirmation, with the why attached
- project — ephemeral state about ongoing work (deadlines, stakeholders, incidents)
- reference — pointers to external systems (Linear, Jira, Grafana, a specific Google Sheet)
The feedback type is the one that pays the most. Each entry is roughly: “here is the rule, here is the incident that produced it, here is when to apply it.” That third part is the secret. Without it, the agent knows what but not when, and it ends up applying the rule in the wrong place or not at all.
After a few months there are about sixty of these. A sample of the ones I reach for most:
- No business logic in
_metrics_bridge— use the bizlogic pipeline. Saved the day the bridge migration cost me (see the previous post). - Always commit before running
deploy.sh. Learned after a production push of uncommitted local edits. - When user says “slowly,” use 15-30 s delays for scraping. Earned by getting rate-limited on a gov API.
- “ss” always means screenshot — don’t ask, just read it. A personal quirk that saves a round-trip every time.
- Never mix work and personal — secrets, keys, GCP projects, service accounts. A hard boundary that no amount of convenience justifies crossing.
2. Domain-specific CLI tools as first-class tool use
The agent is good at running shell commands. It is less good at writing 40-line curl calls to authenticated REST APIs from scratch. The fix is to wrap each API in a small CLI tool with a readable surface, and document it once in CLAUDE.md. Then the agent reaches for the tool the way a human would.
My toolbox looks like this:
bigquery.sh— BigQuery query runner with project aliases (raw,dev,prod,data)jira-ticket.sh— create, comment, assign, transition, and list Jira tickets from the shelldbt-cloud.sh— trigger and inspect dbt Cloud jobspipedrive.sh— read deals, orgs, persons, custom fieldsholistics— query the semantic layer with natural language, returning AQL + resultsbrowse— headless Playwright reader for authenticated pagesread-gsheet— read any Google Sheet via Application Default Credentialsgmail/gcal— work and personal inboxes and calendars
Each of these is maybe 200 lines of shell or Python. Together they replaced thousands of lines of “agent writes ad-hoc API code that almost works.” The CLAUDE.md entry for each tool is four or five examples. The agent extrapolates from there.
3. Schema files as a contract against hallucination
The single biggest category of wrong-code the agent produced, before I fixed it, was “plausible-looking column names that do not exist.” damage_report_id instead of dr_id. created_date instead of created_at. It would try variations until something worked, which sometimes meant a query that returned results but the wrong results.
The fix is a set of CSV files in a knowledge base that list every column of every dbt model in the warehouse, and a single hard rule in CLAUDE.md:
Before writing ANY BigQuery query or dbt model changes, ALWAYS look up column names in these schema files. If a column doesn’t exist in the schema file, READ the actual dbt model source code. Never try variations.
The CSVs are regenerated on a CI schedule. The agent now runs grep "rpt_damage_report_metrics," /path/to/dwh_reporting_schema.csv | head -30 before writing SQL. The error rate on column names dropped to roughly zero.
4. Testing as a deploy gate, not a polishing step
The agent is good at making things work. It is less good at noticing when it has broken something else. The only reliable way I have found to close that gap is to make tests mandatory at the deploy boundary, and to make them comprehensive enough that “all tests pass” actually means something.
For each of my projects, the deploy script does not ship code until it passes a local test battery. The numbers are specific on purpose:
- A PWA project — 59 smoke tests against prod (67 against dev, which adds dev-environment checks), plus 34 Playwright end-to-end tests driving a real browser through the route planner, the vehicle picker, login, and the PWA install flow.
- A games-database project — 161 commits under 17 smoke tests covering the paste auto-submit, coverage scrapers, and the
/out/short-link layer. - The internal Slack data bot — a 100-question test battery run in parallel against the live API, checking that each answer matches a known-good result. 86 pass, 0 fail, 14 intentional “overasks” (the bot asking a clarifying question).
- dbt data warehouse — Project Evaluator PK tests on layers 07, 10, 11, 13, 14, plus standard dbt tests (
unique,not_null, custom macros) wired into CI.
Two specific lessons from writing these:
Playwright tests need waitForResponse before goto. Not after. If you register the response listener after navigating, you miss every API call the page fires during its initial load. This caught me for about a week on the PWA project because a test was passing for the wrong reason.
Smoke tests should use a variable for the frontend directory. Dev and prod serve different paths on my server; hardcoding /srv/app/frontend/ in the smoke test passed on dev and failed silently on prod. A FE_DIR variable, set at the top of the script and passed into each test, removed the entire class of environment-mismatch bugs.
The point is not that 100% test coverage matters. The point is that the tests cover the user-visible contracts — does clicking the vehicle picker open the modal, does the route-planner DP produce a non-zero cost, does the Slack bot still answer “what was revenue last month” correctly — and the deploy script refuses to ship if any of them fail. That refusal is the gate.
5. Investigate before implementing
Early on, my instinct was to give the agent a task and let it run. That produced beautiful, confidently wrong implementations — building the feature the agent thought I wanted based on the naming, rather than the one I actually wanted based on the data.
The rule now, saved as feedback memory:
Investigate data thoroughly before building — query examples first, believe the user.
In practice this means: for anything non-trivial, the agent’s first action is a bq query or a grep or a wc -l, not a code edit. It looks at a sample of the actual rows, the actual distribution, the actual column types. Often the investigation finds that the premise of the task was slightly wrong, and the real task is 30% smaller than the original ask. Occasionally it finds that the task is impossible as specified, which saves a full implementation of the wrong thing.
6. Auto-deploy on merge, robot reviewer before human
Every merge to master on my side projects triggers deploy.sh on the target server. The agent does not run deploys manually. It pushes, the CI runs, and the service redeploys. This removes an entire category of “agent deploys before running tests” incidents.
On pull requests, I route them through CoderabbitAI before I look at them myself. CoderabbitAI catches the small stuff — obvious bugs, typos, forgotten rate limits, missing error handling. By the time I read the PR, those are already addressed. My review focuses on architecture and intent, which is where a human is actually useful.
The memory rule that goes with this:
Always reply to every CR comment after fixing — don’t just push.
A silent fix is indistinguishable from ignoring the feedback, both for the reviewer and for future audits of why a change was made.
7. Local cron for everything the agent might otherwise forget
There are a few recurring tasks the agent could do “when it remembers,” and a few it definitely cannot because it is not running. Those go in crontab on my Mac, often calling the claude CLI in non-interactive mode to produce the analysis.
Recent examples:
- Every Tuesday 10:00, pull the last seven days of the PWA project visitor behaviour from Umami on the server, feed the raw session data to the
claudeCLI, append a new entry tovisitors.log.md. - Daily 09:13, check all revoked GCP service accounts for accidental use, ping Slack if any request shows up.
- Weekly 04:00, back up the three WordPress sites and the the PWA project SQLite database, summarise the result by email.
The pattern is the same in every case: a shell script gathers raw data, the claude CLI analyses it and writes a markdown fragment, and cron re-runs on a schedule. The agent becomes a background process that maintains its own logs.
A working definition of “reliable”
None of these patterns is exotic. What matters is that each one closes one class of failure, and together they close enough classes that the baseline of what the agent ships is trustworthy enough to merge without reading every line. That is the useful threshold. Below it the agent is a toy. Above it, it is a collaborator.
