Migrating 700+ chatbot dialogs off Infobip Answers with Claude Code

Days after Opus 4.8 shipped, I pointed /goal runs and dynamic agent workflows at a 3.1 MB chatbot platform export. One weekend later: owned Python, 912 passing tests, and zero dangling routes — with every fidelity claim adversarially verified.

When I joined the team, a WhatsApp health chatbot had already been running on Infobip Answers, a no-code visual flow builder, for almost three years. It worked. But the only artifact the platform gives you is a 3.1 MB JSON export: no version control that means anything, no search, no tests, and business logic trapped in JavaScript snippets capped at 5,000 characters each. I ran into all of this within my first weeks. Migrating to owned code had been discussed before and always dismissed as too expensive — the kind of problem that sits and waits.

Then, in the span of three weeks, Claude Code shipped two features that changed the calculation: /goal (v2.1.139, May 12), an autonomous loop where a separate evaluator re-checks a finish condition after every turn and keeps the agent working until it holds, and dynamic workflows (May 28), script-orchestrated fan-out to dozens of background subagents. Opus 4.8 had just been released on top of that. I wanted to know whether these could carry a real production job or just look good in demos. So the waiting problem became the test case: the weekend of May 29–31, I pointed all three at the export.

This post is for two audiences: people who want to get off Infobip Answers (the anatomy and the recipe transfer directly), and people who want to understand what /goal and dynamic workflows can handle in practice. The numbers in the strip above are lightly rounded for publication but real — each reproducible from the repo, and each checked by a separate agent that tried to falsify it.

What you're actually migrating

An Answers export is a JSON array of dialogs. Each dialog has elements of about ten types: GO_TO_DIALOG (the routing graph), SEND_MESSAGE, EVALUATE_ATTRIBUTE (branching), API_CALL, CODING (embedded JavaScript — where the real logic hides), plus delays, user-input matchers, and session control. This bot: over 700 dialogs, almost 4,000 elements, more than 1,500 inter-dialog routes. The 329 CODING snippets deduplicated to 110 unique ones (one appears 112 times), and only 44 of those were trivial.

Two properties of the export shape the whole migration, and you should know them before you start. First, attributes and keywords are numeric IDs with no names — the export contains no name signal for the roughly 200 attribute IDs and 30 keyword IDs your branches depend on. About 70% of the attribute names can be reverse-engineered from the export's PEOPLE_PERSON mappings; keyword names don't appear anywhere, so plan to request a name export from Infobip before you start. Second, order is semantics: branch rules are first-match-wins, and almost half the elements jump to non-adjacent targets, so you cannot rely on fall-through.

The target was intentionally plain: a small deterministic state-machine engine (~2,350 hand-written lines of Python) where each dialog is a declarative function, the runtime suspends on user input, and sessions persist as JSON. No LLM anywhere in the runtime — the bot's behaviour stays exactly as deterministic as it was on the platform.

The setup: build the checks before the autonomy

The single most important decision came before any autonomous run: never translate JSON→Python directly, and never let a model guess. Concretely, that meant three pieces of groundwork, built and tested by hand (with Claude, but supervised):

A typed intermediate representation (IR). A pipeline extracts every dialog into pydantic models that keep the original element verbatim alongside typed fields. The IR — not the raw JSON, and not the Python — is the source of truth that everything downstream is verified against.
Registries keyed by numeric ID. Attributes become attr_123, keywords kw_456. No name is ever guessed into a branch; an unknown-name ledger tracks every unresolved ID instead.
A 5-layer conformance harness: (1) the Python dialog graph must be isomorphic to the IR graph, (2) every message string must match exactly, (3) golden scenario traces must replay identically, (4) every one of the 1,500+ routes must resolve with zero dangling targets, (5) any dialog branching on an unresolved ID is flagged by the ledger, not silently passed.

With the harness in place, the work split cleanly: roughly 63% of dialogs contain no complex logic and were scaffolded deterministically from the IR — exact by construction, since a script can't mistranslate. Model judgment was spent only where judgment was required: the 110 unique JavaScript snippets.

The shape of the migration — deterministic where provable, agents where judgment lives, everything proven by the harness

/goal: autonomy with a measurable finish line

/goal is mechanically simple: it wraps a session-scoped Stop hook. After every turn, a small fast model (Haiku by default) evaluates your condition against what Claude has surfaced; if it fails, Claude starts another turn with the evaluator's reasoning as guidance, instead of returning control to you. The detail that matters: the evaluator cannot call tools. It judges only what's already in the transcript.

That one constraint dictates how you write conditions. A vague goal ("migrate the bot") gives the evaluator nothing to check. A measurable one does:

“Every dialog migrated and behaviourally green on the conformance harness; full test suite passing; zero dangling routes — with the counts printed in the transcript.”

The final clause is the important part: the run must surface its own evidence. Every wave of work ended by re-running the harness and printing the scoreboard, so the tool-less evaluator had real numbers to judge. The migration took three /goal runs: Phase 1 owned the logic (every dialog, ~24 hours wall-clock), Phase 2 owned the I/O (production messaging adapters, mock-first, with the simulator path byte-identical), and a short Phase 3 produced the handoff docs. Without /goal, someone babysits hundreds of turns of "continue". With it, the loop closes itself — against a proof, not against the model's own opinion that it's done.

Dynamic workflows: fan out only where judgment lives

Dynamic workflows move orchestration out of the conversation: Claude writes a JavaScript script that spawns background subagents with agent(), streams items through stages with pipeline(), and keeps intermediate results in script variables instead of chat context. Inside the /goal runs, three fan-outs did the heavy lifting:

The CODING port — a two-stage pipeline over the 110 unique JavaScript snippets: one agent ports a snippet to Python, a different agent adversarially verifies the port by executing both sides and comparing traces. One file per snippet, so parallel agents never touch the same file. 44 agents, ~1.55M tokens, ~10 minutes.
Acceptance transcripts — author→verify pipeline over the key user flows: 16 agents, ~65 seconds, 8/8 transcripts kept.
An adversarial fidelity panel — five agents, each handed one claim about the migration and told to falsify it by running code. Three claims held, one was a wording nit — and one probe found a genuinely severe bug (next section).

The pattern across all three is the same and is the part I'd reuse anywhere: the agent that did the work never gets to declare it correct.

What adversarial verification caught

This is the section that should decide whether you trust the approach, because every bug below would have shipped silently in a naive "transpile it" migration — and none was found by the agent that wrote the code.

Six JS↔Python coercion traps, caught by the port verifiers and each reproduced in Node: an empty array is truthy in JavaScript but falsy in Python (a branch silently skipped); toLocaleDateString zero-pads day and month where the naive port didn't; Python's regex $ matches before a trailing newline while JavaScript's doesn't; and JS loose x != false is false for 0 and "", where the port's is not False was true.
JSON.parse(null) semantics: JavaScript returns null; Python's json.loads(None) raises. Eight ports crashed on unset attributes, poisoning every dialog routed through them.
An engine-level entry bug: jumps between dialogs entered at the first declared element, but the platform enters at the dialog's designated start element — which differs for roughly a third of the dialogs.
The most serious one: a high-severity button inversion. The first implementation mapped button postbacks to keywords positionally. The skeptic panel found a dialog whose buttons render [no, yes] while its rules are ordered [yes, no] — so a user tapping “I don't have a monitor” would have been routed down the has-monitor branch of a health conversation. The fix paired postbacks by global majority across the whole bot (97-vs-1 and 90-vs-1 evidence), with a regression test named after the lesson.

How it ended

Every dialog is behaviourally green on the four mechanical layers: structure, message text, golden traces, routing. The full suite is 912 passing tests over 725 snapshot artifacts. But only about 45% are fully green — the rest branch on numeric IDs whose names the export simply doesn't contain, and they stay flagged in the ledger until the platform's name export fills them in. That two-tier result is deliberate. The honest claim is “provably identical, with the unknowns surfaced”, not a fake 100% built on guessed names.

Total model spend across the three big workflow fan-outs was roughly 2.1M tokens — on the order of a weekend's subscription usage, against a migration that had been "too expensive" for almost three years.

Lessons learned

Autonomy without a proof oracle is just fast guessing. The conformance harness was built and trusted before the first /goal run. Everything else follows from having a check the agent can't argue with.
Write /goal conditions like assertions, and make the run print its evidence. The evaluator can't call tools — if the counts aren't in the transcript, they don't exist.
Be deterministic wherever the transform is provable. Scripts scaffolded 63% of the work exactly; the model was reserved for the 110 places that needed judgment.
Verification must be a different agent than the author. Every bug listed above was caught by a verifier or a skeptic, zero by the agent that wrote the code. Budget for it: verification roughly doubled agent count and was worth it every time.
Never guess a name into a branch. IDs stay IDs until a real source resolves them. A ledger of honest unknowns beats a clean-looking lie.
Shape the work so agents can't collide. One file per snippet, pipelines over barriers, durable state in files and registries instead of chat context.
The new features held. Days after release, /goal + workflows + Opus 4.8 carried a real production migration end-to-end. The caveats are real but minor: workflows are a research preview, runs pause on permission prompts, and you still want a human reading the assumptions log between phases.

If you're sitting on an Answers bot — or any no-code platform whose only export is a blob — the recipe is repeatable: extract a typed IR, build the conformance harness, scaffold what's provable, fan out judgment with adversarial verification, and let /goal close the loop against the checks. For this migration the missing piece was never motivation; it was tooling. In May the tooling became good enough.