Dragutin Oreški AI Engineer · Zagreb
May 31, 2026 Case study 8 min read

Migrating 700+ chatbot dialogs off Infobip Answers with Claude Code

Days after Opus 4.8 shipped, I pointed /goal runs and dynamic agent workflows at a 3.1 MB chatbot platform export. One weekend later: owned Python, 912 passing tests, and zero dangling routes — with every fidelity claim adversarially verified.

700+
Dialogs migrated
~4,000
Nodes ported
912
Tests green
0
Dangling routes
~48h
Wall-clock time

When I joined the team, a WhatsApp health chatbot had already been running on Infobip Answers, a no-code visual flow builder, for almost three years. It worked. But the only artifact the platform gives you is a 3.1 MB JSON export: no version control that means anything, no search, no tests, and business logic trapped in JavaScript snippets capped at 5,000 characters each. I ran into all of this within my first weeks. Migrating to owned code had been discussed before and always dismissed as too expensive — the kind of problem that sits and waits.

Then, in the span of three weeks, Claude Code shipped two features that changed the calculation: /goal (v2.1.139, May 12), an autonomous loop where a separate evaluator re-checks a finish condition after every turn and keeps the agent working until it holds, and dynamic workflows (May 28), script-orchestrated fan-out to dozens of background subagents. Opus 4.8 had just been released on top of that. I wanted to know whether these could carry a real production job or just look good in demos. So the waiting problem became the test case: the weekend of May 29–31, I pointed all three at the export.

This post is for two audiences: people who want to get off Infobip Answers (the anatomy and the recipe transfer directly), and people who want to understand what /goal and dynamic workflows can handle in practice. The numbers in the strip above are lightly rounded for publication but real — each reproducible from the repo, and each checked by a separate agent that tried to falsify it.

What you're actually migrating

An Answers export is a JSON array of dialogs. Each dialog has elements of about ten types: GO_TO_DIALOG (the routing graph), SEND_MESSAGE, EVALUATE_ATTRIBUTE (branching), API_CALL, CODING (embedded JavaScript — where the real logic hides), plus delays, user-input matchers, and session control. This bot: over 700 dialogs, almost 4,000 elements, more than 1,500 inter-dialog routes. The 329 CODING snippets deduplicated to 110 unique ones (one appears 112 times), and only 44 of those were trivial.

Two properties of the export shape the whole migration, and you should know them before you start. First, attributes and keywords are numeric IDs with no names — the export contains no name signal for the roughly 200 attribute IDs and 30 keyword IDs your branches depend on. About 70% of the attribute names can be reverse-engineered from the export's PEOPLE_PERSON mappings; keyword names don't appear anywhere, so plan to request a name export from Infobip before you start. Second, order is semantics: branch rules are first-match-wins, and almost half the elements jump to non-adjacent targets, so you cannot rely on fall-through.

The target was intentionally plain: a small deterministic state-machine engine (~2,350 hand-written lines of Python) where each dialog is a declarative function, the runtime suspends on user input, and sessions persist as JSON. No LLM anywhere in the runtime — the bot's behaviour stays exactly as deterministic as it was on the platform.

The setup: build the checks before the autonomy

The single most important decision came before any autonomous run: never translate JSON→Python directly, and never let a model guess. Concretely, that meant three pieces of groundwork, built and tested by hand (with Claude, but supervised):

With the harness in place, the work split cleanly: roughly 63% of dialogs contain no complex logic and were scaffolded deterministically from the IR — exact by construction, since a script can't mistranslate. Model judgment was spent only where judgment was required: the 110 unique JavaScript snippets.

/GOAL LOOP · EVALUATOR RE-CHECKS THE FINISH CONDITION AFTER EVERY TURN JSON export 3.1 MB · 700+ dialogs ~4,000 elements Typed IR lossless · verified Deterministic scaffold ~63% of dialogs exact by construction Port agent JS → Python Skeptic agent tries to falsify pipeline() · 110 snippets · 44 agents · ~10 min Conformance harness 5 layers · the proof all dialogs green unknowns ledgered
The shape of the migration — deterministic where provable, agents where judgment lives, everything proven by the harness

/goal: autonomy with a measurable finish line

/goal is mechanically simple: it wraps a session-scoped Stop hook. After every turn, a small fast model (Haiku by default) evaluates your condition against what Claude has surfaced; if it fails, Claude starts another turn with the evaluator's reasoning as guidance, instead of returning control to you. The detail that matters: the evaluator cannot call tools. It judges only what's already in the transcript.

That one constraint dictates how you write conditions. A vague goal ("migrate the bot") gives the evaluator nothing to check. A measurable one does:

“Every dialog migrated and behaviourally green on the conformance harness; full test suite passing; zero dangling routes — with the counts printed in the transcript.”

The final clause is the important part: the run must surface its own evidence. Every wave of work ended by re-running the harness and printing the scoreboard, so the tool-less evaluator had real numbers to judge. The migration took three /goal runs: Phase 1 owned the logic (every dialog, ~24 hours wall-clock), Phase 2 owned the I/O (production messaging adapters, mock-first, with the simulator path byte-identical), and a short Phase 3 produced the handoff docs. Without /goal, someone babysits hundreds of turns of "continue". With it, the loop closes itself — against a proof, not against the model's own opinion that it's done.

Dynamic workflows: fan out only where judgment lives

Dynamic workflows move orchestration out of the conversation: Claude writes a JavaScript script that spawns background subagents with agent(), streams items through stages with pipeline(), and keeps intermediate results in script variables instead of chat context. Inside the /goal runs, three fan-outs did the heavy lifting:

The pattern across all three is the same and is the part I'd reuse anywhere: the agent that did the work never gets to declare it correct.

What adversarial verification caught

This is the section that should decide whether you trust the approach, because every bug below would have shipped silently in a naive "transpile it" migration — and none was found by the agent that wrote the code.

How it ended

Every dialog is behaviourally green on the four mechanical layers: structure, message text, golden traces, routing. The full suite is 912 passing tests over 725 snapshot artifacts. But only about 45% are fully green — the rest branch on numeric IDs whose names the export simply doesn't contain, and they stay flagged in the ledger until the platform's name export fills them in. That two-tier result is deliberate. The honest claim is “provably identical, with the unknowns surfaced”, not a fake 100% built on guessed names.

Total model spend across the three big workflow fan-outs was roughly 2.1M tokens — on the order of a weekend's subscription usage, against a migration that had been "too expensive" for almost three years.

Lessons learned

If you're sitting on an Answers bot — or any no-code platform whose only export is a blob — the recipe is repeatable: extract a typed IR, build the conformance harness, scaffold what's provable, fan out judgment with adversarial verification, and let /goal close the loop against the checks. For this migration the missing piece was never motivation; it was tooling. In May the tooling became good enough.