Applied Research · April 2026

Fourteen Hours,
Two Agents,
One Deployed Tool

A case study in heterogeneous dual-agent autonomous coding and validation-first methodology.

Read Time: 18 Minutes Applied Case Study
Read

In mid-April 2026, we built and shipped a working developer utility called curlit in fourteen hours of active build effort — spread across calendar time, with explicit human verification between increments — from blank repository to production deployment at curlit.zustis.com, open-sourced at github.com/abhimanyusinghal/curlit, now at version 1.3.

The novel element is not the tool itself, though the tool is useful. The novel element is how it was built. Humans did not type code; humans verified behaviour. Two frontier models — Claude Opus 4.6 as executor and GPT-5.4 as planner-and-reviewer — worked in a structured planner/executor/reviewer loop supervised by a single human architect who ran the software and checked it against reality after each feature landed. The output is production-running software that solves a real problem for real users.

We are publishing this case study because curlit is applied research, not a product launch. At Zustis, we hold a specific technical thesis about enterprise AI: outputs must be validated against reality, not trusted on authority. Most of the conversation about autonomous coding agents in 2025–2026 has stayed at the level of demos, benchmarks, and vendor marketing. What has been missing is disciplined, reproducible evidence of the dual-agent pattern deployed to ship working software, with the architectural reasoning made explicit. This article supplies that evidence and the reasoning behind it.

The Artifact

What curlit Is and Why We Built It

Curlit sits in a crowded but persistently under-served niche: the space between a quick curl one-liner and a full API client like Postman or Bruno. Developers live in cURL because it is ubiquitous, scriptable, and close to the wire — but raw cURL is hostile when you want to iterate on requests, compose headers cleanly, share reproducible invocations with teammates, or keep a lightweight log of what worked and what did not.

Heavier API clients solve those problems but drag in workspaces, accounts, sync, and UI weight that many engineers actively avoid. Curlit is the minimum viable bridge: a fast, shareable way to build, execute, and revisit cURL-style HTTP invocations with just enough structure to be useful and not enough to be in the way.

That is a small product idea. It is deliberately small. We chose it because a small, honest scope was the right substrate for a methodology experiment. The tool had to be genuinely useful — otherwise the exercise is vanity — but it had to be bounded enough that fourteen hours was credible. The repository is open under Zustis sponsorship, the deployment is live, and the code is inspectable. What follows is how it was produced.

The Pattern

The Dual-Agent Pattern, Concretely

A planner/executor/reviewer split across two heterogeneous frontier models, with a single human architect as arbiter.

Role 1

Planner & Reviewer

GPT-5.4

Decomposed product intent into interface contracts, module boundaries, data flow, edge cases, failure modes, and ordered task lists. After each increment, reviewed the diff against plan — finding missing coverage, regressions, violated contracts. Occasionally rejected entire increments.

Role 2

Coding Executor

Claude Opus 4.6

Given a task and the current repository state, produced file edits, new modules, test scaffolding, configuration, commit messages. Ran its own sanity checks, proposed its own tests, and exposed its reasoning so the reviewer had something substantive to critique.

Role 3

Human Architect

Abhimanyu Singhal

Three non-negotiable responsibilities: setting product intent, arbitrating disagreements between the two agents, and final acceptance. Everything in between was the loop.

This split maps onto an emerging consensus in the 2025–2026 agent literature. Plan-then-Execute (Del Rosario, Krawiecka, Schroeder de Witt, Sep 2025) argues that separating strategic planning from tactical execution yields systems that are easier to audit, more resilient to control-flow hijacking, and measurably more cost-efficient than reactive patterns like ReAct. Plan-and-Act (Berkeley & Meta) hit state-of-the-art on WebArena-Lite using a dedicated planner LLM. AgentCoder showed that test generation in a separate agent that cannot see the implementation beats generating tests in the same agent — 96.3% pass@1 on HumanEval with GPT-4.

What curlit adds to this literature is heterogeneity. Most published multi-agent frameworks use the same base model in every role. We deliberately did not.

The Mapping

Why Two Different Models Matter

The common shorthand — "Claude for coding, GPT for reasoning" — is too coarse. The benchmark landscape tells a more honest story.

Claude Opus 4.6

Executor · Released Feb 5, 2026

  • 80.8% on SWE-bench Verified
  • 65.4% on Terminal-Bench 2.0
  • 76% on MRCR v2 at 1M tokens (vs 18.5% for GPT-5.2)
  • Dominates multi-file coherence, long-horizon agentic coding, ambiguous-intent recovery

GPT-5.4

Planner/Reviewer · Released Mar 5, 2026

  • 57.7% on SWE-bench Pro (top models cluster in the 40s–50s)
  • 75.0% on OSWorld-Verified (surpasses human expert)
  • Structured planning, harder-problem reasoning, tool-use breadth, computer-use
  • $2.50/$15 per MTok (vs Opus $5/$25)

The rational mapping is not "Claude does everything" or "GPT does everything." It is: give each model the job at which it is measurably better. GPT-5.4's advantage on structured planning and harder-problem reasoning makes it the stronger planner-reviewer. Opus 4.6's advantage on multi-file coherence and long-context execution makes it the stronger implementer. The pairing is complementary, not redundant.

This is what the ensemble literature has been converging on: Beyond Single LLMs (Chen et al., arXiv:2510.01379) reached 91.5% pass@1 on Rust with multi-stage orchestration; EnsLLM (arXiv:2503.15838) hit 90.2% on HumanEval via cross-model voting; More Agents Is All You Need (arXiv:2402.05120) formalized the scaling law. The dual-agent pattern curlit uses is a minimal, disciplined instance of this ensemble insight.

Load-Bearing

Why a Separate Reviewer Is Not Cosmetic

The single most load-bearing part of the pattern is that the reviewer is a different model from the executor. This is not aesthetic. It is a response to a specific, measurable failure mode of single-model self-critique.

Stechly, Marquez, and Kambhampati's 2023 paper — bluntly titled GPT-4 Doesn't Know It's Wrong — demonstrated that iterative self-critique within a single model is unreliable. A model's self-evaluation inherits its own blind spots. The 2025 Finch-Zk work quantified the alternative: fine-grained cross-model consistency checks improved hallucination-detection F1 by 6–39 percentage points over single-model baselines. Cross-Refine, a dual-LLM variant of Self-Refine, showed that heterogeneous generator-critic pairs work where homogeneous self-refinement collapses. CaLM (arXiv:2406.05365) showed that even a smaller secondary model can validate a larger primary's grounded generation.

The mechanism is straightforward. When the same model both writes and reviews, systematic errors — confident misreadings of a spec, skipped edge cases, hallucinated API shapes, shallow test coverage — survive review because the reviewing pass shares the generating pass's priors. A different model has different priors. It reads the spec differently, cares about different edge cases, and often notices exactly what the first model was predisposed to miss.

Observed on curlit

GPT-5.4 caught in Opus 4.6's output:

  • Quietly over-scoped increments
  • Error handling narrower than the task required
  • Naming that would collide with later-planned modules
  • Tests only covering the happy path

Opus 4.6 caught in GPT-5.4's plans:

  • Underspecified interfaces
  • Plan assumptions about runtime properties that did not hold
  • Decomposition gaps surfaced only at implementation

Neither model, working alone, would have produced the version running at curlit.zustis.com in fourteen hours. The loop is not a tax. It is the mechanism.

The Thesis

Validation-First AI, Made Operational

Our position on enterprise AI is that outputs must be validated against reality, not trusted on authority. We call this approach validation-first and reality-dependent, and we contrast it deliberately with the dominant authority-dependent pattern that retrieves text from a trusted corpus and treats a language model's synthesis of that text as the answer.

Authority-Dependent (RAG)

Grounds outputs in retrieved sources but adds no independent check that the synthesis is actually faithful or correct in the world. When assumptions fail, the failure is silent. The model still sounds confident.

Validation-First

Treats model output as a hypothesis that must earn its status as an answer by surviving an independent check — tests that execute, a schema that parses, a second model that disagrees productively, a symbolic trace that terminates.

The 2025 hallucination survey (arXiv:2510.06265) taxonomized this as the third layer of grounding, above retrieval and reasoning-structure grounding. The November 2025 Verification-First Reasoning paper (arXiv:2511.21734) showed that even a prompting-level variant delivers meaningful accuracy gains at zero marginal cost.

The dual-agent pattern used to build curlit is a form of validation-first AI. The reviewer is the validator. Combined with executed tests and a human arbiter, it produces a chain of validation rather than a chain of authority. Each commit that lands has survived: executor self-checks, cross-model review, automated tests, and human acceptance. Each of those is a verifier. No layer is trusted on authority alone.

This is the same architectural intuition we apply to enterprise AI systems at Zustis — neuro-symbolic verification, knowledge-engineering-grade schemas, and explicit verifier stages between generation and use. Curlit is the smallest credible demonstration that the intuition ships working software.

Positioning

Where This Sits in the Autonomous Coding Landscape

The broader market in April 2026 is unambiguously agentic. Anthropic reports that the majority of its own code is now written by Claude Code. Cognition's Devin 2.0, now paired with the acquired Windsurf IDE, handles 4–8 hour junior-level tickets; Litera reports 40% test-coverage gains and 93% regression-cycle reduction after deploying it. GitHub's Copilot coding agent, as of February 2026, lets you assign the same issue to Claude, Codex, and Copilot simultaneously and compare pull requests. Cursor Composer 2, Google's Jules and Antigravity IDE, OpenHands, Factory Droid, Amazon Q Developer, Atlassian Rovo Dev, and the open-weight MiniMax M2.5 and GLM-5 families have all pushed the envelope in parallel. SWE-bench Verified scores now cluster in the 80s for frontier models; Claude Opus 4.7, released four days before this article, reached 87.6%.

Curlit is not positioned against any of these. It is a deliberately small artifact demonstrating a specific architectural pattern — heterogeneous planner-executor-reviewer with a disciplined human arbiter — applied end-to-end to ship real, deployed, open software on a tight timeline. Most of the products above are coding environments or agent platforms. Curlit is the output of a coding process. The distinction matters. The industry has reasonably-well-resourced answers to what tools autonomous coding agents should be. It has far fewer disciplined answers to how teams should actually use them to ship.

Zustis

What Zustis Is Pioneering

Zustis Technologies Limited, incorporated under DIFC's Innovation and AI framework in Dubai, operates at the intersection of enterprise AI architecture, neuro-symbolic systems, and knowledge engineering. The work we do for clients is not visible in a public repository. Curlit is. It is meant to be.

We are pioneering the applied implementation of autonomous coding agents inside a disciplined, validation-first methodology. Theorizing about multi-agent code generation is a crowded academic field. Building end-to-end coding agent platforms is a crowded commercial field. What is far less crowded is the middle — teams that take the best available frontier models, assemble them into a specific validated methodology, ship working software, and document the methodology with enough precision that others can evaluate and adopt it.

We treat open source as a research instrument. Curlit is open because methodology claims about software have to be demonstrable in software. A repository that a reader can clone, read, and run is a stronger evidentiary basis than a conference slide or a benchmark screenshot.

Generalizations

Lessons for Enterprise AI

Three findings from the curlit build that generalize beyond the tool.

01

Heterogeneity is not optional

Single-model loops — even with the strongest frontier model — are measurably weaker than cross-model loops on any non-trivial task distribution. Enterprises locking to a single vendor are optimizing for procurement convenience at the cost of output quality.

02

The human role narrows — and sharpens

Human time went almost entirely to three things: product intent, disagreement arbitration, final acceptance. This is the inverse of the anxious "AI as autocomplete" frame. In a validation-first dual-agent loop, humans set direction and adjudicate — they do not type.

03

Validation-first generalizes

Generation by one system, independent validation by a different system, explicit acceptance criteria, human arbitration only at decision points — applies to knowledge workflows, contract analysis, regulatory compliance. In regulated markets including DIFC, it is the only defensible architecture.

Download

Read the Full Paper

The complete applied research case study as a printable PDF — including the literature trail, benchmark citations, and methodological notes.

The Conclusion

The Methodology Is the Claim

Curlit is a fourteen-hour build. It is also a position statement. The position is that the autonomous coding agent conversation has matured past demos and leaderboards and is now ready for disciplined methodology work — and that the methodology that generalizes is heterogeneous, planner-executor-reviewer, validation-first, and measurably better than the alternatives it replaces.

The tool is the smallest, clearest proof point we could publish. The methodology is the larger claim. We intend to keep shipping both.