Case study · Client product · Built end to end

ServiceNow Co-Pilot

An assistant for ServiceNow developers built on one principle. An answer is only as good as its grounding and its proof. When it cannot ground an answer, it says so.

The problem: ServiceNow work is high stakes and a generic AI makes it worse. It hallucinates plausible APIs with total confidence, and a wrong answer here ships to production.
What I built: A copilot that reads a screenshot and a question, grounds its answer in retrieved docs with a citation on every claim, and for buildable requests actually builds and tests the change in a sandbox instance.
My role: Built end to end for a client. The trust model, the knowledge layer, the sandbox validator, the evaluation harness, the ship.
The difference: Three honest answer tiers. Built and tested, advice only, or an admission that it does not know. It never dresses one up as another.

178 automated tests3 honest answer tiersBuilt and tested in a sandboxA citation on every claim

Read the full story

178Automated tests, safety first

3Honest answer tiers

7Sequenced build phases

5Weighted eval dimensions

01 · The problem

Three taxes on every ServiceNow change

ServiceNow development is high stakes and slow to get right. A developer staring at a broken form, a slow Business Rule, or a vague build me a utility request pays three recurring taxes. A generic AI assistant makes every one of them worse.

First, the diagnosis tax. Working out what the screenshot even shows, which table, which error, which form context, before any real work can start.

Second, the correctness tax. ServiceNow has strong opinions, and a generic assistant will happily hallucinate a plausible but wrong API. Here a wrong answer ships to production.

Third, the proof tax. Even a correct answer is just words until it compiles and runs on the instance, and the developer is still left to build and test it by hand.

Most AI co-pilots stop at confident prose. The hard, valuable part, grounding the answer in real docs and proving it runs, is exactly what they skip. That gap is the product.

02 · The trust model

An answer earns its place, it is not assumed

The clearest way to understand the product is its trust spectrum. The same interface renders three honestly different classes of answer, and it never pretends one is another.

An answer is graded by how much I can actually stand behind it, not by how confident the prose sounds.

A buildable solution is actually built and smoke tested in a sandbox ServiceNow instance, then exported as an Update Set the developer imports. Nothing is claimed validated unless it ran. This is the hero flow.

03 · The trust model

Diagnostic, advice only

Not everything is buildable. A why is this happening question is classified as diagnostic, grounded in retrieved docs and cited on every claim, and labelled advice only, not lab validated. Real advice, never dressed up as something that was tested.

04 · The trust model

And when it cannot ground an answer, it says so

Asked about something with no supporting source, the co-pilot admits it. It returns an unverified badge, low confidence, and zero citations, offers a generic starting point, and refuses to invent a property name or a citation.

The most important screen in the product is the one where it admits it does not know. That behaviour is enforced by the architecture, not by prompt politeness: the model only ever drafts prose, and the orchestrator, not the model, attaches citations. A claim with no retrieved source cannot acquire a fake one.

05 · Architecture

One diagram, the decisions that make trust real

The whole system is shaped to make a wrong or unsafe answer structurally hard, not just discouraged. A developer pastes a screenshot, types a question, and optionally attaches a ServiceNow XML export. The request flows through a reasoning agent that parses the image, retrieves cited chunks, reasons into a strict contract, classifies the request, and only then, if it is buildable and a sandbox is configured, builds and proves it.

The Answer Contract: the model fills a strict schema of reasoning prose only and never holds the pen on a citation, the orchestrator attaches citations and the validation status. Embeddings run locally with bge-small-en-v1.5, so there is no per query cost and no corpus egress to a third party.

The validator is PDI-only by construction: a single chokepoint throws before any network request if the target is not the configured sandbox, so the agent can never be steered into writing to production. Buildable requests follow build then test then prove, and degrade honestly to advice only if anything fails.

And XML intake is XXE safe because it never parses the XML as XML. It strips the DOCTYPE with a bracket aware scanner, leaving entity expansion attacks no parser to attack.

06 · The product, end to end

Every screen, captured from the running app

Every screen is captured from the running application. Together they walk the full path: from the composer a developer types into, through the source it reads, to the three honest answer tiers, the offline state, and the phone.

The composer accepts a typed question, a pasted, dragged or uploaded screenshot with a live preview, and an optional ServiceNow XML export. It is an installable progressive web app: go offline and past conversations stay viewable while the composer disables new questions, and on the phone the full structured answer reflows cleanly.

07 · How I built it

Seven phases, quality measured not vibed

The product was delivered in seven sequenced phases, each test driven and merged behind passing checks: model gateway, then the knowledge layer, then the reasoning agent and XML intake, then the PDI validator, then persistence, then the web app and PWA, and finally the evaluation harness.

Quality is measured, not vibed. A golden set scores every answer on five weighted dimensions, completeness, citation or unverified, kind match, concept coverage, and groundedness, and two of them are hard guardrails: anti hallucination and groundedness. A model A/B benchmark holds retrieval constant and swaps only the model, so the cheapest one that clears the guardrails wins.

A later high effort code review pass found and fixed 10 issues, each with a regression test. The result is a codebase where the safety critical behaviours are the most tested ones.

A ServiceNow incident list screenshot of open P1 incidents that the model reads as input — The kind of screenshot the model reads. From an open P1 incident list and a question, it extracts the table, the error text, and the form context before any answer is drafted.

Tech & tools

TypeScriptHonoReact 19ViteSupabase (Postgres, pgvector, Storage)OpenRouterbge-small-en-v1.5 local embeddingsZodPWA

My role

Designed the trust model: three honest answer tiers, a strict Answer Contract, and the rule that the model never holds the pen on a citation. A product and safety design problem, not a prompt.
Built the full system: the local embedding knowledge layer whose retrieved chunks become the citations, the reasoning agent, and the sandbox validator that builds, tests and proves changes, degrading honestly to advice when anything fails.
Shipped it across seven test driven phases with a measured quality bar: a golden set scored on five weighted dimensions, a guardrailed model benchmark, and a hardening pass that fixed 10 issues with regression tests.