Case study · Client product · Built end to end
ServiceNow Co-Pilot
A developer assistant built around one principle: an answer is only as good as its grounding and its proof. When it cannot ground an answer, it says so.
Three taxes on every ServiceNow change
ServiceNow development is high stakes and slow to get right. A developer staring at a broken form, a slow Business Rule, or a vague build me a utility request pays three recurring taxes, and a generic AI assistant makes every one of them worse.
The first is the diagnosis tax: working out what the screenshot is even showing, which table, which error, which form context, before any real work can start.
The second is the correctness tax. ServiceNow has strong opinions, and a generic assistant will happily hallucinate a plausible but wrong API. Here a wrong answer ships to production.
The third is the proof tax. Even a correct answer is just words until it compiles and runs on the instance, and the developer is still left to build and test it by hand.
Most AI co-pilots stop at confident prose. The hard and valuable part, grounding the answer in real documentation and proving it runs, is exactly what they skip. That gap is the product.
An answer earns its place, it is not assumed
The clearest way to understand the product is its trust spectrum. The same interface renders three honestly different classes of answer, and it never pretends one is another.
An answer is graded by how much I can actually stand behind it, not by how confident the prose sounds.
A buildable solution is actually built and smoke tested in a sandbox ServiceNow instance, then exported as an Update Set the developer imports. Nothing is claimed validated unless it ran. This is the hero flow.
Diagnostic, advice only
Not everything is buildable. A why is this happening question is classified as diagnostic, grounded in retrieved docs and cited on every claim, and labelled advice only, not lab validated. Real advice, never dressed up as something that was tested.
And when it cannot ground an answer, it says so
Asked about something with no supporting source, the co-pilot admits it. It returns an unverified badge, low confidence, and zero citations, offers a generic starting point, and refuses to invent a property name or a citation.
The most important screen in the product is the one where it admits it does not know. That behaviour is enforced by the architecture, not by prompt politeness: the model only ever drafts prose, and the orchestrator, not the model, attaches citations. A claim with no retrieved source cannot acquire a fake one.
One diagram, the decisions that make trust real
The whole system is shaped to make a wrong or unsafe answer structurally hard, not just discouraged. A developer pastes a screenshot, types a question, and optionally attaches a ServiceNow XML export. The request flows through a reasoning agent that parses the image, retrieves cited chunks, reasons into a strict contract, classifies the request, and only then, if it is buildable and a sandbox is configured, builds and proves it.
The Answer Contract: the model fills a strict schema of reasoning prose only and never holds the pen on a citation, the orchestrator attaches citations and the validation status. Embeddings run locally with bge-small-en-v1.5, so there is no per query cost and no corpus egress to a third party.
The validator is PDI-only by construction: a single chokepoint throws before any network request if the target is not the configured sandbox, so the agent can never be steered into writing to production. Buildable requests follow build then test then prove, and degrade honestly to advice only if anything fails.
And XML intake is XXE safe because it never parses the XML as XML. It strips the DOCTYPE with a bracket aware scanner, leaving entity expansion attacks no parser to attack.
Every screen, captured from the running app
Every screen is captured from the running application. Together they walk the full path: from the composer a developer types into, through the source it reads, to the three honest answer tiers, the offline state, and the phone.
The composer accepts a typed question, a pasted, dragged or uploaded screenshot with a live preview, and an optional ServiceNow XML export. It is an installable progressive web app: go offline and past conversations stay viewable while the composer disables new questions, and on the phone the full structured answer reflows cleanly.
Seven phases, quality measured not vibed
The product was delivered in seven sequenced phases, each test driven and merged behind passing checks: model gateway, then the knowledge layer, then the reasoning agent and XML intake, then the PDI validator, then persistence, then the web app and PWA, and finally the evaluation harness.
Quality is measured, not vibed. A golden set scores every answer on five weighted dimensions, completeness, citation or unverified, kind match, concept coverage, and groundedness, and two of them are hard guardrails: anti hallucination and groundedness. A model A/B benchmark holds retrieval constant and swaps only the model, so the cheapest one that clears the guardrails wins.
A later high effort code review pass found and fixed 10 issues, each with a regression test. The result is a codebase where the safety critical behaviours are the most tested ones.
Tech & tools
My role
- Designed the trust model, the three honest answer tiers and the rule that the model never holds the pen on a citation, which was a product and safety design problem, not a prompt.
- Designed the knowledge layer: a local embedding pipeline and pgvector retrieval whose chunks become the citations the orchestrator attaches.
- Designed the reasoning agent and the strict Answer Contract that turns model output into a structured, validatable answer.
- Designed the PDI validator, the PDI-only safety guard, and the build then test then prove flow that degrades honestly to advice only.
- Designed persistence and the web PWA: schema decoupled answer storage, a private screenshot bucket, and an offline aware shell.
- Designed the evaluation harness: the golden set, the five weighted dimensions, and the guardrailed model A/B benchmark.
- Shipped it across seven phases and led the review and hardening pass that found and fixed 10 issues with regression tests.






