
Senior Director - Engineering
Over the last year, we’ve shipped LLM-powered analytics and decision-support products into production for regulated enterprises: finance, risk, operations, and customer analytics.
2024 was around the time we went from “chat with your PDF” toys to production-ready LLM-backed solutions that sit on top of real data models, with real SLAs, and real executives asking, “Can I use this in next month’s review?”
Now through 2025, and after enough releases, incidents, and “why did it say that?” moments, we’ve observed several repetitive patterns. This is a distilled set of lessons from that journey.
Most LLM projects start the same way: pick a model, pick a vector DB, build a chat UI, and then ask, “So… what should this actually do?”
The useful ones start from a different place, with questions like:
Once those are written down, design stops being abstract. Instead, it shifts and you can:
See which datasets matter and which are distractions
Specify what a “good” answer looks like (table versus narrative, level of aggregation, and caveats to call out)
Define success metrics in concrete terms: “If we automate these 15 questions reliably, this is worth keeping”
We also learnt to ship narrow slices first: one persona, one workflow, and one coherent data slice, end-to-end. When that slice behaves like a boring internal system (no drama at month-end), then we earn the right to add the next persona or workflow.
If you can’t ignore the system for a week without anxiety, it’s not ready for expansion.
LLMs are very good at language, but they are not magic against messy data. Every time we tried to shortcut the data work, the system paid us back with creative but wrong answers: line items moved, new hierarchies appeared, slightly different charts of accounts arrived, and the model happily pretended nothing had changed.
The fixes were predictable but non-negotiable:
Raw lands in an immutable store; analytics live in a curated model with its own versions, owners, and release notes.
Metric definitions, grain, tolerances, and refresh cycles
The model is given the actual schema snapshot for the current tenant/use case; if that contract breaks, the request fails fast instead of hallucinating.
That’s the structural part. The second half is domain understanding.
For a demo, you can type “Act as a [domain] expert” in a prompt, pour some definitions into a vector store, and call it done. In production, that falls over quickly. Modern LLMs already know the textbook view of your domain; what they don’t know is your view and your users’ view:
We ended up needing domain experts in three places:
The schema must reflect how the business thinks, not how a single data engineer saw the source system.
The prompt’s job is to trigger the right “neural pathways” in the base model, compare across periods, reconcile narrative with numbers, and apply the house definitions.
“Looks right” is not good enough; validation requires worked examples, reconciliations back to known reports, and explicit sign-off from SMEs on what counts as correct or acceptable with caveats.
When developers own prompts and domain definitions alone, the result is usually impressive… and quietly wrong.
LLM systems fail differently from traditional apps, but the cure is familiar: instrumentation, evaluation, and cost visibility.
We converged on a surprisingly small, stubborn set of metrics and evaluation signals:
This is assessed using curated question sets with expected answer characteristics (numeric tolerances, dimension coverage, and mandatory caveats). These are not perfect labels, but they are sufficient to catch drift when models, prompts, or schemas change. We started with approximately 50 curated questions and expanded to roughly 300 over three quarters as new real-world questions kept showing up.
What percentage of real user questions can be handled without hand-offs to humans or static reports?
UI, ingestion, retrieval, SQL generation, post-processing, rendering, and even graphing each had their own checks. Latency was tracked per layer as well. “It’s slow” isn’t helpful if you don’t know which part is slow.
This is derived from tokens, model calls, and a simple view of cloud cost over a representative month.
Behind those metrics, we treated every API call and contract as something to instrument: ingestion pipelines, retrieval hops, SQL generation, model responses, and visualization steps all emitted structured logs with clear pass/fail or “flagged” statuses.
We used LLMs as judges sparingly, and only for things humans genuinely don’t want to do at scale; primarily natural-language tone and surface quality, never as the sole source of truth for correctness.
On top of the metrics, we learnt to keep rich traces including prompt + response pairs, the executed SQL/queries, and validation outcomes (passed, failed, and flagged).
You need these when:
Think of it as unit tests and logging for a probabilistic system. “Ask the LLM again” is not a valid error-handling strategy.
Most security work looks familiar: provider due diligence (data retention, training, and residency), identity and access, network controls, logging, and compliance. Your cloud and cyber teams already know that playbook.
Where LLM systems add nuance is in how you guard the interaction between user, model, and data. The pattern that worked for us was to think in concentric circles around the data.
You still need the usual security architecture around this. But treating the model as one fallible component inside layered defenses is much healthier than assuming “the guardrail will handle it.”
In theory, we talk about “trustworthy AI.” In practice, most power users and executives want something simpler: “Show me how you got that number.”
Adoption improved significantly when the systems became inspectable by default, enabling you to:
The goal is not to turn everyone into a data engineer; it is to make it obvious whether a surprising answer is a genuine signal, a data/model limitation, or just a plain bug.
A reasonable mental model is a “high-performing junior analyst”. Effectiveness improves when they show their work, cite sources, and expect questions.
Finally, the unromantic bit. If you are building on cloud LLMs, assume that over a 12–18-month horizon:
In one case, a provider changed the default behavior behind a widely used model alias. Our curated question set suddenly started failing in about 10% of cases; some answers became overcautious and dropped important caveats, others became oddly verbose, and a few began refusing questions they had previously answered.
On another occasion, we walked into an important review with multiple teams exercising the same tenant through automation runs and evaluations. A subtle rate-limit change at the provider level meant that, mid-demo, half the questions started timing out with “try again later” errors. From the user’s point of view, the entire copilot had just fallen over.
In both these cases, nothing “broke” in the infrastructure sense, but the behavior had drifted enough to break functionality or bring the demo crashing down. Those incidents forced us to treat vendors and versions as moving parts. The design responses were simple but non-negotiable:
We also started treating model and prompt upgrades like any other production alteration, with change tickets, limited rollouts, and regression checks against our curated question sets, before anything touched a live tenant.
One of the simplest robustness tests we now use is brutal but effective: Swap your current LLM model for the equivalent “mini” version.
If you find the overall solution turns into a train wreck with wrong answers, broken SQL, incoherent charts, or nonsense caveats, then you’re not ready for enterprise production. The system is overfitting to one model’s quirks.
If, on the other hand, your users say, “The final answers could use some polish, but overall, it’s okay” then you’re on the right path. The less your product depends on a single model’s personality, the more it behaves like proper enterprise software.
Under the hood, these systems involve embeddings, retrieval strategies, prompt patterns, and orchestration. All of that matters to engineers.
But from a product point of view, the lessons that keep repeating are simpler:
Most of this will still apply even if base models get 10x better or 10x cheaper. The specifics of which LLM you call will change, but the hard parts, like understanding workflows, shaping data and domain definitions, building trust, and planning for failure, will not.
Do those well, and “we used an LLM” stops being the headline and becomes a single, unremarkable line in the architecture diagram. For serious enterprise systems, that is exactly where it belongs.

Senior Director - Engineering
[Brochure]Ascendion: Turning Salesforce Agentforce Into Real-World Advantage
[Podcast] The birth of Services-as-Software
[Podcast] Why the CEO Must be the Chief AI Officer
[POV] Agile is Dead!
[Whitepaper] AAVA: Agentic COBOL Modernization
[Whitepaper]DURESS Monitoring in Distributed Systems: A Practical Guide to Keeping Systems Healthy