Having a โvisionโ is not enough. Enterprises need clear objectives, solid data, and a design plan with built-in evaluations and humans in the loop.
Every day brings a new, better large language model (LLM) or a new approach to finding signal in all the AI noise. Itโs exhausting to try to keep up. But hereโs a comforting yet uncomfortable truth about enterprise AI: Most of whatโs loud today wonโt persist tomorrow. While models trend like memes, frameworks spawn like rabbits, and at any given moment a new โthis time itโs differentโ pattern elbows yesterdayโs breakthrough into irrelevance, the reality is you donโt need to chase every shiny AI object. You just need to master a handful of durable skills and decisions that compound over time.
Think of these durable skills and decisions as the โoperating systemโ of enterprise AI workโthe core upon which everything else runs. Get those elements right and all the other stuffโagents, retrieval-augmented generation (RAG), memory, whatever gets rebranded nextโbecomes a plug-in.
Focus on the job, not the model
The most consequential AI decision is figuring out what problem youโre trying to solve in the first place. This sounds obvious, yet most AI projects still begin with, โWe should use agents!โ instead of, โWe need to cut case resolution times by 30%.โ Most AI failures trace back to unclear objectives, lack of data readiness (more on that below), and lack of evaluation. Success starts with defining the business problem and establishing key performance indicators (KPIs). This seems ridiculously simple. You canโt declare victory if you havenโt established what victory looks like. However, this all-important first step is commonly overlooked, as Iโve noted.
Hence, itโs critical to translate the business goal into a crisp task spec:
- Inputs: what the system actually receives (structured fields, PDFs, logs)
- Constraints: latency, accuracy thresholds, regulatory boundaries
- Success definition: the metric the business will celebrate (fewer escalations, faster cycle time, lower cost per ticket, etc.)
This task spec drives everything elseโwhether you even need generative AI (often you wonโt), which patterns fit, and how youโll prove value. Itโs also how you stop your project from growing into an unmaintainable โAI experienceโ that does many things poorly.
Make data clean, governed, and retrievable
Your enterpriseโs advantage is not your model; itโs your data, but โwe have a lot of dataโ is not a strategy. Useful AI depends on three things:
- Fitness for use: You want data thatโs clean enough, labeled enough, and recent enough for the task. Perfection is a tax you donโt need to pay; fitness is what matters. Long before genAI became a thing, I wrote, โFor years weโve oversold the glamorous side of data science โฆ while overlooking the simple reality that much of data science is cleaning and preparing data, and this aspect of data science is fundamental to doing data science well.โ Thatโs never been more true.
- Governance: Know what data you can use, how you can use it, and under what policy.
- Retrievability: You need to get the right slice of data to the model at inference time. Thatโs not a model problem; itโs a data modeling and indexing problem.
Approaches to retrieval-augmented generation will continue to morph, but hereโs a principle that wonโt: The system can only be as good as the context you retrieve. As Iโve suggested, without organization-specific context such as policies, data, and workflows, even great models will miss the point. We therefore must invest in:
- Document normalization: Consistent formats and chunking should align with how your users ask questions.
- Indexing strategy: Hybrid search (lexical plus vector) is table stakes; tune for the tasks you actually run.
- Freshness pipelines: Your index is a dynamic asset, not a quarterly project. Memory is the โkiller appโ for AI, as Iโve written, but much of that memory must be kept fresh and recent to be useful, particularly for real-time applications.
- Meta-permissions: Retrieval must respect row/column/object-level access, not just โwho can use the chatbot.โ
In other words, treat your retrieval layer like an API contract. Stability and clarity there outlast any particular RAG library.
Evaluation is software testing for AI (run it like CI)
If your โevaluationโ is two PMs and a demo room, you donโt have evaluation because LLMs fail gracefully right up until they donโt. The way out is automated, repeatable, task-aligned evals. Great AI requires systematic, skeptical evaluation, not vibes-driven development. Hence, success depends on treating model behavior like crash-test engineering, not magic. This means the use of golden sets (representative prompts/inputs and expected outputs, ideally derived from real production traces), numeric- and rubric-based scoring, guardrail checks, and regression gates (no new model, prompt, or retrieval change ships without passing your evaluation suite).
Evaluations are how you get off the treadmill of endless prompt fiddling and onto a track where improvements are proven. They also enable developers to swap models in or out with confidence. You wouldnโt ship back-end code without tests, so stop shipping AI that way.
Design systems, not demos
The earliest wins in enterprise AI came from heroic demos. You know, the stuff you wade through on X all day. (โWow, I canโt believe I can create a full-length movie with a two-line prompt!โ) That hype-ware has its place, but truly great AI is dull, as Iโve noted. โAnyone whoโs pushed real software to production knows that getting code to compile, pass tests, and run reliably in the wild is a far tougher slog than generating the code in the first place.โ
Sustainable wins come from composable systems with boring interfaces:
- Inference gateways abstract model selection behind a stable API.
- Orchestration layers sequence tools: Retrieval โ Reasoning โ Action โ Verification.
- State and memory are explicit: short-term (per task), session-level (per user), and durable (auditable).
- Observability from logs, traces, cost and latency telemetry, and drift detection.
โAI agentsโ will keep evolving, but theyโre just planners plus tools plus policies. In an enterprise, the policies (permissions, approvals, escalation paths) are the hard part. Build those in early.
Latency, cost, and UX are product features
Enterprises donโt abandon AI because itโs โnot smart enough.โ They abandon it because itโs too slow, too expensive, or too weird for users. Here are a few examples:
- Latency: For interactive flows, aim under ~700ms for visible progress and under ~1.5s for a โfeels instantโ reply. This will have a huge impact on your customer experience. Use smaller or distilled models wherever you can and stage responses (e.g., quick summary first, deep analysis on demand).
- Cost: Track tokens like a P&L. Cache aggressively (semantic caching matters), reuse embeddings, and pick models by task need, not ego. Most tasks donโt need your largest model (or a model at all).
- UX: Users want predictability more than surprise. Offer controls (โcite sources,โ โshow stepsโ), affordances to correct errors (โedit query,โ โthumbs down retrainโ), and consistent failure modes.
AI doesnโt change the laws of enterprise โphysics.โ If you can show โwe cut average handle time by 19% at $0.03 per interaction,โ your budget conversations around AI become easy, just like any other enterprise technology.
Security, privacy, and compliance are essential design inputs
Nothing kills momentum faster than a late-stage โLegal says no.โ Bring them in early and design with constraints as first-class requirements. Enough said. This is the shortest section but arguably the most important.
Keep people in the loop
The fastest way to production is rarely โfull autonomy.โ Itโs human-in-the-loop: Assist โ Suggest โ Approve โ Automate. You start with the AI doing the grunt work (drafts, summaries, extractions), and your people verify. Over time, your evals and telemetry make specific steps safe to auto-approve.
There are at least two benefits to this approach. The first is quality: Humans catch the 1% that wrecks trust. The second is adoption: Your team feels augmented, not replaced. That matters if you want real usage rather than quiet revolt. Itโs also essential since the best approach to AI (in software development and beyond) augments skilled people with fast-but-unthinking AI.
Portability or โdonโt marry your modelโ
Andy Oliver is right: โThe latest GPT, Claude, Gemini, and o-series models have different strengths and weaknesses, so it pays to mix and match.โ Not only that, but the models are in constant flux, as is their pricing and, very likely, your enterpriseโs risk posture. As such, you donโt want to be hardwired to any particular model. If swapping a model means rewriting your app, you only built a demo, not a system. You also built a problem. Hence, successful deployments follow these principles:
- Abstract behind an inference layer with consistent request/response schemas (including tool call formats and safety signals).
- Keep prompts and policies versioned outside code so you can A/B and roll back without redeploying.
- Dual run during migrations: Send the same request to old and new models and compare via evaluation harness before cutting over.
Portability isnโt just insurance; itโs how you negotiate better with vendors and adopt improvements without fear.
Things that matter less than you think
Iโve been talking about how to ensure success, yet surely some (many!) people who have read up to this point are thinking, โSure, but really itโs about prompt engineering.โ Or a better model. Or whatever. These are AI traps. Donโt get carried away by:
- The perfect prompt. Good prompts help; great retrieval, evaluations, and UX help more.
- The biggest model. Most enterprise tasks thrive on right-sized models plus strong context. Context is the key.
- Tomorrowโs acronym. Agents, RAG, memoryโthese are ingredients. Data, evaluation, and orchestration are what make it all work.
- A single vendor to rule them all. Consolidation is nice, but only if your abstractions keep you from being stuck.
These principles and pitfalls may sound sexy and new when applied to AI, but theyโre the same things that make or break enterprise applications, generally. Ultimately, the vendors and enterprises that win in AI will be those that deliver exceptional developer experience or that follow the principles Iโve laid out and avoid the pitfalls.


