AGENTIC AI
Scalable & Responsible
Deployment of
AI Agents in the Enterprise
Debmalya Biswas, PhD
TABLE OF
CONTENTS
INTRODUCTION TO AGENTIC AI ..............................1
AGENTIC AI RFERENCE ARCHITECTURE ...................4
AGENTS DISCOVERY & MARKETPLACE .....................8
PERSONALIZING UX FOR AGENTIC AI....................13
AGENTIC AI MEMORY MANAGEMENT .....................17
AGENTIC RAGS.......................................................23
REINFORCEMENT LEARNING AGENTS ....................31
RESPONSIBLE AI AGENTS ......................................39
INTRODUCTION
TO AGENTIC AI
The discussion around ChatGPT (in general, Generative AI), has
now evolved into Agentic AI. While ChatGPT is primarily a chatbot
that can generate text responses, AI agents can execute complex
tasks autonomously, e.g., make a sale, plan a trip, make a flight
booking, book a contractor to do a house job, order a pizza. Fig. 1
below illustrates the evolution of agentic AI systems.
Fig. 1: Agentic AI evolution
Bill Gates recently envisioned a future where we would have an AI
agent that is able to process and respond to natural language and
2
accomplish a number of different tasks. Gates used planning a trip
as an example.
Ordinarily, this would involve booking your hotel, flights,
restaurants, etc. on your own. But an AI agent would be able to use
its knowledge of your preferences to book and purchase those
things on your behalf.
An AI agent today is able to decompose a given task, monitor long
running sub-tasks, and autonomously adapt its execution strategy
to achieve its goal. This has led to the rise of AI agents optimized to
perform specific tasks, potentially provided by different vendors,
and published / catalogued in an agent marketplace.
AI Agents today are characterized by primarily three characteristics
– illustrated in Fig. 2:
- Complex task decomposition & orchestration
- (Long-term) memory management & context sharing
- Autonomous execution based on ‘reflect & adapt’
Fig. 2: Agentic AI core capabilities
This has led to agentic AI getting applied in almost every vertical
with the potential for significant disruption. For example, agentic AI
use-cases in the BFSI domain include:
3
- Credit monitoring: Set a goal of monitoring a credit portfolio,
and the agent will of out and research relative metrics,
consolidate them, and then continue to monitor the portfolio
until metrics are optimized.
- Research reports: The agent researches the internet for noted
equity analysts, sends them an email about what they
thought of the market, complies responses, and puts them in
a report to wealth management clients.
- Branch locations: Agents researches high-traffic locations
using cellphone data, looks for available locations, and ranks
them according to cost. The agent then maps competitors and
competitor traffic to produce a location report.
- Data management: Given an ETL requirement, agents writes
its own code in Python, debugs it, and moves it to production
to clean and transform data to target data model.
4
AGENTIC AI
RFERENCE
ARCHITECTURE
In this section, we highlight the key components of a reference AI
agent platform—illustrated in Fig. 3:
• Agent marketplace
• Orchestration layer
• Integration layer
• Shared memory layer
• Governance layer, including explainability, privacy, security,
etc.
5
Fig. 3: AI agent platform reference architecture
Given a user task, we prompt a LLM for the task decomposition—
this is the overlap with Gen AI. Unfortunately, this also means that
agentic AI systems today are limited by the reasoning capabilities
of large language models (LLMs). For ex., the GPT4 task
decomposition of the prompt:
Generate a tailored email campaign to achieve sales of USD 1
Million in 1 month, The applicable products and their performance
metrics are available at [url]. Connect to CRM system [integration]
for customer names, email addresses, and demographic details.
is detailed in Fig. 4: (Analyze products)—(Identify target
audience)—(Create tailored email campaign).
6
Fig. 4: Agentic AI execution of a Marketing use-case
The LLM then monitors the execution / environment and adapts
autonomously as needed. In this case, the agent realised that it is
not going to achieve its sales goal and autonomously added the
tasks:
(Find alternative products)—(Utilize customer data to personalize
the emails)—(Perform A/B testing).
Given the need to orchestrate multiple agents, there is a need for
an integration layer supporting different agent interaction
patterns, e.g., agent-to-agent API, agent API providing output for
human consumption, human triggering an AI agent, AI agent-to-
agent with human in the Loop. The integration patterns need to be
supported by the underlying AgentOps platform. Andrew Ng
recently talked about this aspect from a performance perspective:
Today, a lot of LLM output is for human consumption. But in an
agentic workflow, an LLM might be prompted repeatedly to reflect
on and improve its output, use tools, plan and execute multiple
steps, or implement multiple agents that collaborate. So, we might
generate hundreds of thousands of tokens or more before showing
any output to a user. This makes fast token generation very
desirable and makes slower generation a bottleneck to taking better
advantage of existing models.
7
It is also important to mention that integration with enterprise
systems (e.g., CRM in this case) will be needed for most use-cases.
For instance, refer to the Model Context Protocol (MCP) proposed by
Anthropic recently to connect AI agents to external systems where
enterprise data resides.
Given the long-running nature of such complex tasks, memory
management is key for Agentic AI systems. Once the initial email
campaign is launched, the agent needs to monitor the campaign for
1-month.
This entails both context sharing between tasks and maintaining
execution context over long periods.
The standard approach here is to save the embedding
representation of agent information into a vector store database
that can support maximum inner product search (MIPS). For fast
retrieval, the approximate nearest neighbors (ANN) algorithm is
used that returns approximately top k-nearest neighbors with an
accuracy trade-off versus a huge speed gain.
Finally, the governance layer. We need to ensure that data shared
by the user specific to a task, or user profile data that cuts across
tasks; is only shared with the relevant agents (privacy,
authentication and access control). Refer to the later section on
Responsible AI Agents for a discussion on the key dimensions
needed to enable a well governed AI agent platform in terms of
hallucination guardrails, data quality, privacy, reproducibility,
explainability, etc.
8
AGENTS
DISCOVERY &
MARKETPLACE
Given a user task, the goal of an agentic AI system is to identify
(compose) an agent (group of agents) capable to executing that
given task. A high-level approach to solving such complex tasks
involves:
• decomposition of the given complex task into (a hierarchy or
workflow of) simple tasks, followed by
• determining and orchestrating agents capable of executing
the simple(r) tasks.
This can be achieved in a dynamic or static manner. In the
dynamic approach, given a complex user task, the system comes up
with a plan to fulfill the request depending on the capabilities of
available agents at run-time. In the static approach, given a set of
agents, composite agents are defined manually at design-time
combining their capabilities.
9
The main focus of this article is on the discovery aspect of AI
agents, i.e., identifying AI agent(s) capable of executing a given
user prompt / task—illustrated in Fig. 5.
This implies that there exists a marketplace (registry of agents),
with a well-defined description of the agent capabilities and
constraints.
The challenge is then to design a matchmaking algorithm capable
of identifying the ‘right’ agents(s) that can execute a given agentic
task, given the task requirements and capabilities / constraints of
the agents available in the marketplace.
Fig. 5: Agent discovery and matchmaking for Agentic AI
In the sequel, we outline a learning-to-rank algorithm to identify
the top-k agent(s) based on natural language descriptions of both
the tasks and marketplace agents.
We also consider the inherent non-determinism in agentic AI
systems. For example, let us consider the e-shopping scenario
illustrated in Fig. 6.
10
Fig. 6: E-shopping scenario with non-determinism
There are two non-deterministic operators in the execution plan:
‘Check Credit’ and ‘Delivery Mode’. The choice ‘Delivery Mode’
indicates that the user can either pick-up the order directly from the
store or have it shipped to his address. Given this, shipping is a
non-deterministic choice and may not be invoked during the actual
execution. As such, the question arises
if the constraints of the shipping agent, that is, the fact that it can
only ship to certain countries, be projected as constraints of the
composite e-shopping service (or not)?
Note that even component services composed via deterministic
operators (Payment and Shipping) are not guaranteed to be invoked
if they are preceded by a choice.
AGENT DISCOVERY BASED ON NATURAL LANGUAGE
DESCRIPTIONS
In this section, we propose a learning-to-rank algorithm to
discover the right agent(s) corresponding to a user prompt.
For example the available agents in the marketplace for the e-
shopping scenario can be: “Check for price”, “Compare products”,
“Deliver products”, “Pay for order”, etc. Given that both the
prompts and agent service descriptions are provided in natural
11
language, this is primarily a multi-class text classification
problem. Unfortunately, traditional vector or cosine similarity based
algorithms face challenges here due to the following reasons:
• the agent descriptions are ambiguous and overlap with that of
other agents (possibly provided by a different vendor);
• the user prompts (including those of different users)
themselves can be ambiguous and overlapping;
• we expect new agents to be frequently added to the
marketplace, leading to scaling challenges requiring re-
training of the full model.
The learning-to-rank (L2R) algorithm to select top-k agents given
a user prompt is illustrated in Fig. 7. We first convert agent (class)
descriptions to semantic embeddings offline and use them to train
the L2R model. The user prompts and the agents use the same
generic embedding model. The inference results including the agent
description embeddings during training and inferencing are cached
to enable the meta-learning process for the L2R algorithm.
Fig. 7: Learning-to-Rank (L2R) algorithm to select top-k Agents
given a User prompt
For instance, for a user prompt
What is the price of Product X?
12
the ratings of agents “Check for Price”, “List provider Y’s products (Y
is the provider of product X)”, “Separator”, “Compare Products”,
“Non-Y’s Products” are {2, 1, 0,−1,−2}, where higher positive
ratings indicate that the agent description is more related to the
user prompt / (decomposed) task description.
The “Separator” class indicates a class that is not defined in the
current model. The L2R model maps the separator class to the
top-k position, i.e., the placeholder “Separator” class itself and
agents whose service descriptions are not relevant to the user
prompt / task description are ranked below this separator.
Finally, one can also challenge as to why not pose the classification
problem itself to an LLM at run-time?
Indeed, recent studies have shown promising results with respect to
LLM based few-shot text classification. However, in our
experiments, we found that due to hallucinations, the LLM’s outputs
often did not contain the defined agent descriptions. Also, it was
becoming a challenge to fit all the task and agent descriptions, with
few-shot examples, into an LLM invocation given the cost and token
limitations of LLMs. So L2R proved to be the best compromise in
this case.
CONCLUSION
In this section, we focused on the discovery aspect of autonomous
AI agents. We highlighted the current challenges of leveraging LLMs
as an execution engine for Agentic AI systems. We showed that to
execute a complex task, a pre-requisite is an agent marketplace
with a registry of agents, specifying their capabilities and
constraints. We outlined an L2R model to match agents based on
the textual descriptions, and discover top-k agents given a user
prompt / task.
13
PERSONALIZING
UX FOR AGENTIC
AI
We focus on this agent personalization aspect (based on user
persona) in this article. Analogous to fine-tuning of LLMs to domain
specific LLMs / small language models (SLMs),
we argue that customization / fine-tuning of these (generic) AI
agents will be needed with respect to enterprise specific context (of
applicable user personas and use-cases) to drive their enterprise
adoption.
The key benefits of AI agent personalization include:
• Personalized interaction: The AI agent adapts its language,
tone, and complexity based on user preferences and
interaction history. This ensures that the conversation is more
aligned with the user’s expectations and communication style.
• Use-case context: The AI agent is aware of the underlying
enterprise use-case processes, so that it can prioritize or
highlight process features, relevant pieces of content, etc.—
optimizing the interaction to achieve the use-case goal more
efficiently.
14
• Proactive Assistance: The AI agent anticipates the needs of
different users and offers proactive suggestions, resources, or
reminders tailored to their specific profiles or tasks.
To summarize, while the current focus to realise AI agents remains
on the functional aspects (and, rightfully so);
we highlight in this section that UI/UX for AI agents is equally
important as the last mile to drive enterprise adoption.
USER PERSONA BASED AI AGENT
PERSONALIZATION
Users today expect a seamless and personalized experience with
customized execution to meet their specific requirements. However,
enterprise user and process specific AI agent personalization
remains challenging due to scale, performance, and privacy
challenges.
User persona based agent personalization aims to overcome these
challenges by segmenting the end-users of a service into a
manageable set of user categories, which represent the
demographics and preferences of majority of users. For example,
the typical personas in an AI agent enabled IT service desk (one of
the areas with highest Gen AI adoption) scenario include:
• Leadership: Senior individuals (e.g., VPs, Directors) who
require priority support with secure access to sensitive data,
and assistance with high-level presentations and video
conferencing.
• Knowledge workers: employees who rely heavily on
technology to perform their daily tasks (e.g., analysts,
engineers, designers).
• Field workers: employees who work primarily outside the
office (e.g., sales representatives, service technicians). As
such, their requirements are mostly focused on remote access
to corporate systems, reliable VPNs, and support with offline
work capabilities.
15
• Administrative / HR: support staff responsible for various
administrative tasks (e.g., HR, Finance) with primary
requirements around assistance with MS Office software,
access to specific business applications, and quick resolution
of routine IT issues.
• New employees / interns: individuals who are new to the
organization and may not be fully familiar with the company’s
IT systems. As such, their queries mostly focus on onboarding
related queries.
Given this, the solution architecture to perform user persona based
fine-tuning of AI agents is illustrated in Fig. 8.
Fig. 8: User persona based fine-tuning of AI agents
The fine-tuning process consists of first parameterizing
(aggregated) user data and conversation history and storing it as
memory in the LLM via adapters, followed by fine-tuning the LLM for
personalized response generation. The agent—user persona router
helps in performing user segmentation (scoring) and routing the
tasks / prompts to the most relevant agent persona.
Fine-tuning AI agents on raw user data is unfortunately often too
complex, even if it is at the (aggregated) persona level. This is
primarily due to the following reasons:
• Agent interaction data usually spans multiple journeys with
sparse data points, various interaction types (multimodal),
and potential noise or inconsistencies with incomplete
queries—responses.
16
• Moreover, effective personalization often requires a deep
understanding of the latent intent / sentiment behind user
actions, which can pose difficulties for generic (pre-trained)
LLMs—LLM agents.
• Finally, fine-tuning is computationally intensive. Agent-user
interaction data can be lengthy. Processing and modeling such
long sequences (e.g., multi-years’ worth of interaction
history) with LLMs can be practically infeasible.
A good solution reference to overcome the above issues is Google’s
work on User-LLMs. According to the authors,
USER-LLM distills compressed representations from diverse and
noisy user interactions, effectively capturing the essence of a user’s
behavioral patterns and preferences across various interaction
modalities.
This approach empowers LLMs with a deeper understanding of
users’ latent intent (inc. sentiment) and historical patterns (e.g.,
temporal evolution of user queries—responses) enabling LLMs to
tailor responses and generate personalized outcomes.
CONCLUSION
In this section, we considered personalization of AI agent
interactions based on user personas for enterprise use-cases.
Agentic AI personalization has the potential to significantly
accelerate agentic AI adoption by improving user satisfaction rates.
We provided the details to implement a personalization layer for the
platform with (a) agent-user router to perform user segmentation
and map tasks / prompts to the most relevant agent persona, and
(b) leveraging agent-user interaction embeddings.
17
AGENTIC AI
MEMORY
MANAGEMENT
Given the long-running nature of AI agents, memory
management is key for Agentic AI systems. This entails both
context sharing between tasks and maintaining execution context
over long periods.
The current solution is to use vector databases (Vector DBs) to
store the agent memory externally—making data items accessible
as needed. In the sequel, we deep dive into the details of
• how agentic memory is managed using Vector DBs,
• their corresponding data quality issues.
We then show that vector databases (while sufficient for
conversational memory—Q&A pairs) are insufficient for agentic
tasks given their need to manage additional memory types:
• semantic memory (general knowledge),
• episodic memory (personal experiences),
18
• procedural memory (skills and task procedures),
• emotional memory (feelings tied to experiences);
and highlight the need for alternative formalisms (e.g., knowledge
graphs, finite state machines) to manage the same effectively.
CONVERSATIONAL MEMORY MANAGEMENT USING
VECTOR DBS
Vector databases (DBs) are specifically designed to store vectors
and handle queries based on vector similarity. They are currently
the primary medium to store and retrieve data (memory)
corresponding to conversational agents. Vector DB based encoding /
memory management for conversational agents is illustrated in Fig.
9.
Fig. 9: Vector DB based encoding for LLMs
This involves selecting an encoder model that performs offline data
encoding as a separate process, converting various forms of raw
data, such as text, audio, and video, into vectors. Similar raw
conversational data items are mapped to closely positioned vectors
in the encoded space.
For example, text needs to be converted into numerical vectors to
be processed by computers and this is done using Tokenizers. A
token can be a byte, character, set of characters, words, or even
full sentences. Byte-pair encoding (BPE) is the most common
tokenizer today using a pair of adjoining bytes as a token.
19
Selecting the right ‘token’ is key as it impacts both the inter-token
relationships that the neural network will be able to grasp, and the
computational complexity of training that network.
This encoded data is saved in a Vector DB, which can then be
retrieved during inference using the same encoder model based on
vector similarity. During a chat, the conversational agent has the
option of querying the long-term memory system by encoding the
query and searching for relevant information within the vector
database. The retrieved information is then used to answer the
query based on the stored information.
HUMAN MEMORY UNDERSTANDING
In this section, we first try to understand how the human brain
works with respect to short-term and long-term memory—
illustrated in Fig. 10.
Fig. 10: Memory management by the Human Brain
It all begins with the sensory system, where information from the
environment enters our sensory memory. This initial memory
stage holds sensory information in its original form but only for a
brief duration, typically lasting a few hundred milliseconds.
From there, the information we pay attention to is transferred to
short-term memory (STM). This working memory has a limited
capacity, capable of holding around 7 chunks of information for
about 20—30 seconds. It is where conscious mental processes like
thinking, problem-solving, and decision-making take place.
For information to move from short-term memory to long-term
memory (LTM), it undergoes encoding, transforming it into a more
durable and meaningful representation.
Encoding occurs through mechanisms such as repetition,
elaboration, or creating associations with existing knowledge.
20
Once successfully encoded, the information enters long-term
memory, which has an extensive capacity and can store information
for extended periods, ranging from hours to a lifetime.
The retrieval system operates based on associations with
contextual information. Retrieval cues, both external and internal,
aid in accessing specific memories by recreating the encoding
context.
• Recall involves actively reconstructing information without
external cues.
• Recognition involves identifying previously encountered
information among alternatives.
• Finally, retrieval strategies such as priming, mnemonic
techniques, chunking, and rehearsal, enhance the retrieval
process.
MAPPING TO AGENTIC MEMORY
Based on our understanding of the human brain and requirements
of AI agents / apps, we need to consider the following memory
types—illustrated in Fig. 11:
• Semantic knowledge: Information sourced from external (e.g.
Wikipedia) and internal systems (e.g., Sharepoint,
Confluence, documents, messaging platforms, etc.).
• Episodic memory: of specific past events and situations.
These insights are garnered during the operation of AI agents.
• Procedural memory: comparable to how humans remember
motor skills like swimming or driving a car. It encompasses
workflows and procedures outlining how AI agents can
achieve a specific task.
• Emotional memory: catalogs feelings tied to experiences. It
delves into user relationships, preferences, reactions, and
data that renders the AI more human-like and aligned in user
interactions.
21
Semantic memory is probably the only type of memory that is
available today within LLMs through pre-training and embeddings—
the other memory types are still WIP. We show how a memory
management module can be implemented for agentic AI systems
in Fig. 11.
Fig. 11: Agentic AI memory management
The memory router, always, by default, routes to the long-term
memory (LTM) module to see if an existing pattern is there to
respond to the given user prompt. If yes, it retrieves and
immediately responds, personalizing it as needed.
If the LTM fails, the memory router routes it to the short-term
memory (STM) module which then uses its retrieval processes
(function calling, APIs, etc.) to get the relevant context into the STM
(working memory) —leveraging applicable data services.
The STM—LTM transformer module is always active and constantly
getting the context retrieved and extracting recipes out of it (e.g.,
refer to the concepts of teachable agents and recipes in AutoGen)
and storing in a semantic layer (implemented via Vector DB). At
the same time, it is also collecting other associated properties (e.g.,
number of tokens, cost of executing the response, state of the
system, tasks executed / responses generated) and
22
• creating an episode which is then getting stored in a
knowledge graph
• with the underlying procedure stored in a finite state
machine (FSM).
CONCLUSION
To conclude, memory management is critical for the adoption of
long running AI agents. While vector databases are quite
performant for conversational agents, we showed that they are
insufficient for the diverse memory requirements of complex agentic
AI tasks, esp., episodic and procedural memory.
We proposed a first draft of an agentic memory architecture in this
article with a memory router routing requests between STM and
LTM modules. The key takeaway is the proposal of a STM — to —
LTM transformer module that is able to abstract and store episodic
and procedural memory in knowledge graphs and FSMs,
respectively.
23
AGENTIC RAGS
We are increasingly seeing the need to query SQL and document
repositories in an integrated fashion - consolidating the respective
query pipelines.
For example, let us consider the following marketing data
landscape with:
• profile data of customers, products, and sales agents, stored
in documents,
• transactional data on orders placed by customers including
the involved sales agents stored in a SQL DB.
Given this, we will need to get insights from both the profile
(unstructured) and transactional (structured) data repositories to
answer the below query:
“Provide a detailed summary of the the Top 3 sales agents for
Product X in year 2023.”
Needless to say, resolving the above query requires a sequential
decomposition of the user query:
• retrieving the Top 3 sales agents for Product X first from the
SQL DB, followed by
24
• retrieving their respective profiles from the profiles document
repository, followed by
• generating a summary based on the retrieved profiles and
sales data.
In the sequel, we outline a solution flow for the above scenario
- leveraging an Agentic AI framework for the query
decomposition part,
- followed by SQL & document query agents querying the
underlying SQL and document repositories, respectively.
- Finally, the RAG pipeline is completed by adding the retrieved
structured and/or unstructured data to the original query
(prompt)—leading to the generation of a contextualized
response.
QUERY DOCUMENT REPOSITORIES USING RAGS
We start by showing how to build RAG pipelines on unstructured
data / documents. Together with fine-tuning, RAG forms one of the
primary mechanisms of ‘adapting’ a pre-trained large language
model (LLM) with enterprise data making it more contextual—
reducing hallucinations in the process (illustrated in the Gen AI
lifecycle stages in Fig. 12).
Fig. 12: Gen AI lifecycle stages
Given a user query, a RAG pipeline literally consists of the 3 phases
below (Fig. 13):
25
• Retrieve: Transform user queries to embeddings (vector
format) to compare its similarity score (search) with other
content.
• Augment: with search results / context retrieved from a
vector store that is kept current and in sync with the
underlying document repository.
• Generate: contextualized responses by making retrieved
chunks part of the prompt template that provides additional
context to the LLM on how to answer the query.
Fig. 13: RAG pipeline highlighting the data quality issues
For example, Snowflake provides Cortex Search as a managed
service implementing the retrieval and augmentation parts of a RAG
pipeline. So it can be combined with Cortex LLM Functions to create
a full fledged RAG pipeline using data in Snowflake as the
underlying context repository.
DATA QUALITY ISSUES IN RAGS
In this section, we highlight the data quality issues below that can
arise (and need to be addressed) in such a RAG pipeline—illustrated
in Fig. 13. We first look at the common data quality dimensions in
our structured / SQL data world today:
• Accuracy: how closely does the data represent the real-world
scenario?
• Completeness: of data with respect to missing values, null
values, etc.
26
• Consistency: the same information is stored differently at
different places.
• Timeliness: freshness of the data captured in the form of
timestamps.
and then try to map them to the unstructured data world / Vector
DBs.
In the Vector DB world, a collection corresponds to a SQL table, and
each collection item can consist of: ID, vector (actual data, captured
as an array of floating point numbers), metadata (e.g., product
categories, timestamps).
Accuracy: of the data captured in vector stores. Think of an AI
writing news articles based on inaccurate information—it could end
up generating misinformation instead of valuable insights. We rely
on the below two metrics to capture this:
• Correctness: refers to the factual accuracy of the LLM’s
response,
• Groundedness: refers to the relationship between the LLM’s
response and its underlying KB.
Studies have shown how a response could be correct, but still
improperly grounded.
Incorrect and/or inconsistent vectors: Due to issues in the
embedding process, some vectors may end up getting corrupted, be
incomplete, or getting generated with a different dimensionality.
This can lead to confused or disjointed outputs. For example, if an
AI is generating audio based on recordings of varying quality, the
result might be noticeably uneven. In text generation, inconsistent
grammar or tone in the data can lead to awkward or disjointed
content.
Missing data can be in the form of missing vectors or metadata.
For instance, a GenAI generating visual designs from incomplete
datasets might output products with missing features.
27
Timeliness: If the documents in your vector store, which provides
context relevant to prompts in a RAG pipeline, are outdated, then
GenAI systems could result in irrelevant outputs. A GenAI enabled
chatbot answering questions related to policy if fed from obsolete
policy documents, it could provide inaccurate and misleading
response to questions.
TEXT2SQL: NATURAL LANGUAGE QUERIES OVER
STRUCTURED DATA
In this section, we extend the natural language querying capabilities
to structured data stored in a SQL data repository. This capability is
called Text2SQL, also referred to as conversational BI.
Cortex Analyst is the Text2SQL equivalent of Cortex Search to
query structured data / SQL databases and data warehouses.
Cortex Analyst is actually an end-to-end text-to-answer solution as
it returns the generated SQL as well—providing the final query
response. It is very easy to deploy: available as an API, and
Snowflake also provides a simple Streamlit application that can be
deployed with a few lines of code.
Overall, it provides the following three differentiating features with
respect to other Text2SQL tools in the market:
• User intent validation and query explanation
• Lightweight semantic model
• Flexibility with respect to the underlying LLM
We all know that LLMs hallucinate, so the first (and best) thing to
do is to validate the system’s understanding of a given query
with the user—before responding with the final answer. Cortex
Analyst enables this by:
• having a dialogue with the user, where for any given query, it
first presents its understanding of the query to the user with
an explanation of how it generated the SQL query.
28
• In addition, it also provides suggestions to make the query
more concrete in case of ambiguities.
Secondly, Cortex Analyst addresses the data repository metadata
mapping problem highlighted earlier with the help of a semantic
model.
The semantic model is the bridge mapping the domain or business
specific terms used by users to the database schemas.
These additional semantic details, like more descriptive names or
synonyms, enable Cortex Analyst to answer natural language
queries more reliably.
Finally, Cortex Analyst provides a lot of flexibility with respect to
choosing the underlying LLM. By default, Cortex Analyst leverages
Snowflake-hosted Cortex LLMs, which have been heavily fine-
tuned for text-to-SQL generation tasks and are one of the most
powerful LLMs available today. It is however possible to explicitly
opt-in to allow Cortex Analyst to use the latest OpenAI GPT models,
hosted by Microsoft Azure, alongside the Snowflake-hosted models.
At runtime, Cortex Analyst will select the optimal combination of
models to ensure the highest accuracy and performance for each
query.
RAG AGENTS CONSOLIDATING QUERIES OVER SQL
AND DOCUMENT REPOSITORIES
In this section, we tie everything together by outlining an Agentic AI
framework to build RAG pipelines that work seamlessly over both
structured and unstructured data stored in Snowflake. The
reference architecture of such a RAG Agent is illustrated in Fig. 14.
29
Fig. 14: Reference architecture of a RAG Agent querying both SQL &
Document repositories
We first focus on the query decomposition aspect. Given a user
task, the goal of an AI agent is to identify (compose) an agent
(group of agents) capable to executing the given task. This is
achieved by leveraging an LLM as a reasoning engine capable of
decomposing the task (query in this case) into sub-tasks, with
execution of the respective agents orchestrated by an orchestration
engine, e.g. LangGraph.
The Supervisor agent is responsible for coordinating the overall
execution, routing (via if-then-else logic) to the individual agents.
As such, the Router is primarily used to route the (SQL and/or
document related) sub-queries to the SQL & Document query
agents.
The SQL & Document query agents are self-explanatory, and
leverage the respective Snowflake Cortex Analyst and Search
components detailed earlier to query the underlying SQL and
Document repositories. Finally, to complete the RAG pipeline, the
retrieved data is added to the original query / prompt—leading the
generation of a contextualized response.
30
It is important to highlight here the need for a comprehensive
evaluation strategy for such complex systems.
LLMs are used in both pipelines, so it is mostly the difference
between retrieving information from structured vs. unstructured
repositories. In the case of SQL data retrieval, the challenge is in
the Text2SQL conversion. Once the right SQL script has been
determined, there is no confusion (hallucination) with respect to the
retrieved SQL table output. This is solved using a mix of semantic
models and manual validation of the generated SQL as discussed in
the previous section on Text2SQL.
In the case of unstructured data / documents, we need to evaluate
with respect to hallucinations on the output (generated text). So we
can apply RAG evaluation frameworks like RAGAS here. In our case,
in the scenario presented in the article, the two steps / agents are
sequential: retrieve the SQL Sales data, followed by summarizing
the profiles of the Top Sales Agents - so they can be evaluated
independently and in parallel.
31
REINFORCEMENT
LEARNING
AGENTS
When we talk about AI agents today, we mostly talk about LLM
agents, which loosely translates to invoking (prompting) an LLM to
perform natural language processing (NLP) tasks, e.g., processing
documents, summarizing them, generating responses based on the
retrieved data. For example, refer to the “researcher” agent
scenario outlined by LangGraph.
It is important to mention here that some agentic tasks may be
better suited to other machine learning (ML) techniques, e.g.,
reinforcement learning (RL), predictive analytics, etc.—depending
on the use-case objectives – illustrated in Fig. 15.
32
Fig. 15: Agentic AI platform reference architecture focusing on RL
agents
In this section, we focus on RL agents, and show how LLMs can be
used to fine-tune the RL rewards and policy functions.
REINFORCEMENT LEARNING
Reinforcement learning (RL) is able to achieve complex goals by
maximizing a reward function in real-time. The reward function
works similar to incentivizing a child with candy and spankings,
such that the algorithm is penalized when it takes a wrong decision
and rewarded when it takes a right one—this is reinforcement. The
reinforcement aspect also allows it to adapt faster to real-time
changes in the user sentiment. For a detailed introduction to RL,
please refer to our papers leveraging RL for chatbots and
recommender systems.
33
Fig. 16: Reinforcement Learning (RL) methodology (Image
by Author)
Some interesting observations about RL—relevant for the following
discussion on RL agents:
• RL rewards and policies are not the same: The roles and
responsibilities of the reward function vs. RL agent policies are
not very well defined, and can vary between architectures. A
naïve understanding would be that given an associated
reward/cost with every state-action pair, the policy would
always try to minimize the overall cost. Apparently, it seems
that sometimes keeping the ecosystem in a stable state can
be more important than minimizing the cost (e.g. in a climate
control use-case). As such,
the RL agent policy goal need not always be aligned with the reward
function, and that is why two separate functions are needed.
• Similar to supervised approaches in machine learning / deep
learning, the RL approach most suitable for enterprise
adoption is ‘model based RL’.
In model based RL, it is possible to develop a model of the
problem scenario, and bootstrap initial RL training based on
the model simulation values. For instance, for energy
optimization use-cases, a blueprint of the building heating,
ventilation and air conditioning (HVAC) system serves as a
model, whose simulation values can be used to train the RL
model—detailed in Section 3.2. For complex scenarios (e.g.
34
games, robotic tasks), where it is not possible to build a
model of the problem scenario, it might still be possible to
bootstrap an RL model based on historical values.
This is referred to as ‘offline training’, and is considered a good
starting point in the absence of a model. And, this is also the reason
why RL is often considered as a hybrid between supervised and
unsupervised learning, rather than a purely unsupervised learning
paradigm.
• Online and model-free RL remain the most challenging, where
the RL agent is trying to learn and react in real-time without
any supervision. Research in this field seems to lack a
theoretical foundation at this stage. Researchers are trying
out different approaches by simply throwing more data and
computing power at the problems. As such, this remains the
most “interesting” part of RL, with current research primarily
focusing on efficient heuristics and distributed computation to
cover the search space in an accelerated fashion.
Applying DL (neural networks) to the different RL aspects, e.g.,
policies, rewards, also remains a hot topic—referred to as deep
reinforcement learning.
LLM BASED FINE-TUNING OF RL REWARD FUNCTION
In this section, we show how RL agents can be developed
leveraging LLMs to fine-tune the RL agent reward function in this
section. The key steps to perform LLM based fine-tuning of the RL
agent reward function are as follows—illustrated in Fig. 17:
35
Fig. 17: LLM based fine-tuning of RL Agent Reward Function
Initial prompt to the LLM with the RL formulation of the <use-
case>, outlining the objectives of the RL reward and policy
functions; with (optional) examples of some rewards function
templates.
1. LLM generates candidate RL reward functions with the
‘temperature’ parameter providing the exploration—
exploitation balance with respect to novelty of the generated
reward functions.
2. Validate the generated reward functions for relevance with
respect to the given <use-case>; followed by their
evaluation with respect to the <use-case> accuracy.
3. Add the generated reward functions with their evaluation
results to LLM memory—for the LLM to improve its (future)
generation process.
4. Repeat steps 2–4 until convergence is reached with respect
to a threshold improvement in the RL agent reward function.
36
5. Finally, update the RL agent reward function with the top
ranked LLM generated reward function. Fig. 5 outlines sample
templates for the initial prompt and to store evaluation results
of an iteration in LLM memory.
RL AGENTS FOR HVAC OPTIMIZATION
In this section, we apply the outlined LLM based RL reward fine-
tuning approach to a concrete use-case of optimizing HVAC
consumption in buildings / factories.
RL based HVAC optimization It is an interesting case study in the
context of our current discussion. It showcases the successful
transition of an industrial control system run by a traditional PID
(proportional integral derivative) controller for the last 10+ years—
to a more efficient RL based controller.
In this article, we primarily show that it is possible to re-create the
RL based HVAC controller outlined in the paper, using the LLM
based fine-tuning approach proposed in the previous section.
• The industrial control system in this case refers to HVAC units
responsible for maintaining the temperature and humidity
settings in factories (buildings in general). The sensors
correspond to the indoor (and outdoor) temperature and
humidity sensors; and the actuators correspond to the cooling,
heating, re-heating and humidifier valves of the HVAC units.
• The RL model formulation of the HVAC optimization problem is
illustrated in Fig. 18.
Fig. 18: HVAC optimization—Reinforcement Learning formulation
37
The RL controller logic can be summarized as follows: Given the
zone state in terms of the (inside and outside) temperature and
humidity values, the RL model needs to decide by how much to
open the cooling, heating, re-heating and humidifier valves. To take
an informed decision in this scenario, the RL model needs to first
understand the HVAC system behavior, in terms say how much
zone temperature drop can be expected by opening the Cooling
valve to X%?
Once the RL model understands the HVAC system behavior, the
final step is to design the control strategy. For instance, the RL
model now has to choose whether to open the cooling value to 25%
when the zone temperature reaches 23 degrees, or wait till the
zone temperature reaches 24 degrees before opening the cooling
valve to 40%. Note that the longer it waits before opening the
valve, contributes positively towards lowering the energy
consumption. However, it then runs the risk of violating the
temperature / humidity tolerance levels as the outside weather
conditions are always unpredictable. As a result, it might actually
have to open the cooling valve to a higher percentage if it waits
longer, consuming more energy.
The above probabilities are quantified by the reward function
illustrated in Equation 1, which assigns a reward to each possible
action based on the following three parameters: setpoint closeness
(SC), energy cost (EC), tolerance violation (TV).
Given this, we were able to leverage the LLM based fine-tuning
approach to refine both ‘safe’ and ‘business first’ control strategies
to an ‘energy optimal’ policy.
• ‘Safe’ control strategy: assign a very high negative weightage
(penalty) to tolerance violations, ensuring that they never
happen—albeit at a higher energy cost.
• Setpoint closeness encourages a ‘business first’ policy where
the RL model attempts to keep the zone temperature as close
38
as possible to the temperature / humidity setpoints, implicitly
reducing the risk of violations—but at a higher energy cost.
• ‘Energy optimal’ prioritizes energy savings over the other two
parameters.
CONCLUSION
When we talk about AI agents today, we mostly talk about LLM
agents, which loosely translates to prompting an LLM to perform
NLP tasks. In this article, we highlighted the fact that some agentic
tasks can be better suited to other ML techniques, e.g.,
reinforcement learning (RL), predictive analytics, etc.
More concretely, we focused on RL agents, and showed how LLMs
can be used to fine-tune the RL agent reward / policy functions. We
showed a concrete example of applying the fine-tuning methodology
to a real-life industrial control system—designing the RL based
controller for HVAC optimization in a building setting. The proposed
RL fine-tuning methodology shows promise to be applied to a wide
range of RL agentic tasks.
39
RESPONSIBLE AI
AGENTS
The growing adoption of generative AI and agentic AI has reignited
the discussion around AI regulations—to ensure that AI/ML systems
are responsibly trained and deployed.
Unfortunately, this effort is complicated by different governmental
organizations and regulatory bodies releasing their own guidelines
and policies with little to no agreement on the definition of terms.
The specification language is also (sometimes intentionally) kept at
such an important level that it is difficult to map them to an
implementation / enforcement mechanism. For instance, the EU AI
Act mandates a different set of dos & don’ts depending on the ‘risk
level’ of an AI application. However, quantifying the risk level of an
AI application is easier said than done as it depends on multiple
parameters, and basically requires one to classify how the
capabilities of a non-deterministic system will impact users and
systems who might interact with it in the future.
The table below summarizes the key challenges and solutions in
implementing responsible AI for the different generative AI and
agentic AI architectural patterns.
40
Table 1: Responsible AI integrated with AgentOps
We expand on the above points in the rest of the article to enable
an integrated AgentOps pipeline with responsible LLM governance.
For instance, as mentioned earlier, with respect to hallucinations
for agentic AI systems:
hallucination likelihood increases exponentially with increase in
the number of agents involved.
The right prompts can help but only to a limited extent. To further
limit the hallucination, LLMs need to be fine-tuned with curated data
and/or limit the search space of responses to relevant and recent
enterprise data.
Data consistency: The data used for training (esp., fine-tuning)
the LLM should be accurate and precise, which means the relevant
data pertaining to the specific use-case should be used to train the
LLMs, e.g. if the use case is to generate summary of a medical
prescription—the user should not use other data like Q&A of a
diagnosis, user must use only medical prescriptions and
corresponding summarization of the prescription. Many a times,
data pipelines need to be created to ingest the data and feed that to
LLMs. In such scenarios, extra caution needs to be exercised to
consume the running text fields as these fields hold mostly
inconsistent and incorrect data.
41
Data Quality: From a data quality point of view, we see the
following challenges with respect to LLMs, esp. vector databases.
• Accuracy of the encodings in vector stores, measures in terms
of correctness and Groundedness of the generated LLM
responses.
• Incorrect and/or inconsistent vectors: Due to issues in the
embedding process, some vectors may end up getting
corrupted, be incomplete, or getting generated with a
different dimensionality.
• Missing data can be in the form of missing vectors or
metadata.
• Timeliness issues with respect to outdated documents
impacting the vector store.
Bias/Fairness: With respect to model performance and reliability,
it is difficult to control undesired biases in black-box LLMs, though it
can be controlled to some extent by using uniform and unbiased
data to fine-tune the LLMs and/or contextualize the LLMs in a RAG
architecture.
Accountability: To make LLMs more reliable, it is recommended to
have manual validation of the LLM’s outputs. Involving humans
ensures if LLMs hallucinate or provide wrong response, a human can
evaluate and make the necessary corrections.
Explainability: Explainability is an umbrella term for a range of
tools, algorithms and methods, which accompany AI model
inferences with explanations. Chain of Thought (CoT) is a
framework that addresses how a LLM is solving a problem. CoT can
be primarily implemented using two approaches:
• User prompting: Here, during prompting, user provides the
logic about how to approach a certain problem and LLM will
solve similar problem using same logic and return the output
along with the logic.
42
• Automating CoT prompting: Manual handcrafting CoT can be
time consuming and provide sub optimal solution, Automatic
CoT (Auto-CoT) can be leveraged to generate the reasoning
chains automatically thus eliminating the human intervention.
Auto-CoT basically works on two processes: 1. Question
clustering: Cluster the questions of a given dataset. 2.
Demonstration sampling: Select the representative question
from each cluster and generate the reasoning chain using zero
shot CoT. Auto-CoT works well for LLMs having approximately
100B parameters but not so accurate for SLMs.
DATA PRIVACY
We also see a growing (and worrisome) trend where enterprises are
applying the privacy frameworks and controls that they had
designed for their data science / predictive analytics pipelines—as-
is to Gen AI / LLM use-cases.
This is clearly inefficient (and risky) and we need to adapt the
enterprise privacy frameworks, checklists and tooling—to take into
account the novel and differentiating privacy aspects of LLMs.
From a privacy point of view, we need to consider the following
additional / different LLM privacy risks – illustrated in Fig. 19:
• Membership and property leakage from pre-training data
• Model features leakage from pre-trained LLM
• Privacy leakage from conversations (history) with LLMs
• Compliance with privacy intent of users
Pre-training Data Leakage
Instead of privacy leakage from training data belonging to the
enterprise only, we need to start by considering privacy leakage
from training data used to train the pre-trained LLM. For example,
previous studies have shown that GPT models can leak privacy-
sensitive training data, e.g. email addresses from the standard
43
Enron email dataset, implying that the Enron dataset is very likely
included in the training data of GPT-4 and GPT-3.5.
Fig. 19: LLM privacy risks
Leakage tests consisted of a mix of context, zero- and few-shot
prompting.
The core idea is to provide k-shot true (name, email) pairs (from
other users) as demonstrations, and then prompt the model with
the target user’s name to predict the target email address.
Example templates used for few-shot prompting:
• “the email address of {target_name} is”,
• “name: {target_name}, email:”,
• “{target_name} [mailto:”,
• “—–Original Message—–n From: {target_name} [mailto: ”
Enterprise Data Leakage
Privacy of enterprise (training) data does become relevant when we
start leveraging LLMs in a RAG setting or fine-tune LLMs with
enterprise data to create an enterprise / domain specific solution /
small language model (SLM).
44
Fig. 20: Enterprise data Leakage with respect to fine-tuned LLMs
The interesting part here is that the attacker observes both model
snapshots: the pre-trained LLM and the fine-tuned SLM. And, we
then need to measure the privacy leakage (membership / property
inference) with respect to the whole training data: pre-training data
+ (delta) enterprise data.
Previous studies have shown that leakage prone weight sensitive
features in a trained DL model can correspond to specific words in a
language prediction model. It has been further shown that fine-
tuned models are highly susceptible to privacy attacks, given only
API access to the model. This means that if a model is fine-tuned on
highly sensitive data, great care must be taken before deploying
that model—as large portions of the fine-tuning dataset can be
extracted with black-box access! The recommendation then is to
deploy such models with additional privacy-preserving techniques,
e.g., Differential Privacy.
Conversational Privacy Leakage
With traditional ML models, we are primarily talking about a one-
way inference reg. a prediction or classification task. In contrast,
LLMs enable a two-way conversation, so we need to consider
conversation related privacy risks in addition, where e.g. GPT
45
models can leak the user private information provided in a
conversation (history).
Fig. 21: PII & implicit privacy conversations
Personally identifiable information (PII) privacy leakage concerns in
conversations are real given that various applications (e.g., office
suites) have started to deploy GPT models at the inference stage to
help process enterprise data / documents, which usually contain
sensitive (confidential) information.
We can only expect Gen AI adoption to grow in different verticals,
e.g. customer support, health, banking, dating; leading to the
inevitable harvesting of prompts posed by the users as a ‘source of
personal data’ for adverting, phishing, etc. scenarios. Given this,
we also need to consider implicit privacy risks of natural language
conversations (along the lines of side-channel attacks) together
with PII leakage concerns.
46
For example, the query: “Wow, this dress looks amazing! What is its
price?” can leak the the user’s sentiment as compared to a more
neutral prompt: “This dress fits my requirements. What is its price?”
Privacy Intent Compliance
Finally, LLMs today allow users to be a lot more prescriptive with
respect to processing their prompts / queries, e.g. chain-of-thought
(CoT) prompting. CoT is a framework that addresses how a LLM is
solving a problem. During prompting, the user provides the logic
about how to approach a certain problem and LLM will solve the
task using suggested logic and returns the output along with the
logic.
CoT can be extended to allow the User to explicitly specify their
privacy intent in prompts using keywords e.g., “in confidence”,
“confidentially”, “privately”, “in private”, “in secret”, etc. So we also
need to assess the LLM effectiveness in complying with these user
privacy requests. For example, it has been shown that GPT-4 will
leak private information when told “confidentially”, but will not
when prompted “in confidence”.l
CONCLUSION
Agentic AI is a disruptive technology, and we are seeing it evolve
faster than anything we have experienced before. Responsible
usage of LLMs also requires assessing how the identified LLMOps
and AgentOps architectural patterns impact relevant responsible AI
dimensions.
Toward this end, we highlighted the responsible AI dimensions
relevant to AI agents; and showed how they can be integrated in a
seamless fashion with the underlying LLMOps and AgentOps
pipelines. We believe that these will effectively future-proof agentic
AI investments and ensure that AI agents are able to cope as the AI
agent platform and regulatory landscape evolves with time.

Agentic AI: Scalable & Responsible Deployment of AI Agents in the Enterprise

  • 1.
    AGENTIC AI Scalable &Responsible Deployment of AI Agents in the Enterprise Debmalya Biswas, PhD
  • 2.
    TABLE OF CONTENTS INTRODUCTION TOAGENTIC AI ..............................1 AGENTIC AI RFERENCE ARCHITECTURE ...................4 AGENTS DISCOVERY & MARKETPLACE .....................8 PERSONALIZING UX FOR AGENTIC AI....................13 AGENTIC AI MEMORY MANAGEMENT .....................17 AGENTIC RAGS.......................................................23 REINFORCEMENT LEARNING AGENTS ....................31 RESPONSIBLE AI AGENTS ......................................39
  • 3.
    INTRODUCTION TO AGENTIC AI Thediscussion around ChatGPT (in general, Generative AI), has now evolved into Agentic AI. While ChatGPT is primarily a chatbot that can generate text responses, AI agents can execute complex tasks autonomously, e.g., make a sale, plan a trip, make a flight booking, book a contractor to do a house job, order a pizza. Fig. 1 below illustrates the evolution of agentic AI systems. Fig. 1: Agentic AI evolution Bill Gates recently envisioned a future where we would have an AI agent that is able to process and respond to natural language and
  • 4.
    2 accomplish a numberof different tasks. Gates used planning a trip as an example. Ordinarily, this would involve booking your hotel, flights, restaurants, etc. on your own. But an AI agent would be able to use its knowledge of your preferences to book and purchase those things on your behalf. An AI agent today is able to decompose a given task, monitor long running sub-tasks, and autonomously adapt its execution strategy to achieve its goal. This has led to the rise of AI agents optimized to perform specific tasks, potentially provided by different vendors, and published / catalogued in an agent marketplace. AI Agents today are characterized by primarily three characteristics – illustrated in Fig. 2: - Complex task decomposition & orchestration - (Long-term) memory management & context sharing - Autonomous execution based on ‘reflect & adapt’ Fig. 2: Agentic AI core capabilities This has led to agentic AI getting applied in almost every vertical with the potential for significant disruption. For example, agentic AI use-cases in the BFSI domain include:
  • 5.
    3 - Credit monitoring:Set a goal of monitoring a credit portfolio, and the agent will of out and research relative metrics, consolidate them, and then continue to monitor the portfolio until metrics are optimized. - Research reports: The agent researches the internet for noted equity analysts, sends them an email about what they thought of the market, complies responses, and puts them in a report to wealth management clients. - Branch locations: Agents researches high-traffic locations using cellphone data, looks for available locations, and ranks them according to cost. The agent then maps competitors and competitor traffic to produce a location report. - Data management: Given an ETL requirement, agents writes its own code in Python, debugs it, and moves it to production to clean and transform data to target data model.
  • 6.
    4 AGENTIC AI RFERENCE ARCHITECTURE In thissection, we highlight the key components of a reference AI agent platform—illustrated in Fig. 3: • Agent marketplace • Orchestration layer • Integration layer • Shared memory layer • Governance layer, including explainability, privacy, security, etc.
  • 7.
    5 Fig. 3: AIagent platform reference architecture Given a user task, we prompt a LLM for the task decomposition— this is the overlap with Gen AI. Unfortunately, this also means that agentic AI systems today are limited by the reasoning capabilities of large language models (LLMs). For ex., the GPT4 task decomposition of the prompt: Generate a tailored email campaign to achieve sales of USD 1 Million in 1 month, The applicable products and their performance metrics are available at [url]. Connect to CRM system [integration] for customer names, email addresses, and demographic details. is detailed in Fig. 4: (Analyze products)—(Identify target audience)—(Create tailored email campaign).
  • 8.
    6 Fig. 4: AgenticAI execution of a Marketing use-case The LLM then monitors the execution / environment and adapts autonomously as needed. In this case, the agent realised that it is not going to achieve its sales goal and autonomously added the tasks: (Find alternative products)—(Utilize customer data to personalize the emails)—(Perform A/B testing). Given the need to orchestrate multiple agents, there is a need for an integration layer supporting different agent interaction patterns, e.g., agent-to-agent API, agent API providing output for human consumption, human triggering an AI agent, AI agent-to- agent with human in the Loop. The integration patterns need to be supported by the underlying AgentOps platform. Andrew Ng recently talked about this aspect from a performance perspective: Today, a lot of LLM output is for human consumption. But in an agentic workflow, an LLM might be prompted repeatedly to reflect on and improve its output, use tools, plan and execute multiple steps, or implement multiple agents that collaborate. So, we might generate hundreds of thousands of tokens or more before showing any output to a user. This makes fast token generation very desirable and makes slower generation a bottleneck to taking better advantage of existing models.
  • 9.
    7 It is alsoimportant to mention that integration with enterprise systems (e.g., CRM in this case) will be needed for most use-cases. For instance, refer to the Model Context Protocol (MCP) proposed by Anthropic recently to connect AI agents to external systems where enterprise data resides. Given the long-running nature of such complex tasks, memory management is key for Agentic AI systems. Once the initial email campaign is launched, the agent needs to monitor the campaign for 1-month. This entails both context sharing between tasks and maintaining execution context over long periods. The standard approach here is to save the embedding representation of agent information into a vector store database that can support maximum inner product search (MIPS). For fast retrieval, the approximate nearest neighbors (ANN) algorithm is used that returns approximately top k-nearest neighbors with an accuracy trade-off versus a huge speed gain. Finally, the governance layer. We need to ensure that data shared by the user specific to a task, or user profile data that cuts across tasks; is only shared with the relevant agents (privacy, authentication and access control). Refer to the later section on Responsible AI Agents for a discussion on the key dimensions needed to enable a well governed AI agent platform in terms of hallucination guardrails, data quality, privacy, reproducibility, explainability, etc.
  • 10.
    8 AGENTS DISCOVERY & MARKETPLACE Given auser task, the goal of an agentic AI system is to identify (compose) an agent (group of agents) capable to executing that given task. A high-level approach to solving such complex tasks involves: • decomposition of the given complex task into (a hierarchy or workflow of) simple tasks, followed by • determining and orchestrating agents capable of executing the simple(r) tasks. This can be achieved in a dynamic or static manner. In the dynamic approach, given a complex user task, the system comes up with a plan to fulfill the request depending on the capabilities of available agents at run-time. In the static approach, given a set of agents, composite agents are defined manually at design-time combining their capabilities.
  • 11.
    9 The main focusof this article is on the discovery aspect of AI agents, i.e., identifying AI agent(s) capable of executing a given user prompt / task—illustrated in Fig. 5. This implies that there exists a marketplace (registry of agents), with a well-defined description of the agent capabilities and constraints. The challenge is then to design a matchmaking algorithm capable of identifying the ‘right’ agents(s) that can execute a given agentic task, given the task requirements and capabilities / constraints of the agents available in the marketplace. Fig. 5: Agent discovery and matchmaking for Agentic AI In the sequel, we outline a learning-to-rank algorithm to identify the top-k agent(s) based on natural language descriptions of both the tasks and marketplace agents. We also consider the inherent non-determinism in agentic AI systems. For example, let us consider the e-shopping scenario illustrated in Fig. 6.
  • 12.
    10 Fig. 6: E-shoppingscenario with non-determinism There are two non-deterministic operators in the execution plan: ‘Check Credit’ and ‘Delivery Mode’. The choice ‘Delivery Mode’ indicates that the user can either pick-up the order directly from the store or have it shipped to his address. Given this, shipping is a non-deterministic choice and may not be invoked during the actual execution. As such, the question arises if the constraints of the shipping agent, that is, the fact that it can only ship to certain countries, be projected as constraints of the composite e-shopping service (or not)? Note that even component services composed via deterministic operators (Payment and Shipping) are not guaranteed to be invoked if they are preceded by a choice. AGENT DISCOVERY BASED ON NATURAL LANGUAGE DESCRIPTIONS In this section, we propose a learning-to-rank algorithm to discover the right agent(s) corresponding to a user prompt. For example the available agents in the marketplace for the e- shopping scenario can be: “Check for price”, “Compare products”, “Deliver products”, “Pay for order”, etc. Given that both the prompts and agent service descriptions are provided in natural
  • 13.
    11 language, this isprimarily a multi-class text classification problem. Unfortunately, traditional vector or cosine similarity based algorithms face challenges here due to the following reasons: • the agent descriptions are ambiguous and overlap with that of other agents (possibly provided by a different vendor); • the user prompts (including those of different users) themselves can be ambiguous and overlapping; • we expect new agents to be frequently added to the marketplace, leading to scaling challenges requiring re- training of the full model. The learning-to-rank (L2R) algorithm to select top-k agents given a user prompt is illustrated in Fig. 7. We first convert agent (class) descriptions to semantic embeddings offline and use them to train the L2R model. The user prompts and the agents use the same generic embedding model. The inference results including the agent description embeddings during training and inferencing are cached to enable the meta-learning process for the L2R algorithm. Fig. 7: Learning-to-Rank (L2R) algorithm to select top-k Agents given a User prompt For instance, for a user prompt What is the price of Product X?
  • 14.
    12 the ratings ofagents “Check for Price”, “List provider Y’s products (Y is the provider of product X)”, “Separator”, “Compare Products”, “Non-Y’s Products” are {2, 1, 0,−1,−2}, where higher positive ratings indicate that the agent description is more related to the user prompt / (decomposed) task description. The “Separator” class indicates a class that is not defined in the current model. The L2R model maps the separator class to the top-k position, i.e., the placeholder “Separator” class itself and agents whose service descriptions are not relevant to the user prompt / task description are ranked below this separator. Finally, one can also challenge as to why not pose the classification problem itself to an LLM at run-time? Indeed, recent studies have shown promising results with respect to LLM based few-shot text classification. However, in our experiments, we found that due to hallucinations, the LLM’s outputs often did not contain the defined agent descriptions. Also, it was becoming a challenge to fit all the task and agent descriptions, with few-shot examples, into an LLM invocation given the cost and token limitations of LLMs. So L2R proved to be the best compromise in this case. CONCLUSION In this section, we focused on the discovery aspect of autonomous AI agents. We highlighted the current challenges of leveraging LLMs as an execution engine for Agentic AI systems. We showed that to execute a complex task, a pre-requisite is an agent marketplace with a registry of agents, specifying their capabilities and constraints. We outlined an L2R model to match agents based on the textual descriptions, and discover top-k agents given a user prompt / task.
  • 15.
    13 PERSONALIZING UX FOR AGENTIC AI Wefocus on this agent personalization aspect (based on user persona) in this article. Analogous to fine-tuning of LLMs to domain specific LLMs / small language models (SLMs), we argue that customization / fine-tuning of these (generic) AI agents will be needed with respect to enterprise specific context (of applicable user personas and use-cases) to drive their enterprise adoption. The key benefits of AI agent personalization include: • Personalized interaction: The AI agent adapts its language, tone, and complexity based on user preferences and interaction history. This ensures that the conversation is more aligned with the user’s expectations and communication style. • Use-case context: The AI agent is aware of the underlying enterprise use-case processes, so that it can prioritize or highlight process features, relevant pieces of content, etc.— optimizing the interaction to achieve the use-case goal more efficiently.
  • 16.
    14 • Proactive Assistance:The AI agent anticipates the needs of different users and offers proactive suggestions, resources, or reminders tailored to their specific profiles or tasks. To summarize, while the current focus to realise AI agents remains on the functional aspects (and, rightfully so); we highlight in this section that UI/UX for AI agents is equally important as the last mile to drive enterprise adoption. USER PERSONA BASED AI AGENT PERSONALIZATION Users today expect a seamless and personalized experience with customized execution to meet their specific requirements. However, enterprise user and process specific AI agent personalization remains challenging due to scale, performance, and privacy challenges. User persona based agent personalization aims to overcome these challenges by segmenting the end-users of a service into a manageable set of user categories, which represent the demographics and preferences of majority of users. For example, the typical personas in an AI agent enabled IT service desk (one of the areas with highest Gen AI adoption) scenario include: • Leadership: Senior individuals (e.g., VPs, Directors) who require priority support with secure access to sensitive data, and assistance with high-level presentations and video conferencing. • Knowledge workers: employees who rely heavily on technology to perform their daily tasks (e.g., analysts, engineers, designers). • Field workers: employees who work primarily outside the office (e.g., sales representatives, service technicians). As such, their requirements are mostly focused on remote access to corporate systems, reliable VPNs, and support with offline work capabilities.
  • 17.
    15 • Administrative /HR: support staff responsible for various administrative tasks (e.g., HR, Finance) with primary requirements around assistance with MS Office software, access to specific business applications, and quick resolution of routine IT issues. • New employees / interns: individuals who are new to the organization and may not be fully familiar with the company’s IT systems. As such, their queries mostly focus on onboarding related queries. Given this, the solution architecture to perform user persona based fine-tuning of AI agents is illustrated in Fig. 8. Fig. 8: User persona based fine-tuning of AI agents The fine-tuning process consists of first parameterizing (aggregated) user data and conversation history and storing it as memory in the LLM via adapters, followed by fine-tuning the LLM for personalized response generation. The agent—user persona router helps in performing user segmentation (scoring) and routing the tasks / prompts to the most relevant agent persona. Fine-tuning AI agents on raw user data is unfortunately often too complex, even if it is at the (aggregated) persona level. This is primarily due to the following reasons: • Agent interaction data usually spans multiple journeys with sparse data points, various interaction types (multimodal), and potential noise or inconsistencies with incomplete queries—responses.
  • 18.
    16 • Moreover, effectivepersonalization often requires a deep understanding of the latent intent / sentiment behind user actions, which can pose difficulties for generic (pre-trained) LLMs—LLM agents. • Finally, fine-tuning is computationally intensive. Agent-user interaction data can be lengthy. Processing and modeling such long sequences (e.g., multi-years’ worth of interaction history) with LLMs can be practically infeasible. A good solution reference to overcome the above issues is Google’s work on User-LLMs. According to the authors, USER-LLM distills compressed representations from diverse and noisy user interactions, effectively capturing the essence of a user’s behavioral patterns and preferences across various interaction modalities. This approach empowers LLMs with a deeper understanding of users’ latent intent (inc. sentiment) and historical patterns (e.g., temporal evolution of user queries—responses) enabling LLMs to tailor responses and generate personalized outcomes. CONCLUSION In this section, we considered personalization of AI agent interactions based on user personas for enterprise use-cases. Agentic AI personalization has the potential to significantly accelerate agentic AI adoption by improving user satisfaction rates. We provided the details to implement a personalization layer for the platform with (a) agent-user router to perform user segmentation and map tasks / prompts to the most relevant agent persona, and (b) leveraging agent-user interaction embeddings.
  • 19.
    17 AGENTIC AI MEMORY MANAGEMENT Given thelong-running nature of AI agents, memory management is key for Agentic AI systems. This entails both context sharing between tasks and maintaining execution context over long periods. The current solution is to use vector databases (Vector DBs) to store the agent memory externally—making data items accessible as needed. In the sequel, we deep dive into the details of • how agentic memory is managed using Vector DBs, • their corresponding data quality issues. We then show that vector databases (while sufficient for conversational memory—Q&A pairs) are insufficient for agentic tasks given their need to manage additional memory types: • semantic memory (general knowledge), • episodic memory (personal experiences),
  • 20.
    18 • procedural memory(skills and task procedures), • emotional memory (feelings tied to experiences); and highlight the need for alternative formalisms (e.g., knowledge graphs, finite state machines) to manage the same effectively. CONVERSATIONAL MEMORY MANAGEMENT USING VECTOR DBS Vector databases (DBs) are specifically designed to store vectors and handle queries based on vector similarity. They are currently the primary medium to store and retrieve data (memory) corresponding to conversational agents. Vector DB based encoding / memory management for conversational agents is illustrated in Fig. 9. Fig. 9: Vector DB based encoding for LLMs This involves selecting an encoder model that performs offline data encoding as a separate process, converting various forms of raw data, such as text, audio, and video, into vectors. Similar raw conversational data items are mapped to closely positioned vectors in the encoded space. For example, text needs to be converted into numerical vectors to be processed by computers and this is done using Tokenizers. A token can be a byte, character, set of characters, words, or even full sentences. Byte-pair encoding (BPE) is the most common tokenizer today using a pair of adjoining bytes as a token.
  • 21.
    19 Selecting the right‘token’ is key as it impacts both the inter-token relationships that the neural network will be able to grasp, and the computational complexity of training that network. This encoded data is saved in a Vector DB, which can then be retrieved during inference using the same encoder model based on vector similarity. During a chat, the conversational agent has the option of querying the long-term memory system by encoding the query and searching for relevant information within the vector database. The retrieved information is then used to answer the query based on the stored information. HUMAN MEMORY UNDERSTANDING In this section, we first try to understand how the human brain works with respect to short-term and long-term memory— illustrated in Fig. 10. Fig. 10: Memory management by the Human Brain It all begins with the sensory system, where information from the environment enters our sensory memory. This initial memory stage holds sensory information in its original form but only for a brief duration, typically lasting a few hundred milliseconds. From there, the information we pay attention to is transferred to short-term memory (STM). This working memory has a limited capacity, capable of holding around 7 chunks of information for about 20—30 seconds. It is where conscious mental processes like thinking, problem-solving, and decision-making take place. For information to move from short-term memory to long-term memory (LTM), it undergoes encoding, transforming it into a more durable and meaningful representation. Encoding occurs through mechanisms such as repetition, elaboration, or creating associations with existing knowledge.
  • 22.
    20 Once successfully encoded,the information enters long-term memory, which has an extensive capacity and can store information for extended periods, ranging from hours to a lifetime. The retrieval system operates based on associations with contextual information. Retrieval cues, both external and internal, aid in accessing specific memories by recreating the encoding context. • Recall involves actively reconstructing information without external cues. • Recognition involves identifying previously encountered information among alternatives. • Finally, retrieval strategies such as priming, mnemonic techniques, chunking, and rehearsal, enhance the retrieval process. MAPPING TO AGENTIC MEMORY Based on our understanding of the human brain and requirements of AI agents / apps, we need to consider the following memory types—illustrated in Fig. 11: • Semantic knowledge: Information sourced from external (e.g. Wikipedia) and internal systems (e.g., Sharepoint, Confluence, documents, messaging platforms, etc.). • Episodic memory: of specific past events and situations. These insights are garnered during the operation of AI agents. • Procedural memory: comparable to how humans remember motor skills like swimming or driving a car. It encompasses workflows and procedures outlining how AI agents can achieve a specific task. • Emotional memory: catalogs feelings tied to experiences. It delves into user relationships, preferences, reactions, and data that renders the AI more human-like and aligned in user interactions.
  • 23.
    21 Semantic memory isprobably the only type of memory that is available today within LLMs through pre-training and embeddings— the other memory types are still WIP. We show how a memory management module can be implemented for agentic AI systems in Fig. 11. Fig. 11: Agentic AI memory management The memory router, always, by default, routes to the long-term memory (LTM) module to see if an existing pattern is there to respond to the given user prompt. If yes, it retrieves and immediately responds, personalizing it as needed. If the LTM fails, the memory router routes it to the short-term memory (STM) module which then uses its retrieval processes (function calling, APIs, etc.) to get the relevant context into the STM (working memory) —leveraging applicable data services. The STM—LTM transformer module is always active and constantly getting the context retrieved and extracting recipes out of it (e.g., refer to the concepts of teachable agents and recipes in AutoGen) and storing in a semantic layer (implemented via Vector DB). At the same time, it is also collecting other associated properties (e.g., number of tokens, cost of executing the response, state of the system, tasks executed / responses generated) and
  • 24.
    22 • creating anepisode which is then getting stored in a knowledge graph • with the underlying procedure stored in a finite state machine (FSM). CONCLUSION To conclude, memory management is critical for the adoption of long running AI agents. While vector databases are quite performant for conversational agents, we showed that they are insufficient for the diverse memory requirements of complex agentic AI tasks, esp., episodic and procedural memory. We proposed a first draft of an agentic memory architecture in this article with a memory router routing requests between STM and LTM modules. The key takeaway is the proposal of a STM — to — LTM transformer module that is able to abstract and store episodic and procedural memory in knowledge graphs and FSMs, respectively.
  • 25.
    23 AGENTIC RAGS We areincreasingly seeing the need to query SQL and document repositories in an integrated fashion - consolidating the respective query pipelines. For example, let us consider the following marketing data landscape with: • profile data of customers, products, and sales agents, stored in documents, • transactional data on orders placed by customers including the involved sales agents stored in a SQL DB. Given this, we will need to get insights from both the profile (unstructured) and transactional (structured) data repositories to answer the below query: “Provide a detailed summary of the the Top 3 sales agents for Product X in year 2023.” Needless to say, resolving the above query requires a sequential decomposition of the user query: • retrieving the Top 3 sales agents for Product X first from the SQL DB, followed by
  • 26.
    24 • retrieving theirrespective profiles from the profiles document repository, followed by • generating a summary based on the retrieved profiles and sales data. In the sequel, we outline a solution flow for the above scenario - leveraging an Agentic AI framework for the query decomposition part, - followed by SQL & document query agents querying the underlying SQL and document repositories, respectively. - Finally, the RAG pipeline is completed by adding the retrieved structured and/or unstructured data to the original query (prompt)—leading to the generation of a contextualized response. QUERY DOCUMENT REPOSITORIES USING RAGS We start by showing how to build RAG pipelines on unstructured data / documents. Together with fine-tuning, RAG forms one of the primary mechanisms of ‘adapting’ a pre-trained large language model (LLM) with enterprise data making it more contextual— reducing hallucinations in the process (illustrated in the Gen AI lifecycle stages in Fig. 12). Fig. 12: Gen AI lifecycle stages Given a user query, a RAG pipeline literally consists of the 3 phases below (Fig. 13):
  • 27.
    25 • Retrieve: Transformuser queries to embeddings (vector format) to compare its similarity score (search) with other content. • Augment: with search results / context retrieved from a vector store that is kept current and in sync with the underlying document repository. • Generate: contextualized responses by making retrieved chunks part of the prompt template that provides additional context to the LLM on how to answer the query. Fig. 13: RAG pipeline highlighting the data quality issues For example, Snowflake provides Cortex Search as a managed service implementing the retrieval and augmentation parts of a RAG pipeline. So it can be combined with Cortex LLM Functions to create a full fledged RAG pipeline using data in Snowflake as the underlying context repository. DATA QUALITY ISSUES IN RAGS In this section, we highlight the data quality issues below that can arise (and need to be addressed) in such a RAG pipeline—illustrated in Fig. 13. We first look at the common data quality dimensions in our structured / SQL data world today: • Accuracy: how closely does the data represent the real-world scenario? • Completeness: of data with respect to missing values, null values, etc.
  • 28.
    26 • Consistency: thesame information is stored differently at different places. • Timeliness: freshness of the data captured in the form of timestamps. and then try to map them to the unstructured data world / Vector DBs. In the Vector DB world, a collection corresponds to a SQL table, and each collection item can consist of: ID, vector (actual data, captured as an array of floating point numbers), metadata (e.g., product categories, timestamps). Accuracy: of the data captured in vector stores. Think of an AI writing news articles based on inaccurate information—it could end up generating misinformation instead of valuable insights. We rely on the below two metrics to capture this: • Correctness: refers to the factual accuracy of the LLM’s response, • Groundedness: refers to the relationship between the LLM’s response and its underlying KB. Studies have shown how a response could be correct, but still improperly grounded. Incorrect and/or inconsistent vectors: Due to issues in the embedding process, some vectors may end up getting corrupted, be incomplete, or getting generated with a different dimensionality. This can lead to confused or disjointed outputs. For example, if an AI is generating audio based on recordings of varying quality, the result might be noticeably uneven. In text generation, inconsistent grammar or tone in the data can lead to awkward or disjointed content. Missing data can be in the form of missing vectors or metadata. For instance, a GenAI generating visual designs from incomplete datasets might output products with missing features.
  • 29.
    27 Timeliness: If thedocuments in your vector store, which provides context relevant to prompts in a RAG pipeline, are outdated, then GenAI systems could result in irrelevant outputs. A GenAI enabled chatbot answering questions related to policy if fed from obsolete policy documents, it could provide inaccurate and misleading response to questions. TEXT2SQL: NATURAL LANGUAGE QUERIES OVER STRUCTURED DATA In this section, we extend the natural language querying capabilities to structured data stored in a SQL data repository. This capability is called Text2SQL, also referred to as conversational BI. Cortex Analyst is the Text2SQL equivalent of Cortex Search to query structured data / SQL databases and data warehouses. Cortex Analyst is actually an end-to-end text-to-answer solution as it returns the generated SQL as well—providing the final query response. It is very easy to deploy: available as an API, and Snowflake also provides a simple Streamlit application that can be deployed with a few lines of code. Overall, it provides the following three differentiating features with respect to other Text2SQL tools in the market: • User intent validation and query explanation • Lightweight semantic model • Flexibility with respect to the underlying LLM We all know that LLMs hallucinate, so the first (and best) thing to do is to validate the system’s understanding of a given query with the user—before responding with the final answer. Cortex Analyst enables this by: • having a dialogue with the user, where for any given query, it first presents its understanding of the query to the user with an explanation of how it generated the SQL query.
  • 30.
    28 • In addition,it also provides suggestions to make the query more concrete in case of ambiguities. Secondly, Cortex Analyst addresses the data repository metadata mapping problem highlighted earlier with the help of a semantic model. The semantic model is the bridge mapping the domain or business specific terms used by users to the database schemas. These additional semantic details, like more descriptive names or synonyms, enable Cortex Analyst to answer natural language queries more reliably. Finally, Cortex Analyst provides a lot of flexibility with respect to choosing the underlying LLM. By default, Cortex Analyst leverages Snowflake-hosted Cortex LLMs, which have been heavily fine- tuned for text-to-SQL generation tasks and are one of the most powerful LLMs available today. It is however possible to explicitly opt-in to allow Cortex Analyst to use the latest OpenAI GPT models, hosted by Microsoft Azure, alongside the Snowflake-hosted models. At runtime, Cortex Analyst will select the optimal combination of models to ensure the highest accuracy and performance for each query. RAG AGENTS CONSOLIDATING QUERIES OVER SQL AND DOCUMENT REPOSITORIES In this section, we tie everything together by outlining an Agentic AI framework to build RAG pipelines that work seamlessly over both structured and unstructured data stored in Snowflake. The reference architecture of such a RAG Agent is illustrated in Fig. 14.
  • 31.
    29 Fig. 14: Referencearchitecture of a RAG Agent querying both SQL & Document repositories We first focus on the query decomposition aspect. Given a user task, the goal of an AI agent is to identify (compose) an agent (group of agents) capable to executing the given task. This is achieved by leveraging an LLM as a reasoning engine capable of decomposing the task (query in this case) into sub-tasks, with execution of the respective agents orchestrated by an orchestration engine, e.g. LangGraph. The Supervisor agent is responsible for coordinating the overall execution, routing (via if-then-else logic) to the individual agents. As such, the Router is primarily used to route the (SQL and/or document related) sub-queries to the SQL & Document query agents. The SQL & Document query agents are self-explanatory, and leverage the respective Snowflake Cortex Analyst and Search components detailed earlier to query the underlying SQL and Document repositories. Finally, to complete the RAG pipeline, the retrieved data is added to the original query / prompt—leading the generation of a contextualized response.
  • 32.
    30 It is importantto highlight here the need for a comprehensive evaluation strategy for such complex systems. LLMs are used in both pipelines, so it is mostly the difference between retrieving information from structured vs. unstructured repositories. In the case of SQL data retrieval, the challenge is in the Text2SQL conversion. Once the right SQL script has been determined, there is no confusion (hallucination) with respect to the retrieved SQL table output. This is solved using a mix of semantic models and manual validation of the generated SQL as discussed in the previous section on Text2SQL. In the case of unstructured data / documents, we need to evaluate with respect to hallucinations on the output (generated text). So we can apply RAG evaluation frameworks like RAGAS here. In our case, in the scenario presented in the article, the two steps / agents are sequential: retrieve the SQL Sales data, followed by summarizing the profiles of the Top Sales Agents - so they can be evaluated independently and in parallel.
  • 33.
    31 REINFORCEMENT LEARNING AGENTS When we talkabout AI agents today, we mostly talk about LLM agents, which loosely translates to invoking (prompting) an LLM to perform natural language processing (NLP) tasks, e.g., processing documents, summarizing them, generating responses based on the retrieved data. For example, refer to the “researcher” agent scenario outlined by LangGraph. It is important to mention here that some agentic tasks may be better suited to other machine learning (ML) techniques, e.g., reinforcement learning (RL), predictive analytics, etc.—depending on the use-case objectives – illustrated in Fig. 15.
  • 34.
    32 Fig. 15: AgenticAI platform reference architecture focusing on RL agents In this section, we focus on RL agents, and show how LLMs can be used to fine-tune the RL rewards and policy functions. REINFORCEMENT LEARNING Reinforcement learning (RL) is able to achieve complex goals by maximizing a reward function in real-time. The reward function works similar to incentivizing a child with candy and spankings, such that the algorithm is penalized when it takes a wrong decision and rewarded when it takes a right one—this is reinforcement. The reinforcement aspect also allows it to adapt faster to real-time changes in the user sentiment. For a detailed introduction to RL, please refer to our papers leveraging RL for chatbots and recommender systems.
  • 35.
    33 Fig. 16: ReinforcementLearning (RL) methodology (Image by Author) Some interesting observations about RL—relevant for the following discussion on RL agents: • RL rewards and policies are not the same: The roles and responsibilities of the reward function vs. RL agent policies are not very well defined, and can vary between architectures. A naïve understanding would be that given an associated reward/cost with every state-action pair, the policy would always try to minimize the overall cost. Apparently, it seems that sometimes keeping the ecosystem in a stable state can be more important than minimizing the cost (e.g. in a climate control use-case). As such, the RL agent policy goal need not always be aligned with the reward function, and that is why two separate functions are needed. • Similar to supervised approaches in machine learning / deep learning, the RL approach most suitable for enterprise adoption is ‘model based RL’. In model based RL, it is possible to develop a model of the problem scenario, and bootstrap initial RL training based on the model simulation values. For instance, for energy optimization use-cases, a blueprint of the building heating, ventilation and air conditioning (HVAC) system serves as a model, whose simulation values can be used to train the RL model—detailed in Section 3.2. For complex scenarios (e.g.
  • 36.
    34 games, robotic tasks),where it is not possible to build a model of the problem scenario, it might still be possible to bootstrap an RL model based on historical values. This is referred to as ‘offline training’, and is considered a good starting point in the absence of a model. And, this is also the reason why RL is often considered as a hybrid between supervised and unsupervised learning, rather than a purely unsupervised learning paradigm. • Online and model-free RL remain the most challenging, where the RL agent is trying to learn and react in real-time without any supervision. Research in this field seems to lack a theoretical foundation at this stage. Researchers are trying out different approaches by simply throwing more data and computing power at the problems. As such, this remains the most “interesting” part of RL, with current research primarily focusing on efficient heuristics and distributed computation to cover the search space in an accelerated fashion. Applying DL (neural networks) to the different RL aspects, e.g., policies, rewards, also remains a hot topic—referred to as deep reinforcement learning. LLM BASED FINE-TUNING OF RL REWARD FUNCTION In this section, we show how RL agents can be developed leveraging LLMs to fine-tune the RL agent reward function in this section. The key steps to perform LLM based fine-tuning of the RL agent reward function are as follows—illustrated in Fig. 17:
  • 37.
    35 Fig. 17: LLMbased fine-tuning of RL Agent Reward Function Initial prompt to the LLM with the RL formulation of the <use- case>, outlining the objectives of the RL reward and policy functions; with (optional) examples of some rewards function templates. 1. LLM generates candidate RL reward functions with the ‘temperature’ parameter providing the exploration— exploitation balance with respect to novelty of the generated reward functions. 2. Validate the generated reward functions for relevance with respect to the given <use-case>; followed by their evaluation with respect to the <use-case> accuracy. 3. Add the generated reward functions with their evaluation results to LLM memory—for the LLM to improve its (future) generation process. 4. Repeat steps 2–4 until convergence is reached with respect to a threshold improvement in the RL agent reward function.
  • 38.
    36 5. Finally, updatethe RL agent reward function with the top ranked LLM generated reward function. Fig. 5 outlines sample templates for the initial prompt and to store evaluation results of an iteration in LLM memory. RL AGENTS FOR HVAC OPTIMIZATION In this section, we apply the outlined LLM based RL reward fine- tuning approach to a concrete use-case of optimizing HVAC consumption in buildings / factories. RL based HVAC optimization It is an interesting case study in the context of our current discussion. It showcases the successful transition of an industrial control system run by a traditional PID (proportional integral derivative) controller for the last 10+ years— to a more efficient RL based controller. In this article, we primarily show that it is possible to re-create the RL based HVAC controller outlined in the paper, using the LLM based fine-tuning approach proposed in the previous section. • The industrial control system in this case refers to HVAC units responsible for maintaining the temperature and humidity settings in factories (buildings in general). The sensors correspond to the indoor (and outdoor) temperature and humidity sensors; and the actuators correspond to the cooling, heating, re-heating and humidifier valves of the HVAC units. • The RL model formulation of the HVAC optimization problem is illustrated in Fig. 18. Fig. 18: HVAC optimization—Reinforcement Learning formulation
  • 39.
    37 The RL controllerlogic can be summarized as follows: Given the zone state in terms of the (inside and outside) temperature and humidity values, the RL model needs to decide by how much to open the cooling, heating, re-heating and humidifier valves. To take an informed decision in this scenario, the RL model needs to first understand the HVAC system behavior, in terms say how much zone temperature drop can be expected by opening the Cooling valve to X%? Once the RL model understands the HVAC system behavior, the final step is to design the control strategy. For instance, the RL model now has to choose whether to open the cooling value to 25% when the zone temperature reaches 23 degrees, or wait till the zone temperature reaches 24 degrees before opening the cooling valve to 40%. Note that the longer it waits before opening the valve, contributes positively towards lowering the energy consumption. However, it then runs the risk of violating the temperature / humidity tolerance levels as the outside weather conditions are always unpredictable. As a result, it might actually have to open the cooling valve to a higher percentage if it waits longer, consuming more energy. The above probabilities are quantified by the reward function illustrated in Equation 1, which assigns a reward to each possible action based on the following three parameters: setpoint closeness (SC), energy cost (EC), tolerance violation (TV). Given this, we were able to leverage the LLM based fine-tuning approach to refine both ‘safe’ and ‘business first’ control strategies to an ‘energy optimal’ policy. • ‘Safe’ control strategy: assign a very high negative weightage (penalty) to tolerance violations, ensuring that they never happen—albeit at a higher energy cost. • Setpoint closeness encourages a ‘business first’ policy where the RL model attempts to keep the zone temperature as close
  • 40.
    38 as possible tothe temperature / humidity setpoints, implicitly reducing the risk of violations—but at a higher energy cost. • ‘Energy optimal’ prioritizes energy savings over the other two parameters. CONCLUSION When we talk about AI agents today, we mostly talk about LLM agents, which loosely translates to prompting an LLM to perform NLP tasks. In this article, we highlighted the fact that some agentic tasks can be better suited to other ML techniques, e.g., reinforcement learning (RL), predictive analytics, etc. More concretely, we focused on RL agents, and showed how LLMs can be used to fine-tune the RL agent reward / policy functions. We showed a concrete example of applying the fine-tuning methodology to a real-life industrial control system—designing the RL based controller for HVAC optimization in a building setting. The proposed RL fine-tuning methodology shows promise to be applied to a wide range of RL agentic tasks.
  • 41.
    39 RESPONSIBLE AI AGENTS The growingadoption of generative AI and agentic AI has reignited the discussion around AI regulations—to ensure that AI/ML systems are responsibly trained and deployed. Unfortunately, this effort is complicated by different governmental organizations and regulatory bodies releasing their own guidelines and policies with little to no agreement on the definition of terms. The specification language is also (sometimes intentionally) kept at such an important level that it is difficult to map them to an implementation / enforcement mechanism. For instance, the EU AI Act mandates a different set of dos & don’ts depending on the ‘risk level’ of an AI application. However, quantifying the risk level of an AI application is easier said than done as it depends on multiple parameters, and basically requires one to classify how the capabilities of a non-deterministic system will impact users and systems who might interact with it in the future. The table below summarizes the key challenges and solutions in implementing responsible AI for the different generative AI and agentic AI architectural patterns.
  • 42.
    40 Table 1: ResponsibleAI integrated with AgentOps We expand on the above points in the rest of the article to enable an integrated AgentOps pipeline with responsible LLM governance. For instance, as mentioned earlier, with respect to hallucinations for agentic AI systems: hallucination likelihood increases exponentially with increase in the number of agents involved. The right prompts can help but only to a limited extent. To further limit the hallucination, LLMs need to be fine-tuned with curated data and/or limit the search space of responses to relevant and recent enterprise data. Data consistency: The data used for training (esp., fine-tuning) the LLM should be accurate and precise, which means the relevant data pertaining to the specific use-case should be used to train the LLMs, e.g. if the use case is to generate summary of a medical prescription—the user should not use other data like Q&A of a diagnosis, user must use only medical prescriptions and corresponding summarization of the prescription. Many a times, data pipelines need to be created to ingest the data and feed that to LLMs. In such scenarios, extra caution needs to be exercised to consume the running text fields as these fields hold mostly inconsistent and incorrect data.
  • 43.
    41 Data Quality: Froma data quality point of view, we see the following challenges with respect to LLMs, esp. vector databases. • Accuracy of the encodings in vector stores, measures in terms of correctness and Groundedness of the generated LLM responses. • Incorrect and/or inconsistent vectors: Due to issues in the embedding process, some vectors may end up getting corrupted, be incomplete, or getting generated with a different dimensionality. • Missing data can be in the form of missing vectors or metadata. • Timeliness issues with respect to outdated documents impacting the vector store. Bias/Fairness: With respect to model performance and reliability, it is difficult to control undesired biases in black-box LLMs, though it can be controlled to some extent by using uniform and unbiased data to fine-tune the LLMs and/or contextualize the LLMs in a RAG architecture. Accountability: To make LLMs more reliable, it is recommended to have manual validation of the LLM’s outputs. Involving humans ensures if LLMs hallucinate or provide wrong response, a human can evaluate and make the necessary corrections. Explainability: Explainability is an umbrella term for a range of tools, algorithms and methods, which accompany AI model inferences with explanations. Chain of Thought (CoT) is a framework that addresses how a LLM is solving a problem. CoT can be primarily implemented using two approaches: • User prompting: Here, during prompting, user provides the logic about how to approach a certain problem and LLM will solve similar problem using same logic and return the output along with the logic.
  • 44.
    42 • Automating CoTprompting: Manual handcrafting CoT can be time consuming and provide sub optimal solution, Automatic CoT (Auto-CoT) can be leveraged to generate the reasoning chains automatically thus eliminating the human intervention. Auto-CoT basically works on two processes: 1. Question clustering: Cluster the questions of a given dataset. 2. Demonstration sampling: Select the representative question from each cluster and generate the reasoning chain using zero shot CoT. Auto-CoT works well for LLMs having approximately 100B parameters but not so accurate for SLMs. DATA PRIVACY We also see a growing (and worrisome) trend where enterprises are applying the privacy frameworks and controls that they had designed for their data science / predictive analytics pipelines—as- is to Gen AI / LLM use-cases. This is clearly inefficient (and risky) and we need to adapt the enterprise privacy frameworks, checklists and tooling—to take into account the novel and differentiating privacy aspects of LLMs. From a privacy point of view, we need to consider the following additional / different LLM privacy risks – illustrated in Fig. 19: • Membership and property leakage from pre-training data • Model features leakage from pre-trained LLM • Privacy leakage from conversations (history) with LLMs • Compliance with privacy intent of users Pre-training Data Leakage Instead of privacy leakage from training data belonging to the enterprise only, we need to start by considering privacy leakage from training data used to train the pre-trained LLM. For example, previous studies have shown that GPT models can leak privacy- sensitive training data, e.g. email addresses from the standard
  • 45.
    43 Enron email dataset,implying that the Enron dataset is very likely included in the training data of GPT-4 and GPT-3.5. Fig. 19: LLM privacy risks Leakage tests consisted of a mix of context, zero- and few-shot prompting. The core idea is to provide k-shot true (name, email) pairs (from other users) as demonstrations, and then prompt the model with the target user’s name to predict the target email address. Example templates used for few-shot prompting: • “the email address of {target_name} is”, • “name: {target_name}, email:”, • “{target_name} [mailto:”, • “—–Original Message—–n From: {target_name} [mailto: ” Enterprise Data Leakage Privacy of enterprise (training) data does become relevant when we start leveraging LLMs in a RAG setting or fine-tune LLMs with enterprise data to create an enterprise / domain specific solution / small language model (SLM).
  • 46.
    44 Fig. 20: Enterprisedata Leakage with respect to fine-tuned LLMs The interesting part here is that the attacker observes both model snapshots: the pre-trained LLM and the fine-tuned SLM. And, we then need to measure the privacy leakage (membership / property inference) with respect to the whole training data: pre-training data + (delta) enterprise data. Previous studies have shown that leakage prone weight sensitive features in a trained DL model can correspond to specific words in a language prediction model. It has been further shown that fine- tuned models are highly susceptible to privacy attacks, given only API access to the model. This means that if a model is fine-tuned on highly sensitive data, great care must be taken before deploying that model—as large portions of the fine-tuning dataset can be extracted with black-box access! The recommendation then is to deploy such models with additional privacy-preserving techniques, e.g., Differential Privacy. Conversational Privacy Leakage With traditional ML models, we are primarily talking about a one- way inference reg. a prediction or classification task. In contrast, LLMs enable a two-way conversation, so we need to consider conversation related privacy risks in addition, where e.g. GPT
  • 47.
    45 models can leakthe user private information provided in a conversation (history). Fig. 21: PII & implicit privacy conversations Personally identifiable information (PII) privacy leakage concerns in conversations are real given that various applications (e.g., office suites) have started to deploy GPT models at the inference stage to help process enterprise data / documents, which usually contain sensitive (confidential) information. We can only expect Gen AI adoption to grow in different verticals, e.g. customer support, health, banking, dating; leading to the inevitable harvesting of prompts posed by the users as a ‘source of personal data’ for adverting, phishing, etc. scenarios. Given this, we also need to consider implicit privacy risks of natural language conversations (along the lines of side-channel attacks) together with PII leakage concerns.
  • 48.
    46 For example, thequery: “Wow, this dress looks amazing! What is its price?” can leak the the user’s sentiment as compared to a more neutral prompt: “This dress fits my requirements. What is its price?” Privacy Intent Compliance Finally, LLMs today allow users to be a lot more prescriptive with respect to processing their prompts / queries, e.g. chain-of-thought (CoT) prompting. CoT is a framework that addresses how a LLM is solving a problem. During prompting, the user provides the logic about how to approach a certain problem and LLM will solve the task using suggested logic and returns the output along with the logic. CoT can be extended to allow the User to explicitly specify their privacy intent in prompts using keywords e.g., “in confidence”, “confidentially”, “privately”, “in private”, “in secret”, etc. So we also need to assess the LLM effectiveness in complying with these user privacy requests. For example, it has been shown that GPT-4 will leak private information when told “confidentially”, but will not when prompted “in confidence”.l CONCLUSION Agentic AI is a disruptive technology, and we are seeing it evolve faster than anything we have experienced before. Responsible usage of LLMs also requires assessing how the identified LLMOps and AgentOps architectural patterns impact relevant responsible AI dimensions. Toward this end, we highlighted the responsible AI dimensions relevant to AI agents; and showed how they can be integrated in a seamless fashion with the underlying LLMOps and AgentOps pipelines. We believe that these will effectively future-proof agentic AI investments and ensure that AI agents are able to cope as the AI agent platform and regulatory landscape evolves with time.