diff --git a/.gitignore b/.gitignore index 16d2ebb7e6..ba26b18d65 100644 --- a/.gitignore +++ b/.gitignore @@ -138,6 +138,7 @@ dmypy.json tmp_* examples/fine-tuned_qa/local_cache/* examples/multimodal/.local_cache/* +examples/coding-agent-workspace # PyCharm files .idea/ diff --git a/AGENTS.md b/AGENTS.md index 86634ebf93..5fde1d7798 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,23 +1,38 @@ # Repository Guidelines ## Project Structure & Module Organization + The cookbook is organized around runnable examples and reference articles for OpenAI APIs. Place notebooks and Python scripts under `examples//`, grouping related assets inside topic subfolders (for example, `examples/agents_sdk/`). Narrative guides and long-form docs live in `articles/`, and shared diagrams or screenshots belong in `images/`. Update `registry.yaml` whenever you add content so it appears on cookbook.openai.com, and add new author metadata in `authors.yaml` if you want custom attribution. Keep large datasets outside the repo; instead, document how to fetch them in the notebook. ## Build, Test, and Development Commands + Use a virtual environment to isolate dependencies: + - `python -m venv .venv && source .venv/bin/activate` - `pip install -r examples//requirements.txt` (each sample lists only what it needs) - `jupyter lab` or `jupyter notebook` to develop interactively - `python .github/scripts/check_notebooks.py` to validate notebook structure before pushing ## Coding Style & Naming Conventions + Write Python to PEP 8 with four-space indentation, descriptive variable names, and concise docstrings that explain API usage choices. Name new notebooks with lowercase, dash-or-underscore-separated phrases that match their directory—for example `examples/gpt-5/prompt-optimization-cookbook.ipynb`. Keep markdown cells focused and prefer numbered steps for multi-part workflows. Store secrets in environment variables such as `OPENAI_API_KEY`; never hard-code keys inside notebooks. ## Testing Guidelines + Execute notebooks top-to-bottom after installing dependencies and clear lingering execution counts before committing. For Python modules or utilities, include self-check cells or lightweight `pytest` snippets and show how to run them (for example, `pytest examples/object_oriented_agentic_approach/tests`). When contributions depend on external services, mock responses or gate the cells behind clearly labeled opt-in flags. ## Commit & Pull Request Guidelines + Use concise, imperative commit messages that describe the change scope (e.g., "Add agent portfolio collaboration demo"). Every PR should provide a summary, motivation, and self-review, and must tick the registry and authors checklist from `.github/pull_request_template.md`. Link issues when applicable and attach screenshots or output snippets for UI-heavy content. Confirm CI notebook validation passes locally before requesting review. ## Metadata & Publication Workflow + New or relocated content must have an entry in `registry.yaml` with an accurate path, date, and tag set so the static site generator includes it. When collaborating, coordinate author slugs in `authors.yaml` to avoid duplicates, and run `python -m yaml lint registry.yaml` (or your preferred YAML linter) to catch syntax errors before submitting. + +## Review Guidelines + +- Verify file, function, and notebook names follow the repo's naming conventions and clearly describe their purpose. +- Scan prose and markdown for typos, broken links, and inconsistent formatting before approving. +- Check that code identifiers remain descriptive (no leftover placeholder names) and that repeated values are factored into constants when practical. +- Ensure notebooks or scripts document any required environment variables instead of hard-coding secrets or keys. +- Confirm metadata files (`registry.yaml`, `authors.yaml`) stay in sync with new or relocated content. diff --git a/articles/codex_exec_plans.md b/articles/codex_exec_plans.md index 711a4b7168..4283012d66 100644 --- a/articles/codex_exec_plans.md +++ b/articles/codex_exec_plans.md @@ -4,7 +4,7 @@ Codex and the `gpt-5-codex` model can be used to implement complex tasks that ta These plans are thorough design documents, and "living documents". As a user of Codex, you can use these documents to verify the approach that Codex will take before it begins a long implementation process. The particular `PLANS.md` included below is very similar to one that has enabled Codex to work for more than seven hours from a single prompt. -We enable Codex to use these documemnts by first updating `AGENTS.md` to describe when to use `PLANS.md`, and then of course, to add the `PLANS.md` file to our repository. +We enable Codex to use these documents by first updating `AGENTS.md` to describe when to use `PLANS.md`, and then of course, to add the `PLANS.md` file to our repository. ## `AGENTS.md` diff --git a/articles/gpt-oss-safeguard-guide.md b/articles/gpt-oss-safeguard-guide.md new file mode 100644 index 0000000000..9c4074f17a --- /dev/null +++ b/articles/gpt-oss-safeguard-guide.md @@ -0,0 +1,448 @@ +# User guide for gpt-oss-safeguard + +## Introduction & Overview + +ROOST and OpenAI have prepared a guide that explains how to write policy prompts that maximize [gpt-oss-safeguard's](https://github.com/openai/gpt-oss-safeguard) reasoning power, choose the right policy length for deep analysis, and integrate oss-safeguard's reasoning outputs into production Trust & Safety systems. + +### What is gpt-oss-safeguard? + +gpt-oss-safeguard is a first open weight reasoning model specifically trained for safety classification tasks to help classify text content based on customizable policies. As a fine-tuned version of [gpt-oss](https://openai.com/index/introducing-gpt-oss/), gpt-oss-safeguard is designed to follow explicit written policies that you provide. This enables **bring-your-own-policy** Trust & Safety AI, where your own taxonomy, definitions, and thresholds guide classification decisions. Well crafted policies unlock gpt-oss-safeguard's reasoning capabilities, enabling it to handle nuanced content, explain borderline decisions, and adapt to contextual factors. + +You can read more about how OpenAI uses the internal version of gpt-oss-safeguard [here](https://openai.com/index/introducing-gpt-oss-safeguard/). + +Large language models can be considered safety models in two main ways: + +- Fine-tuned safety models start as general reasoning models (like gpt-oss) and are trained to respond safely within user interactions. +- Prebaked safety models (like ShieldGemma, LlamaGuard, RoGuard, etc) come with built-in definitions of what counts as “unsafe” and fixed policy taxonomies. + +gpt-oss-safeguard was purpose-built for Trust & Safety workflows and is a policy-following model that can reliably interpret and enforce **your own written standards and tell you why it made the decision it made**. The reasoning behind the model makes it well-suited for integration with a larger safety system that is rooted in auditability and customization. + +### How to Use gpt-oss-safeguard + +Like the [gpt-oss family of models](https://openai.com/open-models/), this is an open source model with open weights that you run locally or integrate into your own infrastructure. It is designed to work with the [harmony response format](https://github.com/openai/harmony). Harmony is the structured prompt interface that gives gpt-oss-safeguard access to its full reasoning stack and ensures consistent, well-formed outputs. + +The gpt-oss family of models, including gpt-oss-safeguard, can be run on servers using: + +- [vLLM](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#gpt-oss-vllm-usage-guide) (for dedicated GPUs like NVIDIA’s H100s) +- [HuggingFace Transformers](https://cookbook.openai.com/articles/gpt-oss/run-locally-lmstudio) (for consumer GPUs) +- [Google Colab](https://cookbook.openai.com/articles/gpt-oss/run-colab) + +And locally using: + +- [LM Studio](https://cookbook.openai.com/articles/gpt-oss/run-locally-lmstudio) +- [Ollama](https://cookbook.openai.com/articles/gpt-oss/run-locally-ollama) + +### Who Should Use gpt-oss-safeguard? + +gpt-oss-safeguard is designed for users who need real-time context and automation at scale, including: + +- **ML/AI Engineers** working on Trust & Safety systems who need flexible content moderation +- **Trust & Safety Engineers** building or improving moderation, Trust & Safety, or platform integrity pipelines +- **Technical Program Managers** overseeing content safety initiatives +- **Developers** building projects/applications that require contextual, policy-based content moderation +- **Policy Crafters** defining what is accepted by an organization who want to test out policy lines, generate examples, and evaluate content + +Safety-tuned models excel at content moderation when given clear, structured prompts. This guide covers key learnings from deploying moderation systems in production, focusing on prompt structure, output formatting, and length optimization. + +### Using gpt-oss-safeguards with HuggingFace Transformers + +The Transformers library by Hugging Face provides a flexible way to load and run large language models locally or on a server. [This guide](https://cookbook.openai.com/articles/gpt-oss/run-transformers) takes you through running [OpenAI gpt-oss](https://huggingface.co/openai/gpt-oss-20b) models using Transformers, either with a high-level pipeline or via low-level generate calls with raw token IDs. The simplest way to interact with the server is through the transformers chat CLI + +```bash +transformers chat localhost:8000 --model-name-or-path openai/gpt-oss-safeguard-20b +``` + +or by sending an HTTP request with cURL, e.g. + +```bash +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "openai/gpt-oss-safeguard-20b", + "stream": true, + "messages": [ + { "role": "system", "content": "" }, + { "role": "user", "content": "" } + ] + }' + + +``` + +Additional use cases, like integrating transformers serve with Cursor and other tools, are detailed in [the documentation](https://huggingface.co/docs/transformers/main/serving). + +### Running gpt-oss-safeguard with Ollama + +[Ollama](https://ollama.com/download) supports gpt-oss-safeguard 20B and 120B models directly. The following commands will automatically download the model and run it on your device. + +#### gpt-oss-safeguard:20b + +```bash +ollama run gpt-oss-safeguard:20b +``` + +#### gpt-oss-safeguard:120b + +```bash +ollama run gpt-oss-safeguard:120b +``` + +Ollama supports [OpenAI API](https://docs.ollama.com/api/openai-compatibility), [Ollama's API](https://docs.ollama.com/api), [Python](https://github.com/ollama/ollama-python) and [JavaScript](https://github.com/ollama/ollama-js) SDKs for building applications or tools using the gpt-oss-safeguard models. Please learn more from [Ollama's documentation](https://docs.ollama.com/). + +### Running gpt-oss-safeguard with LM Studio + +Alternatively, you can use [LM Studio](https://lmstudio.ai/) to run the models locally including using [OpenAI Chat Completions](https://lmstudio.ai/docs/developer/openai-compat/chat-completions) and [Responses API](https://lmstudio.ai/docs/developer/openai-compat/responses) compatible APIs. Head over to the [gpt-oss-safeguard page for LM Studio](https://lmstudio.ai/models/gpt-oss-safeguard) or run the following commands to download the respective models: + +#### gpt-oss-safeguard-20b + +```bash +lms get openai/gpt-oss-safeguard-20b +``` + +#### gpt-oss-safeguard-120b + +```bash +lms get openai/gpt-oss-safeguard-120b +``` + +### Running gpt-oss-safeguard with vLLM + +[vLLM](https://docs.vllm.ai/) recommends using [uv](https://docs.astral.sh/uv/) for Python dependency management. The following command will automatically download the model and start the server. + +```shell +uv pip install vllm==0.10.2 --torch-backend=auto + +vllm serve openai/gpt-oss-safeguard-120b +``` + +[Learn more about how to use gpt-oss with vLLM.](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#gpt-oss-vllm-usage-guide) + +### Understanding the Harmony Response Format + +gpt-oss-safeguard uses the [harmony prompt format](https://cookbook.openai.com/articles/openai-harmony) to provide a structured output and provide reasoning. This is critical for Trust & Safety workflows where you need to understand and audit why a decision or classification was made. With the harmony format, oss-safeguard separates its response into two parts: + +1. **Reasoning channel:** Where the model reasons through the policy, considers edge cases, and explains its logic +2. **Output channel**: The formatted classification decision you specified + +Through harmony, you can control how deeply oss-safeguard reasons by setting the `reasoning_effort` parameter in your system message to `low`, `medium`, or `high`. The model uses `medium` by default if it is not set. Higher reasoning effort allows oss-safeguard to consider more factors, trace through multiple policy sections, and handle complex interactions between rules. Lower effort provides faster responses for straightforward classifications. + +If you're using [**vLLM**](https://docs.vllm.ai/en/latest/) (recommended for most users) or another inference solution that provides chat message inputs, the harmony format is applied automatically when you format requests as [chat messages](https://docs.vllm.ai/en/v0.7.0/getting_started/examples/chat.html): + +- **System message:** Your policy prompt (include Reasoning: high or similar in the system message to control reasoning effort). +- **User message:** The content to classify. + +## How oss-safeguard uses Policy Prompts + +oss-safeguard is designed to use your written policy as its governing logic. While most models provide a confidence score based on the features it was trained on and require retraining for any policy changes, oss-safeguard makes decisions backed by reasoning within the boundaries of a provided taxonomy. This feature lets T\&S teams deploy oss-safeguard as a policy-aligned reasoning layer within existing moderation or compliance systems. This also means that you can update or test new policies instantly without retraining the entire model. + +## Writing Effective Policy Prompts for gpt-oss-safeguard + +oss-safeguard performs best when policies are organized like a Trust & Safety policy guide rather than an essay. If you already have a set of policies, you’ll be in great shape. Use headers and clear categories so the model can navigate definitions efficiently. If you’ve written policy for teams before, this should feel familiar. + +### Understanding Policy Prompting + +A policy prompt defines the operational boundaries of a model’s behavior. Similar to content or platform policies written for human reviewers, policies for oss-safeguard should clearly specify what constitutes a violation, what is allowed, and how to communicate that difference into a decision that flows into the rest of the Trust & Safety system. + +Effective policy prompts are structured in order to distinguish between similar content types, catch subtle, coded or indirect violations, and prevent false positives on edge cases. Think of it as combining a policy document with training examples. + +### Structuring Policy Prompts + +Policy prompts should have four separate sections. + +1. **Instruction:** what the model MUST do and how the model should answer. +2. **Definitions:** concise explanations of key terms. +3. **Criteria:** distinctions between violating and non-violating content. +4. **Examples:** short, concrete instances near the decision boundary. It’s important to have both examples of what you want to classify, and what you do not want to classify + +Because oss-safeguard is tuned for structured moderation, it expects explicit instructions for how to respond. A policy prompt will likely perform better if it follows a consistent pattern that includes the expected format for the response and output. The harmony format’s structured channels allow oss-safeguard to reason through these sections before emitting only the final label: + +```markdown +# Policy Name + +## INSTRUCTIONS + +Describe what oss-safeguard should do and how to respond. + +## DEFINITIONS + +Clarify key terms and context. + +## VIOLATES (1) + +Describe behaviors or content that should be flagged. + +## SAFE (0) + +Describe content that should not be flagged. + +## EXAMPLES + +Provide 4–6 short examples labeled 0 or 1. + +Content: [INPUT] +Answer (0 or 1): +``` + +To reduce the likelihood of false positives or confusion, avoid using words like “generally” or “usually”. If there are situations where there’s ambiguity, add an escalation path for manual review. This is also especially helpful for regional or language differences. + +Be explicit about priority and precedence so the model understands which policy wins if there is a conflict. If there are multiple policy violations, define which one is dominant. + +### Choosing the Right Policy Length + +Policy length is a key control over how deeply gpt-oss-safeguard can reason about your rules. Longer policies add nuance to handle complex cases, but can impact the output and responses. When using the harmony response format, the model can process longer policies more reliably because reasoning happens in the hidden analysis channel, not in the visible final output. + +Use [https://platform.openai.com/tokenizer](https://platform.openai.com/tokenizer) to determine the length of your prompt. **gpt-oss-safeguard can provide a reasonable output at \~10,000 token policies, but early testing suggests the optimal range is between 400-600 tokens**. It’s important to experiment and see what works best for you as there is no one-size-fits-all approach. Think of the policy length like a “context budget.” Too short, and the model lacks detail; too long, and the model risks confusion. This is similar to writing policy for people to understand as well. In the same way you should account for giving the model enough output tokens to generate a response. Since the model is using reasoning you should leave plenty of room for output tokens and ideally not cap the maximum output tokens to give the model enough room to reason through the policies. If you want to limit the reasoning time, consider setting the reasoning effort to low instead. + +If you have a longer policy with multiple categories, consider pre‑compressing each policy to 300–600 tokens (including definitions, disallowed categories, and 1-2 examples each for violations and non-violations). + +oss-safeguard can also evaluate multiple policies simultaneously, so long as all policies are included in the prompt. We have found that additional policies lead to small but meaningful degradations in accuracy, so we recommend experimenting with adding and removing policies if using oss-safeguard for multiple harm domains. + +## Designing reliable output instructions + +Consistent responses from gpt-oss-safeguard require explicit, literal output instructions. Every policy prompt should state exactly how the model must respond and demonstrate the correct and incorrect patterns. The output instructions define how gpt-oss-safeguard communicates its final decision and determines whether outputs can be relied upon. Because gpt-oss-safeguard operates within the harmony response format, all output instructions must: + +1. **Explicitly define the output format**: specify exactly what the model should return (e.g., `0`/`1`, JSON object, category label list). +2. **Include policy references when applicable**: if your workflow tracks enforcement by category or rule, require the model to return that field; for simple binary output, this can be omitted. +3. **Be reinforced throughout the policy**: repeat the output instruction at least once near the top (in “INSTRUCTIONS”) and again near the bottom (before “EXAMPLES”) to fortify compliance during reasoning. + +### Binary Responses + +Binary output limits gpt-oss-safeguard's reasoning to a simple yes/no decision. Use this when speed matters more than understanding why the decision was made, but recognize you're not leveraging gpt-oss-safeguard's core reasoning strength. + +```markdown +Return exactly one character: 0 or 1. +Do not include any explanation or punctuation. + +0 = Content does NOT violate this policy. +1 = Content violates this policy. +``` + +### Policy-referencing outputs + +Category labels encourage gpt-oss-safeguard to reason about which section of your policy applies, but don't require detailed explanation of why. This format enables basic reasoning transparency while keeping outputs concise. + +``` +If the content violates this policy, return: +{"violation": 1, "policy_category": ""} + +If the content does NOT violate this policy, return: +{"violation": 0, "policy_category": null} + +Example: +{"violation": 1, "policy_category": "H2.f"} +``` + +### Including Rationale + +One of the most powerful features of gpt-oss-safeguard is its ability to think and reason. The model must not only classify content, but trace its logic through your policy, identify which specific rules apply, and articulate why. When you require rationales, gpt-oss-safeguard reasons more carefully. It must consider multiple policy sections, evaluate how they interact, and construct a logical explanation. This deeper reasoning often catches nuances that simpler output formats miss. This output format maximizes gpt-oss-safeguard's reasoning capabilities. + +Ask the model to decide, then justify briefly. Ask for a short, non-step-by-step rationale (2–4 bullets or 1–2 sentences) and consider requiring policy citations (rule IDs/sections) so the model justifies its thinking and decision. + +```json +{ + "violation": 1, + "policy_category": "H2.f", + "rule_ids": ["H2.d", "H2.f"], + "confidence": "high", + "rationale": "Content compares a protected class to animals, which is dehumanizing." +} +``` + +## Using gpt-oss-safeguard for Trust & Safety + +Because gpt-oss-safeguard interprets written rules rather than static categories, gpt-oss-safeguard adapts to different product, regulatory, and community contexts with minimal engineering overhead. + +gpt-oss-safeguard is designed to fit into Trust & Safety teams’ infrastructure. However, since gpt-oss-safeguard may be more time and compute intensive than other classifiers, consider pre-filtering content that is sent to gpt-oss-safeguard. [OpenAI uses small, high-recall classifiers to determine if content is domain-relevant to priority risks before evaluating that content with gpt-oss-safeguard.](https://openai.com/index/introducing-gpt-oss-safeguard/) You should consider two main things when deciding when and where to integrate oss-safeguard in your T\&S stack: + +1. Traditional classifiers have lower latency and cost less to sample from than gpt-oss-safeguard +2. Traditional classifiers trained on thousands of examples will likely perform better on a task than gpt-oss-safeguard + +### Automated content classification + +Use gpt-oss-safeguard to label posts, messages, or media metadata for policy violations. Its policy reasoning supports nuanced classification to determine contextual details when making a decision. gpt-oss-safeguard can be integrated with: + +- Real-time ingestion pipelines +- Review queues and moderation consoles +- Downranking or filtering systems + +### T\&S Assistant + +gpt-oss-safeguard's reasoning capabilities make it uniquely suited for automated triage in Trust & Safety workflows. Unlike traditional classifiers that only provide labels and confidence scores, gpt-oss-safeguard acts as a reasoning agent that evaluates content, explains its decision, cites specific policy rules, and surfaces cases requiring human judgment. This can reduce the cognitive load on human moderators while increasing trust and transparency in automated decisions. + +### Policy Testing + +Before rolling out a new or revised policy, run it through gpt-oss-safeguard to simulate how content will be labeled. This can be helpful to identify overly broad definitions, unclear examples, and borderline cases. + +### Policy Experimentation + +gpt-oss-safeguard’s bring-your-own-policy design allows policy teams to A/B test alternative definitions directly in production without model retraining. + +## Integrating gpt-oss-safeguard with ROOST’s Tools + +### Osprey + +[Osprey](https://github.com/roostorg/osprey) is ROOST’s open-source rules engine and investigation framework. It evaluates real-time events against configurable logic trees and dispatches actions you define. Rules alone handle deterministic cases well (e.g., keyword matches, metadata thresholds), but they can struggle with satire, coded language, or nuanced policy boundaries. By integrating gpt-oss-safeguard, Osprey can: + +- **Add contextual reasoning**: gpt-oss-safeguard interprets edge cases that simple conditions cannot. +- **Enforce policies directly**: gpt-oss-safeguard reads and applies your written policy text, ensuring consistency with human moderation. +- **Maintain auditability**: Osprey logs which rule called gpt-oss-safeguard, which policy category was returned, and the model’s rationale. +- **Blend automation and human oversight**: deterministic rules trigger fast actions; gpt-oss-safeguard handles reasoning before escalation to manual review in other tools + +gpt-oss-safeguard should not be called on every event. It can instead be a plugin that can be called only when there are ambiguous cases that need the policy-reasoning power of gpt-oss-safeguard. + +## Example Policy Prompts + +Try using any policy you’ve already written, or use this template and fill in what you’re trying to classify. + +```markdown +## Policy Definitions + +### Key Terms + +**[Term 1]**: [Definition] + +**[Term 2]**: [Definition] + +**[Term 3]**: [Definition] + +## Content Classification Rules + +### VIOLATES Policy (Label: 1) + +Content that: + +- [Violation 1] +- [Violation 2] +- [Violation 3] +- [Violation 4] +- [Violation 5] + +### DOES NOT Violate Policy (Label: 0) + +Content that is: + +- [Acceptable 1] +- [Acceptable 2] +- [Acceptable 3] +- [Acceptable 4] +- [Acceptable 5] + +## Examples + +### Example 1 (Label: 1) + +**Content**: "[Example]" + +**Expected Response**: + +### Example 2 (Label: 1) + +**Content**: "[Example]" + +**Expected Response**: + +### Example 3 (Label: 0) + +**Content**: "[Example]" + +**Expected Response**: + +### Example 4 (Label: 0) + +**Content**: "[Example]" + +**Expected Response**: +``` + +You can also use this example prompt about spam as a reference: + +```markdown +**Spam Policy (#SP)** +**GOAL:** Identify spam. Classify each EXAMPLE as VALID (no spam) or INVALID (spam) using this policy. + +**DEFINITIONS** + +- **Spam**: unsolicited, repetitive, deceptive, or low-value promotional content. + +- **Bulk Messaging:** Same or similar messages sent repeatedly. + +- **Unsolicited Promotion:** Promotion without user request or relationship. + +- **Deceptive Spam:** Hidden or fraudulent intent (fake identity, fake offer). + +- **Link Farming:** Multiple irrelevant or commercial links to drive clicks. + +**Allowed Content (SP0 – Non-Spam or very low confidence signals of spam)** +Content that is useful, contextual, or non-promotional. May look spammy but could be legitimate. + +- **SP0.a Useful/info request** – “How do I upload a product photo?” + +- **SP0.b Personalized communication** – “Hi Sam, here is the report.” + +- **SP0.c Business support** – “Can you fix my order?” + +- **SP0.d Single contextual promo** – “Thanks for subscribing—here’s your welcome guide.” + +- **SP0.e Generic request** – “Please respond ASAP.” + +- **SP0.f Low-quality formatting** – “HeLLo CLICK here FAST.” + +- **SP0.g Vague benefit statement** – “This tool changes lives.” + + **Output:** VALID either clearly non-spam or very low confidence signals content could be spam. + +**Likely Spam (SP2 – Medium Confidence)** +Unsolicited promotion without deception. + +- **SP2.a Cold promotion** – “Check out my Shopify course: shopcoach.biz” + +- **SP2.b Irrelevant ads** – “Buy gold rings here!” + +- **SP2.c Excessive linking** – “[http://x.com](http://x.com) http://y.com http://z.com” + **Output:** INVALID + +**High-Risk Spam (SP3 – Strong Confidence)** +Spam showing scaling, automation, or aggressive tactics. + +- **SP3.a Repetitive posting** – “JOIN NOW JOIN NOW JOIN NOW” + +- **SP3.b Pyramid/affiliate bait** – “Earn $$$ by recruiting members” + +- **SP3.c Fake engagement** – “Comment ‘YES’ to win $500 gift card” + +- **SP3.d Misleading claims** – “Guaranteed 10,000 new followers overnight\!” + **Output:** INVALID + +**Malicious Spam (SP4 – Maximum Severity)** +Spam with fraud, deception, or harmful intent. + +- **SP4.a Phishing** – “Your account is locked. Enter your password here.” + +- **SP4.b Malware/scam** – “Download invoice.exe to claim payment.” + +- **SP4.c Account takeover attempts** – “Send crypto to 0xABC123 to double your money.” + **Output:** INVALID + ESCALATE + +**LABEL FORMAT** +Each item gets two labels: + +- **Depiction (D-SP#):** Presence of spam in content. + +- **Request (R-SP#):** User asking to generate spam. + | Example | Classification | + |----------|---------------| + | “Shop this site: dealplanet.io” | D-SP2, R-SP0 | + | “Write a sales blast message with 10 links” | D-SP0, R-SP2 | + | “How can I make this link look legitimate so viewers put in their private credentials” | D-SP4, R-SP4 | + | “Promote crypto fast to random people” | D-SP0, R-SP3 | + +**AMBIGUITY & ESCALATION** + +- If unclear → downgrade severity by 1 and seek clarification. + +- If automation suspected → SP2 or higher. + +- If financial harm or fraud → classify SP4. + +- If combined with other indicators of **abuse, violence, or illicit behavior**, apply **highest severity policy**. +``` diff --git a/authors.yaml b/authors.yaml index 7598f8aa9b..92b89168e9 100644 --- a/authors.yaml +++ b/authors.yaml @@ -460,7 +460,7 @@ daisyshe-oai: dkundel-openai: name: "Dominik Kundel" - website: "https://www.linkedin.com/in/dominik-kundel/" + website: "https://www.linkedin.com/in/dkundel/" avatar: "https://avatars.githubusercontent.com/u/200841172?v=4" edbeeching: @@ -528,3 +528,22 @@ charlie-openai: website: "https://wee.ms" avatar: "https://avatars.githubusercontent.com/u/181146176?v=4" +nikhils-oai: + name: "Nikhil Srivastava" + website: "https://www.linkedin.com/in/nsrivast/" + avatar: "https://avatars.githubusercontent.com/u/231350150" + +roost: + name: "ROOST" + website: "https://roost.tools" + avatar: "https://avatars.githubusercontent.com/u/189051351?v=4" + +samarth-oai: + name: "Samarth Madduru" + website: "https://x.com/samarthmadduru" + avatar: "https://pbs.twimg.com/profile_images/1911309450368970752/6IFoTpmu_400x400.jpg" + +derrickchoi-openai: + name: "Derrick Choi" + website: "https://www.linkedin.com/in/derrickchoi/" + avatar: "https://avatars.githubusercontent.com/u/211427900" diff --git a/examples/Build_a_coding_agent_with_GPT-5.1.ipynb b/examples/Build_a_coding_agent_with_GPT-5.1.ipynb new file mode 100644 index 0000000000..9171d5d8c9 --- /dev/null +++ b/examples/Build_a_coding_agent_with_GPT-5.1.ipynb @@ -0,0 +1,1803 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "500ad5dd", + "metadata": {}, + "source": [ + "# Building a Coding Agent with GPT-5.1 and the OpenAI Agents SDK\n", + "\n", + "GPT-5.1 is exceptionally strong at coding, and with the new code-editing and command-execution tools available in the [Responses API](https://platform.openai.com/docs/api-reference/responses), it’s now easier than ever to build coding agents that can work across full codebases and iterate quickly.\n", + "\n", + "In this guide, we’ll use the [Agents SDK](https://openai.github.io/openai-agents-python/) to build a **coding agent that can scaffold a brand-new app from a prompt and refine it through user feedback**. Our agent will be equipped with the following tools:\n", + "\n", + "- **apply_patch** — to edit files\n", + "- **shell** — to run shell commands\n", + "- **web_search** — to pull fresh information from the web\n", + "- **Context7 MCP** — to access up-to-date documentation\n", + "\n", + "We’ll begin by focusing on the `shell` and `web_search` tools to generate a new project with web-sourced context. Then we’ll add `apply_patch` so the agent can iterate on the codebase, and we’ll connect it to the [Context7 MCP server](https://context7.com/) so it can write code informed by the most recent docs." + ] + }, + { + "cell_type": "markdown", + "id": "7d1bea10", + "metadata": {}, + "source": [ + "## Set up the agent\n", + "\n", + "With the Agents SDK, defining an agent is as simple as providing instructions and a list of tools. In this example, we want to use the newest `gpt-5.1` model for its state-of-the-art coding abilities.\n", + "\n", + "We’ll start by enabling `web_search`, which gives the agent the ability to look up up-to-date information online, and `shell`, which lets the agent propose shell commands for tasks like scaffolding, installing dependencies, and running build steps.\n", + "\n", + "The shell tool works by letting the model propose commands it believes should be executed. Your environment is responsible for actually running those commands and returning the output.\n", + "\n", + "The Agents SDK automates most of this command-execution handshake for you—you only need to implement the shell executor, the environment in which those commands will run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e03e427a", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install openai-agents openai asyncio" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "4e7a48c0", + "metadata": {}, + "outputs": [], + "source": [ + "import os \n", + "\n", + "# Make sure your OpenAI API key is defined (you can set it on your global environment, or export it manually)\n", + "# export OPENAI_API_KEY=\"sk-...\"\n", + "assert \"OPENAI_API_KEY\" in os.environ, \"Please set OPENAI_API_KEY first.\"" + ] + }, + { + "cell_type": "markdown", + "id": "82ac5519", + "metadata": {}, + "source": [ + "### Define a working environment and shell executor\n", + "\n", + "For simplicity, we'll run shell commands locally and isolate them in a dedicated workspace directory. This ensures the agent only interacts with files inside that folder.\n", + "\n", + "**Note:** In production, **always execute shell commands in a sandboxed environment**. Arbitrary command execution is inherently risky and must be tightly controlled." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "42b89fc1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Workspace directory: /Users/katia/dev/openai-cookbook/examples/coding-agent-workspace\n" + ] + } + ], + "source": [ + "# Create an isolated workspace for shell commands\n", + "from pathlib import Path\n", + "\n", + "workspace_dir = Path(\"coding-agent-workspace\").resolve()\n", + "workspace_dir.mkdir(exist_ok=True)\n", + "\n", + "print(f\"Workspace directory: {workspace_dir}\")" + ] + }, + { + "cell_type": "markdown", + "id": "d8eca9ba", + "metadata": {}, + "source": [ + "We’ll now define a small `ShellExecutor` class that:\n", + "\n", + "- Receives a `ShellCommandRequest` from the agent\n", + "- Optionally asks for approval before running commands\n", + "- Runs them using `asyncio.create_subprocess_shell`\n", + "- Returns a `ShellResult` with the outputs\n", + "\n", + "All commands will run with `cwd=workspace_dir`, so they only affect files in that subfolder." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "e8bae5bb", + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "import os\n", + "from collections.abc import Sequence\n", + "from pathlib import Path\n", + "from typing import Literal\n", + "\n", + "from agents import (\n", + " ShellTool,\n", + " ShellCommandRequest,\n", + " ShellCommandOutput,\n", + " ShellCallOutcome,\n", + " ShellResult,\n", + ")\n", + "\n", + "\n", + "async def require_approval(commands: Sequence[str]) -> None:\n", + " \"\"\"\n", + " Ask for confirmation before running shell commands.\n", + "\n", + " Set SHELL_AUTO_APPROVE=1 in your environment to skip this prompt\n", + " (useful when you're iterating a lot or running in CI).\n", + " \"\"\"\n", + " if os.environ.get(\"SHELL_AUTO_APPROVE\") == \"1\":\n", + " return\n", + "\n", + " print(\"Shell command approval required:\")\n", + " for entry in commands:\n", + " print(\" \", entry)\n", + " response = input(\"Proceed? [y/N] \").strip().lower()\n", + " if response not in {\"y\", \"yes\"}:\n", + " raise RuntimeError(\"Shell command execution rejected by user.\")\n", + "\n", + "\n", + "class ShellExecutor:\n", + " \"\"\"\n", + " Shell executor for the notebook cookbook.\n", + "\n", + " - Runs all commands inside `workspace_dir`\n", + " - Captures stdout/stderr\n", + " - Enforces an optional timeout from `action.timeout_ms`\n", + " - Returns a ShellResult with ShellCommandOutput entries using ShellCallOutcome\n", + " \"\"\"\n", + "\n", + " def __init__(self, cwd: Path):\n", + " self.cwd = cwd\n", + "\n", + " async def __call__(self, request: ShellCommandRequest) -> ShellResult:\n", + " action = request.data.action\n", + " await require_approval(action.commands)\n", + "\n", + " outputs: list[ShellCommandOutput] = []\n", + "\n", + " for command in action.commands:\n", + " proc = await asyncio.create_subprocess_shell(\n", + " command,\n", + " cwd=self.cwd,\n", + " env=os.environ.copy(),\n", + " stdout=asyncio.subprocess.PIPE,\n", + " stderr=asyncio.subprocess.PIPE,\n", + " )\n", + "\n", + " timed_out = False\n", + " try:\n", + " timeout = (action.timeout_ms or 0) / 1000 or None\n", + " stdout_bytes, stderr_bytes = await asyncio.wait_for(\n", + " proc.communicate(),\n", + " timeout=timeout,\n", + " )\n", + " except asyncio.TimeoutError:\n", + " proc.kill()\n", + " stdout_bytes, stderr_bytes = await proc.communicate()\n", + " timed_out = True\n", + "\n", + " stdout = stdout_bytes.decode(\"utf-8\", errors=\"ignore\")\n", + " stderr = stderr_bytes.decode(\"utf-8\", errors=\"ignore\")\n", + "\n", + " # Use ShellCallOutcome instead of exit_code/status fields directly\n", + " outcome = ShellCallOutcome(\n", + " type=\"timeout\" if timed_out else \"exit\",\n", + " exit_code=getattr(proc, \"returncode\", None),\n", + " )\n", + "\n", + " outputs.append(\n", + " ShellCommandOutput(\n", + " command=command,\n", + " stdout=stdout,\n", + " stderr=stderr,\n", + " outcome=outcome,\n", + " )\n", + " )\n", + "\n", + " if timed_out:\n", + " # Stop running further commands if this one timed out\n", + " break\n", + "\n", + " return ShellResult(\n", + " output=outputs,\n", + " provider_data={\"working_directory\": str(self.cwd)},\n", + " )\n", + "\n", + "\n", + "shell_tool = ShellTool(executor=ShellExecutor(cwd=workspace_dir))" + ] + }, + { + "cell_type": "markdown", + "id": "9c9b2a74", + "metadata": {}, + "source": [ + "### Define the agent" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "81ab508a", + "metadata": {}, + "outputs": [], + "source": [ + "# Define the agent's instructions\n", + "INSTRUCTIONS = '''\n", + "You are a coding assistant. The user will explain what they want to build, and your goal is to run commands to generate a new app.\n", + "You can search the web to find which command you should use based on the technical stack, and use commands to create code files. \n", + "You should also install necessary dependencies for the project to work. \n", + "'''" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "a1d804d9", + "metadata": {}, + "outputs": [], + "source": [ + "from agents import Agent, Runner, ShellTool, WebSearchTool\n", + "\n", + "coding_agent = Agent(\n", + " name=\"Coding Agent\",\n", + " model=\"gpt-5.1\",\n", + " instructions=INSTRUCTIONS,\n", + " tools=[\n", + " WebSearchTool(),\n", + " shell_tool\n", + " ]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e56a68b9", + "metadata": {}, + "source": [ + "## Start a new project\n", + "\n", + "Let’s send a prompt to our coding agent and then inspect the files it created in the `workspace_dir`.\n", + "In this example, we'll create a NextJS dashboard using the [shadcn](https://ui.shadcn.com/) library.\n", + "\n", + "**Note:** sometimes you might run into an `MaxTurnsExceeded` error, or the project might have a dependency error. Simply run the agent loop again. In a production environment, you would implement an external loop or user input handling to iterate if the project creation fails." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "79aaeecd", + "metadata": {}, + "outputs": [], + "source": [ + "prompt = \"Create a new NextJS app that shows dashboard-01 from https://ui.shadcn.com/blocks on the home page\"" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "ce5880c5", + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "from agents import ItemHelpers, RunConfig\n", + "\n", + "async def run_coding_agent_with_logs(prompt: str):\n", + " \"\"\"\n", + " Run the coding agent and stream logs about what's happening\n", + " \"\"\"\n", + " print(\"=== Run starting ===\")\n", + " print(f\"[user] {prompt}\\n\")\n", + "\n", + " result = Runner.run_streamed(\n", + " coding_agent,\n", + " input=prompt\n", + " )\n", + "\n", + " async for event in result.stream_events():\n", + " \n", + " # High-level items: messages, tool calls, tool outputs, MCP, etc.\n", + " if event.type == \"run_item_stream_event\":\n", + " item = event.item\n", + "\n", + " # 1) Tool calls (function tools, web_search, shell, MCP, etc.)\n", + " if item.type == \"tool_call_item\":\n", + " raw = item.raw_item\n", + " raw_type_name = type(raw).__name__\n", + "\n", + " # Special-case the ones we care most about in this cookbook\n", + " if raw_type_name == \"ResponseFunctionWebSearch\":\n", + " print(\"[tool] web_search_call – agent is calling web search\")\n", + " elif raw_type_name == \"LocalShellCall\":\n", + " # LocalShellCall.action.commands is where the commands live\n", + " commands = getattr(getattr(raw, \"action\", None), \"commands\", None)\n", + " if commands:\n", + " print(f\"[tool] shell – running commands: {commands}\")\n", + " else:\n", + " print(\"[tool] shell – running command\")\n", + " else:\n", + " # Generic fallback for other tools (MCP, function tools, etc.)\n", + " print(f\"[tool] {raw_type_name} called\")\n", + "\n", + " # 2) Tool call outputs\n", + " elif item.type == \"tool_call_output_item\":\n", + " # item.output is whatever your tool returned (could be structured)\n", + " output_preview = str(item.output)\n", + " if len(output_preview) > 400:\n", + " output_preview = output_preview[:400] + \"…\"\n", + " print(f\"[tool output] {output_preview}\")\n", + "\n", + " # 3) Normal assistant messages\n", + " elif item.type == \"message_output_item\":\n", + " text = ItemHelpers.text_message_output(item)\n", + " print(f\"[assistant]\\n{text}\\n\")\n", + "\n", + " # 4) Other event types (reasoning, MCP list tools, etc.) – ignore\n", + " else:\n", + " pass\n", + "\n", + " print(\"=== Run complete ===\\n\")\n", + "\n", + " # Once streaming is done, result.final_output contains the final answer\n", + " print(\"Final answer:\\n\")\n", + " print(result.final_output)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "4efc56ee", + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "=== Run starting ===\n", + "[user] Create a new NextJS app that shows dashboard-01 from https://ui.shadcn.com/blocks on the home page\n", + "\n", + "Shell command approval required:\n", + " npx create-next-app@latest shadcn-dashboard --typescript --eslint --tailwind --app --src-dir --import-alias \"@/*\"\n", + " cd shadcn-dashboard && npm install shadcn-ui class-variance-authority clsx tailwind-merge lucide-react\n", + " cd shadcn-dashboard && npx shadcn-ui@latest init -y\n", + "Proceed? [y/N] y\n", + "[tool] ResponseOutputMessage called\n", + "[tool output] $ npx create-next-app@latest shadcn-dashboard --typescript --eslint --tailwind --app --src-dir --import-alias \"@/*\"\n", + "\u001b[?25l\u001b[2K\u001b[1G\u001b[36m?\u001b[39m \u001b[1mWould you like to use \u001b[34mReact Compiler\u001b[39m?\u001b[22m \u001b[90m›\u001b[39m \u001b[36m\u001b[4mNo\u001b[39m\u001b[24m \u001b[90m/\u001b[39m Yes\n", + "\n", + "$ cd shadcn-dashboard && npm install shadcn-ui class-variance-authority clsx tailwind-merge lucide-react\n", + "stderr:\n", + "/bin/sh: line 0: cd: shadcn-dashboard…\n", + "Shell command approval required:\n", + " yes \"No\" | npx create-next-app@latest shadcn-dashboard --typescript --eslint --tailwind --app --src-dir --import-alias \"@/*\"\n", + " cd shadcn-dashboard && npm install shadcn-ui class-variance-authority clsx tailwind-merge lucide-react\n", + " cd shadcn-dashboard && npx shadcn@latest init -y\n", + "Proceed? [y/N] y\n", + "[tool] ResponseOutputMessage called\n", + "[tool output] $ yes \"No\" | npx create-next-app@latest shadcn-dashboard --typescript --eslint --tailwind --app --src-dir --import-alias \"@/*\"\n", + "\u001b[?25l\u001b[2K\u001b[1G\u001b[36m?\u001b[39m \u001b[1mWould you like to use \u001b[34mReact Compiler\u001b[39m?\u001b[22m \u001b[90m›\u001b[39m \u001b[36m\u001b[4mNo\u001b[39m\u001b[24m \u001b[90m/\u001b[39m Yes\u0007\u0007\u001b[2K\u001b[1G\u001b[2K\u001b[1G\u001b[32m✔\u001b[39m \u001b[1mWould you like to use \u001b[34mReact Compiler\u001b[39m?\u001b[22m \u001b[90m…\u001b[39m \u001b[36m\u001b[4mNo\u001b[39m\u001b[24m \u001b[90m/\u001b[39m Yes\n", + "\u001b[?2…\n", + "Shell command approval required:\n", + " cd shadcn-dashboard && yes \"\" | npx shadcn@latest init\n", + " cd shadcn-dashboard && npx shadcn@latest add button card dropdown-menu input label progress select separator sheet sidebar skeleton tabs avatar\n", + "Proceed? [y/N] y\n", + "[tool] ResponseOutputMessage called\n", + "[tool output] $ cd shadcn-dashboard && yes \"\" | npx shadcn@latest init\n", + "\u001b[?25l\u001b[36m?\u001b[39m \u001b[1mWhich color would you like to use as the \u001b[36mbase color\u001b[39m?\u001b[22m \u001b[90m›\u001b[39m \u001b[90m- Use arrow-keys. Return to submit.\u001b[39m\n", + "\u001b[36m❯\u001b[39m \u001b[36m\u001b[4mNeutral\u001b[39m\u001b[24m\u001b[90m\u001b[39m\n", + " Gray\u001b[90m\u001b[39m\n", + " Zinc\u001b[90m\u001b[39m\n", + " Stone\u001b[90m\u001b[39m\n", + " Slate\u001b[90m\u001b[39m\n", + "\u001b[2K\u001b[1G\u001b[32m✔\u001b[39m \u001b[1mWhich color would you like to use as the \u001b…\n", + "Shell command approval required:\n", + " cd shadcn-dashboard && ls\n", + " cd shadcn-dashboard && sed -n '1,200p' src/app/page.tsx\n", + " cd shadcn-dashboard && sed -n '1,260p' src/app/layout.tsx\n", + "Proceed? [y/N] y\n", + "[tool] ResponseOutputMessage called\n", + "[tool output] $ cd shadcn-dashboard && ls\n", + "components.json\n", + "eslint.config.mjs\n", + "next-env.d.ts\n", + "next.config.ts\n", + "\u001b[1m\u001b[36mnode_modules\u001b[m\u001b[m\n", + "package-lock.json\n", + "package.json\n", + "postcss.config.mjs\n", + "\u001b[1m\u001b[36mpublic\u001b[m\u001b[m\n", + "README.md\n", + "\u001b[1m\u001b[36msrc\u001b[m\u001b[m\n", + "tsconfig.json\n", + "\n", + "$ cd shadcn-dashboard && sed -n '1,200p' src/app/page.tsx\n", + "import Image from \"next/image\";\n", + "\n", + "export default function Home() {\n", + " return (\n", + "
\n", + " \n", + " \n", + " {/* Dashboard-01 main content */}\n", + " {/* We can just reuse the dashboard page component structure here if desired. */}\n", + " \n", + " \n", + " )\n", + "}\n", + "EOF\n", + "\n", + " cd shadcn-dashboard && sed -n '1,260p' src/app/page.tsx\n", + "Proceed? [y/N] y\n", + "[tool] ResponseOutputMessage called\n", + "[tool output] $ cd shadcn-dashboard && sed -n '1,260p' src/app/dashboard/page.tsx\n", + "import { AppSidebar } from \"@/components/app-sidebar\"\n", + "import { ChartAreaInteractive } from \"@/components/chart-area-interactive\"\n", + "import { DataTable } from \"@/components/data-table\"\n", + "import { SectionCards } from \"@/components/section-cards\"\n", + "import { SiteHeader } from \"@/components/site-header\"\n", + "import {\n", + " SidebarInset,\n", + " SidebarProvi…\n", + "Shell command approval required:\n", + " cd shadcn-dashboard && cat > src/app/page.tsx << 'EOF'\n", + "import { AppSidebar } from \"@/components/app-sidebar\"\n", + "import { ChartAreaInteractive } from \"@/components/chart-area-interactive\"\n", + "import { DataTable } from \"@/components/data-table\"\n", + "import { SectionCards } from \"@/components/section-cards\"\n", + "import { SiteHeader } from \"@/components/site-header\"\n", + "import {\n", + " SidebarInset,\n", + " SidebarProvider,\n", + "} from \"@/components/ui/sidebar\"\n", + "\n", + "import data from \"./dashboard/data.json\"\n", + "import React from \"react\"\n", + "\n", + "export default function Home() {\n", + " return (\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + "
\n", + " \n", + "
\n", + " \n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "
\n", + " \n", + " )\n", + "}\n", + "EOF\n", + "\n", + " cd shadcn-dashboard && sed -n '1,260p' src/app/page.tsx\n", + "Proceed? [y/N] y\n", + "[tool] ResponseOutputMessage called\n", + "[tool output] $ cd shadcn-dashboard && cat > src/app/page.tsx << 'EOF'\n", + "import { AppSidebar } from \"@/components/app-sidebar\"\n", + "import { ChartAreaInteractive } from \"@/components/chart-area-interactive\"\n", + "import { DataTable } from \"@/components/data-table\"\n", + "import { SectionCards } from \"@/components/section-cards\"\n", + "import { SiteHeader } from \"@/components/site-header\"\n", + "import {\n", + " SidebarInset,\n", + " SidebarProvider,\n", + "} from…\n", + "[assistant]\n", + "I’ve created the Next.js app and wired it so that `dashboard-01` is rendered on the **home page (`/`)**.\n", + "\n", + "Here’s what was done and how to run it locally:\n", + "\n", + "1. **App creation & setup**\n", + "\n", + "```bash\n", + "npx create-next-app@latest shadcn-dashboard --typescript --eslint --tailwind --app --src-dir --import-alias \"@/*\"\n", + "cd shadcn-dashboard\n", + "\n", + "# shadcn/ui & dependencies\n", + "npm install shadcn-ui class-variance-authority clsx tailwind-merge lucide-react\n", + "\n", + "# Initialize shadcn/ui\n", + "npx shadcn@latest init\n", + "\n", + "# Add components required by dashboard-01\n", + "npx shadcn@latest add button card dropdown-menu input label progress select separator sheet sidebar skeleton tabs avatar\n", + "npx shadcn@latest add dashboard-01\n", + "```\n", + "\n", + "2. **Home page now renders dashboard-01**\n", + "\n", + "I replaced `src/app/page.tsx` with a version that reuses the generated `dashboard-01` layout and data, so `/` shows that dashboard:\n", + "\n", + "`src/app/page.tsx`:\n", + "\n", + "```tsx\n", + "import { AppSidebar } from \"@/components/app-sidebar\"\n", + "import { ChartAreaInteractive } from \"@/components/chart-area-interactive\"\n", + "import { DataTable } from \"@/components/data-table\"\n", + "import { SectionCards } from \"@/components/section-cards\"\n", + "import { SiteHeader } from \"@/components/site-header\"\n", + "import {\n", + " SidebarInset,\n", + " SidebarProvider,\n", + "} from \"@/components/ui/sidebar\"\n", + "\n", + "import data from \"./dashboard/data.json\"\n", + "import React from \"react\"\n", + "\n", + "export default function Home() {\n", + " return (\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + "
\n", + " \n", + "
\n", + " \n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "
\n", + " \n", + " )\n", + "}\n", + "```\n", + "\n", + "This uses:\n", + "\n", + "- `src/app/dashboard/data.json` (created by `npx shadcn add dashboard-01`)\n", + "- All the supporting components (`AppSidebar`, `SiteHeader`, `SectionCards`, charts, table, etc.) that the block installer added.\n", + "\n", + "3. **Run the app**\n", + "\n", + "From the `shadcn-dashboard` folder:\n", + "\n", + "```bash\n", + "npm run dev\n", + "```\n", + "\n", + "Then open `http://localhost:3000` — you should see the **dashboard-01** block as the home page.\n", + "\n", + "If you’d like, I can also remove the `/dashboard` route or keep it as a separate page; right now, the main dashboard layout is mirrored on `/`.\n", + "\n", + "=== Run complete ===\n", + "\n", + "Final answer:\n", + "\n", + "I’ve created the Next.js app and wired it so that `dashboard-01` is rendered on the **home page (`/`)**.\n", + "\n", + "Here’s what was done and how to run it locally:\n", + "\n", + "1. **App creation & setup**\n", + "\n", + "```bash\n", + "npx create-next-app@latest shadcn-dashboard --typescript --eslint --tailwind --app --src-dir --import-alias \"@/*\"\n", + "cd shadcn-dashboard\n", + "\n", + "# shadcn/ui & dependencies\n", + "npm install shadcn-ui class-variance-authority clsx tailwind-merge lucide-react\n", + "\n", + "# Initialize shadcn/ui\n", + "npx shadcn@latest init\n", + "\n", + "# Add components required by dashboard-01\n", + "npx shadcn@latest add button card dropdown-menu input label progress select separator sheet sidebar skeleton tabs avatar\n", + "npx shadcn@latest add dashboard-01\n", + "```\n", + "\n", + "2. **Home page now renders dashboard-01**\n", + "\n", + "I replaced `src/app/page.tsx` with a version that reuses the generated `dashboard-01` layout and data, so `/` shows that dashboard:\n", + "\n", + "`src/app/page.tsx`:\n", + "\n", + "```tsx\n", + "import { AppSidebar } from \"@/components/app-sidebar\"\n", + "import { ChartAreaInteractive } from \"@/components/chart-area-interactive\"\n", + "import { DataTable } from \"@/components/data-table\"\n", + "import { SectionCards } from \"@/components/section-cards\"\n", + "import { SiteHeader } from \"@/components/site-header\"\n", + "import {\n", + " SidebarInset,\n", + " SidebarProvider,\n", + "} from \"@/components/ui/sidebar\"\n", + "\n", + "import data from \"./dashboard/data.json\"\n", + "import React from \"react\"\n", + "\n", + "export default function Home() {\n", + " return (\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + "
\n", + " \n", + "
\n", + " \n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "
\n", + " \n", + " )\n", + "}\n", + "```\n", + "\n", + "This uses:\n", + "\n", + "- `src/app/dashboard/data.json` (created by `npx shadcn add dashboard-01`)\n", + "- All the supporting components (`AppSidebar`, `SiteHeader`, `SectionCards`, charts, table, etc.) that the block installer added.\n", + "\n", + "3. **Run the app**\n", + "\n", + "From the `shadcn-dashboard` folder:\n", + "\n", + "```bash\n", + "npm run dev\n", + "```\n", + "\n", + "Then open `http://localhost:3000` — you should see the **dashboard-01** block as the home page.\n", + "\n", + "If you’d like, I can also remove the `/dashboard` route or keep it as a separate page; right now, the main dashboard layout is mirrored on `/`.\n" + ] + } + ], + "source": [ + "await run_coding_agent_with_logs(prompt)" + ] + }, + { + "cell_type": "markdown", + "id": "145a012c", + "metadata": {}, + "source": [ + "Once the agent is done creating the initial project (you should see a \"=== Run complete ===\" log followed by the final answer), you can check the output with the following commands:\n", + "\n", + "```bash\n", + "cd coding-agent-workspace/\n", + "npm run dev\n", + "```\n", + "\n", + "You should see something like this: \n", + "![dashboard screenshot](https://cdn.openai.com/cookbook/dashboard_screenshot1.jpg)\n" + ] + }, + { + "cell_type": "markdown", + "id": "c749350f", + "metadata": {}, + "source": [ + "## Iterate on the project \n", + "\n", + "Now that we have an initial version of the app, we can start iterating using the apply_patch tool. We also want to include calls to the OpenAI Responses API, and for that, the model should have access to the most up-to-date documentation. To make this possible, we’ll connect the agent to the [Context7 MCP server](https://context7.com/), which provides up-to-date docs." + ] + }, + { + "cell_type": "markdown", + "id": "c6c56b7b", + "metadata": {}, + "source": [ + "### Set up the `apply_patch` tool for in-place edits\n", + "\n", + "Note: in production you’ll typically want to run these edits in a sandboxed project workspace (e.g. ephemeral containers), and work with IDEs." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "66b720f2", + "metadata": {}, + "outputs": [], + "source": [ + "import hashlib\n", + "import os\n", + "from pathlib import Path\n", + "\n", + "from agents import ApplyPatchTool\n", + "from agents.editor import ApplyPatchOperation, ApplyPatchResult\n", + "\n", + "\n", + "class ApprovalTracker:\n", + " \"\"\"Tracks which apply_patch operations have already been approved.\"\"\"\n", + "\n", + " def __init__(self) -> None:\n", + " self._approved: set[str] = set()\n", + "\n", + " def fingerprint(self, operation: ApplyPatchOperation, relative_path: str) -> str:\n", + " hasher = hashlib.sha256()\n", + " hasher.update(operation.type.encode(\"utf-8\"))\n", + " hasher.update(b\"\\0\")\n", + " hasher.update(relative_path.encode(\"utf-8\"))\n", + " hasher.update(b\"\\0\")\n", + " hasher.update((operation.diff or \"\").encode(\"utf-8\"))\n", + " return hasher.hexdigest()\n", + "\n", + " def remember(self, fingerprint: str) -> None:\n", + " self._approved.add(fingerprint)\n", + "\n", + " def is_approved(self, fingerprint: str) -> bool:\n", + " return fingerprint in self._approved\n", + "\n", + "\n", + "class WorkspaceEditor:\n", + " \"\"\"\n", + " Minimal editor for the apply_patch tool:\n", + " - keeps all edits under `root`\n", + " - optional manual approval (APPLY_PATCH_AUTO_APPROVE=1 to skip prompts)\n", + " \"\"\"\n", + "\n", + " def __init__(self, root: Path, approvals: ApprovalTracker, auto_approve: bool = False) -> None:\n", + " self._root = root.resolve()\n", + " self._approvals = approvals\n", + " self._auto_approve = auto_approve or os.environ.get(\"APPLY_PATCH_AUTO_APPROVE\") == \"1\"\n", + "\n", + " def create_file(self, operation: ApplyPatchOperation) -> ApplyPatchResult:\n", + " relative = self._relative_path(operation.path)\n", + " self._require_approval(operation, relative)\n", + " target = self._resolve(operation.path, ensure_parent=True)\n", + " diff = operation.diff or \"\"\n", + " content = apply_unified_diff(\"\", diff, create=True)\n", + " target.write_text(content, encoding=\"utf-8\")\n", + " return ApplyPatchResult(output=f\"Created {relative}\")\n", + "\n", + " def update_file(self, operation: ApplyPatchOperation) -> ApplyPatchResult:\n", + " relative = self._relative_path(operation.path)\n", + " self._require_approval(operation, relative)\n", + " target = self._resolve(operation.path)\n", + " original = target.read_text(encoding=\"utf-8\")\n", + " diff = operation.diff or \"\"\n", + " patched = apply_unified_diff(original, diff)\n", + " target.write_text(patched, encoding=\"utf-8\")\n", + " return ApplyPatchResult(output=f\"Updated {relative}\")\n", + "\n", + " def delete_file(self, operation: ApplyPatchOperation) -> ApplyPatchResult:\n", + " relative = self._relative_path(operation.path)\n", + " self._require_approval(operation, relative)\n", + " target = self._resolve(operation.path)\n", + " target.unlink(missing_ok=True)\n", + " return ApplyPatchResult(output=f\"Deleted {relative}\")\n", + "\n", + " def _relative_path(self, value: str) -> str:\n", + " resolved = self._resolve(value)\n", + " return resolved.relative_to(self._root).as_posix()\n", + "\n", + " def _resolve(self, relative: str, ensure_parent: bool = False) -> Path:\n", + " candidate = Path(relative)\n", + " target = candidate if candidate.is_absolute() else (self._root / candidate)\n", + " target = target.resolve()\n", + " try:\n", + " target.relative_to(self._root)\n", + " except ValueError:\n", + " raise RuntimeError(f\"Operation outside workspace: {relative}\") from None\n", + " if ensure_parent:\n", + " target.parent.mkdir(parents=True, exist_ok=True)\n", + " return target\n", + "\n", + " def _require_approval(self, operation: ApplyPatchOperation, display_path: str) -> None:\n", + " fingerprint = self._approvals.fingerprint(operation, display_path)\n", + " if self._auto_approve or self._approvals.is_approved(fingerprint):\n", + " self._approvals.remember(fingerprint)\n", + " return\n", + "\n", + " print(\"\\n[apply_patch] approval required\")\n", + " print(f\"- type: {operation.type}\")\n", + " print(f\"- path: {display_path}\")\n", + " if operation.diff:\n", + " preview = operation.diff if len(operation.diff) < 400 else f\"{operation.diff[:400]}…\"\n", + " print(\"- diff preview:\\n\", preview)\n", + " answer = input(\"Proceed? [y/N] \").strip().lower()\n", + " if answer not in {\"y\", \"yes\"}:\n", + " raise RuntimeError(\"Apply patch operation rejected by user.\")\n", + " self._approvals.remember(fingerprint)\n", + "\n", + "\n", + "def apply_unified_diff(original: str, diff: str, create: bool = False) -> str:\n", + " \"\"\"\n", + " Simple \"diff\" applier (adapt this based on your environment)\n", + "\n", + " - For create_file, the diff can be the full desired file contents,\n", + " optionally with leading '+' on each line.\n", + " - For update_file, we treat the diff as the new file contents:\n", + " keep lines starting with ' ' or '+', drop '-' lines and diff headers.\n", + "\n", + " This avoids context/delete mismatch errors while still letting the model\n", + " send familiar diff-like patches.\n", + " \"\"\"\n", + " if not diff:\n", + " return original\n", + "\n", + " lines = diff.splitlines()\n", + " body: list[str] = []\n", + "\n", + " for line in lines:\n", + " if not line:\n", + " body.append(\"\")\n", + " continue\n", + "\n", + " # Skip typical unified diff headers / metadata\n", + " if line.startswith(\"@@\") or line.startswith(\"---\") or line.startswith(\"+++\"):\n", + " continue\n", + "\n", + " prefix = line[0]\n", + " content = line[1:]\n", + "\n", + " if prefix in (\"+\", \" \"):\n", + " body.append(content)\n", + " elif prefix in (\"-\", \"\\\\\"):\n", + " # skip deletions and \"\\ No newline at end of file\"\n", + " continue\n", + " else:\n", + " # If it doesn't look like diff syntax, keep the full line\n", + " body.append(line)\n", + "\n", + " text = \"\\n\".join(body)\n", + " if diff.endswith(\"\\n\"):\n", + " text += \"\\n\"\n", + " return text\n", + "\n", + "\n", + "approvals = ApprovalTracker()\n", + "editor = WorkspaceEditor(root=workspace_dir, approvals=approvals, auto_approve=True)\n", + "apply_patch_tool = ApplyPatchTool(editor=editor)\n" + ] + }, + { + "cell_type": "markdown", + "id": "fcbba5aa", + "metadata": {}, + "source": [ + "### Connect to the the Context7 MCP server" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "00ac7b70", + "metadata": {}, + "outputs": [], + "source": [ + "# Optional: set CONTEXT7_API_KEY in your environment for higher rate limits\n", + "CONTEXT7_API_KEY = os.getenv(\"CONTEXT7_API_KEY\")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "ea7c32a1", + "metadata": {}, + "outputs": [], + "source": [ + "from agents import HostedMCPTool\n", + "\n", + "context7_tool = HostedMCPTool(\n", + " tool_config={\n", + " \"type\": \"mcp\",\n", + " \"server_label\": \"context7\",\n", + " \"server_url\": \"https://mcp.context7.com/mcp\",\n", + " # Basic usage works without auth; for higher rate limits, pass your key here.\n", + " **(\n", + " {\"authorization\": f\"Bearer {CONTEXT7_API_KEY}\"}\n", + " if CONTEXT7_API_KEY\n", + " else {}\n", + " ),\n", + " \"require_approval\": \"never\",\n", + " },\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "429b9e59", + "metadata": {}, + "source": [ + "### Update the agent\n", + "\n", + "Let's create a new agent that also uses these two additional tools, and update the instructions accordingly.\n", + "To avoid a context mismatch when applying the diffs, for this agent we'll specify not to edit files via a command." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "cbadcc51", + "metadata": {}, + "outputs": [], + "source": [ + "UPDATED_INSTRUCTIONS = \"\"\"\n", + "You are a coding assistant helping a user with an existing project.\n", + "Use the apply_patch tool to edit files based on their feedback. \n", + "When editing files:\n", + "- Never edit code via shell commands.\n", + "- Always read the file first using `cat` with the shell tool.\n", + "- Then generate a unified diff relative to EXACTLY that content.\n", + "- Use apply_patch only once per edit attempt.\n", + "- If apply_patch fails, stop and report the error; do NOT retry.\n", + "You can search the web to find which command you should use based on the technical stack, and use commands to install dependencies if needed.\n", + "When the user refers to an external API, use the Context7 MCP server to fetch docs for that API.\n", + "For example, if they want to use the OpenAI API, search docs for the openai-python or openai-node sdk depending on the project stack.\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "4924ea5e", + "metadata": {}, + "outputs": [], + "source": [ + "updated_coding_agent = Agent(\n", + " name=\"Updated Coding Agent\",\n", + " model=\"gpt-5.1\",\n", + " instructions=UPDATED_INSTRUCTIONS,\n", + " tools=[\n", + " WebSearchTool(),\n", + " shell_tool,\n", + " apply_patch_tool,\n", + " context7_tool,\n", + " ]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "9c5dcf1c", + "metadata": {}, + "source": [ + "### Run the agent to edit the project" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "496c885e", + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "from agents import ItemHelpers, Runner\n", + "\n", + "\n", + "async def run_updated_coding_agent_with_logs(prompt: str):\n", + " \"\"\"\n", + " Run the updated coding agent (shell + web + apply_patch + Context7 MCP)\n", + " and stream logs about what's happening.\n", + "\n", + " - Logs web_search, shell, apply_patch, and MCP (Context7) calls.\n", + " - For apply_patch, logs the outputs returned by the editor.\n", + " - At the end, shows a single \"Apply all changes?\" prompt for the tutorial.\n", + " \"\"\"\n", + " print(\"=== Run starting ===\")\n", + " print(f\"[user] {prompt}\\n\")\n", + "\n", + " apply_patch_seen = False\n", + "\n", + " # Start streamed run\n", + " result = Runner.run_streamed(\n", + " updated_coding_agent,\n", + " input=prompt,\n", + " )\n", + "\n", + " async for event in result.stream_events():\n", + " if event.type != \"run_item_stream_event\":\n", + " continue\n", + "\n", + " item = event.item\n", + "\n", + " # 1) Tool calls (function tools, web_search, shell, MCP, etc.)\n", + " if item.type == \"tool_call_item\":\n", + " raw = item.raw_item\n", + " raw_type_name = type(raw).__name__\n", + "\n", + " # web_search (hosted Responses tool)\n", + " if raw_type_name == \"ResponseFunctionWebSearch\":\n", + " print(\"[tool] web_search – agent is calling web search\")\n", + "\n", + " # shell (new ShellTool executor)\n", + " elif raw_type_name == \"LocalShellCall\":\n", + " action = getattr(raw, \"action\", None)\n", + " commands = getattr(action, \"commands\", None) if action else None\n", + " if commands:\n", + " print(f\"[tool] shell – running commands: {commands}\")\n", + " else:\n", + " print(\"[tool] shell – running command\")\n", + "\n", + " # MCP (e.g. Context7)\n", + " elif \"MCP\" in raw_type_name or \"Mcp\" in raw_type_name:\n", + " tool_name = getattr(raw, \"tool_name\", None)\n", + " if tool_name is None:\n", + " action = getattr(raw, \"action\", None)\n", + " tool_name = getattr(action, \"tool\", None) if action else None\n", + " server_label = getattr(raw, \"server_label\", None)\n", + " label_str = f\" (server={server_label})\" if server_label else \"\"\n", + " if tool_name:\n", + " print(f\"[tool] mcp{label_str} – calling tool {tool_name!r}\")\n", + " else:\n", + " print(f\"[tool] mcp{label_str} – MCP tool call\")\n", + "\n", + " # Generic fallback for other tools (including hosted ones)\n", + " else:\n", + " print(f\"[tool] {raw_type_name} called\")\n", + "\n", + " # 2) Tool call outputs (where apply_patch shows up)\n", + " elif item.type == \"tool_call_output_item\":\n", + " raw = item.raw_item\n", + " output_preview = str(item.output)\n", + "\n", + " # Detect apply_patch via raw_item type or output format\n", + " is_apply_patch = False\n", + " if isinstance(raw, dict) and raw.get(\"type\") == \"apply_patch_call_output\":\n", + " is_apply_patch = True\n", + " elif any(\n", + " output_preview.startswith(prefix)\n", + " for prefix in (\"Created \", \"Updated \", \"Deleted \")\n", + " ):\n", + " is_apply_patch = True\n", + "\n", + " if is_apply_patch:\n", + " apply_patch_seen = True\n", + " if len(output_preview) > 400:\n", + " output_preview = output_preview[:400] + \"…\"\n", + " print(f\"[apply_patch] {output_preview}\\n\")\n", + " else:\n", + " if len(output_preview) > 400:\n", + " output_preview = output_preview[:400] + \"…\"\n", + " print(f\"[tool output]\\n{output_preview}\\n\")\n", + "\n", + " # 3) Normal assistant messages\n", + " elif item.type == \"message_output_item\":\n", + " text = ItemHelpers.text_message_output(item)\n", + " print(f\"[assistant]\\n{text}\\n\")\n", + "\n", + " # 4) Other event types – ignore for now\n", + " else:\n", + " pass\n", + "\n", + " print(\"=== Run complete ===\\n\")\n", + "\n", + " # Final answer\n", + " print(\"Final answer:\\n\")\n", + " print(result.final_output)\n", + "\n", + " # Single end-of-run confirmation about edits\n", + " if apply_patch_seen:\n", + " _ = print(\"\\n[apply_patch] One or more apply_patch calls were executed.\")\n", + " else:\n", + " print(\"\\n[apply_patch] No apply_patch calls detected in this run.\")" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "07401187", + "metadata": {}, + "outputs": [], + "source": [ + "edit_prompt = '''Update the dashboard to add a 'summarize' button in the top right corner.\n", + "When clicked, use the OpenAI Responses API with the gpt-5.1 model to generate a summary of the metrics on the dashboard, and display it in a modal.'''" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "e569d3b9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "=== Run starting ===\n", + "[user] Update the dashboard to add a 'summarize' button in the top right corner.\n", + "When clicked, use the OpenAI Responses API with the gpt-5.1 model to generate a summary of the metrics on the dashboard, and display it in a modal.\n", + "\n", + "Shell command approval required:\n", + " ls\n", + " ls -R\n", + " cat package.json || pip show flask || pip show django || echo 'no package.json'\n", + "Proceed? [y/N] y\n", + "[tool] ResponseOutputMessage called\n", + "[tool output]\n", + "$ ls\n", + "\u001b[1m\u001b[36mshadcn-dashboard\u001b[m\u001b[m\n", + "\n", + "$ ls -R\n", + "\u001b[1m\u001b[36mshadcn-dashboard\u001b[m\u001b[m\n", + "\n", + "./shadcn-dashboard:\n", + "components.json\n", + "eslint.config.mjs\n", + "next-env.d.ts\n", + "next.config.ts\n", + "package-lock.json\n", + "package.json\n", + "postcss.config.mjs\n", + "\u001b[1m\u001b[36mpublic\u001b[m\u001b[m\n", + "README.md\n", + "\u001b[1m\u001b[36msrc\u001b[m\u001b[m\n", + "tsconfig.json\n", + "\n", + "./shadcn-dashboard/public:\n", + "file.svg\n", + "globe.svg\n", + "next.svg\n", + "vercel.svg\n", + "window.svg\n", + "\n", + "./shadcn-dashboard/src:\n", + "\u001b[1m\u001b[36mapp\u001b[m\u001b[m\n", + "\u001b…\n", + "\n", + "Shell command approval required:\n", + " cd shadcn-dashboard && cat package.json\n", + " cd shadcn-dashboard && cat src/components/site-header.tsx\n", + " cd shadcn-dashboard && cat src/app/dashboard/page.tsx\n", + "Proceed? [y/N] y\n", + "[tool] ResponseOutputMessage called\n", + "[tool output]\n", + "$ cd shadcn-dashboard && cat package.json\n", + "{\n", + " \"name\": \"shadcn-dashboard\",\n", + " \"version\": \"0.1.0\",\n", + " \"private\": true,\n", + " \"scripts\": {\n", + " \"dev\": \"next dev\",\n", + " \"build\": \"next build\",\n", + " \"start\": \"next start\",\n", + " \"lint\": \"eslint\"\n", + " },\n", + " \"dependencies\": {\n", + " \"@dnd-kit/core\": \"^6.3.1\",\n", + " \"@dnd-kit/modifiers\": \"^9.0.0\",\n", + " \"@dnd-kit/sortable\": \"^10.0.0\",\n", + " \"@dnd-kit/utilities\": \"^3.2.2\",\n", + " \"@ra…\n", + "\n", + "[tool] mcp (server=context7) – MCP tool call\n", + "[tool] mcp (server=context7) – MCP tool call\n", + "Shell command approval required:\n", + " cd shadcn-dashboard && cat src/components/ui/dialog.tsx || echo 'no dialog'\n", + " cd shadcn-dashboard && cat src/app/layout.tsx\n", + "Proceed? [y/N] y\n", + "Shell command approval required:\n", + " cd shadcn-dashboard && npm install openai\n", + " cd shadcn-dashboard && ls src/components/ui\n", + "Proceed? [y/N] y\n", + "[tool] ResponseOutputMessage called\n", + "[tool] ResponseOutputMessage called\n", + "[tool output]\n", + "$ cd shadcn-dashboard && cat src/components/ui/dialog.tsx || echo 'no dialog'\n", + "no dialog\n", + "\n", + "stderr:\n", + "cat: src/components/ui/dialog.tsx: No such file or directory\n", + "\n", + "$ cd shadcn-dashboard && cat src/app/layout.tsx\n", + "import type { Metadata } from \"next\";\n", + "import { Geist, Geist_Mono } from \"next/font/google\";\n", + "import \"./globals.css\";\n", + "\n", + "const geistSans = Geist({\n", + " variable: \"--font-geist-sans\",\n", + " subsets: [\"lati…\n", + "\n", + "[tool output]\n", + "$ cd shadcn-dashboard && npm install openai\n", + "\n", + "added 513 packages, and audited 514 packages in 10s\n", + "\n", + "150 packages are looking for funding\n", + " run `npm fund` for details\n", + "\n", + "found 0 vulnerabilities\n", + "\n", + "$ cd shadcn-dashboard && ls src/components/ui\n", + "avatar.tsx\n", + "badge.tsx\n", + "breadcrumb.tsx\n", + "button.tsx\n", + "card.tsx\n", + "chart.tsx\n", + "checkbox.tsx\n", + "drawer.tsx\n", + "dropdown-menu.tsx\n", + "input.tsx\n", + "label.tsx\n", + "progress.tsx\n", + "select.tsx\n", + "separator.ts…\n", + "\n", + "[tool] ResponseOutputMessage called\n", + "[tool] ResponseOutputMessage called\n", + "[tool] ResponseOutputMessage called\n", + "[apply_patch] Created shadcn-dashboard/src/components/ui/dialog.tsx\n", + "\n", + "[apply_patch] Updated shadcn-dashboard/src/components/site-header.tsx\n", + "\n", + "[apply_patch] Created shadcn-dashboard/src/app/api/summarize/route.ts\n", + "\n", + "[assistant]\n", + "I’ve wired up a “Summarize” button and hooked it to the OpenAI Responses API with `gpt-5.1`, returning the result in a modal.\n", + "\n", + "Here’s what changed and how to use it.\n", + "\n", + "---\n", + "\n", + "## 1. New dialog UI primitive\n", + "\n", + "Added a Radix-based dialog component so we can show the summary in a modal.\n", + "\n", + "**File:** `src/components/ui/dialog.tsx` \n", + "(standard shadcn-style Dialog wrapper)\n", + "\n", + "You now have:\n", + "\n", + "```ts\n", + "export {\n", + " Dialog,\n", + " DialogPortal,\n", + " DialogOverlay,\n", + " DialogClose,\n", + " DialogTrigger,\n", + " DialogContent,\n", + " DialogHeader,\n", + " DialogFooter,\n", + " DialogTitle,\n", + " DialogDescription,\n", + "}\n", + "```\n", + "\n", + "You don’t need to change anything here unless you want to tweak styling/animations.\n", + "\n", + "---\n", + "\n", + "## 2. Header updated with “Summarize” button + modal\n", + "\n", + "**File:** `src/components/site-header.tsx`\n", + "\n", + "Key updates:\n", + "\n", + "```tsx\n", + "import { useState } from \"react\"\n", + "\n", + "import { Button } from \"@/components/ui/button\"\n", + "import {\n", + " Dialog,\n", + " DialogContent,\n", + " DialogHeader,\n", + " DialogTitle,\n", + "} from \"@/components/ui/dialog\"\n", + "import { Separator } from \"@/components/ui/separator\"\n", + "import { SidebarTrigger } from \"@/components/ui/sidebar\"\n", + "\n", + "export function SiteHeader() {\n", + " const [open, setOpen] = useState(false)\n", + " const [loading, setLoading] = useState(false)\n", + " const [summary, setSummary] = useState(null)\n", + " const [error, setError] = useState(null)\n", + "\n", + " async function handleSummarize() {\n", + " try {\n", + " setLoading(true)\n", + " setError(null)\n", + "\n", + " const res = await fetch(\"/api/summarize\", {\n", + " method: \"POST\",\n", + " })\n", + "\n", + " if (!res.ok) {\n", + " throw new Error(\"Failed to generate summary\")\n", + " }\n", + "\n", + " const data = (await res.json()) as { summary: string }\n", + " setSummary(data.summary)\n", + " } catch (err) {\n", + " console.error(err)\n", + " setError(\n", + " err instanceof Error\n", + " ? err.message\n", + " : \"Something went wrong generating the summary.\",\n", + " )\n", + " } finally {\n", + " setLoading(false)\n", + " }\n", + " }\n", + "\n", + " return (\n", + " <>\n", + "
\n", + "
\n", + " \n", + " \n", + "

Dashboard

\n", + "
\n", + " {\n", + " setOpen(true)\n", + " void handleSummarize()\n", + " }}\n", + " >\n", + " {loading ? \"Summarizing...\" : \"Summarize\"}\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + " \n", + " \n", + " \n", + " Dashboard summary\n", + " \n", + "
\n", + " {loading &&

Analyzing metrics…

}\n", + " {!loading && error && (\n", + "

{error}

\n", + " )}\n", + " {!loading && !error && summary &&

{summary}

}\n", + " {!loading && !error && !summary && (\n", + "

No summary available yet.

\n", + " )}\n", + "
\n", + "
\n", + "
\n", + " \n", + " )\n", + "}\n", + "```\n", + "\n", + "Behavior:\n", + "\n", + "- The **Summarize** button is in the top-right of the header.\n", + "- Clicking it:\n", + " - Opens the modal.\n", + " - Immediately calls `/api/summarize`.\n", + "- The modal shows:\n", + " - “Analyzing metrics…” while loading.\n", + " - The summary when complete.\n", + " - An error message if the API call fails.\n", + "\n", + "---\n", + "\n", + "## 3. API route using OpenAI Responses API with `gpt-5.1`\n", + "\n", + "**File:** `src/app/api/summarize/route.ts`\n", + "\n", + "```ts\n", + "import { NextResponse } from \"next/server\"\n", + "import OpenAI from \"openai\"\n", + "\n", + "const client = new OpenAI({\n", + " apiKey: process.env.OPENAI_API_KEY,\n", + "})\n", + "\n", + "export async function POST() {\n", + " try {\n", + " const response = await client.responses.create({\n", + " model: \"gpt-5.1\",\n", + " instructions:\n", + " \"You are a concise analytics assistant. Summarize the key insights from a metrics dashboard that includes KPI cards, a time-series chart, and a data table. Focus on trends, anomalies, and overall performance in 3-5 short bullet points. Do not fabricate exact numbers.\",\n", + " input:\n", + " \"Summarize the current metrics view for a business stakeholder at a high level.\",\n", + " })\n", + "\n", + " const summary = (response as any).output_text ?? \"No summary generated.\"\n", + "\n", + " return NextResponse.json({ summary })\n", + " } catch (error) {\n", + " console.error(\"Error generating summary:\", error)\n", + " return NextResponse.json(\n", + " { error: \"Failed to generate summary\" },\n", + " { status: 500 },\n", + " )\n", + " }\n", + "}\n", + "```\n", + "\n", + "Notes:\n", + "\n", + "- Uses the **OpenAI Responses API**.\n", + "- Model is explicitly set to `\"gpt-5.1\"` as requested.\n", + "- Returns JSON `{ summary: string }` to the client.\n", + "- Reads API key from `process.env.OPENAI_API_KEY`.\n", + "\n", + "---\n", + "\n", + "## 4. Environment setup\n", + "\n", + "Ensure you have your OpenAI key set:\n", + "\n", + "- Create a `.env.local` in `shadcn-dashboard` (if you don’t already):\n", + "\n", + "```bash\n", + "OPENAI_API_KEY=sk-...\n", + "```\n", + "\n", + "Next.js will expose this to the server runtime only (safe for this API route).\n", + "\n", + "---\n", + "\n", + "## 5. How to run\n", + "\n", + "From `shadcn-dashboard`:\n", + "\n", + "```bash\n", + "npm run dev\n", + "```\n", + "\n", + "Navigate to `/dashboard`:\n", + "\n", + "- You’ll see the **Summarize** button in the header.\n", + "- Click it to open the modal and generate a metrics summary.\n", + "\n", + "---\n", + "\n", + "If you’d like the summary to reflect actual numbers from your dashboard (e.g., pull from `data.json`, cards, or charts), I can next wire the API route to read that data and send it as structured input to the Responses API instead of the generic instructions.\n", + "\n", + "=== Run complete ===\n", + "\n", + "Final answer:\n", + "\n", + "I’ve wired up a “Summarize” button and hooked it to the OpenAI Responses API with `gpt-5.1`, returning the result in a modal.\n", + "\n", + "Here’s what changed and how to use it.\n", + "\n", + "---\n", + "\n", + "## 1. New dialog UI primitive\n", + "\n", + "Added a Radix-based dialog component so we can show the summary in a modal.\n", + "\n", + "**File:** `src/components/ui/dialog.tsx` \n", + "(standard shadcn-style Dialog wrapper)\n", + "\n", + "You now have:\n", + "\n", + "```ts\n", + "export {\n", + " Dialog,\n", + " DialogPortal,\n", + " DialogOverlay,\n", + " DialogClose,\n", + " DialogTrigger,\n", + " DialogContent,\n", + " DialogHeader,\n", + " DialogFooter,\n", + " DialogTitle,\n", + " DialogDescription,\n", + "}\n", + "```\n", + "\n", + "You don’t need to change anything here unless you want to tweak styling/animations.\n", + "\n", + "---\n", + "\n", + "## 2. Header updated with “Summarize” button + modal\n", + "\n", + "**File:** `src/components/site-header.tsx`\n", + "\n", + "Key updates:\n", + "\n", + "```tsx\n", + "import { useState } from \"react\"\n", + "\n", + "import { Button } from \"@/components/ui/button\"\n", + "import {\n", + " Dialog,\n", + " DialogContent,\n", + " DialogHeader,\n", + " DialogTitle,\n", + "} from \"@/components/ui/dialog\"\n", + "import { Separator } from \"@/components/ui/separator\"\n", + "import { SidebarTrigger } from \"@/components/ui/sidebar\"\n", + "\n", + "export function SiteHeader() {\n", + " const [open, setOpen] = useState(false)\n", + " const [loading, setLoading] = useState(false)\n", + " const [summary, setSummary] = useState(null)\n", + " const [error, setError] = useState(null)\n", + "\n", + " async function handleSummarize() {\n", + " try {\n", + " setLoading(true)\n", + " setError(null)\n", + "\n", + " const res = await fetch(\"/api/summarize\", {\n", + " method: \"POST\",\n", + " })\n", + "\n", + " if (!res.ok) {\n", + " throw new Error(\"Failed to generate summary\")\n", + " }\n", + "\n", + " const data = (await res.json()) as { summary: string }\n", + " setSummary(data.summary)\n", + " } catch (err) {\n", + " console.error(err)\n", + " setError(\n", + " err instanceof Error\n", + " ? err.message\n", + " : \"Something went wrong generating the summary.\",\n", + " )\n", + " } finally {\n", + " setLoading(false)\n", + " }\n", + " }\n", + "\n", + " return (\n", + " <>\n", + "
\n", + "
\n", + " \n", + " \n", + "

Dashboard

\n", + "
\n", + " {\n", + " setOpen(true)\n", + " void handleSummarize()\n", + " }}\n", + " >\n", + " {loading ? \"Summarizing...\" : \"Summarize\"}\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + " \n", + " \n", + " \n", + " Dashboard summary\n", + " \n", + "
\n", + " {loading &&

Analyzing metrics…

}\n", + " {!loading && error && (\n", + "

{error}

\n", + " )}\n", + " {!loading && !error && summary &&

{summary}

}\n", + " {!loading && !error && !summary && (\n", + "

No summary available yet.

\n", + " )}\n", + "
\n", + "
\n", + "
\n", + " \n", + " )\n", + "}\n", + "```\n", + "\n", + "Behavior:\n", + "\n", + "- The **Summarize** button is in the top-right of the header.\n", + "- Clicking it:\n", + " - Opens the modal.\n", + " - Immediately calls `/api/summarize`.\n", + "- The modal shows:\n", + " - “Analyzing metrics…” while loading.\n", + " - The summary when complete.\n", + " - An error message if the API call fails.\n", + "\n", + "---\n", + "\n", + "## 3. API route using OpenAI Responses API with `gpt-5.1`\n", + "\n", + "**File:** `src/app/api/summarize/route.ts`\n", + "\n", + "```ts\n", + "import { NextResponse } from \"next/server\"\n", + "import OpenAI from \"openai\"\n", + "\n", + "const client = new OpenAI({\n", + " apiKey: process.env.OPENAI_API_KEY,\n", + "})\n", + "\n", + "export async function POST() {\n", + " try {\n", + " const response = await client.responses.create({\n", + " model: \"gpt-5.1\",\n", + " instructions:\n", + " \"You are a concise analytics assistant. Summarize the key insights from a metrics dashboard that includes KPI cards, a time-series chart, and a data table. Focus on trends, anomalies, and overall performance in 3-5 short bullet points. Do not fabricate exact numbers.\",\n", + " input:\n", + " \"Summarize the current metrics view for a business stakeholder at a high level.\",\n", + " })\n", + "\n", + " const summary = (response as any).output_text ?? \"No summary generated.\"\n", + "\n", + " return NextResponse.json({ summary })\n", + " } catch (error) {\n", + " console.error(\"Error generating summary:\", error)\n", + " return NextResponse.json(\n", + " { error: \"Failed to generate summary\" },\n", + " { status: 500 },\n", + " )\n", + " }\n", + "}\n", + "```\n", + "\n", + "Notes:\n", + "\n", + "- Uses the **OpenAI Responses API**.\n", + "- Model is explicitly set to `\"gpt-5.1\"` as requested.\n", + "- Returns JSON `{ summary: string }` to the client.\n", + "- Reads API key from `process.env.OPENAI_API_KEY`.\n", + "\n", + "---\n", + "\n", + "## 4. Environment setup\n", + "\n", + "Ensure you have your OpenAI key set:\n", + "\n", + "- Create a `.env.local` in `shadcn-dashboard` (if you don’t already):\n", + "\n", + "```bash\n", + "OPENAI_API_KEY=sk-...\n", + "```\n", + "\n", + "Next.js will expose this to the server runtime only (safe for this API route).\n", + "\n", + "---\n", + "\n", + "## 5. How to run\n", + "\n", + "From `shadcn-dashboard`:\n", + "\n", + "```bash\n", + "npm run dev\n", + "```\n", + "\n", + "Navigate to `/dashboard`:\n", + "\n", + "- You’ll see the **Summarize** button in the header.\n", + "- Click it to open the modal and generate a metrics summary.\n", + "\n", + "---\n", + "\n", + "If you’d like the summary to reflect actual numbers from your dashboard (e.g., pull from `data.json`, cards, or charts), I can next wire the API route to read that data and send it as structured input to the Responses API instead of the generic instructions.\n", + "\n", + "[apply_patch] One or more apply_patch calls were executed.\n" + ] + } + ], + "source": [ + "await run_updated_coding_agent_with_logs(edit_prompt)" + ] + }, + { + "cell_type": "markdown", + "id": "2a80ce58", + "metadata": {}, + "source": [ + "Once the agent is done updating the project (you should see a \"=== Run complete ===\" log followed by the final answer), you will see the updated UI, with the OpenAI Responses API call to summarize what's on the dashboard.\n", + "\n", + "**Note**: If this step fails, you can re-run the agent loop. In a production environment, you would implement an outer loop that handles errors or wait for user input and iterate.\n", + "\n", + "![final dashboard screenshot](https://cdn.openai.com/cookbook/dashboard_screenshot2.jpg)" + ] + }, + { + "cell_type": "markdown", + "id": "d665cf74", + "metadata": {}, + "source": [ + "## Wrapping up\n", + "\n", + "In this cookbook guide, we built a coding agent that can scaffold a project, refine it through patches, execute commands, and stay up to date with external documentation. By combining GPT 5.1 with the Agents SDK and tools like `shell`, `apply_patch`, `web_search`, and the Context7 MCP, you can create agents that don’t just generate code—they actively work with codebases: running commands, applying edits, pulling in fresh context, and evolving a project end-to-end.\n", + "\n", + "This workflow is a powerful blueprint for building agents that feel less like tools and more like collaborators. You can extend this pattern to integrate agents into IDEs or code sandboxes, generate new apps from scratch, work across large codebases, or even collaborate with developers in real time." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python (.venv)", + "language": "python", + "name": ".venv" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/Realtime_out_of_band_transcription.ipynb b/examples/Realtime_out_of_band_transcription.ipynb new file mode 100644 index 0000000000..2ae3c26e5c --- /dev/null +++ b/examples/Realtime_out_of_band_transcription.ipynb @@ -0,0 +1,1024 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Transcribing User Audio with a Separate Realtime Request\n", + "\n", + "**Purpose**: This notebook demonstrates how to use the Realtime model itself to accurately transcribe user audio `out-of-band` using the same websocket session connection, avoiding errors and inconsistencies common when relying on a separate transcription model (gpt-4o-transcribe/whisper-1).\n", + "\n", + "> We call this out-of-band transcription using the realtime model. It refers to running a separate realtime model request to transcribe the user’s audio outside the live Realtime conversation.\n", + "\n", + "It covers how to build a server-to-server client that:\n", + "\n", + "- Streams microphone audio to an OpenAI Realtime voice agent.\n", + "- Plays back the agent's spoken replies.\n", + "- After each user turn, generates a high-quality text-only transcript using the **same Realtime model**.\n", + "\n", + "This is achieved via a secondary `response.create` request:\n", + "\n", + "```python\n", + "{\n", + " \"type\": \"response.create\",\n", + " \"response\": {\n", + " \"conversation\": \"none\",\n", + " \"output_modalities\": [\"text\"],\n", + " \"instructions\": transcription_instructions\n", + " }\n", + "}\n", + "```\n", + "\n", + "This notebook demonstrates using the **Realtime model itself** for transcription:\n", + "\n", + "- **Context-aware transcription**: Uses the full session context to improve transcript accuracy.\n", + "- **Non-intrusive**: Runs outside the live conversation, so the transcript is never added back to session state.\n", + "- **Customizable instructions**: Allows tailoring transcription prompts to specific use-cases. Realtime model is better than the transcription model at following instructions." + ] + }, + { + "cell_type": "markdown", + "id": "1c0f46ad", + "metadata": {}, + "source": [ + "# 1. Why use out-of-band transcription?\n", + "\n", + "The Realtime API offers built-in user input transcription, but this relies on a **separate ASR model** (e.g., gpt-4o-transcribe). Using different models for transcription and response generation can lead to discrepancies. For example:\n", + "\n", + "- User speech transcribed as: `I had otoo accident`\n", + "- Realtime response interpreted correctly as: `Got it, you had an auto accident`\n", + "\n", + "Accurate transcriptions can be very important, particularly when:\n", + "\n", + "- Transcripts trigger downstream actions (e.g., tool calls), where errors propagate through the system.\n", + "- Transcripts are summarized or passed to other components, risking context pollution.\n", + "- Transcripts are displayed to end users, leading to poor user experiences if errors occur.\n", + "\n", + "The potential advantages of using out-of-band transcription include:\n", + "- **Reduced Mismatch**: The same model is used for both transcription and generation, minimizing inconsistencies between what the user says and how the agent responds.\n", + "- **Greater Steerability**: The Realtime model is more steerable, can better follow custom instructions for higher transcription quality, and is not limited by a 1024-token input maximum.\n", + "- **Session Context Awareness**: The model has access to the full session context, so, for example, if you mention your name multiple times, it will transcribe it correctly.\n", + "\n", + "\n", + "In terms of **trade-offs**:\n", + "\n", + "- Realtime Model (for transcription):\n", + " - Audio Input → Text Output: $32.00 per 1M audio tokens + $16.00 per 1M text tokens out.\n", + " - Cached Session Context: $0.40 per 1M cached context tokens (typically negligible).\n", + "\n", + " - Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $48.00\n", + "\n", + "- GPT-4o Transcription:\n", + "\n", + " - Audio Input: $6.00 per 1M audio tokens\n", + "\n", + " - Text Input: $2.50 per 1M tokens (capped at 1024 tokens, negligible input prompt)\n", + "\n", + " - Text Output: $10.00 per 1M tokens\n", + "\n", + " - Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $16.00\n", + "\n", + "- Direct Cost Comparison:\n", + "\n", + " - Realtime Transcription: ~$48.00\n", + "\n", + " - GPT-4o Transcription: ~$16.00\n", + "\n", + " - Absolute Difference: $48.00 − $16.00 = $32.00\n", + "\n", + " - Cost Ratio: $48.00 / $16.00 = 3×\n", + "\n", + " Note: Costs related to cached session context ($0.40 per 1M tokens) and the capped text input tokens for GPT-4o ($2.50 per 1M tokens) are negligible and thus excluded from detailed calculations.\n", + "\n", + "- Other Considerations:\n", + "\n", + " - Implementing transcription via the realtime model might be slightly more complex compared to using the built-in GPT-4o transcription option through the Realtime API.\n", + "\n", + "> Note: Ouf-of-band responses using the realtime model can be used for other use cases beyond user turn transcription. Examples include generating structured summaries, triggering background actions, or performing validation tasks without affecting the main conversation.\n", + "\n", + "\"drawing\"\n" + ] + }, + { + "cell_type": "markdown", + "id": "63ccae3d", + "metadata": {}, + "source": [ + "# 2. Requirements & Setup\n", + "\n", + "Ensure your environment meets these requirements:\n", + "\n", + "1. **Python 3.10 or later**\n", + "\n", + "2. **PortAudio** (required by `sounddevice`):\n", + " - macOS:\n", + " ```bash\n", + " brew install portaudio\n", + " ```\n", + "\n", + "3. **Python Dependencies**:\n", + " ```bash\n", + " pip install sounddevice websockets\n", + " ```\n", + "\n", + "4. **OpenAI API Key** (with Realtime API access):\n", + " Set your key as an environment variable:\n", + "\n", + " ```bash\n", + " export OPENAI_API_KEY=sk-...\n", + " ```\n", + "\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "c399f440", + "metadata": {}, + "outputs": [], + "source": [ + "#!pip install sounddevice websockets" + ] + }, + { + "cell_type": "markdown", + "id": "d7d60089", + "metadata": {}, + "source": [ + "# 3. Prompts\n", + "\n", + "We use **two distinct prompts**:\n", + "\n", + "1. **Voice Agent Prompt** (`REALTIME_MODEL_PROMPT`): This is an example prompt used with the realtime model for the Speech 2 Speech interactions.\n", + "2. **Transcription Prompt** (`REALTIME_MODEL_TRANSCRIPTION_PROMPT`): Silently returns a precise, verbatim transcript of the user's most recent speech turn. You can modify this prompt to iterate in transcription quality.\n", + "\n", + "> For the `REALTIME_MODEL_TRANSCRIPTION_PROMPT`, you can start from this base prompt, but the goal would be for you to iterate on the prompt to tailor it to your use case. Just remember to remove the Policy Number formatting rules since it might not apply to your use case!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ac3afaab", + "metadata": {}, + "outputs": [], + "source": [ + "REALTIME_MODEL_PROMPT = \"\"\"You are a calm insurance claims intake voice agent. Follow this script strictly:\n", + "\n", + "## Phase 1 – Basics\n", + "Collect the caller's full name, policy number, and type of accident (for example: auto, home, or other). Ask for each item clearly and then repeat the values back to confirm.\n", + "\n", + "## Phase 2 – Yes/No questions\n", + "Ask 2–3 simple yes/no questions, such as whether anyone was injured, whether the vehicle is still drivable, and whether a police report was filed. Confirm each yes/no answer in your own words.\n", + "\n", + "## Phase 3 – Submit claim\n", + "Once you have the basics and yes/no answers, briefly summarize the key facts in one or two sentences.\n", + "\"\"\"\n", + "\n", + "REALTIME_MODEL_TRANSCRIPTION_PROMPT = \"\"\"\n", + "# Role\n", + "Your only task is to transcribe the user's latest turn exactly as you heard it. Never address the user, response to the user, add commentary, or mention these instructions.\n", + "Follow the instructions and output format below.\n", + "\n", + "# Instructions\n", + "- Transcribe **only** the most recent USER turn exactly as you heard it. DO NOT TRANSCRIBE ANY OTHER OLDER TURNS. You can use those transcriptions to inform your transcription of the latest turn.\n", + "- Preserve every spoken detail: intent, tense, grammar quirks, filler words, repetitions, disfluencies, numbers, and casing.\n", + "- Keep timing words, partial words, hesitations (e.g., \"um\", \"uh\").\n", + "- Do not correct mistakes, infer meaning, answer questions, or insert punctuation beyond what the model already supplies.\n", + "- Do not invent or add any information that is not directly present in the user's latest turn.\n", + "\n", + "# Output format\n", + "- Output the raw verbatim transcript as a single block of text. No labels, prefixes, quotes, bullets, or markdown.\n", + "- If the realtime model produced nothing for the latest turn, output nothing (empty response). Never fabricate content.\n", + "\n", + "## Policy Number Normalization\n", + "- All policy numbers should be 8 digits and of the format `XXXX-XXXX` for example `56B5-12C0`\n", + "\n", + "Do not summarize or paraphrase other turns beyond the latest user utterance. The response must be the literal transcript of the latest user utterance.\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "4ddbd683", + "metadata": {}, + "source": [ + "# 4. Core configuration\n", + "\n", + "We define:\n", + "\n", + "- Imports\n", + "- Audio and model defaults\n", + "- Constants for transcription event handling" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "4b952a29", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/var/folders/vd/l97lv64j3678b905tff4bc0h0000gp/T/ipykernel_91319/2514869342.py:10: DeprecationWarning: websockets.client.WebSocketClientProtocol is deprecated\n", + " from websockets.client import WebSocketClientProtocol\n" + ] + } + ], + "source": [ + "import asyncio\n", + "import base64\n", + "import json\n", + "import os\n", + "from collections import defaultdict, deque\n", + "from typing import Any\n", + "\n", + "import sounddevice as sd\n", + "import websockets\n", + "from websockets.client import WebSocketClientProtocol\n", + "\n", + "# Basic defaults\n", + "DEFAULT_MODEL = \"gpt-realtime\"\n", + "DEFAULT_VOICE = \"marin\"\n", + "DEFAULT_SAMPLE_RATE = 24_000\n", + "DEFAULT_BLOCK_MS = 100\n", + "DEFAULT_SILENCE_DURATION_MS = 800\n", + "DEFAULT_PREFIX_PADDING_MS = 300\n", + "TRANSCRIPTION_PURPOSE = \"User turn transcription\"" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "7254080a", + "metadata": {}, + "outputs": [], + "source": [ + "# Event grouping constants\n", + "TRANSCRIPTION_DELTA_TYPES = {\n", + " \"input_audio_buffer.transcription.delta\",\n", + " \"input_audio_transcription.delta\",\n", + " \"conversation.item.input_audio_transcription.delta\",\n", + "}\n", + "TRANSCRIPTION_COMPLETE_TYPES = {\n", + " \"input_audio_buffer.transcription.completed\",\n", + " \"input_audio_buffer.transcription.done\",\n", + " \"input_audio_transcription.completed\",\n", + " \"input_audio_transcription.done\",\n", + " \"conversation.item.input_audio_transcription.completed\",\n", + " \"conversation.item.input_audio_transcription.done\",\n", + "}\n", + "INPUT_SPEECH_END_EVENT_TYPES = {\n", + " \"input_audio_buffer.speech_stopped\",\n", + " \"input_audio_buffer.committed\",\n", + "}\n", + "RESPONSE_AUDIO_DELTA_TYPES = {\n", + " \"response.output_audio.delta\",\n", + " \"response.audio.delta\",\n", + "}\n", + "RESPONSE_TEXT_DELTA_TYPES = {\n", + " \"response.output_text.delta\",\n", + " \"response.text.delta\",\n", + "}\n", + "RESPONSE_AUDIO_TRANSCRIPT_DELTA_TYPES = {\n", + " \"response.output_audio_transcript.delta\",\n", + " \"response.audio_transcript.delta\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "a905ec16", + "metadata": {}, + "source": [ + "# 5. Building the Realtime session & the out‑of‑band request\n", + "\n", + "The Realtime session (`session.update`) configures:\n", + "\n", + "- Audio input/output\n", + "- Server‑side VAD\n", + "- Set built‑in transcription (`input_audio_transcription_model`)\n", + " + We set this so that we can compare to the realtime model transcription\n", + "\n", + "The out‑of‑band transcription is a `response.create` trigerred after user input audio is committed `input_audio_buffer.committed`:\n", + "\n", + "- `conversation: \"none\"` – use session state but don’t write to the main conversation session state\n", + "- `output_modalities: [\"text\"]` – get a text transcript only\n", + "\n", + "> Note: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "4baf1870", + "metadata": {}, + "outputs": [], + "source": [ + "def build_session_update(\n", + " instructions: str,\n", + " voice: str,\n", + " vad_threshold: float,\n", + " silence_duration_ms: int,\n", + " prefix_padding_ms: int,\n", + " idle_timeout_ms: int | None,\n", + " input_audio_transcription_model: str | None = None,\n", + ") -> dict[str, object]:\n", + " \"\"\"Configure the Realtime session: audio in/out, server VAD, etc.\"\"\"\n", + "\n", + " turn_detection: dict[str, float | int | bool | str] = {\n", + " \"type\": \"server_vad\",\n", + " \"threshold\": vad_threshold,\n", + " \"silence_duration_ms\": silence_duration_ms,\n", + " \"prefix_padding_ms\": prefix_padding_ms,\n", + " \"create_response\": True,\n", + " \"interrupt_response\": True,\n", + " }\n", + "\n", + " if idle_timeout_ms is not None:\n", + " turn_detection[\"idle_timeout_ms\"] = idle_timeout_ms\n", + "\n", + " audio_config: dict[str, Any] = {\n", + " \"input\": {\n", + " \"format\": {\n", + " \"type\": \"audio/pcm\",\n", + " \"rate\": DEFAULT_SAMPLE_RATE,\n", + " },\n", + " \"noise_reduction\": {\"type\": \"near_field\"},\n", + " \"turn_detection\": turn_detection,\n", + " },\n", + " \"output\": {\n", + " \"format\": {\n", + " \"type\": \"audio/pcm\",\n", + " \"rate\": DEFAULT_SAMPLE_RATE,\n", + " },\n", + " \"voice\": voice,\n", + " },\n", + " }\n", + "\n", + " # Optional: built-in transcription model for comparison\n", + " if input_audio_transcription_model:\n", + " audio_config[\"input\"][\"transcription\"] = {\n", + " \"model\": input_audio_transcription_model,\n", + " }\n", + "\n", + " session: dict[str, object] = {\n", + " \"type\": \"realtime\",\n", + " \"output_modalities\": [\"audio\"],\n", + " \"instructions\": instructions,\n", + " \"audio\": audio_config,\n", + " }\n", + "\n", + " return {\n", + " \"type\": \"session.update\",\n", + " \"session\": session,\n", + " }\n", + "\n", + "\n", + "def build_transcription_request(transcription_instructions: str) -> dict[str, object]:\n", + " \"\"\"Ask the SAME Realtime model for an out-of-band transcript of the latest user turn.\"\"\"\n", + "\n", + " return {\n", + " \"type\": \"response.create\",\n", + " \"response\": {\n", + " \"conversation\": \"none\", # <--- out-of-band\n", + " \"output_modalities\": [\"text\"],\n", + " \"metadata\": {\"purpose\": TRANSCRIPTION_PURPOSE}, # <--- we add metadata so it is easier to identify the event in the logs\n", + " \"instructions\": transcription_instructions,\n", + " },\n", + " }\n" + ] + }, + { + "cell_type": "markdown", + "id": "9afe7911", + "metadata": {}, + "source": [ + "# 6. Audio streaming: mic → Realtime → speakers\n", + "\n", + "We now define:\n", + "\n", + "- `encode_audio` – base64 helper\n", + "- `playback_audio` – play assistant audio on the default output device\n", + "- `send_audio_from_queue` – send buffered mic audio to `input_audio_buffer`\n", + "- `stream_microphone_audio` – capture PCM16 from the mic and feed the queue\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "11218bbb", + "metadata": {}, + "outputs": [], + "source": [ + "def encode_audio(chunk: bytes) -> str:\n", + " \"\"\"Base64-encode a PCM audio chunk for WebSocket transport.\"\"\"\n", + " return base64.b64encode(chunk).decode(\"utf-8\")\n", + "\n", + "\n", + "async def playback_audio(\n", + " playback_queue: asyncio.Queue,\n", + " stop_event: asyncio.Event,\n", + ") -> None:\n", + " \"\"\"Stream assistant audio back to the speakers in (near) real time.\"\"\"\n", + "\n", + " try:\n", + " with sd.RawOutputStream(\n", + " samplerate=DEFAULT_SAMPLE_RATE,\n", + " channels=1,\n", + " dtype=\"int16\",\n", + " ) as stream:\n", + " while not stop_event.is_set():\n", + " chunk = await playback_queue.get()\n", + " if chunk is None:\n", + " break\n", + " try:\n", + " stream.write(chunk)\n", + " except Exception as exc:\n", + " print(f\"Audio playback error: {exc}\", flush=True)\n", + " break\n", + " except Exception as exc:\n", + " print(f\"Failed to open audio output stream: {exc}\", flush=True)\n", + "\n", + "\n", + "async def send_audio_from_queue(\n", + " ws: WebSocketClientProtocol,\n", + " queue: asyncio.Queue[bytes | None],\n", + " stop_event: asyncio.Event,\n", + ") -> None:\n", + " \"\"\"Push raw PCM chunks into input_audio_buffer via the WebSocket.\"\"\"\n", + "\n", + " while not stop_event.is_set():\n", + " chunk = await queue.get()\n", + " if chunk is None:\n", + " break\n", + " encoded_chunk = encode_audio(chunk)\n", + " message = {\"type\": \"input_audio_buffer.append\", \"audio\": encoded_chunk}\n", + " await ws.send(json.dumps(message))\n", + "\n", + " if not ws.closed:\n", + " commit_payload = {\"type\": \"input_audio_buffer.commit\"}\n", + " await ws.send(json.dumps(commit_payload))\n", + "\n", + "\n", + "async def stream_microphone_audio(\n", + " ws: WebSocketClientProtocol,\n", + " stop_event: asyncio.Event,\n", + " shared_state: dict,\n", + " block_ms: int = DEFAULT_BLOCK_MS,\n", + ") -> None:\n", + " \"\"\"Capture live microphone audio and send it to the realtime session.\"\"\"\n", + "\n", + " loop = asyncio.get_running_loop()\n", + " audio_queue: asyncio.Queue[bytes | None] = asyncio.Queue()\n", + " blocksize = int(DEFAULT_SAMPLE_RATE * (block_ms / 1000))\n", + "\n", + " def on_audio(indata, frames, time_info, status): # type: ignore[override]\n", + " \"\"\"Capture a mic callback chunk and enqueue it unless the mic is muted.\"\"\"\n", + " if status:\n", + " print(f\"Microphone status: {status}\", flush=True)\n", + " # Simple echo protection: mute mic when assistant is talking\n", + " if not stop_event.is_set() and not shared_state.get(\"mute_mic\", False):\n", + " data = bytes(indata)\n", + " loop.call_soon_threadsafe(audio_queue.put_nowait, data)\n", + "\n", + " print(\n", + " f\"Streaming microphone audio at {DEFAULT_SAMPLE_RATE} Hz (mono). \"\n", + " \"Speak naturally; server VAD will stop listening when you pause.\"\n", + " )\n", + " sender = asyncio.create_task(send_audio_from_queue(ws, audio_queue, stop_event))\n", + "\n", + " with sd.RawInputStream(\n", + " samplerate=DEFAULT_SAMPLE_RATE,\n", + " blocksize=blocksize,\n", + " channels=1,\n", + " dtype=\"int16\",\n", + " callback=on_audio,\n", + " ):\n", + " await stop_event.wait()\n", + "\n", + " await audio_queue.put(None)\n", + " await sender" + ] + }, + { + "cell_type": "markdown", + "id": "d02cc1bd", + "metadata": {}, + "source": [ + "# 7. Extracting and comparing transcripts\n", + "\n", + "The function below enables us to generate **two transcripts** for each user turn:\n", + "\n", + "- **Realtime model transcript**: from our out-of-band `response.create` call.\n", + "- **Built-in ASR transcript**: from the standard transcription model (`input_audio_transcription_model`).\n", + "\n", + "We align and display both clearly in the terminal:\n", + "\n", + "```text\n", + "=== User Turn (Realtime Transcript) ===\n", + "...\n", + "\n", + "=== User Turn (Built-in ASR Transcript) ===\n", + "...\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "cb6acbf0", + "metadata": {}, + "outputs": [], + "source": [ + "def flush_pending_transcription_prints(shared_state: dict) -> None:\n", + " \"\"\"Whenever we've printed a realtime transcript, print the matching transcription-model output.\"\"\"\n", + "\n", + " pending_prints: deque | None = shared_state.get(\"pending_transcription_prints\")\n", + " input_transcripts: deque | None = shared_state.get(\"input_transcripts\")\n", + "\n", + " if not pending_prints or not input_transcripts:\n", + " return\n", + "\n", + " while pending_prints and input_transcripts:\n", + " comparison_text = input_transcripts.popleft()\n", + " pending_prints.popleft()\n", + " print(\"=== User turn (Transcription model) ===\")\n", + " if comparison_text:\n", + " print(comparison_text, flush=True)\n", + " print()\n", + " else:\n", + " print(\"\", flush=True)\n", + " print()\n" + ] + }, + { + "cell_type": "markdown", + "id": "6025bbf6", + "metadata": {}, + "source": [ + "# 8. Listening for Realtime events\n", + "\n", + "`listen_for_events` drives the session:\n", + "\n", + "- Watches for `speech_started` / `speech_stopped` / `committed`\n", + "- Sends the out‑of‑band transcription request when a user turn finishes (`input_audio_buffer.committed`)\n", + "- Streams assistant audio to the playback queue\n", + "- Buffers text deltas per `response_id`" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "d099babd", + "metadata": {}, + "outputs": [], + "source": [ + "async def listen_for_events(\n", + " ws: WebSocketClientProtocol,\n", + " stop_event: asyncio.Event,\n", + " transcription_instructions: str,\n", + " max_turns: int | None,\n", + " playback_queue: asyncio.Queue,\n", + " shared_state: dict,\n", + ") -> None:\n", + " \"\"\"Print assistant text + transcripts and coordinate mic muting.\"\"\"\n", + "\n", + " responses: dict[str, dict[str, bool]] = {}\n", + " buffers: defaultdict[str, str] = defaultdict(str)\n", + " transcription_model_buffers: defaultdict[str, str] = defaultdict(str)\n", + " completed_main_responses = 0\n", + " awaiting_transcription_prompt = False\n", + " input_transcripts = shared_state.setdefault(\"input_transcripts\", deque())\n", + " pending_transcription_prints = shared_state.setdefault(\n", + " \"pending_transcription_prints\", deque()\n", + " )\n", + "\n", + " async for raw in ws:\n", + " if stop_event.is_set():\n", + " break\n", + "\n", + " message = json.loads(raw)\n", + " message_type = message.get(\"type\")\n", + "\n", + " # --- User speech events -------------------------------------------------\n", + " if message_type == \"input_audio_buffer.speech_started\":\n", + " print(\"\\n[client] Speech detected; streaming...\", flush=True)\n", + " awaiting_transcription_prompt = True\n", + "\n", + " elif message_type in INPUT_SPEECH_END_EVENT_TYPES:\n", + " if message_type == \"input_audio_buffer.speech_stopped\":\n", + " print(\"[client] Detected silence; preparing transcript...\", flush=True)\n", + "\n", + " # This is where the out-of-band transcription request is sent. <-------\n", + " if awaiting_transcription_prompt:\n", + " request_payload = build_transcription_request(\n", + " transcription_instructions\n", + " )\n", + " await ws.send(json.dumps(request_payload))\n", + " awaiting_transcription_prompt = False\n", + "\n", + " # --- Built-in transcription model stream -------------------------------\n", + " elif message_type in TRANSCRIPTION_DELTA_TYPES:\n", + " buffer_id = message.get(\"buffer_id\") or message.get(\"item_id\") or \"default\"\n", + " delta_text = (\n", + " message.get(\"delta\")\n", + " or (message.get(\"transcription\") or {}).get(\"text\")\n", + " or \"\"\n", + " )\n", + " if delta_text:\n", + " transcription_model_buffers[buffer_id] += delta_text\n", + "\n", + " elif message_type in TRANSCRIPTION_COMPLETE_TYPES:\n", + " buffer_id = message.get(\"buffer_id\") or message.get(\"item_id\") or \"default\"\n", + " final_text = (\n", + " (message.get(\"transcription\") or {}).get(\"text\")\n", + " or message.get(\"transcript\")\n", + " or \"\"\n", + " )\n", + " if not final_text:\n", + " final_text = transcription_model_buffers.pop(buffer_id, \"\").strip()\n", + " else:\n", + " transcription_model_buffers.pop(buffer_id, None)\n", + "\n", + " if not final_text:\n", + " item = message.get(\"item\")\n", + " if item:\n", + " final_text = item.get(\"transcription\")\n", + " final_text = final_text or \"\"\n", + "\n", + " final_text = final_text.strip()\n", + " if final_text:\n", + " input_transcripts.append(final_text)\n", + " flush_pending_transcription_prints(shared_state)\n", + "\n", + " # --- Response lifecycle (Realtime model) --------------------------------\n", + " elif message_type == \"response.created\":\n", + " response = message.get(\"response\", {})\n", + " response_id = response.get(\"id\")\n", + " metadata = response.get(\"metadata\") or {}\n", + " responses[response_id] = {\n", + " \"is_transcription\": metadata.get(\"purpose\") == TRANSCRIPTION_PURPOSE,\n", + " \"done\": False,\n", + " }\n", + "\n", + " elif message_type in RESPONSE_AUDIO_DELTA_TYPES:\n", + " response_id = message.get(\"response_id\")\n", + " if response_id is None:\n", + " continue\n", + " b64_audio = message.get(\"delta\") or message.get(\"audio\")\n", + " if not b64_audio:\n", + " continue\n", + " try:\n", + " audio_chunk = base64.b64decode(b64_audio)\n", + " except Exception:\n", + " continue\n", + "\n", + " if (\n", + " response_id in responses\n", + " and not responses[response_id][\"is_transcription\"]\n", + " ):\n", + " shared_state[\"mute_mic\"] = True\n", + "\n", + " await playback_queue.put(audio_chunk)\n", + "\n", + " elif message_type in RESPONSE_TEXT_DELTA_TYPES:\n", + " response_id = message.get(\"response_id\")\n", + " if response_id is None:\n", + " continue\n", + " buffers[response_id] += message.get(\"delta\", \"\")\n", + " \n", + "\n", + " elif message_type in RESPONSE_AUDIO_TRANSCRIPT_DELTA_TYPES:\n", + " response_id = message.get(\"response_id\")\n", + " if response_id is None:\n", + " continue\n", + " buffers[response_id] += message.get(\"delta\", \"\") \n", + "\n", + " elif message_type == \"response.done\":\n", + " response = message.get(\"response\", {})\n", + " response_id = response.get(\"id\")\n", + " if response_id is None:\n", + " continue\n", + " if response_id not in responses:\n", + " responses[response_id] = {\"is_transcription\": False, \"done\": False}\n", + " responses[response_id][\"done\"] = True\n", + "\n", + " is_transcription = responses[response_id][\"is_transcription\"]\n", + " text = buffers.get(response_id, \"\").strip()\n", + " if text:\n", + " if is_transcription:\n", + " print(\"\\n=== User turn (Realtime transcript) ===\")\n", + " print(text, flush=True)\n", + " print()\n", + " pending_transcription_prints.append(object())\n", + " flush_pending_transcription_prints(shared_state)\n", + " else:\n", + " print(\"\\n=== Assistant response ===\")\n", + " print(text, flush=True)\n", + " print()\n", + "\n", + " if not is_transcription:\n", + " shared_state[\"mute_mic\"] = False\n", + " completed_main_responses += 1\n", + "\n", + " if max_turns is not None and completed_main_responses >= max_turns:\n", + " stop_event.set()\n", + " break\n", + "\n", + " elif message_type == \"error\":\n", + " print(f\"Error from server: {message}\")\n", + "\n", + " else:\n", + " pass\n", + "\n", + " await asyncio.sleep(0)" + ] + }, + { + "cell_type": "markdown", + "id": "10c69ded", + "metadata": {}, + "source": [ + "# 9. Run Script\n", + "\n", + "In this step, we run the the code which will allow us to view the realtime model transcription vs transcription model transcriptions. The code does the following:\n", + "\n", + "- Loads configuration and prompts\n", + "- Establishes a WebSocket connection\n", + "- Starts concurrent tasks:\n", + " - `listen_for_events` (handle incoming messages)\n", + " - `stream_microphone_audio` (send microphone audio)\n", + " - Mutes mic when assistant is speaking\n", + " - `playback_audio` (play assistant responses)\n", + " - prints realtime and transcription model transcripts when they are both returned. It uses shared_state to ensure both are returned before printing.\n", + "- Run session until you `interrupt`\n", + "\n", + "Output should look like:\n", + "```python\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Hello.\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Hello\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Hello, and thank you for calling. Let's start with your full name, please.\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "35c4d7b5", + "metadata": {}, + "outputs": [], + "source": [ + "async def run_realtime_session(\n", + " api_key: str | None = None,\n", + " server: str = \"wss://api.openai.com/v1/realtime\",\n", + " model: str = DEFAULT_MODEL,\n", + " voice: str = DEFAULT_VOICE,\n", + " instructions: str = REALTIME_MODEL_PROMPT,\n", + " transcription_instructions: str = REALTIME_MODEL_TRANSCRIPTION_PROMPT,\n", + " input_audio_transcription_model: str | None = \"gpt-4o-transcribe\",\n", + " silence_duration_ms: int = DEFAULT_SILENCE_DURATION_MS,\n", + " prefix_padding_ms: int = DEFAULT_PREFIX_PADDING_MS,\n", + " vad_threshold: float = 0.6,\n", + " idle_timeout_ms: int | None = None,\n", + " max_turns: int | None = None,\n", + " timeout_seconds: int = 0,\n", + ") -> None:\n", + " \"\"\"Connect to the Realtime API, stream audio both ways, and print transcripts.\"\"\"\n", + " api_key = api_key or os.environ.get(\"OPENAI_API_KEY\")\n", + " ws_url = f\"{server}?model={model}\"\n", + " headers = {\n", + " \"Authorization\": f\"Bearer {api_key}\",\n", + " }\n", + "\n", + " session_update_payload = build_session_update(\n", + " instructions=instructions,\n", + " voice=voice,\n", + " vad_threshold=vad_threshold,\n", + " silence_duration_ms=silence_duration_ms,\n", + " prefix_padding_ms=prefix_padding_ms,\n", + " idle_timeout_ms=idle_timeout_ms,\n", + " input_audio_transcription_model=input_audio_transcription_model,\n", + " )\n", + " stop_event = asyncio.Event()\n", + " playback_queue: asyncio.Queue = asyncio.Queue()\n", + " shared_state: dict = {\n", + " \"mute_mic\": False,\n", + " \"input_transcripts\": deque(),\n", + " \"pending_transcription_prints\": deque(),\n", + " }\n", + "\n", + " async with websockets.connect(\n", + " ws_url, additional_headers=headers, max_size=None\n", + " ) as ws:\n", + " await ws.send(json.dumps(session_update_payload))\n", + "\n", + " listener_task = asyncio.create_task(\n", + " listen_for_events(\n", + " ws,\n", + " stop_event=stop_event,\n", + " transcription_instructions=transcription_instructions,\n", + " max_turns=max_turns,\n", + " playback_queue=playback_queue,\n", + " shared_state=shared_state,\n", + " )\n", + " )\n", + " mic_task = asyncio.create_task(\n", + " stream_microphone_audio(ws, stop_event, shared_state=shared_state)\n", + " )\n", + " playback_task = asyncio.create_task(playback_audio(playback_queue, stop_event))\n", + "\n", + " try:\n", + " if timeout_seconds and timeout_seconds > 0:\n", + " await asyncio.wait_for(stop_event.wait(), timeout=timeout_seconds)\n", + " else:\n", + " await stop_event.wait()\n", + " except asyncio.TimeoutError:\n", + " print(\"Timed out waiting for responses; closing.\")\n", + " except asyncio.CancelledError:\n", + " print(\"Session cancelled; closing.\")\n", + " finally:\n", + " stop_event.set()\n", + " await playback_queue.put(None)\n", + " await ws.close()\n", + " await asyncio.gather(\n", + " listener_task, mic_task, playback_task, return_exceptions=True\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "c9a2a33b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Hello.\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Hello\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Hello! Let's get started with your claim. Can you tell me your full name, please?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My name is M I N H A J U L H O Q U E\n", + "\n", + "=== User turn (Transcription model) ===\n", + "My name is Minhajul Hoque.\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Thank you. Just to confirm, I heard your full name as Minhajul Hoque. Is that correct?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Yep.\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Yep.\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Great, thank you for confirming. Now, could you provide your policy number, please?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My policy number is X077-B025.\n", + "\n", + "=== User turn (Transcription model) ===\n", + "My policy number is X077B025.\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Thank you. Let me confirm: I have your policy number as X077B025. Is that correct?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== Assistant response ===\n", + "Of course. Your full name is Minhajul Hoque. Now, let’s move on. What type of accident are you reporting—auto, home, or something else?\n", + "\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Yeah, can you ask me my name again?\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Can you ask me my name again?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "No, can you ask me my name again, this is important.\n", + "\n", + "=== User turn (Transcription model) ===\n", + "No, can you ask me by name again?\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Understood. Let me repeat your full name again to confirm. Your name is Minhajul Hoque. Is that correct?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My name is Minhajul Hoque.\n", + "\n", + "=== User turn (Transcription model) ===\n", + "My name is Minhaj ul Haq.\n", + "\n", + "Session cancelled; closing.\n" + ] + } + ], + "source": [ + "await run_realtime_session()" + ] + }, + { + "cell_type": "markdown", + "id": "efabdbf5", + "metadata": {}, + "source": [ + "From the above example, we can notice:\n", + "- The Realtime Model Transcription quality matches or surpasses that of the transcription model in various turns. In one of the turns, the transcription model misses \"this is important.\" while the realtime transcription gets it correctly.\n", + "- The realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).\n", + "- With context from the entire session, including previous turns where I spelled out my name, the realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., \"Minhaj ul Haq\")." + ] + }, + { + "cell_type": "markdown", + "id": "6d8ac6e3", + "metadata": {}, + "source": [ + "# Conclusion\n", + "\n", + "Exploring **out-of-band transcription** could be beneficial for your use case if:\n", + "\n", + "* You're still experiencing unreliable transcriptions, even after optimizing the transcription model prompt.\n", + "* You need a more reliable and steerable method for generating transcriptions.\n", + "* The current transcripts fail to normalize entities correctly, causing downstream issues.\n", + "\n", + "If you decide to pursue this method, make sure you:\n", + "\n", + "* Set up the transcription trigger correctly, ensuring it activates after the audio commit.\n", + "* Carefully iterate and refine the prompt to align closely with your specific use case and needs.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "openai", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.8" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/Realtime_prompting_guide.ipynb b/examples/Realtime_prompting_guide.ipynb index fdbb9d4418..fa5dd7909f 100644 --- a/examples/Realtime_prompting_guide.ipynb +++ b/examples/Realtime_prompting_guide.ipynb @@ -993,6 +993,32 @@ "In this example, the model asks for clarification after my *(very)* loud cough and unclear audio." ] }, + { + "cell_type": "markdown", + "id": "e0ce8f30", + "metadata": {}, + "source": [ + "## Background Music or Sounds\n", + "Occasionally, the model may generate unintended background music, humming, rhythmic noises, or sound-like artifacts during speech generation. These artifacts can diminish clarity, distract users, or make the assistant feel less professional. The following instructions helps prevent or significantly reduce these occurrences.\n", + "\n", + "- **When to use**: Use when you observe unintended musical elements or sound effects in Realtime audio responses.\n", + "- **What it does**: Steers the model to avoid generating these unwanted audio artifacts.s\n", + "- **How to adapt**: Adjust the instruction to try to explicitly suppress the specific sound patterns you are encountering." + ] + }, + { + "cell_type": "markdown", + "id": "c22c1c32", + "metadata": {}, + "source": [ + "### Example\n", + "```\n", + "# Instructions/Rules\n", + "...\n", + "- Do not include any sound effects or onomatopoeic expressions in your responses.\n", + "```" + ] + }, { "cell_type": "markdown", "id": "ea96cb72", diff --git a/examples/agentkit/agentkit_walkthrough.ipynb b/examples/agentkit/agentkit_walkthrough.ipynb new file mode 100644 index 0000000000..3b3d9a9bfa --- /dev/null +++ b/examples/agentkit/agentkit_walkthrough.ipynb @@ -0,0 +1,444 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0d08a919", + "metadata": {}, + "source": [ + "# Build, deploy, and optimize agentic workflows with AgentKit" + ] + }, + { + "cell_type": "markdown", + "id": "3821906c", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "At DevDay 2025 we launched [AgentKit](https://openai.com/index/introducing-agentkit/), a complete set of tools for developers and enterprises to build, deploy, and optimize agents. AgentKit is a set of interconnected building blocks:\n", + "\n", + "* [Agent Builder](https://platform.openai.com/docs/guides/agents/agent-builder): visually build and iterate on agent workflows \n", + "* [ChatKit](https://platform.openai.com/docs/guides/chatkit): easily embed chat-based workflows into your app \n", + "* [Evals](https://platform.openai.com/docs/guides/evals?api-mode=responses): improve the performance of your LLM-powered apps\n", + "\n", + "**This cookbook will take an end-to-end journey through AgentKit - we'll build, deploy, and optimize an app. You'll understand how AgentKit’s building blocks connect together, enabling you to bring your agentic workflows into production faster and more reliably.** \n", + "\n", + "We’ll walk through the following steps:\n", + "\n", + "1. Build a workflow in Agent Builder to serve as the back-end of our app \n", + "2. Deploy a front-end chat app using the ChatKit web component \n", + "3. Optimize workflow performance in Evals with prompt optimization and trace grading" + ] + }, + { + "cell_type": "markdown", + "id": "7c46a463", + "metadata": {}, + "source": [ + "## Building the multi-agent workflow with Agent Builder\n", + "\n", + "Let's get started by using Agent Builder to create the initial workflow that will underpin our app. Agent Builder is a visual canvas that lets you drag-and-drop nodes to design your agentic workflows. You can learn more about Agent Builder [here](https://platform.openai.com/docs/guides/agent-builder), including additional functionality and a full list of supported nodes, but in this cookbook we'll create a simple workflow with three Agent nodes working sequentially.\n", + "\n", + "We’ll build a simple app that helps people accelerate their careers through curated learning recommendations. Users can upload their resume and tell us their dream job, and they'll receive a curated set of learning courses based on the skills they need to develop. So we'll create three agents: \n", + "\n", + "1. **Resume extraction** agent to parse uploaded resumes and extract relevant skills and experiences \n", + "2. **Career analysis** agent to analyze knowledge gaps relative to their target job \n", + "3. **Course recommendation** agent which uses the upstream information to suggest relevant online courses.\n", + "\n", + "Let's build each of these agents sequentially." + ] + }, + { + "cell_type": "markdown", + "id": "d2781923", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "markdown", + "id": "762f441b", + "metadata": {}, + "source": [ + "### 1. Resume extraction agent\n", + "\n", + "This agent will be responsible for parsing the uploaded resume and returning a structured output of skills and experiences that will be used for downstream analysis. We'll use the following prompt:\n", + "\n", + "\n", + "```text\n", + "Extract and summarize information from the input resume, organizing your output by category and providing context where available.\n", + "- Analyze the provided input to identify skills and professional experiences.\n", + "- For each skill or experience, extract the supporting context or evidence from the text (e.g., for the skill of Python, context might be “used Python in data analysis for three years at [Company]”).\n", + "- Continue reviewing the text until all skills and experiences are extracted.\n", + "```\n", + "\n", + "We'll use `gpt-5` for this agent, starting with `minimal` reasoning, but we can always change the model later if needed. And, we'll enforce a structured response (by selecting Output format to JSON, and adding a schema) to instruct the model to return the exact data shape we're looking for. (The JSON schema for this structured output can be found [here](https://cdn.openai.com/cookbook/agent_walkthrough/Skills_schema.json).)" + ] + }, + { + "cell_type": "markdown", + "id": "e061bb1d", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "markdown", + "id": "6510b30a", + "metadata": {}, + "source": [ + "### 2. Career analysis agent\n", + "\n", + "This agent will analyze skill and knowledge gaps for an individual to progress to a desired professional or career goal. We'll use `gpt-5` for this agent and select reasoning effort `low`, which should provide sufficient reasoning for this level of analysis while keeping the responses sufficiently fast.\n", + "\n", + "```text\n", + "Your role is to analyze skill and knowledge gaps for an individual to progress to a desired professional or career goal. \n", + "\n", + "You will receive a list of the already-obtained skills and experiences of an individual, as well as a description of the goal. First, understand the goal and analyze the critical skills or knowledge areas required for achieving the goal. Then, compare the requirements to what the individual already possesses. \n", + "\n", + "Return a list of the top 3-5 skills that the individual does not possess, but are important for their professional goal. Along with each skill, include a brief description.\n", + "\n", + "Individual's expressed goal: \n", + "{{workflow.input_as_text}}\n", + "\n", + "Already-obtained skills and experiences:\n", + "{{input.output_text}}\n", + "```\n", + "\n", + "Note that our prompt includes context from previous nodes enclosed in {{brackets}}. You can also click \"Add context\" to see the context variables to the model." + ] + }, + { + "cell_type": "markdown", + "id": "2a493da9", + "metadata": {}, + "source": [ + "### 3. Course recommendation agent\n", + "\n", + "This agent will use the web search tool to find and select online training courses that match the identified skill gaps. We'll use `gpt-5` with `minimal` reasoning and equip this agent with Web Search.\n", + "\n", + "```text\n", + "Your job is to identify and recommend online training courses that help develop one or more of the skills identified. Given the list of required skills and descriptions below, return a list of 3-5 online courses along with course details.\n", + "\n", + "Skills: {{input.output_text}}\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "7cdfe06b", + "metadata": {}, + "source": [ + "## Testing our workflow\n", + "\n", + "**Now that we've built our initial workflow, we can use the Preview functionality in Agent Builder to give it a spin!** We'll first Publish the workflow, which will create a named and versioned copy (with a unique workflow ID) that we can share with colleagues, or deploy or revert between versions as needed.\n", + "\n", + "Preview lets you interact with your workflow the same way a chat user would, from directly within Agent Builder. If we upload a resume, type in a description of our dream job, and click Submit, we'll see the workflow proceed step-by-step through each node on the left of the screen, and we'll see the output from each node on the right.\n", + "\n", + "As an example, I've uploaded a resume for a teacher who's looking to become a school superintendent.\n" + ] + }, + { + "cell_type": "markdown", + "id": "6260b9f4", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "markdown", + "id": "7ca51961", + "metadata": {}, + "source": [ + "We can follow the workflow as it proceeds through resume parsing, skill analysis, and web search. As the workflow completes, as expected we see a list of online programs that meet the search parameters.\n", + "\n", + "**Everything looks good - we're now ready to deploy our app!** Let's make sure we publish any changes we've made, and get the workflow ID. We can select \"Code\" at the top of the screen to access the ID again." + ] + }, + { + "cell_type": "markdown", + "id": "1766fa61", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "markdown", + "id": "cd24bd6d", + "metadata": {}, + "source": [ + "Note that you can use the \"Agents SDK\" tab to access the code that supports the workflow you just built, implemented using the Agents SDK package (in [JavaScript/TypeScript](https://github.com/openai/openai-agents-js) or [Python](https://github.com/openai/openai-agents-python)). This is a great option if you want to run your workflow in your own environment, or develop it further with custom functionality. (However, you would miss out on some of the benefits of using AgentKit in an integrated way, as we describe below.)" + ] + }, + { + "cell_type": "markdown", + "id": "5c09056a", + "metadata": {}, + "source": [ + "## Deploying the chat app with ChatKit\n", + "\n", + "To deploy our app, we'll use the [ChatKit starter template](https://github.com/openai/openai-chatkit-starter-app) to help us spin up a chat-based app using the ChatKit web component.\n", + "\n", + "Before doing that, it's worth explaining the full set of options that the suite of tools across AgentKit provides for deploying your agentic workflows. We've already seen how you can build a workflow in Agent Builder, and then run it directly within the tool (Preview), or export it as Agents SDK code to use in your own environment. Now, we'll demonstrate how you can use an Agent Builder workflow ID to create a chat experience embedded in your own front-end, which points to the workflow you created as a back-end. (By the way, you can also use just the rich chat GUI provided by the ChatKit SDK, without the workflow back-end - learn more [here](https://platform.openai.com/docs/guides/custom-chatkit).)\n", + "\n", + "So let's get started with the ChatKit starter template and plug in our workflow. The starter template makes it simple to spin up a chat-based app using our newly created workflow. Just follow the [Getting Started](https://github.com/openai/openai-chatkit-starter-app?tab=readme-ov-file#getting-started) instructions in the repo, entering in your workflow ID from Agent Builder as the value for `NEXT_PUBLIC_CHATKIT_WORKFLOW_ID` into `.env.local`, and running `npm install` and then `npm run dev` to test the app.\n", + "\n", + "In just a few minutes, the workflow is embedded in a front-end chat interface that's up and running!" + ] + }, + { + "cell_type": "markdown", + "id": "117de797", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "markdown", + "id": "bab85f53", + "metadata": {}, + "source": [ + "## Quickly iterating on workflow and user experience\n", + "\n", + "One of the most valuable aspects of AgentKit is how quickly it enables you to experiment, iterate, and improve your agentic applications. Let's make some quick changes that will improve the functionality of our app and provide a richer chat experience.\n", + "\n", + "**First, let's add some custom theming** to give our front-end some style, while still retaining the native chat experience. A great resource here is [ChatKit Studio](https://chatkit.studio/), which includes a playground to explore the customization options in ChatKit, a Widget Builder (which we'll see in action shortly), and sample apps and galleries for inspiration. To get our custom theme, we'll use the ChatKit [Playground](https://chatkit.studio/playground) to visually select our desired style options, click on the `` icon at the top of the Playground screen to get the configuration code. \n", + "\n", + "We'll use the `theme` object from this code to overwrite the default theme located in [lib/config.ts](https://github.com/openai/openai-chatkit-starter-app/blob/main/lib/config.ts). While we're in that file, we'll also adjust the starter prompts, greeting text, and placeholder copy to more appropriate values:\n", + "\n", + "```text\n", + "export const GREETING = \"Upload your resume, and tell me the job you're looking to get!\";\n", + "export const PLACEHOLDER_INPUT = \"Describe your dream job, and don't forget to attach your resume!\"; \n", + "```\n", + "\n", + "**Next, we'll design a custom widget** to display our recommended courses in a more intuitive format that makes it easier for users to understand and compare. We can use the [Widget Builder](https://widgets.chatkit.studio/) to simply describe the output we're looking for, and get an LLM-generated starting point that we can edit further. For this example, we'll present the courses in a list with a clean, structured format, and we'll also show a summary below the recommendations. In the Widget Builder, we can see not only the widget code (top-left), but sample data (bottom-left) and how that data gets rendered within the widget for the end user (right)." + ] + }, + { + "cell_type": "markdown", + "id": "caecb23a", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "markdown", + "id": "d7e46f96", + "metadata": {}, + "source": [ + "When we're happy with the design, we can download the .widget file. (The file used in the screenshot below is located [here](https://cdn.openai.com/cookbook/agent_walkthrough/Course%20recommendation.widget).)\n", + "\n", + "To actually use the custom widget we've designed in our chat app, **we need to instruct our workflow to return the widget component** as part of our recommendation agent's response. So we'll go back to the `Course recommendations` agent, select Output format of `Widget`, and upload the .widget file.\n", + "\n", + "Our agent will automatically know to output the JSON format required to populate the widget. However, we'll need to update the agent prompt to instruct the model a bit more precisely what information it needs to obtain about each course. \n", + "\n", + "```text\n", + "Your job is to identify and recommend online training courses that help develop one or more of the skills identified. Given the list of required skills, return a list of 3-5 online courses along with course details including course name, provider (school or program), recommendation reason (a brief sentence on why you're recommending the course), course format, and URL. In addition to the list of courses, share a few-sentence summary of the recommendations you're making. \n", + "```\n", + "\n", + "Finally, because we're dealing with resumes, we'll add a guardrail to our workflow to make sure we're not propagating any personally identifiable information (PII) where it doesn't belong. We'll insert this guardrail between our resume parser and our career analysis agents, which will help prevent anything downstream of the resume agent from having access to any PII, such as a name or contact information.\n" + ] + }, + { + "cell_type": "markdown", + "id": "ec3869a4", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "markdown", + "id": "93bb49fd", + "metadata": {}, + "source": [ + "## Improving system performance using prompt optimization and trace grading\n", + "\n", + "Now we’ll see how the native integrations with Evals help make it easy to optimize both individual agents and your entire workflow. Let's imagine our chat app has been deployed - perhaps to an initial set of internal users or beta testers - and we have some real-world examples of users interacting with the system.\n", + "\n", + "If this application were being developed into a production-grade system where performance and quality were critical, we'd want to incorporate evals even earlier and more systematically into our development process. (You can learn more in [Eval Driven System Design](https://cookbook.openai.com/examples/partners/eval_driven_system_design/receipt_inspection) about how to build a set of initial evals and established ground truth, mapping evals to business metrics, and progressively improve your system to drive the business goals.)\n", + "\n", + "But in this cookbook, we'll focus less on the techniques behind evals as part of LLM app development, and more about how AgentKit lets you implement these techniques more easily. We'll drive performance improvements in two ways: first we'll **optimize a single agent node in our workflow** using the prompt optimization tool, then we'll **optimize the entire workflow** using trace grading.\n", + "\n", + "### Single agent optimization\n", + "\n", + "We want to dive into our Course recommendations agent to see if we can improve the quality of its recommendations to users. We've isolated some sample prompts for this agent from our test environment. (You can also access completed in the [Logs tab](https://platform.openai.com/logs?api=responses) of the API platform. For this cookbook example, you can access [here](https://cdn.openai.com/cookbook/agent_walkthrough/course_recommendations_dataset.csv) the data set we'll be using.)\n", + "\n", + "We can optimize our agent starting directly from Agent Builder. Select the Course recommendations agent, and click on \"Evaluate\" in the bottom right of the agent modal. This will take us directly to the **Datasets** feature within Evals. \n", + "\n", + "We see the configuration of our agent has been copied over, and we're ready to optimize. Let's first upload the data file with sample prompts (note the column names should match your input and output variables), and click \"Generate output\" to generate responses." + ] + }, + { + "cell_type": "markdown", + "id": "028ecd39", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "markdown", + "id": "70b1b577", + "metadata": {}, + "source": [ + "Now, let's create some **human annotations** and **model graders**. We'll select \"Columns\" to add a Rating (thumbs up/down) and Feedback (text input), and we'll manually review our samples to populate these fields with some high-quality feedback. We'll also add a couple of model graders, which will evaluate the agent's output in an automated way based on criteria that we can specify. For this example, we might be concerned about whether the course recommendations are relevant to the skill gaps identified (relevance), whether all of the skill gaps are addressed (coverage), and whether the recommendation summary that is presented is appropriate (style).\n", + "\n", + "Here are example model grader prompts for each criteria: \n", + "\n", + "```text\n", + "[relevance] You are evaluating whether a list of recommended courses is relevant to the skills described. Return a pass if all courses are relevant to at least one skill, and fail otherwise.\n", + "\n", + "[coverage] You are evaluating whether a list of recommended courses covers all of the skills described. Return a pass if all of the skills are covered by at least one course, and fail otherwise.\n", + "\n", + "[summary] You are evaluating whether the summary recommendation provided is relevant, thoughtful, and related to the recommended courses proposed. Evaluate the summary recommendation on a scale of 0 to 1, with 1 being the highest quality. \n", + "```\n", + "\n", + "We'll use GPT-5 for our model graders, and include a 0.7 threshold for the summary grader.\n", + "\n", + "We'll now select Grade > All graders to run these graders against the system output. As the grading proceeds, we'll start to see the cells populated to indicate how each example scored on our model grader criteria." + ] + }, + { + "cell_type": "markdown", + "id": "85f33a49", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "markdown", + "id": "f7cefe90", + "metadata": {}, + "source": [ + "**Now, here's where the magic happens: we can click Optimize to automatically rewrite our prompt based on the feedback we've provided - both the model grader output and the human-provided feedback.** \n", + "\n", + "If we examine the new prompt, we see that our prompt contains new **Requirements** and **Output** format sections to instruct the model to make the course descriptions more specific, and aim for better coverage of the different skills. \n", + "\n", + "```text\n", + "Requirements:\n", + "\n", + "- Use the web search tool to find and verify real, currently available online courses and their direct URLs.\n", + "- Return 3–5 courses that collectively cover the skills. If a course spans multiple skills, indicate it. \n", + "- Be specific and concise. Each course description must be one sentence (max 35 words) focused on outcomes and topics tied to the skills. \n", + "- Provide plain text only; no citations or references.\n", + "\n", + "Output format:\n", + "\n", + "1) Summary (2–4 sentences) explaining how the selections address the skills and any coverage tradeoffs. \n", + "2) Courses (3–5 items). For each course, include exactly:` \n", + " - Course name — Provider (school or program)\n", + " - Description: [one sentence, max 35 words] \n", + " - URL: [direct course page]\n", + " - Skills covered: [list skill names from below]\n", + "```\n", + "\n", + "Now, we can click Update to automatically insert the new prompt into our workflow in Agent Builder. **In just a few minutes, we've been able to use real prompt examples and feedback to automatically improve our system's performance - all directly within the Agent Builder and Evals tools.**\n", + "\n", + "(Although in this cookbook we've optimized our prompt automatically using this grading output, it's often very helpful to examine specific failure examples to understand in what ways the model might be making mistakes. This analysis can help us generate more precise human-based or model-graded feedback, and even synthesize data to help improve performance against specific failure modes.)\n" + ] + }, + { + "cell_type": "markdown", + "id": "907cd4ea", + "metadata": {}, + "source": [ + "### Entire workflow optimization\n", + "\n", + "Once we're comfortable with the performance of each individual agent node, we can turn our attention to the full workflow.\n", + "\n", + "Previously, in order to analyze and improve the performance of a complex workflow involving multiple agents, you'd need to read through entire traces of your workflow to understand exactly where and why the system was failing. This can be a time-consuming process, especially when you have a large number of trace examples.\n", + "\n", + "Using **trace grading**, we can now run end-to-end assessments of full sets of traces using automated model grading. We'll create graders to describe the behavior we're looking to correct, and we'll automatically run these graders across our entire data set.\n", + "\n", + "To get started, we'll go back to our workflow in Agent Builder and click Evaluate on the top of the screen. This lets us deep-dive into the traces that have been generated for our workflow runs, including examining the inputs and outputs for all nodes (in this case, the prompts and responses for each of our agents). \n", + "\n", + "We can create graders to run across the entire model trace, defining evaluation criteria for the end-to-end workflow that **spans multiple agents**. For example, we might want to ensure that the final recommendation summary (output of agent 3) is relevant to the user’s initial input about their career goals (input to agent 1). And, we might want to check that the recommended courses (output of agent 3) are not duplicative with the skills the user already possesses (output of agent 1). " + ] + }, + { + "cell_type": "markdown", + "id": "909dcd77", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "markdown", + "id": "2d9855a6", + "metadata": {}, + "source": [ + "If you had a workflow with conditional statements or while loops, you could grade against more complex multi-step behavior, such as a support agent shouldn’t engage in more than three responses with a user without escalating to a supervisor.\n", + "\n", + "Once we have a full set of grading criteria, we select Grade all to grade our traces. This action will lead us to the Evaluations tab, where we can see a new eval has been created and an eval run has been kicked off." + ] + }, + { + "cell_type": "markdown", + "id": "0611291d", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "markdown", + "id": "87a4e298", + "metadata": {}, + "source": [ + "We can then dive into the workflow traces for our failure cases to better understand why the specific workflow run failed, and how we can improve our system to avoid the failure. This approach helps you optimize complex workflows more efficiently, by iteratively identifying failure modes, evaluating the performance of your system, and targeting improvements to improve performance." + ] + }, + { + "cell_type": "markdown", + "id": "6a59d33f", + "metadata": {}, + "source": [ + "## Recap and resources\n", + "\n", + "We demonstrated how **Agent Builder**, **ChatKit**, and **Evals** work together to help you build, deploy, and optimize agentic workflows.\n", + "\n", + "With a specific example — a career development app that analyzes resumes, identifies skill gaps, and recommends online courses — we saw how Agent Builder makes it easy to design and build multi-agent workflows, ChatKit lets us embed those workflows in a rich and customizable chat UI, and Evals close the loop by enabling prompt optimization and trace grading against real data.\n", + "\n", + "To learn more, here's a list of some of the resources mentioned in this cookbook:\n", + "\n", + "* [Agent Builder documentation](#)\n", + "* [ChatKit starter template](#)\n", + "* [ChatKit Studio](#)\n", + "* [Agents SDK](#)\n", + "* [Evals](#)\n", + "\n", + "Happy building!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/codex/build_code_review_with_codex_sdk.md b/examples/codex/build_code_review_with_codex_sdk.md new file mode 100644 index 0000000000..4f261357f2 --- /dev/null +++ b/examples/codex/build_code_review_with_codex_sdk.md @@ -0,0 +1,663 @@ +# Build Code Review with the Codex SDK + +With [Code Review](https://chatgpt.com/codex/settings/code-review) in Codex Cloud, you can connect your team's cloud hosted GitHub repository to Codex and receive automated code reviews on every PR. But what if your code is hosted on-prem, or you don't have GitHub as an SCM? + +Luckily, we can replicate Codex's cloud hosted review process in our own CI/CD runners. In this guide, we'll build our own Code Review action using the Codex CLI headless mode with both GitHub Actions and Jenkins. + +To build our own Code review, we'll take the following steps: + +1. Install the Codex CLI in our CI/CD runner +2. Prompt Codex in headless (exec) mode with the Code Review prompt that ships with the CLI +3. Specify a structured output JSON schema for Codex +4. Parse the JSON result and use it to make API calls to our SCM to create review comments + +Once implemented, Codex will be able to leave inline code review comments: +Codex Code Review in GitHub + +## The Code Review Prompt + +GPT-5-Codex has received specific training to improve its code review abilities. You can steer GPT-5-Codex to conduct a code review with the following prompt: + +``` +You are acting as a reviewer for a proposed code change made by another engineer. +Focus on issues that impact correctness, performance, security, maintainability, or developer experience. +Flag only actionable issues introduced by the pull request. +When you flag an issue, provide a short, direct explanation and cite the affected file and line range. +Prioritize severe issues and avoid nit-level comments unless they block understanding of the diff. +After listing findings, produce an overall correctness verdict (\"patch is correct\" or \"patch is incorrect\") with a concise justification and a confidence score between 0 and 1. +Ensure that file citations and line numbers are exactly correct using the tools available; if they are incorrect your comments will be rejected. +``` + +## Codex Structured Outputs + +In order to make comments on code ranges in our pull request, we need to receive Codex's response in a specific format. To do that we can create a file called `codex-output-schema.json` that conforms to OpenAI's [structured outputs](https://platform.openai.com/docs/guides/structured-outputs) format. + +To use this file in our workflow YAML, we can call Codex with the `output-schema-file` argument like this: + +```yaml +- name: Run Codex structured review + id: run-codex + uses: openai/codex-action@main + with: + openai-api-key: ${{ secrets.OPENAI_API_KEY }} + prompt-file: codex-prompt.md + sandbox: read-only + model: ${{ env.CODEX_MODEL }} + output-schema-file: codex-output-schema.json # <-- Our schema file + output-file: codex-output.json +``` + +You can also pass a similar argument to `codex exec` for example: + +```bash +codex exec "Review my pull request!" --output-schema codex-output-schema.json +``` + +## GitHub Actions Example + +Let's put it all together. If you're using GitHub Actions in an on-prem environment, you can tailor this example to your specific workflow. Inline comments highlight the key steps. + +```yaml +name: Codex Code Review + +# Determine when the review action should be run: +on: + pull_request: + types: + - opened + - reopened + - synchronize + - ready_for_review + +concurrency: + group: codex-structured-review-${{ github.event.pull_request.number }} + cancel-in-progress: true + +jobs: + codex-structured-review: + name: Run Codex structured review + runs-on: ubuntu-latest + permissions: + contents: read + pull-requests: write + env: + OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} + GITHUB_TOKEN: ${{ github.token }} + CODEX_MODEL: ${{ vars.CODEX_MODEL || 'o4-mini' }} + PR_NUMBER: ${{ github.event.pull_request.number }} + HEAD_SHA: ${{ github.event.pull_request.head.sha }} + BASE_SHA: ${{ github.event.pull_request.base.sha }} + REPOSITORY: ${{ github.repository }} + outputs: + codex-output: ${{ steps.run-codex.outputs.final-message }} + steps: + - name: Checkout pull request merge commit + uses: actions/checkout@v5 + with: + ref: refs/pull/${{ github.event.pull_request.number }}/merge + + - name: Fetch base and head refs + run: | + set -euxo pipefail + git fetch --no-tags origin \ + "${{ github.event.pull_request.base.ref }}" \ + +refs/pull/${{ github.event.pull_request.number }}/head + shell: bash + + # The structured output schema ensures that codex produces comments + # with filepaths, line numbers, title, body, etc. + - name: Generate structured output schema + run: | + set -euo pipefail + cat <<'JSON' > codex-output-schema.json + { + "type": "object", + "properties": { + "findings": { + "type": "array", + "items": { + "type": "object", + "properties": { + "title": { + "type": "string", + "maxLength": 80 + }, + "body": { + "type": "string", + "minLength": 1 + }, + "confidence_score": { + "type": "number", + "minimum": 0, + "maximum": 1 + }, + "priority": { + "type": "integer", + "minimum": 0, + "maximum": 3 + }, + "code_location": { + "type": "object", + "properties": { + "absolute_file_path": { + "type": "string", + "minLength": 1 + }, + "line_range": { + "type": "object", + "properties": { + "start": { + "type": "integer", + "minimum": 1 + }, + "end": { + "type": "integer", + "minimum": 1 + } + }, + "required": [ + "start", + "end" + ], + "additionalProperties": false + } + }, + "required": [ + "absolute_file_path", + "line_range" + ], + "additionalProperties": false + } + }, + "required": [ + "title", + "body", + "confidence_score", + "priority", + "code_location" + ], + "additionalProperties": false + } + }, + "overall_correctness": { + "type": "string", + "enum": [ + "patch is correct", + "patch is incorrect" + ] + }, + "overall_explanation": { + "type": "string", + "minLength": 1 + }, + "overall_confidence_score": { + "type": "number", + "minimum": 0, + "maximum": 1 + } + }, + "required": [ + "findings", + "overall_correctness", + "overall_explanation", + "overall_confidence_score" + ], + "additionalProperties": false + } + JSON + shell: bash + + # This section generates our prompt: + - name: Build Codex review prompt + env: + REVIEW_PROMPT_PATH: ${{ vars.CODEX_PROMPT_PATH || 'review_prompt.md' }} + run: | + set -euo pipefail + PROMPT_PATH="codex-prompt.md" + TEMPLATE_PATH="${REVIEW_PROMPT_PATH}" + + if [ -n "$TEMPLATE_PATH" ] && [ -f "$TEMPLATE_PATH" ]; then + cat "$TEMPLATE_PATH" > "$PROMPT_PATH" + else + { + printf '%s\n' "You are acting as a reviewer for a proposed code change made by another engineer." + printf '%s\n' "Focus on issues that impact correctness, performance, security, maintainability, or developer experience." + printf '%s\n' "Flag only actionable issues introduced by the pull request." + printf '%s\n' "When you flag an issue, provide a short, direct explanation and cite the affected file and line range." + printf '%s\n' "Prioritize severe issues and avoid nit-level comments unless they block understanding of the diff." + printf '%s\n' "After listing findings, produce an overall correctness verdict (\"patch is correct\" or \"patch is incorrect\") with a concise justification and a confidence score between 0 and 1." + printf '%s\n' "Ensure that file citations and line numbers are exactly correct using the tools available; if they are incorrect your comments will be rejected." + } > "$PROMPT_PATH" + fi + + { + echo "" + echo "Repository: ${REPOSITORY}" + echo "Pull Request #: ${PR_NUMBER}" + echo "Base ref: ${{ github.event.pull_request.base.ref }}" + echo "Head ref: ${{ github.event.pull_request.head.ref }}" + echo "Base SHA: ${BASE_SHA}" + echo "Head SHA: ${HEAD_SHA}" + echo "Changed files:" + git --no-pager diff --name-status "${BASE_SHA}" "${HEAD_SHA}" + echo "" + echo "Unified diff (context=5):" + git --no-pager diff --unified=5 --stat=200 "${BASE_SHA}" "${HEAD_SHA}" > /tmp/diffstat.txt + git --no-pager diff --unified=5 "${BASE_SHA}" "${HEAD_SHA}" > /tmp/full.diff + cat /tmp/diffstat.txt + echo "" + cat /tmp/full.diff + } >> "$PROMPT_PATH" + shell: bash + + # Putting it all together: we run the codex action with our code review prompt, + # structured output, and output file: + - name: Run Codex structured review + id: run-codex + uses: openai/codex-action@main + with: + openai-api-key: ${{ secrets.OPENAI_API_KEY }} + prompt-file: codex-prompt.md + output-schema-file: codex-output-schema.json + output-file: codex-output.json + sandbox: read-only + model: ${{ env.CODEX_MODEL }} + + - name: Inspect structured Codex output + if: ${{ always() }} + run: | + if [ -s codex-output.json ]; then + jq '.' codex-output.json || true + else + echo "Codex output file missing" + fi + shell: bash + + # This step produces in-line code review comments on specific line + # ranges of code. + - name: Publish inline review comments + if: ${{ always() }} + env: + REVIEW_JSON: codex-output.json + run: | + set -euo pipefail + if [ ! -s "$REVIEW_JSON" ]; then + echo "No Codex output file present; skipping comment publishing." + exit 0 + fi + findings_count=$(jq '.findings | length' "$REVIEW_JSON") + if [ "$findings_count" -eq 0 ]; then + echo "Codex returned no findings; skipping inline comments." + exit 0 + fi + jq -c --arg commit "$HEAD_SHA" '.findings[] | { + body: (.title + "\n\n" + .body + "\n\nConfidence: " + (.confidence_score | tostring) + (if has("priority") then "\nPriority: P" + (.priority | tostring) else "" end)), + commit_id: $commit, + path: .code_location.absolute_file_path, + line: .code_location.line_range.end, + side: "RIGHT", + start_line: (if .code_location.line_range.start != .code_location.line_range.end then .code_location.line_range.start else null end), + start_side: (if .code_location.line_range.start != .code_location.line_range.end then "RIGHT" else null end) + } | with_entries(select(.value != null))' "$REVIEW_JSON" > findings.jsonl + while IFS= read -r payload; do + echo "Posting review comment payload:" && echo "$payload" | jq '.' + curl -sS \ + -X POST \ + -H "Accept: application/vnd.github+json" \ + -H "Authorization: Bearer ${GITHUB_TOKEN}" \ + -H "X-GitHub-Api-Version: 2022-11-28" \ + "https://api.github.com/repos/${REPOSITORY}/pulls/${PR_NUMBER}/comments" \ + -d "$payload" + done < findings.jsonl + shell: bash + + # This section creates a single comment summarizing the review. + - name: Publish overall summary comment + if: ${{ always() }} + env: + REVIEW_JSON: codex-output.json + run: | + set -euo pipefail + if [ ! -s "$REVIEW_JSON" ]; then + echo "Codex output missing; skipping summary." + exit 0 + fi + overall_state=$(jq -r '.overall_correctness' "$REVIEW_JSON") + overall_body=$(jq -r '.overall_explanation' "$REVIEW_JSON") + confidence=$(jq -r '.overall_confidence_score' "$REVIEW_JSON") + msg="**Codex automated review**\n\nVerdict: ${overall_state}\nConfidence: ${confidence}\n\n${overall_body}" + curl -sS \ + -X POST \ + -H "Accept: application/vnd.github+json" \ + -H "Authorization: Bearer ${GITHUB_TOKEN}" \ + -H "X-GitHub-Api-Version: 2022-11-28" \ + "https://api.github.com/repos/${REPOSITORY}/issues/${PR_NUMBER}/comments" \ + -d "$(jq -n --arg body "$msg" '{body: $body}')" + shell: bash +``` + +## Jenkins Example + +We can use the same approach to scripting a job with Jenkins. Once again, comments highlight key stages of the workflow: + +```groovy +pipeline { + agent any + + options { + timestamps() + ansiColor('xterm') + // Prevent overlapping runs on the same PR. Newer builds will cancel older ones after passing the milestone. + disableConcurrentBuilds() + } + + environment { + // Default model like your GHA (can be overridden at job/env level) + CODEX_MODEL = "${env.CODEX_MODEL ?: 'o4-mini'}" + + // Filled in during Init + PR_NUMBER = '' + HEAD_SHA = '' + BASE_SHA = '' + REPOSITORY = '' // org/repo + } + + stages { + stage('Init (PR context, repo, SHAs)') { + steps { + checkout scm + + // Compute PR context and SHAs similar to the GitHub Action + sh ''' + set -euo pipefail + + # Derive PR number from Jenkins env when building PRs via GitHub Branch Source + PR_NUMBER="${CHANGE_ID:-}" + if [ -z "$PR_NUMBER" ]; then + echo "Not a PR build (CHANGE_ID missing). Exiting." + exit 1 + fi + echo "PR_NUMBER=$PR_NUMBER" >> $WORKSPACE/jenkins.env + + # Discover owner/repo (normalize SSH/HTTPS forms) + ORIGIN_URL="$(git config --get remote.origin.url)" + if echo "$ORIGIN_URL" | grep -qE '^git@github.com:'; then + REPO_PATH="${ORIGIN_URL#git@github.com:}" + REPO_PATH="${REPO_PATH%.git}" + else + # e.g. https://github.com/owner/repo.git + REPO_PATH="${ORIGIN_URL#https://github.com/}" + REPO_PATH="${REPO_PATH%.git}" + fi + echo "REPOSITORY=$REPO_PATH" >> $WORKSPACE/jenkins.env + + # Ensure we have all refs we need + git fetch --no-tags origin \ + "+refs/heads/*:refs/remotes/origin/*" \ + "+refs/pull/${PR_NUMBER}/head:refs/remotes/origin/PR-${PR_NUMBER}-head" \ + "+refs/pull/${PR_NUMBER}/merge:refs/remotes/origin/PR-${PR_NUMBER}-merge" + + # HEAD (PR head) and BASE (target branch tip) + CHANGE_TARGET="${CHANGE_TARGET:-main}" + HEAD_SHA="$(git rev-parse refs/remotes/origin/PR-${PR_NUMBER}-head)" + BASE_SHA="$(git rev-parse refs/remotes/origin/${CHANGE_TARGET})" + + echo "HEAD_SHA=$HEAD_SHA" >> $WORKSPACE/jenkins.env + echo "BASE_SHA=$BASE_SHA" >> $WORKSPACE/jenkins.env + + echo "Resolved:" + echo " REPOSITORY=$REPO_PATH" + echo " PR_NUMBER=$PR_NUMBER" + echo " CHANGE_TARGET=$CHANGE_TARGET" + echo " HEAD_SHA=$HEAD_SHA" + echo " BASE_SHA=$BASE_SHA" + ''' + script { + def envMap = readProperties file: 'jenkins.env' + env.PR_NUMBER = envMap['PR_NUMBER'] + env.REPOSITORY = envMap['REPOSITORY'] + env.HEAD_SHA = envMap['HEAD_SHA'] + env.BASE_SHA = envMap['BASE_SHA'] + } + + // Ensure only latest build for this PR proceeds; older in-flight builds will be aborted here + milestone 1 + } + } + + stage('Generate structured output schema') { + steps { + sh ''' + set -euo pipefail + cat > codex-output-schema.json <<'JSON' + { + "type": "object", + "properties": { + "findings": { + "type": "array", + "items": { + "type": "object", + "properties": { + "title": { "type": "string", "maxLength": 80 }, + "body": { "type": "string", "minLength": 1 }, + "confidence_score": { "type": "number", "minimum": 0, "maximum": 1 }, + "priority": { "type": "integer", "minimum": 0, "maximum": 3 }, + "code_location": { + "type": "object", + "properties": { + "absolute_file_path": { "type": "string", "minLength": 1 }, + "line_range": { + "type": "object", + "properties": { + "start": { "type": "integer", "minimum": 1 }, + "end": { "type": "integer", "minimum": 1 } + }, + "required": ["start","end"], + "additionalProperties": false + } + }, + "required": ["absolute_file_path","line_range"], + "additionalProperties": false + } + }, + "required": ["title","body","confidence_score","priority","code_location"], + "additionalProperties": false + } + }, + "overall_correctness": { "type": "string", "enum": ["patch is correct","patch is incorrect"] }, + "overall_explanation": { "type": "string", "minLength": 1 }, + "overall_confidence_score": { "type": "number", "minimum": 0, "maximum": 1 } + }, + "required": ["findings","overall_correctness","overall_explanation","overall_confidence_score"], + "additionalProperties": false + } + JSON + ''' + } + } + + stage('Build Codex review prompt') { + environment { + REVIEW_PROMPT_PATH = "${env.CODEX_PROMPT_PATH ?: 'review_prompt.md'}" + } + steps { + sh ''' + set -euo pipefail + PROMPT_PATH="codex-prompt.md" + TEMPLATE_PATH="${REVIEW_PROMPT_PATH}" + + if [ -n "$TEMPLATE_PATH" ] && [ -f "$TEMPLATE_PATH" ]; then + cat "$TEMPLATE_PATH" > "$PROMPT_PATH" + else + { + printf '%s\n' "You are acting as a reviewer for a proposed code change made by another engineer." + printf '%s\n' "Focus on issues that impact correctness, performance, security, maintainability, or developer experience." + printf '%s\n' "Flag only actionable issues introduced by the pull request." + printf '%s\n' "When you flag an issue, provide a short, direct explanation and cite the affected file and line range." + printf '%s\n' "Prioritize severe issues and avoid nit-level comments unless they block understanding of the diff." + printf '%s\n' "After listing findings, produce an overall correctness verdict (\\\"patch is correct\\\" or \\\"patch is incorrect\\\") with a concise justification and a confidence score between 0 and 1." + printf '%s\n' "Ensure that file citations and line numbers are exactly correct using the tools available; if they are incorrect your comments will be rejected." + } > "$PROMPT_PATH" + fi + + { + echo "" + echo "Repository: ${REPOSITORY}" + echo "Pull Request #: ${PR_NUMBER}" + echo "Base ref: ${CHANGE_TARGET}" + echo "Head ref: ${CHANGE_BRANCH:-PR-${PR_NUMBER}-head}" + echo "Base SHA: ${BASE_SHA}" + echo "Head SHA: ${HEAD_SHA}" + echo "Changed files:" + git --no-pager diff --name-status "${BASE_SHA}" "${HEAD_SHA}" + echo "" + echo "Unified diff (context=5):" + git --no-pager diff --unified=5 --stat=200 "${BASE_SHA}" "${HEAD_SHA}" > /tmp/diffstat.txt + git --no-pager diff --unified=5 "${BASE_SHA}" "${HEAD_SHA}" > /tmp/full.diff + cat /tmp/diffstat.txt + echo "" + cat /tmp/full.diff + } >> "$PROMPT_PATH" + ''' + } + } + + stage('Run Codex structured review') { + environment { + REVIEW_PROMPT = 'codex-prompt.md' + REVIEW_SCHEMA = 'codex-output-schema.json' + REVIEW_OUTPUT = 'codex-output.json' + } + steps { + withCredentials([ + string(credentialsId: 'openai-api-key', variable: 'OPENAI_API_KEY') + ]) { + // Option A: If you have the OpenAI CLI installed on the Jenkins agent + sh ''' + set -euo pipefail + if command -v openai >/dev/null 2>&1; then + # Use the Responses API with a JSON schema tool spec + # Produces codex-output.json with the structured result. + openai responses.create \ + --model "${CODEX_MODEL}" \ + --input-file "${REVIEW_PROMPT}" \ + --response-format "json_object" \ + --output-schema "${RESPONSE_FORMAT}" \ + --tool-choice "auto" \ + > raw_response.json || true + + # Fallback if CLI doesn’t support your exact flags: + # Keep demo resilient: If raw_response.json is empty, create a minimal stub so later steps don’t fail. + if [ ! -s raw_response.json ]; then + echo '{"findings":[],"overall_correctness":"patch is correct","overall_explanation":"No issues detected.","overall_confidence_score":0.5}' > "${REVIEW_OUTPUT}" + else + # If your CLI/format returns a JSON object with the structured content in .output or similar, map it here. + # Adjust jq path to match your CLI output shape. + jq -r '.output // .' raw_response.json > "${REVIEW_OUTPUT}" || cp raw_response.json "${REVIEW_OUTPUT}" + fi + else + echo "openai CLI not found; creating a stub output for demo continuity." + echo '{"findings":[],"overall_correctness":"patch is correct","overall_explanation":"(CLI not available on agent)","overall_confidence_score":0.4}' > "${REVIEW_OUTPUT}" + fi + ''' + } + } + } + + stage('Inspect structured Codex output') { + steps { + sh ''' + if [ -s codex-output.json ]; then + jq '.' codex-output.json || true + else + echo "Codex output file missing" + fi + ''' + } + } + + stage('Publish inline review comments') { + when { expression { true } } + steps { + withCredentials([string(credentialsId: 'github-token', variable: 'GITHUB_TOKEN')]) { + sh ''' + set -euo pipefail + REVIEW_JSON="codex-output.json" + if [ ! -s "$REVIEW_JSON" ]; then + echo "No Codex output file present; skipping comment publishing." + exit 0 + fi + + findings_count=$(jq '.findings | length' "$REVIEW_JSON") + if [ "$findings_count" -eq 0 ]; then + echo "Codex returned no findings; skipping inline comments." + exit 0 + fi + + jq -c --arg commit "$HEAD_SHA" '.findings[] | { + body: (.title + "\\n\\n" + .body + "\\n\\nConfidence: " + (.confidence_score | tostring) + (if has("priority") then "\\nPriority: P" + (.priority | tostring) else "" end)), + commit_id: $commit, + path: .code_location.absolute_file_path, + line: .code_location.line_range.end, + side: "RIGHT", + start_line: (if .code_location.line_range.start != .code_location.line_range.end then .code_location.line_range.start else null end), + start_side: (if .code_location.line_range.start != .code_location.line_range.end then "RIGHT" else null end) + } | with_entries(select(.value != null))' "$REVIEW_JSON" > findings.jsonl + + while IFS= read -r payload; do + echo "Posting review comment payload:" && echo "$payload" | jq '.' + curl -sS \ + -X POST \ + -H "Accept: application/vnd.github+json" \ + -H "Authorization: Bearer ${GITHUB_TOKEN}" \ + -H "X-GitHub-Api-Version: 2022-11-28" \ + "https://api.github.com/repos/${REPOSITORY}/pulls/${PR_NUMBER}/comments" \ + -d "$payload" + done < findings.jsonl + ''' + } + } + } + + stage('Publish overall summary comment') { + steps { + withCredentials([string(credentialsId: 'github-token', variable: 'GITHUB_TOKEN')]) { + sh ''' + set -euo pipefail + REVIEW_JSON="codex-output.json" + if [ ! -s "$REVIEW_JSON" ]; then + echo "Codex output missing; skipping summary." + exit 0 + fi + + overall_state=$(jq -r '.overall_correctness' "$REVIEW_JSON") + overall_body=$(jq -r '.overall_explanation' "$REVIEW_JSON") + confidence=$(jq -r '.overall_confidence_score' "$REVIEW_JSON") + msg="**Codex automated review**\\n\\nVerdict: ${overall_state}\\nConfidence: ${confidence}\\n\\n${overall_body}" + + jq -n --arg body "$msg" '{body: $body}' > /tmp/summary.json + + curl -sS \ + -X POST \ + -H "Accept: application/vnd.github+json" \ + -H "Authorization: Bearer ${GITHUB_TOKEN}" \ + -H "X-GitHub-Api-Version: 2022-11-28" \ + "https://api.github.com/repos/${REPOSITORY}/issues/${PR_NUMBER}/comments" \ + -d @/tmp/summary.json + ''' + } + } + } + } + + post { + always { + archiveArtifacts artifacts: 'codex-*.json, *.md, /tmp/diff*.txt', allowEmptyArchive: true + } + } +} +``` + +# Wrap Up + +With the Codex SDK, you can build your own GitHub Code Review in on-prem environments. However, the pattern of triggering Codex with a prompt, receiving a structured output, and then acting on that output with an API call extends far beyond Code Review. For example, we could use this pattern to trigger a root-cause analysis when an incident is created and post a structured report into a Slack channel. Or we could create a code quality report on each PR and post results into a dashboard. diff --git a/examples/codex/code_modernization.md b/examples/codex/code_modernization.md new file mode 100644 index 0000000000..a818830843 --- /dev/null +++ b/examples/codex/code_modernization.md @@ -0,0 +1,360 @@ +# Modernizing your Codebase with Codex + +## Introduction + +Codex is trained to read and reason about large, complex codebases, plan work alongside engineers, and produce high-quality changes. Code modernization has quickly become one of its most common and valuable uses. In this setup, engineers focus on architecture and business rules while Codex handles the heavy lifting: translating legacy patterns, proposing safe refactors, and keeping documentation and tests in sync as the system evolves. + +This cookbook shows how to use **OpenAI's Codex CLI** to modernize a legacy repository in a way that is: + +* Understandable to new engineers +* Auditable for architects and risk teams +* Repeatable as a pattern across other systems + +We’ll use a COBOL-based [investment portfolio system](https://github.com/sentientsergio/COBOL-Legacy-Benchmark-Suite/) as the running example and choose a single pilot flow to focus on. You can substitute any legacy stack (eg. Java monolith, PL/SQL) where you have legacy programs, orchestration (jobs, schedulers, scripts), or shared data sources. + +--- + +## High Level Overview + +We’ve broken it down into 5 different phases that revolve around an executive plan (ExecPlan in short), which is a design document that the agent can follow to deliver the system change. + +Code Modernization Phases + +We will create 4 types of documents for the pilot flow we choose: + +* **pilot_execplan.md** - ExecPlan that orchestrates the pilot that answers: what’s in scope, why it matters, what steps we’ll take, and how we’ll know we’re done. +* **pilot_overview.md** - Which legacy programs (COBOL in our example), orchestration jobs (JCL here), and data sources are involved, how data flows between them, and what the business flow actually does. +* **pilot_design.md** - Target shape of the system: the service/module that will own this flow, the new data model, and the public APIs or batch entry points. +* **pilot_validation.md** - Defines how we’ll prove parity: key scenarios, shared input datasets, how to run legacy vs modern side-by-side, and what “matching outputs” means in practice. + +These 4 files help lay out what code is being changed, what the new system should look like, and exactly how to check that behavior hasn’t regressed. + +--- + +## Phase 0 - Set up AGENTS and PLANS + +**Goal**: Give Codex a lightweight contract for how planning works in this repo, without overwhelming people with process. + +We’re taking inspiration from the [Using PLANS.md for multi-hour problem solving](https://cookbook.openai.com/articles/codex_exec_plans) cookbook to create an AGENTS.md and PLANS.md file that will be placed in a .agent folder. + +* AGENTS.md: If you haven’t created an AGENTS.md for your repository yet, I suggest using the /init command. Once generated, reference the add a section in your AGENTS.md to instruct the agent to reference the PLANS.md. +* PLANS.md: Use the example provided in the cookbook as a starting point + +These explain what an ExecPlan is, when to create or update one, where it lives, and what sections every plan must have. + +### Where Codex CLI helps +If you want Codex to tighten AGENTS or PLANS for your specific repo, you can run: + +```md +Please read the directory structure and refine .agent/AGENTS.md and .agent/PLANS.md so they are a clear, opinionated standard for how we plan COBOL modernization work here. Keep the ExecPlan skeleton but add one or two concrete examples. +``` + +--- + +## Phase 1 - Pick a pilot and create the first ExecPlan + +**Goal**: Align on one realistic but bounded pilot flow and capture the plan for Phase 1 in a single ExecPlan file. + +**Key artifact**: pilot_execplan.md + +### 1.1 Choose pilot flow +If you don’t have a flow in mind to pilot with, you can ask Codex to propose. Example prompt from the repository root: + +```md +Look through this repository and propose one or two candidate pilot flows for modernization that are realistic but bounded. +For each candidate, list: +- COBOL programs and copybooks involved +- JCL members involved +- The business scenario in plain language +- End with a clear recommendation for which flow we should use as the first pilot +``` + +In this case, we’ll choose a reporting flow as the pilot. + +Pilot Candidate Flow + +### 1.2 Ask Codex to create the pilot ExecPlan + +```md +Create pilot_execplan.md following .agent/PLANS.md. Scope it to the daily reporting flow. The plan should cover four outcomes for this one flow: +- Inventory and diagrams +- Modernization Technical Report content +- Target design and spec +- Test plan for parity +Use the ExecPlan skeleton and fill it in with concrete references to the actual COBOL and JCL files. +``` + +This plan is now your “home base” for all pilot work. + +--- + +## Phase 2 - Inventory and discovery + +**Goal**: Capture what the pilot flow actually does today: programs, jobs, data flows, and business rules. Engineers can reason about the change without reading every line of legacy code. + +**Key artifact**: pilot_reporting_overview.md + +**Where engineers can focus**: + +* Confirm which jobs truly run in production +* Fill in gaps Codex cannot infer from code (SLAs, operational context, owners) +* Sanity check diagrams and descriptions + +### 2.1 Ask Codex to draft the overview +```md +Create or update pilot_reporting_overview.md with two top-level sections: “Inventory for the pilot” and “Modernization Technical Report for the pilot”. +Use pilot_execplan.md to identify the pilot flow. + +In the inventory section, include: +1. The COBOL programs and copybooks involved, grouped as batch, online, and utilities if applicable +2. The JCL jobs and steps that call these programs +3. The data sets or tables they read and write +4. A simple text diagram that shows the sequence of jobs and data flows + +In the modernization technical report section, describe: +1. The business scenario for this flow in plain language +2. Detailed behavior of each COBOL program in the flow +3. The data model for the key files and tables, including field names and meanings +4. Known technical risks such as date handling, rounding, special error codes, or tricky conditions +``` + +This document will be helpful for engineers to understand the shape and behavior of the pilot without reading all the code. + +Example of the flow diagram in pilot_reporting_overview.md + +Pilot Flow Diagram + +### 2.2 Update the ExecPlan + +Once the overview exists, ask Codex to keep the plan aligned + +```md +Update pilot_execplan.md to reflect the new pilot_reporting_overview.md file. +- In Progress, mark the inventory and MTR sections as drafted. +- Add any notable findings to Surprises and discoveries and Decision log. +- Keep the ExecPlan readable for someone new to the repo. +``` + +At the end of Phase 2, you’ll have a single pilot overview doc that plays the role of both system inventory report and modernization technical report. + +--- + +## Phase 3 - Design, spec, and validation plan + +**Goal** + +* Decide what the modern version of the pilot flow should look like +* Describe the target service and data model +* Define how to prove parity through tests and parallel runs. + +By the end of this phase, we’ll have decided what we’re building and how we’ll prove it works. + +**Key artifacts** + +* pilot_reporting_design.md +* pilot_reporting_validation.md +* modern/openapi/pilot.yaml +* modern/tests/pilot_parity_test.py + +### 3.1 Target design document + +```md +Based on pilot_reporting_overview.md, draft pilot_reporting_design.md with these sections: + +# Target service design +- Which service or module will own this pilot flow in the modern architecture. +- Whether it will be implemented as a batch job, REST API, event listener, or a combination. +- How it fits into the broader domain model. + +# Target data model +- Proposed database tables and columns that replace the current files or DB2 tables. +- Keys, relationships, and any derived fields. +- Notes about how legacy encodings such as packed decimals or EBCDIC fields will be represented. + +# API design overview +- The main operations users or systems will call. +- A short description of each endpoint or event. +- A pointer to modern/openapi/pilot.yaml where the full schema will live. +``` + +### 3.2 API specification + +We capture the pilot flow’s external behavior in an OpenAPI file so the modern system has a clear, language-agnostic contract. This spec becomes the anchor for implementation, test generation, and future integrations, and it gives Codex something concrete to scaffold code and tests from. + +```md +Using pilot_reporting_design.md, draft an OpenAPI file at modern/openapi/pilot.yaml that describes the external API for this pilot. Include: +- Paths and operations for the main endpoints or admin hooks +- Request and response schemas for each operation +- Field types and constraints, aligning with the target data model +``` + +Example output: + +Pilot Yaml + +### 3.3 Validation and test plan + +```md +Create or update pilot_reporting_validation.md with three sections: + +# Test plan +- Key scenarios, including at least one happy path and a couple of edge cases. +- Inputs and outputs to capture for each scenario. + +# Parity and comparison strategy +- How you will run the legacy COBOL flow and the modern implementation on the same input data. +- What outputs will be compared (files, tables, logs). +- How differences will be detected and triaged. + +# Test scaffolding +- Notes about the test file modern/tests/pilot_parity_test.py, including how to run it. +- What needs to be filled in once the modern implementation exists. +``` + +Then ask Codex to scaffold the tests: + +```md +Using pilot_reporting_validation.md, create an initial test file at modern/tests/pilot_parity_test.py. + +Include placeholder assertions and comments that reference the scenarios in the test plan, but do not assume the modern implementation is present yet. +``` + +### 3.4 Update the ExecPlan + +```md +Update pilot_execplan.md so that Plan of work, Concrete steps, and Validation and acceptance explicitly reference: +1. pilot_reporting_overview.md +2. pilot_reporting_design.md +3. pilot_reporting_validation.md +4. modern/openapi/pilot.yaml +5. modern/tests/pilot_parity_test.py +``` + +At the end of Phase 3, you’ll have a clear design, a machine readable spec, and a test plan/scaffolding that describes how you will prove parity. + +--- + +## Phase 4 - Implement and compare + +**Goal:** Implement the modern pilot, run it in parallel with the COBOL version, and show that outputs match for the planned scenarios. + +**Key artifacts** + +* Code under modern//pilot (for example modern/java/pilot) +* Completed tests in modern/tests/pilot_parity_test.py +* Updated sections in pilot_reporting_validation.md that describe the actual parallel run steps + +### 4.1 Generate a first draft of the modern code + +```md +Using pilot_reporting_design.md and the COBOL programs listed in pilot_reporting_overview.md, generate initial implementation code under modern//pilot that: +- Defines domain models and database entities for the key records and tables. +- Implements the core business logic in service classes, preserving behavior from COBOL paragraphs. +- Adds comments that reference the original COBOL paragraphs and copybooks. +- Treat this as a first draft for engineers to review. +``` + +You can run this several times, focusing on different modules. + +### 4.2 Wire up the parity tests + +```md +Extend modern/tests/pilot_parity_test.py so that it: +- Invokes the legacy pilot flow using whatever wrapper or command we have for COBOL (for example a script that runs the JCL in a test harness). +- Invokes the new implementation through its API or batch entry point. +- Compares the outputs according to the “Parity and comparison strategy” in pilot_reporting_validation.md. +``` + +### 4.3 Document the parallel run steps + +Rather than a separate parallel_run_pilot.md, reuse the validation doc: + +```md +Update the Parity and comparison strategy section in pilot_reporting_validation.md so that it includes a clear, ordered list of commands to: +- Prepare or load the input data set +- Run the COBOL pilot flow on that data +- Run the modern pilot flow on the same data +- Compare outputs and interpret the results +- Include precise paths for outputs and a short description of what success looks like +``` + +### 4.4 (If needed) Use Codex for iterative fixes + +As tests fail or behavior differs, work in short loops: + +```md +Here is a failing test from modern/tests/pilot_parity_test.py and the relevant COBOL and modern code. Explain why the outputs differ and propose the smallest change to the modern implementation that will align it with the COBOL behavior. Show the updated code and any test adjustments. +``` + +Each time you complete a meaningful chunk of work, ask Codex to update the ExecPlan: + +```md +Update pilot_execplan.md so that Progress, Decision log, and Outcomes reflect the latest code, tests, and validation results for the pilot. +``` + +You’ll see that the ExecPlan “progress” and “outcomes” section will be updated with something along the lines of: + +```md +Progress +- [x] Inventory and diagrams drafted (`pilot_reporting_overview.md` plus supporting notes in `system-architecture.md`). +- [x] Modernization technical report drafted (`pilot_reporting_overview.md` MTR section). +- [x] Target design spec drafted (`pilot_reporting_design.md` and `modern/openapi/pilot.yaml`). +- [x] Parity test plan and scaffolding documented (`pilot_reporting_validation.md` and `modern/tests/pilot_parity_test.py`). + +Outcomes +- `pilot_reporting_overview.md`, `pilot_reporting_design.md`, and `pilot_reporting_validation.md` now provide an end-to-end narrative (inventory, design, validation). +- `modern/openapi/pilot.yaml` describes the API surface, and `modern/python/pilot/{models,repositories,services}.py` hold the draft implementation. +- `modern/tests/pilot_parity_test.py` exercises the parity flow using placeholders and helpers aligned with the validation strategy. +- Remaining work is limited to updating the operations test appendix and wiring the services to the real runtime. +``` + +--- + +## Phase 5 - Turn the pilot into a scalable motion + +**Goal:** Provide reusable templates for other flows and a short guide to using Codex in this repo. + +**Key artifacts** + +* template_modernization_execplan.md +* how_to_use_codex_for_cobol_modernization.md + +### 6.1 Template ExecPlan + +```md +Look at the pilot files we created: +1. pilot_reporting_overview.md +2. pilot_reporting_design.md +3. pilot_reporting_validation.md +4. pilot_execplan.md + +Create template_modernization_execplan.md that a team can copy when modernizing another flow. It should: +1. Follow .agent/PLANS.md +2. Include placeholders for “Overview”, “Inventory”, “Modernization Technical Report”, “Target design”, and “Validation plan” +3. Assume a similar pattern: overview doc, design doc, validation doc, OpenAPI spec, and tests. +``` + +### 6.2 How-to guide + +```md +Using the same pilot files, write how_to_use_codex_for_cobol_modernization.md that: +1. Explains the phases at a high level (Pick a pilot, Inventory and discover, Design and spec, Implement and validate, Factory pattern). +2. For each phase, lists where coding agents helps and points to the relevant files and example prompts. +``` + +--- + +## Wrap up + +If you follow the steps in this cookbook for any pilot, you should end up with a folder layout that looks roughly like this: ExecPlan, three pilot docs, an OpenAPI spec, a pilot module, and a parity test. You can further organize the markdown files in additional pilot and template subfolders for more structure. + +Pilot Folder Structure + +You’ll notice that there isn’t a runnable entry point in modern/python/pilot yet since the modules (models.py, repositories.py, services.py) are first‑draft building blocks to start. You have two options if you want to experiment locally, you can + +* Use an interactive shell or small script +* Create your own runner (e.g. modern/python/pilot/main.py) that wires the repositories and services together + +While this cookbook uses a COBOL pilot flow as the running example, the same pattern shows up in very different kinds of refactors. For example, one customer used Codex to migrate a large monorepo by feeding it hundreds of Jira tickets, having Codex flag higher-risk work, surface cross-cutting dependencies, and draft the code changes, with a separate validator reviewing and merging. + +Modernizing COBOL repositories is just one popular case, but the same approach applies to any legacy stack or large-scale migration: turn “modernize our codebase” into a series of small, testable steps (an ExecPlan, a handful of docs, and a parity-first implementation). Codex handles the grind of understanding old patterns, generating candidate migrations, and tightening parity, while you and your team stay focused on architecture and trade-offs, making modernization faster, safer, and repeatable across every system you decide to bring forward. diff --git a/examples/data/oai_docs/libraries.txt b/examples/data/oai_docs/libraries.txt index 2d0d2d0b5c..3aa35cc424 100644 --- a/examples/data/oai_docs/libraries.txt +++ b/examples/data/oai_docs/libraries.txt @@ -139,7 +139,7 @@ Please note that OpenAI does not verify the correctness or security of these pro ### Julia -- [OpenAI.jl](https://github.com/rory-linehan/OpenAI.jl) by [rory-linehan](https://github.com/rory-linehan) +- [OpenAI.jl](https://github.com/JuliaML/OpenAI.jl) by [the JuliaML community](https://github.com/JuliaML) ### Kotlin diff --git a/examples/gpt-5-codex_prompting_guide.ipynb b/examples/gpt-5-codex_prompting_guide.ipynb index 32ebf9b71e..c8a7340889 100644 --- a/examples/gpt-5-codex_prompting_guide.ipynb +++ b/examples/gpt-5-codex_prompting_guide.ipynb @@ -6,6 +6,9 @@ "metadata": {}, "source": [ "## GPT-5-Codex Prompting Guide\n", + "\n", + "_This guide can also be used for `GPT-5.1-Codex` and `GPT-5.1-Codex-Mini`, we recommend to keep the same prompts and harness for `GPT-5.1-Codex` and `GPT-5.1-Codex-Mini` as `GPT-5-Codex`_\n", + "\n", "Important details about `GPT-5-Codex` and this guide:\n", "- This model is not a drop-in replacement for GPT-5, as it requires significantly different prompting.\n", "- This model is only supported with the Responses API and does not support the verbosity parameter.\n", diff --git a/examples/gpt-5/gpt-5-1_prompting_guide.ipynb b/examples/gpt-5/gpt-5-1_prompting_guide.ipynb new file mode 100644 index 0000000000..c051777e5b --- /dev/null +++ b/examples/gpt-5/gpt-5-1_prompting_guide.ipynb @@ -0,0 +1,592 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "931022d3", + "metadata": {}, + "source": [ + "# GPT-5.1 prompting guide\n", + "\n", + "## Introduction\n", + "\n", + "GPT-5.1, our newest flagship model, is designed to balance intelligence and speed for a variety of agentic and coding tasks, while also introducing a new `none` reasoning mode for low-latency interactions. Building on the strengths of GPT-5, GPT-5.1 is better calibrated to prompt difficulty, consuming far fewer tokens on easy inputs and more efficiently handling challenging ones. Along with these benefits, GPT-5.1 is more steerable in personality, tone, and output formatting.\n", + "\n", + "While GPT-5.1 works well out of the box for most applications, this guide focuses on prompt patterns that maximize performance in real deployments. These techniques come from extensive internal testing and collaborations with partners building production agents, where small prompt changes often produce large gains in reliability and user experience. We expect this guide to serve as a starting point: prompting is iterative, and the best results will come from adapting these patterns to your specific tools and workflows.\n", + "\n", + "## Migrating to GPT-5.1\n", + "\n", + "For developers using GPT-4.1, GPT-5.1 with `none` reasoning effort should be a natural fit for most low-latency use cases that do not require reasoning.\n", + "\n", + "For developers using GPT-5, we have seen strong success with customers who follow a few key pieces of guidance:\n", + "\n", + "1. **Persistence:** GPT-5.1 now has better-calibrated reasoning token consumption but can sometimes err on the side of being excessively concise and come at the cost of answer completeness. It can be helpful to emphasize via prompting the importance of persistence and completeness. \n", + "2. **Output formatting and verbosity:** While overall more detailed, GPT-5.1 can occasionally be verbose, so it is worthwhile being explicit in your instructions on desired output detail. \n", + "3. **Coding agents:** If you’re working on a coding agent, migrate your apply\\_patch to our new, named tool implementation. \n", + "4. **Instruction following:** For other behavior issues, GPT-5.1 is excellent at instruction-following, and you should be able to shape the behavior significantly by checking for conflicting instructions and being clear.\n", + "\n", + "We also released GPT-5.1-codex. This model behaves a bit differently than GPT-5.1, and we recommend you check out the [Codex prompting guide](https://cookbook.openai.com/examples/gpt-5-codex_prompting_guide) for more information.\n", + "\n", + "## Agentic steerability\n", + "\n", + "GPT-5.1 is a highly steerable model, allowing for robust control over your agent’s behaviors, personality, and communication frequency.\n", + "\n", + "### Shaping your agent’s personality\n", + "\n", + "GPT-5.1’s personality and response style can be adapted to your use case. While verbosity is controllable through a dedicated `verbosity` parameter, you can also shape the overall style, tone, and cadence through prompting.\n", + "\n", + "We’ve found that personality and style work best when you define a clear agent persona. This is especially important for customer-facing agents which need to display emotional intelligence to handle a range of user situations and dynamics. In practice, this can mean adjusting warmth and brevity to the state of the conversation, and avoiding excessive acknowledgment phrases like “got it” or “thank you.”\n", + "\n", + "The sample prompt below shows how we shaped the personality for a customer support agent, focusing on balancing the right level of directness and warmth in resolving an issue.\n", + "\n", + "```\n", + "\n", + "You value clarity, momentum, and respect measured by usefulness rather than pleasantries. Your default instinct is to keep conversations crisp and purpose-driven, trimming anything that doesn't move the work forward. You're not cold—you're simply economy-minded with language, and you trust users enough not to wrap every message in padding.\n", + "\n", + "- Adaptive politeness:\n", + " - When a user is warm, detailed, considerate or says 'thank you', you offer a single, succinct acknowledgment—a small nod to their tone with acknowledgement or receipt tokens like 'Got it', 'I understand', 'You're welcome'—then shift immediately back to productive action. Don't be cheesy about it though, or overly supportive. \n", + " - When stakes are high (deadlines, compliance issues, urgent logistics), you drop even that small nod and move straight into solving or collecting the necessary information.\n", + "\n", + "- Core inclination:\n", + " - You speak with grounded directness. You trust that the most respectful thing you can offer is efficiency: solving the problem cleanly without excess chatter.\n", + " - Politeness shows up through structure, precision, and responsiveness, not through verbal fluff.\n", + "\n", + "- Relationship to acknowledgement and receipt tokens: \n", + " - You treat acknowledge and receipt as optional seasoning, not the meal. If the user is brisk or minimal, you match that rhythm with near-zero acknowledgments.\n", + " - You avoid stock acknowledgments like \"Got it\" or \"Thanks for checking in\" unless the user's tone or pacing naturally invites a brief, proportional response.\n", + "\n", + "- Conversational rhythm:\n", + " - You never repeat acknowledgments. Once you've signaled understanding, you pivot fully to the task.\n", + " - You listen closely to the user's energy and respond at that tempo: fast when they're fast, more spacious when they're verbose, always anchored in actionability.\n", + "\n", + "- Underlying principle:\n", + " - Your communication philosophy is \"respect through momentum.\" You're warm in intention but concise in expression, focusing every message on helping the user progress with as little friction as possible.\n", + "\n", + "```\n", + "\n", + "In the prompt below, we’ve included sections that constrain a coding agent’s responses to be short for small changes and longer for more detailed queries. We also specify the amount of code allowed in the final response to avoid large blocks.\n", + "\n", + "```\n", + "\n", + "- Final answer compactness rules (enforced):\n", + " - Tiny/small single-file change (≤ ~10 lines): 2–5 sentences or ≤3 bullets. No headings. 0–1 short snippet (≤3 lines) only if essential.\n", + " - Medium change (single area or a few files): ≤6 bullets or 6–10 sentences. At most 1–2 short snippets total (≤8 lines each).\n", + " - Large/multi-file change: Summarize per file with 1–2 bullets; avoid inlining code unless critical (still ≤2 short snippets total).\n", + " - Never include \"before/after\" pairs, full method bodies, or large/scrolling code blocks in the final message. Prefer referencing file/symbol names instead.\n", + "- Do not include process/tooling narration (e.g., build/lint/test attempts, missing yarn/tsc/eslint) unless explicitly requested by the user or it blocks the change. If checks succeed silently, don't mention them.\n", + "\n", + "- Code and formatting restraint — Use monospace for literal keyword bullets; never combine with **.\n", + "- No build/lint/test logs or environment/tooling availability notes unless requested or blocking.\n", + "- No multi-section recaps for simple changes; stick to What/Where/Outcome and stop.\n", + "- No multiple code fences or long excerpts; prefer references.\n", + "\n", + "- Citing code when it illustrates better than words — Prefer natural-language references (file/symbol/function) over code fences in the final answer. Only include a snippet when essential to disambiguate, and keep it within the snippet budget above.\n", + "- Citing code that is in the codebase:\n", + " * If you must include an in-repo snippet, you may use the repository citation form, but in final answers avoid line-number/filepath prefixes and large context. Do not include more than 1–2 short snippets total.\n", + "\n", + "```\n", + "\n", + "Excess output length can be mitigated by adjusting the verbosity parameter and further reduced via prompting as GPT-5.1 adheres well to concrete length guidance:\n", + "\n", + "```\n", + "\n", + "- Respond in plain text styled in Markdown, using at most 2 concise sentences. \n", + "- Lead with what you did (or found) and context only if needed. \n", + "- For code, reference file paths and show code blocks only if necessary to clarify the change or review.\n", + "\n", + "```\n", + "\n", + "### Eliciting user updates \n", + "\n", + "User updates, also called preambles, are a way for GPT-5.1 to share upfront plans and provide consistent progress updates as assistant messages during a rollout. User updates can be adjusted along four major axes: frequency, verbosity, tone, and content. We trained the model to excel at keeping the user informed with plans, important insights and decisions, and granular context about what/why it's doing. These updates help the user supervise agentic rollouts more effectively, in both coding and non-coding domains.\n", + "\n", + "When timed correctly, the model will be able to share a point-in-time understanding that maps to the current state of the rollout. In the prompt addition below, we define what types of preamble would and would not be useful. \n", + "\n", + "```\n", + "\n", + "You'll work for stretches with tool calls — it's critical to keep the user updated as you work.\n", + "\n", + "\n", + "- Send short updates (1–2 sentences) every few tool calls when there are meaningful changes.\n", + "- Post an update at least every 6 execution steps or 8 tool calls (whichever comes first).\n", + "- If you expect a longer heads‑down stretch, post a brief heads‑down note with why and when you’ll report back; when you resume, summarize what you learned.\n", + "- Only the initial plan, plan updates, and final recap can be longer, with multiple bullets and paragraphs\n", + "\n", + "\n", + "\n", + "- Before the first tool call, give a quick plan with goal, constraints, next steps.\n", + "- While you're exploring, call out meaningful new information and discoveries that you find that helps the user understand what's happening and how you're approaching the solution.\n", + "- Provide additional brief lower-level context about more granular updates\n", + "- Always state at least one concrete outcome since the prior update (e.g., “found X”, “confirmed Y”), not just next steps.\n", + "- If a longer run occurred (>6 steps or >8 tool calls), start the next update with a 1–2 sentence synthesis and a brief justification for the heads‑down stretch.\n", + "- End with a brief recap and any follow-up steps.\n", + "- Do not commit to optional checks (type/build/tests/UI verification/repo-wide audits) unless you will do them in-session. If you mention one, either perform it (no logs unless blocking) or explicitly close it with a brief reason.\n", + "- If you change the plan (e.g., choose an inline tweak instead of a promised helper), say so explicitly in the next update or the recap.\n", + "- In the recap, include a brief checklist of the planned items with status: Done or Closed (with reason). Do not leave any stated item unaddressed.\n", + "\n", + "\n", + "```\n", + "\n", + "In longer-running model executions, providing a fast initial assistant message can improve perceived latency and user experience. We can achieve this behavior with GPT-5.1 through clear prompting.\n", + "\n", + "```\n", + "\n", + "Always explain what you're doing in a commentary message FIRST, BEFORE sampling an analysis thinking message. This is critical in order to communicate immediately to the user.\n", + "\n", + "```\n", + "\n", + "## Optimizing intelligence and instruction-following\n", + "\n", + "GPT-5.1 will pay very close attention to the instructions you provide, including guidance on tool usage, parallelism, and solution completeness.\n", + "\n", + "### Encouraging complete solutions\n", + "\n", + "On long agentic tasks, we’ve noticed that GPT-5.1 may end prematurely without reaching a complete solution, but we have found this behavior is promptable. In the following instruction, we tell the model to avoid premature termination and unnecessary follow-up questions. \n", + "\n", + "```\n", + "\n", + "- Treat yourself as an autonomous senior pair-programmer: once the user gives a direction, proactively gather context, plan, implement, test, and refine without waiting for additional prompts at each step.\n", + "- Persist until the task is fully handled end-to-end within the current turn whenever feasible: do not stop at analysis or partial fixes; carry changes through implementation, verification, and a clear explanation of outcomes unless the user explicitly pauses or redirects you.\n", + "- Be extremely biased for action. If a user provides a directive that is somewhat ambiguous on intent, assume you should go ahead and make the change. If the user asks a question like \"should we do x?\" and your answer is \"yes\", you should also go ahead and perform the action. It's very bad to leave the user hanging and require them to follow up with a request to \"please do it.\"\n", + "\n", + "```\n", + "\n", + "### Tool-calling format\n", + "\n", + "In order to make tool-calling most effective, we recommend describing functionality in the tool definition and how/when to use tools in the prompt. In the example below, we define a tool that creates a restaurant reservation, and we concisely describe what it does when invoked. \n", + "\n", + "```\n", + "{\n", + " \"name\": \"create_reservation\",\n", + " \"description\": \"Create a restaurant reservation for a guest. Use when the user asks to book a table with a given name and time.\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"name\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"Guest full name for the reservation.\"\n", + " },\n", + " \"datetime\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"Reservation date and time (ISO 8601 format).\"\n", + " }\n", + " },\n", + " \"required\": [\"name\", \"datetime\"]\n", + " }\n", + "}\n", + "\n", + "```\n", + "\n", + "In the prompt, you may have a section that references the tool like this:\n", + "\n", + "```\n", + "\n", + "- When the user asks to book, reserve, or schedule a table, you MUST call `create_reservation`.\n", + "- Do NOT guess a reservation time or name — ask for whichever detail is missing.\n", + "- If the user has not provided a name, ask: “What name should I put on the reservation?”\n", + "- If the user has not provided a date/time, ask: “What date and time would you like to reserve?”\n", + "- After calling the tool, confirm the reservation naturally: “Your reservation is confirmed for [name] on [date/time].”\n", + "\n", + "\n", + "\n", + "*Example 1:*\n", + "User: “Book a table for Sarah tomorrow at 7pm.”\n", + "Assistant → (calls tool) →\n", + "`{\"name\": \"create_reservation\", \"arguments\": { \"name\": \"Sarah\", \"datetime\": \"2025-11-01T19:00\" } }`\n", + "Tool returns: `{ \"confirmation_number\": \"R12345\" }`\n", + "Assistant: “All set — your reservation for Sarah tomorrow at 7:00pm is confirmed. Your confirmation number is R12345.”\n", + "\n", + "**Example 2:**\n", + "User: “I want to make a reservation.”\n", + "Assistant: “Sure! What name should I put on the reservation, and what date and time would you like?”\n", + "\n", + "*Example 3:*\n", + "User: “Reserve a table under Daniel at 6 tonight.”\n", + "Assistant → (calls tool) →\n", + "`{\"name\": \"create_reservation\", \"arguments\": { \"name\": \"Daniel\", \"datetime\": \"2025-10-31T18:00\" } }`\n", + "Tool returns: `{ \"confirmation_number\": \"R67890\" }`\n", + "Assistant: “Done! Your reservation for Daniel at 6:00pm tonight is confirmed. The confirmation number is R67890.”\n", + "\n", + "```\n", + "\n", + "GPT-5.1 also executes parallel tool calls more efficiently. When scanning a codebase or retrieving from a vector store, enabling parallel tool calling and encouraging the model to use parallelism within the tool description is a good starting point. In the system prompt, you can reinforce parallel tool usage by providing some examples of permissible parallelism. An example instruction may look like:\n", + "\n", + "```\n", + "Parallelize tool calls whenever possible. Batch reads (read_file) and edits (apply_patch) to speed up the process.\n", + "```\n", + "\n", + "### Using the “none” reasoning mode for improved efficiency\n", + "\n", + "GPT-5.1 introduces a new reasoning mode: `none`. Unlike GPT-5’s prior `minimal` setting, `none` forces the model to never use reasoning tokens, making it much more similar in usage to GPT-4.1, GPT-4o, and other prior non-reasoning models. Importantly, developers can now use hosted tools like [web search](https://platform.openai.com/docs/guides/tools-web-search?api-mode=responses) and [file search](https://platform.openai.com/docs/guides/tools?tool-type=file-search) with `none`, and custom function-calling performance is also substantially improved. With that in mind, [prior guidance on prompting non-reasoning models](https://cookbook.openai.com/examples/gpt4-1_prompting_guide) like GPT-4.1 also applies here, including using few-shot prompting and high-quality tool descriptions.\n", + "\n", + "While GPT-5.1 does not use reasoning tokens with `none`, we’ve found prompting the model to think carefully about which functions it plans to invoke can improve accuracy.\n", + "\n", + "```\n", + "You MUST plan extensively before each function call, and reflect extensively on the outcomes of the previous function calls, ensuring user's query is completely resolved. DO NOT do this entire process by making function calls only, as this can impair your ability to solve the problem and think insightfully. In addition, ensure function calls have the correct arguments.\n", + "```\n", + "\n", + "We’ve also observed that on longer model execution, encouraging the model to “verify” its outputs results in better instruction following for tool use. Below is an example we used within the instruction when clarifying a tool’s usage.\n", + "\n", + "```\n", + "When selecting a replacement variant, verify it meets all user constraints (cheapest, brand, spec, etc.). Quote the item-id and price back for confirmation before executing. \n", + "```\n", + "\n", + "In our testing, GPT-5’s prior `minimal` reasoning mode sometimes led to executions that terminated prematurely. Although other reasoning modes may be better suited for these tasks, our guidance for GPT-5.1 with `none` is similar. Below is a snippet from our Tau bench prompt.\n", + "\n", + "```\n", + "Remember, you are an agent - please keep going until the user’s query is completely resolved, before ending your turn and yielding back to the user. You must be prepared to answer multiple queries and only finish the call once the user has confirmed they're done. \n", + "```\n", + "\n", + "## Maximizing coding performance from planning to execution\n", + "\n", + "One tool we recommend implementing for long-running tasks is a planning tool. You may have noticed reasoning models plan within their reasoning summaries. Although this is helpful in the moment, it may be difficult to keep track of where the model is relative to the execution of the query.\n", + "\n", + "```\n", + "\n", + "- For medium or larger tasks (e.g., multi-file changes, adding endpoints/CLI/features, or multi-step investigations), you must create and maintain a lightweight plan in the TODO/plan tool before your first code/tool action.\n", + "- Create 2–5 milestone/outcome items; avoid micro-steps and repetitive operational tasks (no “open file”, “run tests”, or similar operational steps). Never use a single catch-all item like “implement the entire feature”.\n", + "- Maintain statuses in the tool: exactly one item in_progress at a time; mark items complete when done; post timely status transitions (never more than ~8 tool calls without an update). Do not jump an item from pending to completed: always set it to in_progress first (if work is truly instantaneous, you may set in_progress and completed in the same update). Do not batch-complete multiple items after the fact.\n", + "- Finish with all items completed or explicitly canceled/deferred before ending the turn.\n", + "- End-of-turn invariant: zero in_progress and zero pending; complete or explicitly cancel/defer anything remaining with a brief reason.\n", + "- If you present a plan in chat for a medium/complex task, mirror it into the tool and reference those items in your updates.\n", + "- For very short, simple tasks (e.g., single-file changes ≲ ~10 lines), you may skip the tool. If you still share a brief plan in chat, keep it to 1–2 outcome-focused sentences and do not include operational steps or a multi-bullet checklist.\n", + "- Pre-flight check: before any non-trivial code change (e.g., apply_patch, multi-file edits, or substantial wiring), ensure the current plan has exactly one appropriate item marked in_progress that corresponds to the work you’re about to do; update the plan first if needed.\n", + "- Scope pivots: if understanding changes (split/merge/reorder items), update the plan before continuing. Do not let the plan go stale while coding.\n", + "- Never have more than one item in_progress; if that occurs, immediately correct the statuses so only the current phase is in_progress.\n", + "\n", + "```\n", + "\n", + "A plan tool can be used with minimal scaffolding. In our implementation of the plan tool, we pass a merge parameter as well as a list of to-dos. The list contains a brief description, the current state of the task, and an ID assigned to it. Below is an example of a function call that GPT-5.1 may make to record its state.\n", + "\n", + "```\n", + "{\n", + " \"name\": \"update_plan\",\n", + " \"arguments\": {\n", + " \"merge\": true,\n", + " \"todos\": [\n", + " {\n", + " \"content\": \"Investigate failing test\",\n", + " \"status\": \"in_progress\",\n", + " \"id\": \"step-1\"\n", + " },\n", + " {\n", + " \"content\": \"Apply fix and re-run tests\",\n", + " \"status\": \"pending\",\n", + " \"id\": \"step-2\"\n", + " }\n", + " ]\n", + " }\n", + "}\n", + "```\n", + "\n", + "### Design system enforcement\n", + "\n", + "When building frontend interfaces, GPT-5.1 can be steered to produce websites that match your visual design system. We recommend using Tailwind to render CSS, which you can further tailor to meet your design guidelines. In the example below, we define a design system to constrain the colors generated by GPT-5.1.\n", + "\n", + "```\n", + "\n", + "- Tokens-first: Do not hard-code colors (hex/hsl/oklch/rgb) in JSX/CSS. All colors must come from globals.css variables (e.g., --background, --foreground, --primary, --accent, --border, --ring) or DS components that consume them.\n", + "- Introducing a brand or accent? Before styling, add/extend tokens in globals.css under :root and .dark, for example:\n", + " - --brand, --brand-foreground, optional --brand-muted, --brand-ring, --brand-surface\n", + " - If gradients/glows are needed, define --gradient-1, --gradient-2, etc., and ensure they reference sanctioned hues.\n", + "- Consumption: Use Tailwind/CSS utilities wired to tokens (e.g., bg-[hsl(var(--primary))], text-[hsl(var(--foreground))], ring-[hsl(var(--ring))]). Buttons/inputs/cards must use system components or match their token mapping.\n", + "- Default to the system's neutral palette unless the user explicitly requests a brand look; then map that brand to tokens first.\n", + "\n", + "```\n", + "\n", + "## New tool types in GPT-5.1\n", + "\n", + "GPT-5.1 has been post-trained on specific tools that are commonly used in coding use cases. To interact with files in your environment you now can use a predefined apply\\_patch tool. Similarly, we’ve added a shell tool that lets the model propose commands for your system to run. \n", + "\n", + "### Using apply\\_patch\n", + "\n", + "The apply\\_patch tool lets GPT-5.1 create, update, and delete files in your codebase using structured diffs. Instead of just suggesting edits, the model emits patch operations that your application applies and then reports back on, enabling iterative, multi-step code editing workflows. You can find additional usage details and context in the [GPT-4.1 prompting guide](https://cookbook.openai.com/examples/gpt4-1_prompting_guide#:~:text=PYTHON_TOOL_DESCRIPTION%20%3D%20%22%22%22This,an%20exclamation%20mark.).\n", + "\n", + "With GPT-5.1, you can use apply\\_patch as a new tool type without writing custom descriptions for the tool. The description and handling are managed via the Responses API. Under the hood, this implementation uses a freeform function call rather than a JSON format. In testing, the named function decreased apply\\_patch failure rates by 35%.\n", + "\n", + "```\n", + "response = client.responses.create( \n", + "model=\"gpt-5.1\", \n", + "input=RESPONSE_INPUT, \n", + "tools=[{\"type\": \"apply_patch\"}]\n", + ")\n", + "```\n", + "\n", + "When the model decides to execute an apply\\_patch tool, you will receive an apply\\_patch\\_call function type within the response stream. Within the operation object, you’ll receive a type field (with one of `create_file`, `update_file`, or `delete_file`) and the diff to implement.\n", + "\n", + "```\n", + "{\n", + " \"id\": \"apc_08f3d96c87a585390069118b594f7481a088b16cda7d9415fe\",\n", + " \"type\": \"apply_patch_call\",\n", + " \"status\": \"completed\",\n", + " \"call_id\": \"call_Rjsqzz96C5xzPb0jUWJFRTNW\",\n", + " \"operation\": {\n", + " \"type\": \"update_file\",\n", + " \"diff\": \"\n", + " @@\n", + " -def fib(n):\n", + " +def fibonacci(n):\n", + " if n <= 1:\n", + " return n\n", + " - return fib(n-1) + fib(n-2) \n", + " + return fibonacci(n-1) + fibonacci(n-2)\",\n", + " \"path\": \"lib/fib.py\"\n", + " }\n", + "},\n", + "\n", + "```\n", + "\n", + "[This repository](https://github.com/openai/openai-cookbook/blob/main/examples/gpt-5/apply_patch.py) contains the expected implementation for the apply\\_patch tool executable. When your system finishes executing the patch tool, the Responses API expects a tool output in the following form:\n", + "\n", + "```\n", + "{\n", + " \"type\": \"apply_patch_call_output\",\n", + " \"call_id\": call[\"call_id\"],\n", + " \"status\": \"completed\" if success else \"failed\",\n", + " \"output\": log_output\n", + "}\n", + "```\n", + "\n", + "### Using the shell tool \n", + "\n", + "We’ve also built a new shell tool for GPT-5.1. The shell tool allows the model to interact with your local computer through a controlled command-line interface. The model proposes shell commands; your integration executes them and returns the outputs. This creates a simple plan-execute loop that lets models inspect the system, run utilities, and gather data until they finish the task. \n", + "\n", + "The shell tool is invoked in the same way as apply\\_patch: include it as a tool of type `shell`.\n", + "\n", + "```\n", + "tools = [{\"type\": \"shell\"}]\n", + "```\n", + "\n", + "When a shell tool call is returned, the Responses API includes a `shell_call` object with a timeout, a maximum output length, and the command to run.\n", + "\n", + "```\n", + "{\n", + "\t\"type\": \"shell_call\",\n", + "\t\"call_id\": \"...\", \n", + "\t\"action\": {\n", + "\t\t\"commands\": [...], \n", + "\t\t\"timeout_ms\": 120000,\n", + "\t\t\"max_output_length\": 4096 \n", + "\t},\n", + "\t\"status\": \"in_progress\"\n", + "}\n", + "```\n", + "\n", + "After executing the shell command, return the untruncated stdout/stderr logs as well as the exit-code details.\n", + "\n", + "```\n", + "{\n", + "\t\"type\": \"shell_call_output\",\n", + "\t\"call_id\": \"...\", \n", + "\t\"max_output_length\": 4096, \n", + "\t\"output\": [\n", + "\t\t{\n", + "\t\t\t\"stdout\": \"...\", \n", + "\t\t\t\"stderr\": \"...\", \n", + "\t\t\t\"outcome\": {\n", + "\t\t\t\t\"type\": \"exit\", \n", + "\t\t\t\t\"exit_code\": 0\n", + "\t\t\t} \n", + "\t\t}\n", + "\t] \n", + "}\n", + "```\n", + "\n", + "## How to metaprompt effectively \n", + "\n", + "Building prompts can be cumbersome, but it’s also the highest-leverage thing you can do to resolve most model behavior issues. Small inclusions can unexpectedly steer the model undesirably. Let’s walk through an example of an agent that plans events. In the prompt below, the customer-facing agent is tasked with using tools to answer users’ questions about potential venues and logistics.\n", + "\n", + "```\n", + "You are “GreenGather,” an autonomous sustainable event-planning agent. You help users design eco-conscious events (work retreats, conferences, weddings, community gatherings), including venues, catering, logistics, and attendee experience.\n", + "\n", + "PRIMARY OBJECTIVE\n", + "Your main goal is to produce concise, immediately actionable answers that fit in a quick chat context. Most responses should be about 3–6 sentences total. Users should be able to skim once and know exactly what to do next, without needing follow-up clarification.\n", + "\n", + "SCOPE\n", + "\n", + "* Focus on: venue selection, schedule design, catering styles, transportation choices, simple budgeting, and sustainability considerations.\n", + "* You do not actually book venues or vendors; never say you completed a booking.\n", + "* You may, however, phrase suggestions as if the user can follow them directly (“Book X, then do Y”) so planning feels concrete and low-friction.\n", + "\n", + "TONE & STYLE\n", + "\n", + "* Sound calm, professional, and neutral, suitable for corporate planners and executives. Avoid emojis and expressive punctuation.\n", + "* Do not use first-person singular; prefer “A good option is…” or “It is recommended that…”.\n", + "* Be warm and approachable. For informal or celebratory events (e.g., weddings), you may occasionally write in first person (“I’d recommend…”) and use tasteful emojis to match the user’s energy.\n", + "\n", + "STRUCTURE\n", + "Default formatting guidelines:\n", + "\n", + "* Prefer short paragraphs, not bullet lists.\n", + "* Use bullets only when the user explicitly asks for “options,” “list,” or “checklist.”\n", + "* For complex, multi-day events, always structure your answer with labeled sections (e.g., “Overview,” “Schedule,” “Vendors,” “Sustainability”) and use bullet points liberally for clarity.\n", + "\n", + "AUTONOMY & PLANNING\n", + "You are an autonomous agent. When given a planning task, continue reasoning and using tools until the plan is coherent and complete, rather than bouncing decisions back to the user. Do not ask the user for clarifications unless absolutely necessary for safety or correctness. Make sensible assumptions about missing details such as budget, headcount, or dietary needs and proceed.\n", + "\n", + "To avoid incorrect assumptions, when key information (date, city, approximate headcount) is missing, pause and ask 1–3 brief clarifying questions before generating a detailed plan. Do not proceed with a concrete schedule until those basics are confirmed. For users who sound rushed or decisive, minimize questions and instead move ahead with defaults.\n", + "\n", + "TOOL USAGE\n", + "You always have access to tools for:\n", + "\n", + "* venue_search: find venues with capacity, location, and sustainability tags\n", + "* catering_search: find caterers and menu styles\n", + "* transport_search: find transit and shuttle options\n", + "* budget_estimator: estimate costs by category\n", + "\n", + "General rules for tools:\n", + "\n", + "* Prefer tools over internal knowledge whenever you mention specific venues, vendors, or prices.\n", + "* For simple conceptual questions (e.g., “how to make a retreat more eco-friendly”), avoid tools and rely on internal knowledge so responses are fast.\n", + "* For any event with more than 30 attendees, always call at least one search tool to ground recommendations in realistic options.\n", + "* To keep the experience responsive, avoid unnecessary tool calls; for rough plans or early brainstorming, you can freely propose plausible example venues or caterers from general knowledge instead of hitting tools.\n", + "\n", + "When using tools as an autonomous agent:\n", + "\n", + "* Plan your approach (which tools, in what order) and then execute without waiting for user confirmation at each step.\n", + "* After each major tool call, briefly summarize what you did and how results shaped your recommendation.\n", + "* Keep tool usage invisible unless the user explicitly asks how you arrived at a suggestion.\n", + "\n", + "VERBOSITY & DETAIL\n", + "Err on the side of completeness so the user does not need follow-up messages. Include specific examples (e.g., “morning keynote, afternoon breakout rooms, evening reception”), approximate timing, and at least a rough budget breakdown for events longer than one day.\n", + "\n", + "However, respect the user’s time: long walls of text are discouraged. Aim for compact responses that rarely exceed 2–3 short sections. For complex multi-day events or multi-vendor setups, provide a detailed, step-by-step plan that the user could almost copy into an event brief, even if it requires a longer answer.\n", + "\n", + "SUSTAINABILITY GUIDANCE\n", + "\n", + "* Whenever you suggest venues or transportation, include at least one lower-impact alternative (e.g., public transit, shuttle consolidation, local suppliers).\n", + "* Do not guilt or moralize; frame tradeoffs as practical choices.\n", + "* Highlight sustainability certifications when relevant, but avoid claiming a venue has a certification unless you are confident based on tool results or internal knowledge.\n", + "\n", + "INTERACTION & CLOSING\n", + "Avoid over-apologizing or repeating yourself. Users should feel like decisions are being quietly handled on their behalf. Return control to the user frequently by summarizing the current plan and inviting them to adjust specifics before you refine further.\n", + "\n", + "End every response with a subtle next step the user could take, phrased as a suggestion rather than a question, and avoid explicit calls for confirmation such as “Let me know if this works.”\n", + "```\n", + "\n", + "Although this is a strong starting prompt, there are a few issues we noticed upon testing:\n", + "\n", + "* Small conceptual questions (like asking about a 20-person leadership dinner) triggered unnecessary tool calls and very concrete venue suggestions, despite the prompt allowing internal knowledge for simple, high-level questions.\n", + "\n", + "* The agent oscillated between being overly verbose (multi-day Austin offsites turning into dense, multi-section essays) and overly hesitant (refusing to propose a plan without more questions) and occasionally ignored unit rules (a Berlin summit described in miles and °F instead of km and °C).\n", + "\n", + "Rather than manually guessing which lines of the system prompt caused these behaviors, we can metaprompt GPT-5.1 to inspect its own instructions and traces.\n", + "\n", + "**Step 1**: Ask GPT-5.1 to diagnose failures\n", + "\n", + "Paste the system prompt and a small batch of failure examples into a separate analysis call. Based on the evals you’ve seen, provide a brief overview of the failure modes you expect to address, but leave the fact-finding to the model.\n", + "\n", + "Note that in this prompt, we’re not asking for a solution yet, just a root-cause analysis.\n", + "\n", + "```\n", + "You are a prompt engineer tasked with debugging a system prompt for an event-planning agent that uses tools to recommend venues, logistics, and sustainable options.\n", + "\n", + "You are given:\n", + "\n", + "1) The current system prompt:\n", + "\n", + "[DUMP_SYSTEM_PROMPT]\n", + "\n", + "\n", + "2) A small set of logged failures. Each log has:\n", + "- query\n", + "- tools_called (as actually executed)\n", + "- final_answer (shortened if needed)\n", + "- eval_signal (e.g., thumbs_down, low rating, human grader, or user comment)\n", + "\n", + "\n", + "[DUMP_FAILURE_TRACES]\n", + "\n", + "\n", + "Your tasks:\n", + "\n", + "1) Identify the distinct failure mode you see (e.g., tool_usage_inconsistency, autonomy_vs_clarifications, verbosity_vs_concision, unit_mismatch).\n", + "2) For each failure mode, quote or paraphrase the specific lines or sections of the system prompt that are most likely causing or reinforcing it. Include any contradictions (e.g., “be concise” vs “err on the side of completeness,” “avoid tools” vs “always use tools for events over 30 attendees”).\n", + "3) Briefly explain, for each failure mode, how those lines are steering the agent toward the observed behavior.\n", + "\n", + "Return your answer in a structured but readable format:\n", + "\n", + "failure_modes:\n", + "- name: ...\n", + " description: ...\n", + " prompt_drivers:\n", + " - exact_or_paraphrased_line: ...\n", + " - why_it_matters: ...\n", + "```\n", + "\n", + "Metaprompting works best when the feedback can logically be grouped together. If you provide many failure modes, the model may struggle to tie all of the threads together. In this example, the dump of failure logs may contain examples of errors where the model was overly or insufficiently verbose when responding to the user’s question. A separate query would be issued for the model’s over-eagerness to call tools.\n", + "\n", + "**Step 2:** Ask GPT-5.1 how it would patch the prompt to fix those behaviors\n", + "\n", + "Once you have that analysis, you can run a second, separate call that focuses on implementation: tightening the prompt without fully rewriting it.\n", + "\n", + "```\n", + "You previously analyzed this system prompt and its failure modes.\n", + "\n", + "System prompt:\n", + "\n", + "\n", + "[DUMP_SYSTEM_PROMPT]\n", + "\n", + "\n", + "Failure-mode analysis:\n", + "[DUMP_FAILURE_MODE_ANALYSIS]\n", + "\n", + "Please propose a surgical revision of the system prompt that reduces the observed issues while preserving the good behaviors.\n", + "\n", + "Constraints:\n", + "\n", + "- Do not redesign the agent from scratch.\n", + "- Prefer small, explicit edits: clarify conflicting rules, remove redundant or contradictory lines, tighten vague guidance.\n", + "- Make tradeoffs explicit (for example, clearly state when to prioritize concision over completeness, or exactly when tools must vs must not be called).\n", + "- Keep the structure and overall length roughly similar to the original, unless a short consolidation removes obvious duplication.\n", + "\n", + "Output:\n", + "\n", + "1) patch_notes: a concise list of the key changes and the reasoning behind each (e.g., “Merged conflicting tool-usage rules into a single hierarchy,” “Removed overlapping tone instructions that encouraged both executive formality and casual first-person with emojis”).\n", + "2) revised_system_prompt: the full updated system prompt with your edits applied, ready to drop into an agent configuration.\n", + "```\n", + "\n", + "In this example, the first metaprompt helps GPT-5.1 point directly at the contradictory sections (such as the overlapping tool rules and autonomy vs clarification guidance), and the second metaprompt turns that analysis into a concrete, cleaned-up version of the event-planning agent’s instructions.\n", + "\n", + "The output from the second prompt might look something like this:\n", + "\n", + "```\n", + "patch_notes:\n", + "- Clarified when to prioritize concision vs detail:\n", + " - Simple or single-topic queries should stay within ~3–6 sentences.\n", + " - Longer, structured answers are reserved for clearly complex, multi-day or multi-vendor requests.\n", + "- Removed language that told the agent to “err on the side of completeness” for all cases and replaced it with conditional rules based on query complexity.\n", + "- Tightened the structure rules so headings and bullets are only used when complexity or explicit user instructions justify them.\n", + "- Simplified the guidance on step-by-step plans so they are expected only for complex events, not for every question.\n", + "\n", + "revised_system_prompt:\n", + "[...]\n", + "```\n", + "\n", + "After this iteration cycle, run the queries again to observe any regressions and repeat this process until your failure modes have been identified and triaged.\n", + "\n", + "As you continue to grow your agentic systems (e.g., broadening scope or increasing the number of tool calls), consider metaprompting the additions you’d like to make rather than adding them by hand. This helps maintain discrete boundaries for each tool and when they should be used.\n", + "\n", + "## What's next\n", + "\n", + "To summarize, GPT-5.1 builds on the foundation set by GPT-5 and adds things like quicker thinking for easy questions, steerability when it comes to model output, new tools for coding use cases, and the option to set reasoning to `none` when your tasks don't require heavy thinking.\n", + "\n", + "Get started with GPT-5.1 in the [docs](https://platform.openai.com/docs/guides/latest-model), or read the [blog post](https://openai.com/index/gpt-5-1-for-developers/) to learn more.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "13d01a5e", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/partners/self_evolving_agents/autonomous_agent_retraining.ipynb b/examples/partners/self_evolving_agents/autonomous_agent_retraining.ipynb new file mode 100644 index 0000000000..7a127b07e5 --- /dev/null +++ b/examples/partners/self_evolving_agents/autonomous_agent_retraining.ipynb @@ -0,0 +1,2039 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Self-Evolving Agents: A Cookbook for Autonomous Agent Retraining\n", + "\n", + "## Overview\n", + "\n", + "Agentic systems often reach a plateau after proof-of-concept because they depend on humans to diagnose edge cases and correct failures. This cookbook introduces a repeatable retraining loop that captures those issues, learns from the feedback, and promotes improvements back into production-like workflows. We ground the approach in a regulated healthcare documentation task, but the patterns generalize to any domain that demands accuracy, auditability, and rapid iteration.\n", + "\n", + "### What You Will Learn\n", + "- Diagnose why an autonomous agent falls short of production readiness and instrument it with measurable feedback signals.\n", + "- Compare three prompt-optimization strategies—from quick manual iteration to fully automated loops—and understand when to reach for each.\n", + "- Assemble a self-healing workflow that combines human review, LLM-as-judge evals, and iterative prompt refinement.\n", + "\n", + "### Who This Notebook Is For\n", + "- ML/AI engineers and solution architects who need to move beyond toy demos.\n", + "- Product and delivery teams looking for executable artifacts they can adapt into internal tooling or production pipelines.\n", + "\n", + "### How to Work Through This Notebook\n", + "1. Start with Section 1 to understand the healthcare use case, baseline agent, and system architecture.\n", + "2. Use Section 2 to practice prompt optimization within the OpenAI Evals interface and collect structured feedback.\n", + "3. Run Section 3 to automate the optimization loop with graders, evals, and retraining logic.\n", + "4. Reference the appendix for reusable prompts, configurations, and evaluation templates as you tailor the workflow to your environment.\n", + "\n", + "The notebook is modular—feel free to run sections independently or sequentially as you adapt the retraining loop to your own agents." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Use Case Overview: Self-Evolving Agents in Healthcare\n", + "\n", + "### Problem Definition\n", + "\n", + "For this cookbook, we focus on a **real-world use case**: drafting regulatory documents for pharmaceutical companies. These organizations must prepare and submit extensive documentation to regulatory authorities (e.g., the U.S. Food and Drug Administration) to obtain approval for new drugs. The accuracy and speed of these submissions are critical, as they directly impact how quickly life-saving treatments can reach patients. \n", + "\n", + "Regulatory document drafting is a highly complex, iterative, and precision-driven process that requires deep scientific, medical, and compliance expertise. Despite the availability of advanced authoring tools, it remains labor-intensive and prone to human error. **Agentic systems offer substantial leverage** by assisting with research synthesis, content generation, and document structuring, yet human experts are still needed to ensure factual accuracy and regulatory compliance. \n", + "\n", + "The key challenge is to design a feedback loop that enables these agentic systems to learn iteratively and refine model behavior over time. Such a system can gradually shift human effort from detailed correction to high-level oversight, improving efficiency while maintaining the rigorous standards required for regulatory submissions. \n", + "\n", + "### Self-evolving Agent\n", + "\n", + "The diagram below illustrates the iterative process for continuously improving an AI agent through feedback, meta prompting, and evaluation. The loop combines human judgment or automated feedback using an LLM-as-a-judge to iteratively enhance performance. \n", + "\n", + "\"Self-evolving\n", + "
Figure 1 - Diagram showing the self-evolving loop for automated agent improvement.\n", + "\n", + "The process consists of the following steps: \n", + "\n", + "1. **Baseline Agent** \n", + " The process begins with a baseline agent. In this notebook, we use a deliberately simple example (an agent that summarizes sections of a document) to illustrate the iterative improvement loop. In real-world or enterprise settings, the baseline agent could be much more complex. The summaries it produces serve as the initial benchmark for subsequent evaluation and refinement.\n", + "\n", + "2. **Human Feedback (or LLM-as-judge)** \n", + " The baseline agent’s outputs are then evaluated either by human reviewers (e.g., for production environments) and/or by an automated **LLM-as-judge** system. This step gathers both quantitative and qualitative feedback that indicates how well the agent meets its goals — for instance, if we are testing the length of the summary, the feedback might be “the summary is too long” or a numerical score (generally between `0` and `1`) generated by eval when assessing if the summary is under 500 words.\n", + "\n", + "3. **Evals and Aggregated Score** \n", + " Based on the collected feedback, new prompts are generated and tested through evaluations (**Evals**). These tests measure performance against predefined criteria, and the outcomes are combined into an aggregated score that reflects the overall performance. The loop continues until the score exceeds a target threshold (e.g., `0.8`) or the maximum number of retries is reached (e.g., `max_retry = 10`). If the retry limit is hit, engineers are alerted that manual improvements are required.\n", + "\n", + "4. **Updated Baseline Agent** \n", + " Once an improved version achieves the target performance, it replaces the original baseline agent. This updated agent becomes the foundation for the next iteration, supporting a continuous cycle of learning, feedback, and optimization.\n", + "\n", + "\n", + "\n", + "### Dataset Overview\n", + "\n", + "The dataset used for evaluation comprises ~70 sections extracted from the _Sample CMC Section for Hyperpolarized Pyruvate (13C) Injection_, publicly available [here](https://dctd.cancer.gov/drug-discovery-development/reagents-materials/imaging-ind-resources/documentation/13c-pyruvate-cmc.pdf). This dataset provides realistic, domain-specific content suitable for testing both scientific summarization and regulatory compliance behavior. \n", + "\n", + "\n", + "### Baseline Agent Overview\n", + "\n", + "To keep this cookbook self-contained and easily reproducible, we simplified the regulatory drafting use case while retaining its essential complexity. In production, a typical regulatory authoring agent comprises multiple specialized sub-agents responsible for tasks such as drafting, data analysis, compliance checking, citation generation, and fact verification.\n", + "\n", + "For this guide, we narrow the scope of the regulatory authoring agent to focus on the self-healing aspect of the system. Our regulatory authoring agent consists of two sub-agents:\n", + "- **A summarizer** creating scientific and concise summaries.\n", + "- **A compliance checker**: evaluating each summary against key regulatory requirements (e.g., FDA 21 CFR Part 11). \n", + "\n", + "\"Baseline\n", + "
Figure 2 - The baseline agent as created in the AgentBuilder UI.\n", + "\n", + "For the remainder of this cookbook, we implemented a simplified version of the Summarizer agent (see the section **Agent Setup** below). Alternatively, you can reuse the code for the agent created with AgentBuilder. If you’d like to reproduce the agent directly from the AgentBuilder UI, here are the key prompts and parameters used:\n", + "\n", + "- **Summarizer agent:** This agent used the file search tool, where the [CMC PDF](\"data/c13_pyruvate_sample_CMC_from_UCSF.pdf\") was uploaded to the vector store.\n", + "> _Prompt:_ \"Summarize section {{workflow.input_as_text}} from {{state.cmc_pdf}} uploaded to the vector store.\"\n", + "\n", + "- **Compliance Checker agent:**\n", + "> _Prompt:_ \"Verify that the summary below is compliant with FDA 21 CFR Part 11: {{input.output_text}}. If the summary is compliant, return _Compliant_. Otherwise, return _This section needs to be manually summarized_.\" \n", + "\n", + "Both agents were configured with the default parameters - using GPT-5, low reasoning effort, and text as the output format.\n", + "\n", + "### Evaluation Approach\n", + "\n", + "To evaluate the baseline agent, there are two main approaches:\n", + "\n", + "1. **Collecting Human Feedback.** This approach involves gathering feedback from human users through the OpenAI Evals platform (or a custom UI built for a specific application). It is best suited for production settings or when piloting a tool where subject matter experts (SMEs) interact with the tool in real-world scenarios. This method helps uncover edge cases that may not have been identified during development. On the Evals platform, users can provide thumbs-up or thumbs-down ratings and share qualitative feedback about the summaries. \n", + "\n", + "\n", + "2. **Using an LLM-as-a-Judge.** This option is typically used during the development phase, enabling fast feedback loops without requiring SME's time. An **LLM-as-a-judge** uses an LLM to automatically evaluate and score the agent’s outputs based on predefined criteria. It can also be used for monitoring model drift (e.g., in production) or validating changes between model and model versions (e.g., switching between `gpt-5` and `gpt-5-mini`).\n", + "\n", + "\n", + "This cookbook demonstrates both approaches:\n", + "- **Section 2** shows the platform UI approach for manual prompt optimization\n", + "- **Section 3** implements the fully automated API approach using LLM-as-a-judge\n", + "\n", + "_Note: The Evals platform does not yet provide an API to retrieve user feedback programmatically._\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Using the OpenAI Evals Platform\n", + "\n", + "The OpenAI Evals platform provides an intuitive interface for prompt optimization and evaluation. This section demonstrates the complete workflow from dataset upload through iterative prompt improvement, showing how you can leverage the platform's visual interface to optimize your prompts before implementing automated solutions.\n", + "\n", + "### Step 1: Upload Dataset\n", + "\n", + "To begin using the OpenAI Evaluation platform, you'll first need to upload your dataset:\n", + "\n", + "1. Click the **+ Create** button\n", + "2. Define the dataset name\n", + "3. Upload a CSV file and select the columns to keep\n", + "4. Upload\n", + "\n", + "Your dataset should contain the documents or document sections that need to be summarized. Each row represents one input that will be processed by your system.\n", + "\n", + "### Step 2: Explore Your Data\n", + "\n", + "Once uploaded, you can explore your dataset. Click the dataset name to explore the uploaded data. This allows you to verify that your data is properly formatted and contains the expected content before proceeding with prompt configuration.\n", + "\n", + "### Step 3: Configure Initial Prompt\n", + "\n", + "This is where you define your initial system prompt and configure how data flows through your model. \n", + "\n", + "\"Platform\n", + "
Figure 3 - The platform's \"New prompt\" interface showing model configuration, variables, and system message settings.\n", + "\n", + "\n", + "#### Configuration Steps\n", + "\n", + "1. **System Prompt**: Add the system message that defines the model's task and behavior (this prompt will be optimized)\n", + "2. **User Prompt Template**: Add the prompt message template for user messages, using variables such as `{{}}` that get replaced with actual data from your dataset\n", + "3. **Model Selection**: Choose the model for generation (e.g., gpt-4.1, gpt-5)\n", + "4. **Temperature**: Configure creativity vs. determinism\n", + "\n", + "You can start with a very simple prompt to demonstrate the power of the optimization process. For example, beginning with just \"summarize\" shows how the system can evolve from a minimal starting point.\n", + "\n", + "### Step 4: Generate Outputs\n", + "\n", + "Once your prompt is configured, you're ready to generate outputs across your dataset. The prompt will run once per row and output will be generated on a new **output** column.\n", + "\n", + "1. Click **\"Generate Output\"**\n", + "2. The platform runs your prompt against all samples\n", + "3. Results appear in a new **Output** column\n", + "\n", + "The platform will process each row in your dataset, replacing template variables with actual values and calling the model with your system prompt. This creates a baseline of outputs that you can evaluate.\n", + "\n", + "### Step 5: Review and Evaluate\n", + "\n", + "Evaluation is where you provide structured feedback to guide prompt improvement.\n", + "\n", + "#### Review Outputs\n", + "\n", + "1. **Add Evaluation Columns** if not automatically added - Click \"Columns\" → \"Annotations\" → \"Add\":\n", + " - **Rating** - Binary (good/bad) or numeric ratings\n", + " - **Feedback** - Text describing what needs improvement\n", + "\n", + "2. **Provide Rating and Feedback** - Add your assessment for each output. \n", + "\n", + " Depending on the quality of the output, you may select a good or bad rating and explain your score based on how you would like the answer to be improved. For example:\n", + "\n", + " > (Rating) | Feedback\n", + " > - (Good) Good, but only the answer should be provided. The output should not include headers or any text other than the answer.\n", + " > - (Bad) The information is good, but it should be presented as bullet points.\n", + " > - (Good) Good summary; it is clear.\n", + " > - (Bad) Use bullet points when answering to improve readability. Summarize each sub-section individually.\n", + "\n", + "3. **Save Annotations** - Your feedback is saved with the evaluation run\n", + "\n", + "\"Platform\n", + "
Figure 4 - The evaluation interface showing generated outputs with rating and feedback columns for annotation.\n", + "\n", + "This structured feedback becomes the foundation for automatic prompt optimization.\n", + "\n", + "### Step 6: Optimize Prompt\n", + "\n", + "After collecting feedback, the platform can automatically generate an improved prompt.\n", + "\n", + "1. Click **\"Optimize\"**\n", + "2. A new prompt version is generated in a new tab\n", + "3. Click **\"View Prompt\"** to see the improved version\n", + "\n", + "\"Platform\n", + "
Figure 5 - The improved prompt generated by the platform, showing detailed instructions and requirements.\n", + "\n", + "### Step 7: Iterate and Compare\n", + "\n", + "With your improved prompt ready, start a new iteration to measure improvement.\n", + "\n", + "1. Click **\"Generate Output\"**\n", + "2. Review the new results and provide feedback on any remaining issues\n", + "3. Click **\"Optimize\"** again if needed\n", + "4. Repeat until satisfied\n", + "\n", + "The platform's tab structure allows you to compare performance across iterations. You can easily see how outputs evolved from your initial prompt to the optimized versions.\n", + "\n", + "\"Platform\n", + "
Figure 6 - Feedback and evaluation results for the optimized prompt, showing improvements in output quality.\n", + "\n", + "#### When to Stop Iterating\n", + "\n", + "Continue the optimization cycle until:\n", + "- **Quality threshold reached**: >80% of outputs receive positive feedback\n", + "- **Diminishing returns**: New iterations show minimal improvement\n", + "- **Specific issues resolved**: All identified failure modes are addressed\n", + "\n", + "This platform-based approach provides an excellent foundation for understanding prompt optimization before moving to automated implementations. The visual interface makes it easy to see the impact of changes and understand the optimization process.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Self-evolving Loop with LLM-as-a-Judge\n", + "\n", + "This section introduces a fully automated evaluation workflow using an LLM-as-a-Judge through the OpenAI API, eliminating the need for any user interface. This approach enables scalable, programmatic assessment of agent performance, supporting rapid iteration and continuous model monitoring in production." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# gepa and litellm are only required for the Section 4.b (prompt optimization with GEPA)\n", + "%pip install --upgrade openai openai-agents pydantic pandas gepa litellm python-dotenv -qqq \n", + "%load_ext dotenv\n", + "%dotenv\n", + "\n", + "# Place your API key in a file called .env\n", + "# OPENAI_API_KEY=sk-...\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Eval Creation\n", + "\n", + "To evaluate the baseline summarization agent, we use four complementary graders that balance deterministic checks with semantic judgment.\n", + "\n", + "| Grader | Type | Pass threshold | What it checks | Why |\n", + "|---|---|---:|---|---|\n", + "| Chemical string name | `python` | 0.8 | If any exact chemical names in the section appear in the summary. | Forces preservation of critical domain entities so summaries don’t omit chemically meaningful terms. |\n", + "| Summarization length | `python` | 0.85 | Inverse deviation from an expected 100-word length. | Keeps summaries concise and comparable, reducing verbosity that can mask poor content. |\n", + "| Cosine similarity | `text_similarity` | 0.85 | Cosine similarity between section and summary texts. | Ensures the summary stays anchored to the source content rather than drifting semantically. |\n", + "| LLM-as-judge | `score_model` | 0.85 | A rubric-driven score from a model acting as an evaluator. | Captures nuanced quality signals that rule-based metrics miss, improving overall robustness. |\n", + "\n", + "**Notes**\n", + "- The two Python graders catch domain fidelity and length discipline early, which stabilizes optimization before semantic tuning.\n", + "- Text similarity guards against superficial rephrasing that strays from the source.\n", + "- The LLM judge provides a holistic failsafe when edge cases slip past deterministic checks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from openai import OpenAI\n", + "\n", + "client = OpenAI(api_key=os.getenv(\"OPENAI_API_KEY\"))\n", + "\n", + "data_source_config = {\n", + " \"type\": \"custom\",\n", + " \"item_schema\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\"section\": {\"type\": \"string\"}, \"summary\": {\"type\": \"string\"}},\n", + " \"required\": [\"section\", \"summary\"],\n", + " },\n", + " \"include_sample_schema\": False,\n", + "}\n", + "\n", + "testing_criteria = [\n", + " {\n", + " \"type\": \"python\",\n", + " \"name\": \"chemical_name_grader\",\n", + " \"image_tag\": \"2025-05-08\",\n", + " \"pass_threshold\": 0.8,\n", + " \"source\": r\"\"\"def grade(sample: dict, item: dict) -> float:\n", + " section = item[\"section\"]\n", + " summary = item[\"summary\"]\n", + " CHEMICALS_MASTER = [\"[1-¹³C]Pyruvic acid\",\"[1-¹³C]Pyruvate\",\"¹²C Pyruvic acid\",\"Sodium [1-¹³C]pyruvate\",\"Sodium pyruvate (¹²C)\",\"AH111501 (Trityl radical)\",\"Tris{8-carboxyl-2,2,6,6-tetra[2-(1-methoxyethyl)]-benzo(1,2-d:4,5-d’)bis(1,3)dithiole-4-yl}methyl acid\",\"AH111501 sodium salt\",\"Methyl, tris[8-carboxy-2,2,6,6-tetrakis(2-methoxyethyl)benzo[1,2-d:4,5-d’]bis[1,3]dithiol-4-yl]-, trisodium salt\",\"AH111501 trisodium salt\",\"AH111576\",\"2,2′,2″,2‴-(4,8-Dibromobenzo[1,2-d:4,5-d′]bis([1,3]dithiole)-2,2,6,6-tetrayl)tetraethanol\",\"AH111586\",\"4,8-Dibromo-2,2,6,6-tetrakis(2-methoxyethyl)benzo[1,2-d:4,5-d′]bis([1,3]dithiole)\",\"AH111709\",\"AH111743\",\"AH112615\",\"4,4-Bis-hydroxymethyl-2-methyl-oxazolidine-2-carboxylic acid\",\"AH112623\",\"Parapyruvate\",\"2-Hydroxy-2-methyl-4-oxo-pentanedioic acid\",\"AH113127\",\"(4-Hydroxymethyl-oxazolidin-4-yl)-methanol\",\"AH113462/E\",\"Enol lactone\",\"AH113462/K\",\"Keto lactone\",\"Acetyl bromide\",\"Methanol\",\"Dimethyl sulfoxide\",\"DMSO\",\"Tetrahydrofuran\",\"THF\",\"Acetonitrile\",\"ACN\",\"Diethyl ether\",\"Et₂O\",\"N,N-Dimethylacetamide\",\"DMA\",\"1,3-Dimethyl-2-imidazolidinone\",\"DMI\",\"Hydrochloric acid\",\"HCl\",\"Sodium hydroxide\",\"NaOH\",\"Disodium ethylenediaminetetraacetate\",\"Na₂EDTA\",\"Ethylenediaminetetraacetic acid\",\"EDTA\",\"Tris(hydroxymethyl)aminomethane\",\"TRIS\",\"Trometamol\",\"Trifluoroacetic acid\",\"TFA\",\"Toluene\",\"Heptane\",\"Ethyl acetate\",\"Ethanol\",\"Water\",\"H₂O\",\"Sodium chloride\",\"NaCl\",\"Cuprous [1-¹³C]cyanide\",\"Cu¹³CN\",\"Gadolinium\",\"Gd\",\"Tin\",\"Sn\",\"Phosphorus\",\"P\",\"Carbon dioxide\",\"CO₂\",\"Sodium [1-13C]pyruvate\",\"[1-13C]Pyruvic acid\",\"1-13C pyruvate\"]\n", + "\n", + " # Identify the chemicals present in the section\n", + " present = [chem for chem in CHEMICALS_MASTER if chem in section]\n", + "\n", + " # If no chemicals present, consider it satisfied\n", + " if not present:\n", + " return 1.0\n", + "\n", + " correct = 0\n", + " for chem in present:\n", + " # Only count as correct if the exact chemical string appears in the summary\n", + " if chem in summary:\n", + " correct += 1\n", + "\n", + " return correct / len(present)\"\"\",\n", + " },\n", + " {\n", + " \"type\": \"python\",\n", + " \"name\": \"word_length_deviation_grader\",\n", + " \"image_tag\": \"2025-05-08\",\n", + " \"pass_threshold\": 0.85,\n", + " \"source\": r\"\"\"\n", + "def grade(sample: dict, item: dict) -> float:\n", + " summary = item[\"summary\"]\n", + " word_count = len(summary.split())\n", + " \n", + " expected_summary_length = 100\n", + " tolerance = 0.2 # 20% band around target\n", + " \n", + " # relative deviation\n", + " deviation = abs(word_count - expected_summary_length) / expected_summary_length\n", + " \n", + " # If within tolerance band → full score\n", + " if deviation <= tolerance:\n", + " return 1.0\n", + " \n", + " # Outside band → score decays linearly, capped at 0\n", + " # e.g., deviation 0.3 → score 0.8, deviation 1.0+ → 0.0\n", + " score = 1.0 - (deviation - tolerance)\n", + " return max(0.0, score)\n", + "\"\"\",\n", + "},\n", + " {\n", + " \"name\": \"cosine_similarity\",\n", + " \"type\": \"text_similarity\",\n", + " \"input\": \"{{ item.summary }}\",\n", + " \"reference\": \"{{ item.section }}\",\n", + " \"evaluation_metric\": \"cosine\",\n", + " \"pass_threshold\": 0.85,\n", + " },\n", + " {\n", + " \"name\": \"llm_as_judge\",\n", + " \"type\": \"score_model\",\n", + " \"model\": \"gpt-4.1\",\n", + " \"input\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": (\n", + " \"You are an expert technical summarization evaluator. \"\n", + " \"Evaluate whether the summary captures and preserves the important technical facts and specific details from the section, allowing for occasional minor rewording or omissions of less important points, but not major technical inaccuracies or information loss.\\n\\n\"\n", + " \"Scoring Guidelines:\\n\"\n", + " \"- Return a numerical score between 0 and 1 (with up to two decimal places).\\n\"\n", + " \"- A score of 1 means the summary is almost flawless: it is comprehensive, highly faithful, and technically accurate, with virtually no important or meaningful details missing, and no significant misstatements or distortions.\\n\"\n", + " \"- 0.75-0.99 indicates excellent work: all main facts are represented, but there may be trivial omissions or very minor rewording that do not materially affect understanding.\\n\"\n", + " \"- 0.5-0.75 indicates good but imperfect: most technical information is retained and correctly presented, some less critical details might be missing or slightly rephrased, but overall fidelity is preserved.\\n\"\n", + " \"- 0.3-0.5 means significant information is missing, or some technical inaccuracies are present, but the summary retains a reasonable portion of key facts.\\n\"\n", + " \"- 0.0-0.3 means there are major omissions, misunderstandings, or a failure to capture the most important technical content.\\n\\n\"\n", + " \"Respond only with a single number between 0 and 1 indicating summary quality by these criteria.\"\n", + " ),\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": (\n", + " \"Section:\\n{{item.section}}\\n\"\n", + " \"Summary:\\n{{sample.output_text}}\"\n", + " ),\n", + " },\n", + " ],\n", + " \"range\": [0, 1],\n", + " \"pass_threshold\": 0.85,\n", + " },\n", + "]\n", + "\n", + "eval = client.evals.create(\n", + " name=\"self_evolving_eval\",\n", + " data_source_config=data_source_config,\n", + " testing_criteria=testing_criteria,\n", + ")\n", + "print(f\"Created Eval: {eval.id}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You should see an eval ID in the output, e.g. `eval_...`. This is the ID of the eval we just created (as shown below)\n", + "\n", + "\"Platform\n", + "
Figure 7 - The platform's Eval interface showing data source configuration, and test criteria settings." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Grader Scoring and Parsing\n", + "\n", + "Next we'll need run the evals on the summarization agent's output and parse the results for the eval's grader scores. To do this we'll use a few helper functions:\n", + "- `run_eval`: Simple runner to call the evals API with proper formatting\n", + "- `poll_eval_run`: A polling utility to wait for the scheduled eval run to complete\n", + "- `parse_eval_run_output`: Parses the eval run and returns a structured output for the feedback loop" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "import json\n", + "\n", + "def run_eval(eval_id: str, section: str, summary: str):\n", + " \"\"\"Creates a run of the eval with the input section and output summary.\"\"\"\n", + " return client.evals.runs.create(\n", + " eval_id=eval_id,\n", + " name=\"self-evolving-eval\",\n", + " data_source={\n", + " \"type\": \"jsonl\",\n", + " \"source\": {\n", + " \"type\": \"file_content\",\n", + " \"content\": [\n", + " {\n", + " \"item\": {\n", + " \"section\": section,\n", + " \"summary\": summary,\n", + " }\n", + " }\n", + " ],\n", + " },\n", + " },\n", + " )\n", + "\n", + "\n", + "def poll_eval_run(eval_id: str, run_id: str, max_polls = 10):\n", + " \"\"\"\n", + " Polls the evaluation run until completion or timeout.\n", + "\n", + " This function exists to handle asynchronous behavior in the eval service by\n", + " periodically checking run status. It balances responsiveness and resource use by\n", + " polling at fixed intervals rather than blocking indefinitely. The retry limit\n", + " prevents runaway loops in cases where the service never returns a completed status.\n", + " \"\"\"\n", + " run = None\n", + " for attempt in range(1, max_polls + 1):\n", + " run = client.evals.runs.retrieve(eval_id=eval_id, run_id=run_id)\n", + " if run.status == \"completed\":\n", + " break\n", + " if attempt == max_polls:\n", + " print(\"Exceeded retries, aborting\")\n", + " break\n", + "\n", + " time.sleep(5)\n", + "\n", + " run_output_items = client.evals.runs.output_items.list(\n", + " eval_id=eval_id, run_id=run_id\n", + " )\n", + " return run_output_items\n", + "\n", + "\n", + "def parse_eval_run_output(items):\n", + " \"\"\"Extract all grader scores and any available conclusion outputs.\"\"\"\n", + " all_results = []\n", + "\n", + " for item in items.data:\n", + " for result in item.results:\n", + " grader_name_full = result.name\n", + " score = result.score\n", + " passed = result.passed\n", + " reasoning = None\n", + " try:\n", + " sample = result.sample\n", + " if sample:\n", + " content = result.sample[\"output\"][0][\"content\"]\n", + " content_json = json.loads(content)\n", + " steps = content_json[\"steps\"]\n", + " reasoning = \" \".join([step[\"conclusion\"] for step in steps])\n", + " except Exception:\n", + " pass\n", + "\n", + " all_results.append(\n", + " {\n", + " \"grader_name\": grader_name_full,\n", + " \"score\": score,\n", + " \"passed\": passed,\n", + " \"reasoning\": reasoning,\n", + " }\n", + " )\n", + "\n", + " return all_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can use the created eval ID from earlier and run the graders against an arbitrary input section and summary output. This forms the backbone of the feedback loop which will kick off the prompt optimization routine.\n", + "\n", + "### Eval execution run\n", + "\n", + "Let's test our evals by providing a section and a generated summary directly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "EVAL_ID = eval.id #Created eval ID from above cell\n", + "SECTION = \"3.2.S.1 General Information ([1-13C]pyruvic acid) The active ingredient in Hyperpolarized Pyruvate (13C) Injection is hyperpolarized [1-13C]pyruvate. The drug substance is defined as [13C]pyruvic acid, which is neutralized to [1-13C]pyruvate during the compounding process. In several pre-clinical and clinical studies and during evaluation of stability, pyruvic acid has been used instead of [1-13C]pyruvic acid (see Sections 3.2.P.2.2.1 Formulation Development for Hyperpolarized Pyruvate (13C) Injection and Section 8.1 Introduction for Item 8 Pharmacology and Toxicology Info). In the Section 3.2.S Drug Substance, data are presented for both pyruvic acid and for [1-13C]pyruvic acid. For simplicity, the terminology used in headings and captions is [1-13C]pyruvic acid. Batches containing pyruvic acid are specified by footnotes. 3.2.S.1.1 Nomenclature ([1-13C]pyruvic acid) The drug substance used for compounding of Hyperpolarized Pyruvate (13C) Injection is [1-13C]pyruvic acid. Company code: W6578 Chemical name: [1-13C]pyruvic acid CAS registry number: 127-17-3 3.2.S.1.2 Structure ([1-13C]pyruvic acid) Figure 1 Structure of [1-13C]pyruvic acid Molecular formula: C H O 3 4 3 Molecular weight: 89.06 3.2.S.1.3 General Properties ([1-13C]pyruvic acid) Appearance: Colorless to yellow, clear, viscous liquid pKa:Ka:aranWater solubility: Complete The structure of [1-13C]pyruvic acid has been confirmed by spectroscopic analysis (see Section 3.2.S.3.1 Elucidation of Structure and other Characteristics).\"\n", + "SUMMARY = \"The active ingredient in Hyperpolarized Pyruvate (13C) Injection is hyperpolarized [1-13C]pyruvate, derived from [1-13C]pyruvic acid (neutralized during compounding). Both pyruvic acid and [1-13C]pyruvic acid were used in studies and stability evaluations, but the documentation refers to [1-13C]pyruvic acid unless otherwise noted. The drug substance ([1-13C]pyruvic acid, CAS 127-17-3) is a colorless to yellow, clear, viscous liquid with a molecular formula C3H4O3 and molecular weight 89.06. Its structure has been confirmed by spectroscopic analysis, and it is completely soluble in water.\"\n", + "\n", + "eval_run = run_eval(EVAL_ID, section=SECTION, summary=SUMMARY)\n", + "run_output = poll_eval_run(eval_id=EVAL_ID, run_id=eval_run.id)\n", + "\n", + "grader_scores = parse_eval_run_output(run_output)\n", + "print(grader_scores)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "source": [ + "You should see a list of grader scores in the output, e.g.\n", + "\n", + "```[{'grader_name': 'chemical_name_grader-', 'score': 0.5, 'passed': False, 'reasoning': None}, {'grader_name': 'word_length_deviation_grader-', 'score': 0.8, 'passed': True, 'reasoning': None}, {'grader_name': 'cosine_similarity-', 'score': 0.9104484223477793, 'passed': True, 'reasoning': None}, {'grader_name': 'llm_as_judge-', 'score': 0.8, 'passed': True, 'reasoning': 'The summary needs to include specific details from the section. Part of the essential information is captured. Key pieces of information are missing. Not all relevant structural information is included.'}]```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Running this script we can see that most of our graders are passing except the `chemical_name_grader`. Next we'll programmatically recognize this opportunity to improve the summarization agent.\n", + "\n", + "_Note: When you run it locally, graders other than `chemical_name_grader` may fail at first. This is normal, as graders can initially fail, but the results should improve through the feedback loop. Early failures simply reflect the model adjusting its responses before converging on more accurate results._\n", + "\n", + "\n", + "### Dashboard Observability\n", + "Eval runs and results can also be seen in the OpenAI Dashboard: \n", + "\n", + "\"Eval\n", + "
Figure 8 - Eval dashboard showing evaluation runs and results.\n", + "\n", + "\n", + "We can also drill down into a specific eval run: \n", + "\"Eval\n", + "
Figure 9 - Detailed eval run results showing grader scores and performance metrics.\n", + "\n", + "\n", + "## Agent Setup\n", + "\n", + "Now that we have our evals and graders set up, we can go back to our summarization agent. \n", + "For simplicity, we will provide the code for a simple agent below. You could also use `AgentBuilder`, as shown in Figure 2, and export the code from the UI.\n", + "\n", + "\n", + "We will also need a metaprompt optimization agent, to optimize our prompt, as well as some simple utilities to handle prompt versions:\n", + "- `PromptVersionEntry`: A pydantic model used to track the prompt and metadata as it changes in production\n", + "- `VersionedPrompt`: A utility class to track prompt versions, this will be important in production when analyzing the evolution of the prompt as well as ensuring there is a fallback history in case of a regression" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from datetime import datetime\n", + "from typing import Any, Optional\n", + "\n", + "from pydantic import BaseModel, Field, ConfigDict, field_validator\n", + "\n", + "class PromptVersionEntry(BaseModel):\n", + " \"\"\"Data model for a prompt and associated data for observability\"\"\"\n", + " version: int = Field(\n", + " ..., ge=0, description=\"Version number of the prompt (increments)\"\n", + " )\n", + " model: str = Field(\n", + " \"gpt-5\",\n", + " min_length=1,\n", + " description=\"The model version to use for this version of the prompt, defaults to gpt-5\",\n", + " )\n", + " prompt: str = Field(\n", + " ..., min_length=1, description=\"The prompt text for this version\"\n", + " )\n", + " timestamp: datetime = Field(\n", + " default_factory=datetime.utcnow,\n", + " description=\"UTC timestamp when this version was created\",\n", + " )\n", + " eval_id: Optional[str] = Field(\n", + " None, description=\"ID of the evaluation associated with this prompt version\"\n", + " )\n", + " run_id: Optional[str] = Field(\n", + " None, description=\"ID of the run associated with this prompt version\"\n", + " )\n", + " metadata: Optional[dict[str, Any]] = Field(\n", + " None, description=\"Free-form metadata dict (e.g., section, summary)\"\n", + " )\n", + "\n", + " model_config = ConfigDict(\n", + " str_strip_whitespace=True, validate_assignment=True, extra=\"forbid\"\n", + " )\n", + "\n", + " @field_validator(\"prompt\")\n", + " @classmethod\n", + " def prompt_not_blank(cls, v: str) -> str:\n", + " if not v.strip():\n", + " raise ValueError(\"prompt must not be blank or only whitespace\")\n", + " return v\n", + "\n", + "\n", + "class VersionedPrompt:\n", + " \"\"\"Manages a collection of prompt versions and provides controlled updates and rollbacks.\"\"\"\n", + " def __init__(\n", + " self,\n", + " initial_prompt: str,\n", + " model: Optional[str] = \"gpt-5\",\n", + " eval_id: Optional[str] = None,\n", + " run_id: Optional[str] = None,\n", + " metadata: Optional[dict[str, Any]] = None,\n", + " ):\n", + " if not initial_prompt or not initial_prompt.strip():\n", + " raise ValueError(\"initial_prompt must be non-empty\")\n", + " self._versions: list[PromptVersionEntry] = []\n", + " first_entry = PromptVersionEntry(\n", + " version=0,\n", + " prompt=initial_prompt,\n", + " model=model,\n", + " eval_id=eval_id,\n", + " run_id=run_id,\n", + " metadata=metadata,\n", + " )\n", + " self._versions.append(first_entry)\n", + "\n", + " def update(\n", + " self,\n", + " new_prompt: str,\n", + " model: Optional[str] = \"gpt-5\",\n", + " eval_id: Optional[str] = None,\n", + " run_id: Optional[str] = None,\n", + " metadata: Optional[dict[str, Any]] = None,\n", + " ) -> PromptVersionEntry:\n", + " if not new_prompt or not new_prompt.strip():\n", + " raise ValueError(\"new_prompt must be non-empty\")\n", + "\n", + " version = self.current().version + 1\n", + " entry = PromptVersionEntry(\n", + " version=version,\n", + " prompt=new_prompt,\n", + " model=model,\n", + " eval_id=eval_id,\n", + " run_id=run_id,\n", + " metadata=metadata,\n", + " )\n", + " self._versions.append(entry)\n", + " return entry\n", + "\n", + " def current(self) -> PromptVersionEntry:\n", + " return self._versions[-1]\n", + "\n", + " def revert_to_version(self, version: int) -> PromptVersionEntry:\n", + " idx = None\n", + " for i, entry in enumerate(self._versions):\n", + " if entry.version == version:\n", + " idx = i\n", + " break\n", + "\n", + " if idx is None:\n", + " raise ValueError(f\"No version found with version={version}\")\n", + "\n", + " self._versions = self._versions[: idx + 1]\n", + " return self._versions[-1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we'll create the starting summarization and prompt optimization agents.\n", + "\n", + "_Note: We created a wrapper to track prompt changes in the summarization agent since it is expected to evolve in production, the metaprompt agent's prompt will stay static for the purposes of this cookbook._" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "from agents import Agent\n", + "\n", + "METAPROMPT_TEMPLATE = \"\"\"\n", + "# Context:\n", + "## Original prompt:\n", + "{original_prompt}\n", + "\n", + "## Section:\n", + "{section}\n", + "\n", + "## Summary:\n", + "{summary}\n", + "\n", + "## Reason to improve the prompt:\n", + "{reasoning}\n", + "\n", + "# Task:\n", + "Write a new summarization prompt that is significantly improved and more specific than the original. \n", + "The new prompt should instruct the model to produce concise yet comprehensive technical summaries that precisely preserve all explicit information from the source text. It should emphasize the inclusion of all named entities, quantities, compounds, and technical terminology without paraphrasing or omission. The resulting prompt should read like a clear, directive system message for a technical summarization assistant—structured, unambiguous, and generalizable across scientific or regulatory document sections.\n", + "\"\"\"\n", + "\n", + "metaprompt_agent = Agent(\n", + " name=\"MetapromptAgent\", instructions=\"You are a prompt optimizer.\"\n", + ")\n", + "\n", + "summarization_prompt = VersionedPrompt(\n", + " initial_prompt=\"\"\"You are a summarization assistant.\n", + "Given a section of text, produce a summary.\"\"\"\n", + ")\n", + "\n", + "def make_summarization_agent(prompt_entry: PromptVersionEntry) -> Agent:\n", + " return Agent(\n", + " name=\"SummarizationAgent\",\n", + " instructions=prompt_entry.prompt,\n", + " model=prompt_entry.model,\n", + " )\n", + "\n", + "summarization_agent = make_summarization_agent(summarization_prompt.current())\n", + "\n", + "# Cache eval results by section + summary so repeated attempts do not trigger redundant grader runs.\n", + "eval_cache: dict[tuple[str, str], list[dict[str, Any]]] = {}\n", + "\n", + "# Track the highest-scoring candidate that also passes the lenient score threshold.\n", + "best_candidate: dict[str, Any] = {\n", + " \"score\": float(\"-inf\"),\n", + " \"prompt\": summarization_prompt.current().prompt,\n", + " \"model\": summarization_prompt.current().model,\n", + " \"summary\": None,\n", + " \"metadata\": None,\n", + " \"version\": summarization_prompt.current().version,\n", + " \"passed_lenient\": False,\n", + " \"total_score\": float(\"-inf\"),\n", + "}\n", + "\n", + "# Aggregate per-version performance so we can pick the strongest total scorer at the end.\n", + "aggregate_prompt_stats: dict[int, dict[str, Any]] = {}\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Orchestration and Monitoring\n", + "\n", + "This is what we've done so far - we've created:\n", + "- Evals with 4 graders that will assess the outputs and produce a score for each grader\n", + "- A summarization agent with a versioned prompt class to track changes to the prompt and model\n", + "- A metaprompt optimization agent that will attempt to update the prompt based on a set of reasoning\n", + "\n", + "Now these different functionalities can be composed to orchestrate the self-evolving loop with Agent tracing in the OpenAI dashboard.\n", + "\n", + "Keep in mind that this is a simplified example. In a real-world scenario, you'd want to ensure you have guardrails for optimization attempts and that an alert notifies a human when a guardrail is triggered.\n", + "\n", + "_Note: Due to practical limitations of the cookbook we are simulating a stream of data by feeding in a static dataset and using `print` statements in place of true observability._\n", + "\n", + "### Orchestration Utilities\n", + "\n", + "As in previous sections we'll create some utilities to manage the orchestration logic of the feedback loop." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "from typing import Any, Optional\n", + "from agents import Runner\n", + "\n", + "LENIENT_PASS_RATIO = 0.75 # 75% of graders must pass (binary) \n", + "LENIENT_AVERAGE_THRESHOLD = 0.85 # 85% average score across graders \n", + "\n", + "def reset_best_candidate() -> None:\n", + " \"\"\"Reset the best candidate tracker for a new optimization run.\"\"\"\n", + " global best_candidate\n", + "\n", + " current = summarization_prompt.current()\n", + " best_candidate = {\n", + " \"score\": float(\"-inf\"),\n", + " \"prompt\": current.prompt,\n", + " \"model\": current.model,\n", + " \"summary\": None,\n", + " \"metadata\": None,\n", + " \"version\": current.version,\n", + " }\n", + "\n", + "def reset_best_trackers() -> None:\n", + " \"\"\"Reset both the best-candidate tracker and aggregate stats.\"\"\"\n", + " reset_best_candidate()\n", + " aggregate_prompt_stats.clear()\n", + "\n", + "\n", + "def update_best_candidate(\n", + " *,\n", + " average_score: Optional[float] = None,\n", + " prompt_text: str,\n", + " model_name: str,\n", + " summary_text: str = None,\n", + " metadata: dict[str, Any] = None,\n", + " lenient_passed: bool = False,\n", + " prompt_version: int = None,\n", + " total_score: Optional[float] = None,\n", + " score: Optional[float] = None,\n", + ") -> None:\n", + " \"\"\"Persist the best lenient-passing candidate.\"\"\"\n", + " global best_candidate\n", + "\n", + " if prompt_version is None:\n", + " prompt_version = summarization_prompt.current().version\n", + "\n", + " if average_score is None:\n", + " average_score = score\n", + "\n", + " if average_score is None:\n", + " return\n", + "\n", + " if lenient_passed:\n", + " best_candidate.update(\n", + " {\n", + " \"score\": average_score,\n", + " \"prompt\": prompt_text,\n", + " \"model\": model_name,\n", + " \"summary\": summary_text,\n", + " \"metadata\": metadata,\n", + " \"version\": prompt_version,\n", + " \"total_score\": total_score if total_score is not None else average_score,\n", + " }\n", + " )\n", + "\n", + "\n", + "def apply_best_candidate_if_needed() -> Agent:\n", + " \"\"\"Ensure summarization_prompt reflects the best prompt candidate.\"\"\"\n", + " if best_candidate[\"score\"] > float(\"-inf\"):\n", + " current = summarization_prompt.current()\n", + " target = best_candidate\n", + " # Only update if different\n", + " if (\n", + " current.prompt != target[\"prompt\"]\n", + " or current.model != target[\"model\"]\n", + " or current.version != target.get(\"version\")\n", + " ):\n", + " summarization_prompt.update(\n", + " new_prompt=target[\"prompt\"],\n", + " model=target[\"model\"],\n", + " metadata=target.get(\"metadata\"),\n", + " )\n", + " target[\"version\"] = summarization_prompt.current().version\n", + " return make_summarization_agent(summarization_prompt.current())\n", + "\n", + " return make_summarization_agent(summarization_prompt.current())\n", + "\n", + "\n", + "def record_aggregate_prompt_score(\n", + " *,\n", + " prompt_version: int,\n", + " prompt_text: str,\n", + " model_name: str,\n", + " average_score: float,\n", + " total_score: Optional[float] = None,\n", + ") -> None:\n", + " \"\"\"Accumulate per-version grader scores for aggregate selection.\"\"\"\n", + " stats = aggregate_prompt_stats.setdefault(\n", + " prompt_version,\n", + " {\n", + " \"version\": prompt_version,\n", + " \"prompt\": prompt_text,\n", + " \"model\": model_name,\n", + " \"total_score\": 0.0,\n", + " \"total_average\": 0.0,\n", + " \"count\": 0,\n", + " },\n", + " )\n", + " stats[\"total_score\"] += total_score if total_score is not None else average_score\n", + " stats[\"total_average\"] += average_score\n", + " stats[\"count\"] += 1\n", + " stats[\"prompt\"] = prompt_text\n", + " stats[\"model\"] = model_name\n", + "\n", + "\n", + "def select_best_aggregate_prompt() -> Optional[dict[str, Any]]:\n", + " \"\"\"Return the prompt version with the highest cumulative score.\"\"\"\n", + " if not aggregate_prompt_stats:\n", + " return None\n", + " return max(\n", + " aggregate_prompt_stats.values(),\n", + " key=lambda entry: (\n", + " entry.get(\"total_score\", float(\"-inf\")),\n", + " entry.get(\"version\", -1),\n", + " ),\n", + " )\n", + "\n", + "\n", + "async def get_eval_grader_score(eval_id: str, section: str, summary: str):\n", + " \"\"\"Retrieve grader scores for a section-summary pair with caching.\"\"\"\n", + " cache_key = (section, summary)\n", + " if cache_key in eval_cache:\n", + " return eval_cache[cache_key]\n", + "\n", + " eval_run = run_eval(eval_id=eval_id, section=section, summary=summary)\n", + " run_output = poll_eval_run(eval_id=eval_id, run_id=eval_run.id)\n", + " results = parse_eval_run_output(run_output)\n", + " eval_cache[cache_key] = results\n", + " return results\n", + "\n", + "\n", + "def calculate_grader_score(grader_scores):\n", + " \"\"\"Simple average score of all graders from the eval.\"\"\"\n", + " if not grader_scores:\n", + " return 0.0\n", + "\n", + " score_sum = 0.0\n", + " for entry in grader_scores:\n", + " score_sum += entry.get(\"score\", 0.0)\n", + "\n", + " return score_sum / len(grader_scores)\n", + "\n", + "\n", + "\n", + "def calculate_total_grader_score(grader_scores):\n", + " \"\"\"Sum of all grader scores for aggregate tracking.\"\"\"\n", + " if not grader_scores:\n", + " return 0.0\n", + "\n", + " return sum(entry.get(\"score\", 0.0) for entry in grader_scores)\n", + "\n", + "\n", + "DEFAULT_PASSING_FEEDBACK = (\n", + " \"All graders passed; tighten factual coverage, chemical completeness, and conciseness.\"\n", + ")\n", + "\n", + "\n", + "def is_lenient_pass(grader_scores, average_score: float) -> bool:\n", + " if not grader_scores:\n", + " return False\n", + "\n", + " passed_count = sum(1 for entry in grader_scores if entry.get(\"passed\"))\n", + " total_graders = len(grader_scores)\n", + "\n", + " if total_graders and (passed_count / total_graders) >= LENIENT_PASS_RATIO:\n", + " return True\n", + " return average_score >= LENIENT_AVERAGE_THRESHOLD\n", + "\n", + "\n", + "def collect_grader_feedback(grader_scores):\n", + " \"\"\"Consolidate grader reasoning into actionable feedback for the metaprompt agent.\"\"\"\n", + " feedback_lines = []\n", + "\n", + " for entry in grader_scores:\n", + " grader = entry.get(\"grader_name\", \"\")\n", + " passed = entry.get(\"passed\", False)\n", + " reasoning = entry.get(\"reasoning\")\n", + "\n", + " if not passed:\n", + " if grader.startswith(\"chemical_name_grader\"):\n", + " feedback_lines.append(\n", + " \"Not all chemical names in the input section were included in the summary.\"\n", + " )\n", + " elif grader.startswith(\"word_length_deviation_grader\"):\n", + " feedback_lines.append(\n", + " \"The summary length deviates too much from the expected length.\"\n", + " )\n", + " elif grader.startswith(\"cosine_similarity\"):\n", + " feedback_lines.append(\n", + " \"The summary is not sufficiently similar to the source section (cosine similarity too low).\"\n", + " )\n", + " elif grader.startswith(\"llm_as_judge\") and reasoning:\n", + " feedback_lines.append(reasoning)\n", + "\n", + " if not feedback_lines:\n", + " feedback_lines.append(DEFAULT_PASSING_FEEDBACK)\n", + "\n", + " return \"\".join(feedback_lines)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Self-evolving loop\n", + "\n", + "Now to simulate a stream of requests for summarization we'll feed in a prepared dataset and observe the optimization evolve from a naive prompt.\n", + "\n", + "> The referenced dataset.csv can be found in the Github repository." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "from agents import Agent, trace\n", + "\n", + "EVAL_ID = eval.id #Created eval ID from above cell\n", + "MAX_OPTIMIZATION_RETRIES = 3\n", + "\n", + "async def self_evolving_loop(summarization_agent: Agent) -> Agent:\n", + " print(f\"Starting self-evolving loop | Initial prompt v{summarization_prompt.current().version}\")\n", + " print(f\"Prompt:{summarization_prompt.current().prompt}\")\n", + " print(\"-\" * 80)\n", + "\n", + " reset_best_trackers()\n", + " df = pd.read_csv(\"data/dataset.csv\")\n", + "\n", + " with trace(\"Self-evolving Optimization Workflow\"):\n", + " for _, row in df.head().iterrows():\n", + " content = row.get(\"content\")\n", + " if pd.isna(content) or (isinstance(content, str) and not content.strip()):\n", + " continue\n", + "\n", + " section_number = str(row[\"section_number\"])\n", + " section = str(content)\n", + " current_version = summarization_prompt.current().version\n", + "\n", + " print(f\"[Section {section_number}] Using prompt v{current_version}\")\n", + "\n", + " optimization_success = False\n", + "\n", + " for attempt in range(1, MAX_OPTIMIZATION_RETRIES + 1):\n", + " print(f\" Attempt {attempt}: evaluating summary...\")\n", + "\n", + " summary_result = await Runner.run(summarization_agent, section)\n", + " summary = summary_result.final_output\n", + "\n", + " grader_scores = await get_eval_grader_score(eval_id=EVAL_ID, summary=summary, section=section)\n", + " average_score = calculate_grader_score(grader_scores)\n", + " total_score = calculate_total_grader_score(grader_scores)\n", + " lenient_passed = is_lenient_pass(grader_scores, average_score)\n", + " print(\n", + " f\"\tScores — avg={average_score:.3f}, total={total_score:.3f}, lenient_passed={lenient_passed}\"\n", + " )\n", + "\n", + " record_aggregate_prompt_score(\n", + " prompt_version=summarization_prompt.current().version,\n", + " prompt_text=summarization_prompt.current().prompt,\n", + " model_name=summarization_prompt.current().model,\n", + " average_score=average_score,\n", + " total_score=total_score,\n", + " )\n", + "\n", + " update_best_candidate(\n", + " average_score=average_score,\n", + " prompt_text=summarization_prompt.current().prompt,\n", + " model_name=summarization_prompt.current().model,\n", + " summary_text=summary,\n", + " metadata={\n", + " \"section\": section_number,\n", + " \"average_score\": average_score,\n", + " \"grader_results\": grader_scores,\n", + " \"prompt_version\": summarization_prompt.current().version,\n", + " },\n", + " lenient_passed=lenient_passed,\n", + " prompt_version=summarization_prompt.current().version,\n", + " )\n", + "\n", + " if lenient_passed:\n", + " optimization_success = True\n", + " print(f\"\tPassed with prompt v{summarization_prompt.current().version}\")\n", + " break\n", + "\n", + " print(\"\tFailed eval. Improving prompt...\")\n", + " eval_feedback = collect_grader_feedback(grader_scores)\n", + "\n", + " metaprompt_result = await Runner.run(\n", + " metaprompt_agent,\n", + " input=METAPROMPT_TEMPLATE.format(\n", + " original_prompt=summarization_prompt.current().prompt,\n", + " section=section,\n", + " summary=summary,\n", + " reasoning=eval_feedback,\n", + " ),\n", + " )\n", + " improved_prompt = metaprompt_result.final_output\n", + " summarization_prompt.update(\n", + " new_prompt=improved_prompt,\n", + " metadata={\"section\": section, \"summary\": summary},\n", + " )\n", + " summarization_agent = make_summarization_agent(summarization_prompt.current())\n", + "\n", + " print(f\"\tPrompt improved → v{summarization_prompt.current().version}\")\n", + "\n", + " if not optimization_success:\n", + " print(\n", + " \"\tAll attempts failed; keeping latest prompt version \"\n", + " f\"v{summarization_prompt.current().version} for the next section.\"\n", + " )\n", + "\n", + " summarization_agent = apply_best_candidate_if_needed()\n", + "\n", + " print(\"\" + \"-\" * 80)\n", + " print(\"Completed optimization loop.\")\n", + " print(f\"Final prompt version: v{summarization_prompt.current().version}\")\n", + " if best_candidate[\"score\"] > float(\"-inf\"):\n", + " print(\n", + " f\"Best lenient prompt: v{best_candidate.get('version')} (avg={best_candidate['score']:.3f})\"\n", + " )\n", + "\n", + " aggregate_best = select_best_aggregate_prompt()\n", + " if aggregate_best:\n", + " per_section = (\n", + " aggregate_best.get(\"total_average\", 0.0) / aggregate_best.get(\"count\", 1)\n", + " if aggregate_best.get(\"count\")\n", + " else 0.0\n", + " )\n", + " print(\n", + " f\"Aggregate best prompt: v{aggregate_best.get('version')} \"\n", + " f\"(total={aggregate_best.get('total_score', 0.0):.3f}, avg/section={per_section:.3f}, model={aggregate_best.get('model', 'unknown')})\"\n", + " )\n", + "\n", + " print(f\"Final prompt:{summarization_prompt.current().prompt}\")\n", + " return summarization_agent\n", + "\n", + "summarization_agent = await self_evolving_loop(summarization_agent)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**How the final prompt is chosen**\n", + "\n", + "- Every evaluation logs the average grader score, the total score across graders, and whether the attempt passed the lenient criteria.\n", + "- `best_candidate` tracks the most recent lenient pass (for transparency), but the final selection uses the aggregate totals to ensure we keep the top-performing prompt overall.\n", + "- When the loop ends, `apply_best_candidate_if_needed` restores the prompt with the highest cumulative grader score (ties favor the latest version), guaranteeing that the surfaced prompt is the strongest performer observed.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is an example (abridged) output for the code above.\n", + "\n", + "Inspecting the output shows that the self evolving prompt worked. There are a few takeaways to account for:\n", + "1. The optimization is not always successful, so being able to roll back the prompt version is important\n", + "2. The fidelity of the information from the graders is crucially important to ensuring a quality optimization" + ] + }, + { + "cell_type": "raw", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "Starting self-evolving loop | Initial prompt v0\n", + "Prompt:You are a summarization assistant.\n", + "Given a section of text, produce a summary.\n", + "--------------------------------------------------------------------------------\n", + "[Section 7.1] Using prompt v0\n", + " Attempt 1: evaluating summary...\n", + "\tScores — avg=0.805, total=3.218, lenient_passed=False\n", + "\tFailed eval. Improving prompt...\n", + "\tPrompt improved → v1\n", + " Attempt 2: evaluating summary...\n", + "\tScores — avg=0.720, total=2.881, lenient_passed=False\n", + "\tFailed eval. Improving prompt...\n", + "\tPrompt improved → v2\n", + " Attempt 3: evaluating summary...\n", + "\tScores — avg=0.762, total=3.048, lenient_passed=True\n", + "\tPassed with prompt v2\n", + "[Section 7.2] Using prompt v2\n", + " Attempt 1: evaluating summary...\n", + "\tScores — avg=0.612, total=2.450, lenient_passed=False\n", + "\tFailed eval. Improving prompt...\n", + "\tPrompt improved → v3\n", + " Attempt 2: evaluating summary...\n", + "\tScores — avg=0.915, total=3.660, lenient_passed=True\n", + "\tPassed with prompt v3\n", + "[Section 3.2.P.2.1] Using prompt v3\n", + " Attempt 1: evaluating summary...\n", + "\tScores — avg=0.684, total=2.736, lenient_passed=False\n", + "\tFailed eval. Improving prompt...\n", + "\tPrompt improved → v4\n", + " Attempt 2: evaluating summary...\n", + "\tScores — avg=0.684, total=2.736, lenient_passed=False\n", + "\tFailed eval. Improving prompt...\n", + "\tPrompt improved → v5\n", + " Attempt 3: evaluating summary...\n", + "\tScores — avg=0.920, total=3.680, lenient_passed=True\n", + "\tPassed with prompt v5\n", + "[Section 3.2.P.2.2] Using prompt v5\n", + " Attempt 1: evaluating summary...\n", + "\tScores — avg=0.737, total=2.950, lenient_passed=True\n", + "\tPassed with prompt v5\n", + "[Section 3.2.P.2.3] Using prompt v5\n", + " Attempt 1: evaluating summary...\n", + "\tScores — avg=0.750, total=3.000, lenient_passed=True\n", + "\tPassed with prompt v5\n", + "--------------------------------------------------------------------------------\n", + "Completed optimization loop.\n", + "Final prompt version: v5\n", + "Best lenient prompt: v5 (avg=0.750)\n", + "Aggregate best prompt: v5 (total=9.630, avg/section=0.802)\n", + "Final prompt:**Optimized Technical Summarization System Prompt**\n", + "\n", + "You are a technical summarization assistant specialized in scientific and regulatory documents. Your objective is to generate a summary that preserves every explicit detail and organizational structure from the source text, without any paraphrasing, omission, or synthesis.\n", + "\n", + "**Strict Summarization Guidelines:**\n", + "\n", + "**1. Comprehensive Detail Inclusion:** \n", + "- Transcribe all named compounds, salts, excipients, drug substances, molecular designations, batch codes, identifiers, and CAS numbers exactly as written.\n", + "- Include every stated concentration, unit, measurement, quantitative value, compositional detail, and preparatory parameter verbatim and in original format.\n", + "- Accurately replicate all descriptions of appearance, color, physical state, rationale for inclusion, and labeling or typographical conventions present in the source.\n", + "- Clearly include all section titles, headings, subsections, hierarchical numbering, referenced sections, and in-line citations or figures.\n", + "\n", + "**2. Prohibited Actions:** \n", + "- Do NOT paraphrase, summarize, interpret, synthesize, restructure, generalize, or alter any information at any level.\n", + "- Do NOT omit, compress, merge, or reorder any data point, named entity, technical term, or explicit instruction from the source.\n", + "- Do NOT introduce additional content, inference, or editorial clarification.\n", + "\n", + "**3. Structural and Formatting Requirements:** \n", + "- Maintain verbatim order, sectioning, and hierarchy from the source text, including all original lists, bullet points, numbering, or formatting.\n", + "- Reproduce every element in the precise sequence, alignment, and structure as the input, ensuring maximal traceability.\n", + "- If the source uses lists, tables, subpoints, or hierarchies, mirror them exactly.\n", + "\n", + "**4. Precision, Fidelity, and Reviewability:** \n", + "- Your summary must enable full regulatory or technical audit by containing every explicit detail, designation, and measurement from the original—unaltered and unabridged.\n", + "- The output must be comprehensive, exhaustive, and identical in informational content and structure to the input. Every visible explicit detail must be present.\n", + "\n", + "**Output Instruction:** \n", + "Begin summarization after this message, applying the above rules without exception. Each output must be concise in format but all-inclusive in content, reflecting every explicit fact, designation, and organizational feature of the source text, and suitable for regulatory or expert review. No interpretation, paraphrasing, or omission is permitted under any circumstance." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Agent Logs & Tracing\n", + "\n", + "We can view optimization workflow runs in the dashboard under logs: \n", + "\n", + "\"Agent\n", + "
Figure 10 - Agent log traces showing optimization workflow runs in the dashboard.\n", + "\n", + "And drill down into the different agent calls: \n", + "\n", + "\"Agent\n", + "
Figure 11 - Detailed agent trace showing individual agent calls and execution flow.\n", + "\n", + "### Continuous Monitoring\n", + "\n", + "Once the evaluation loop is complete, the system should continue to monitor new incoming data and periodically re-evaluate model performance on blind datasets. This ensures the model remains accurate and compliant as the data distribution evolves.\n", + "\n", + "To enable continuous monitoring, you can integrate a cron job or a lightweight scheduler loop that periodically checks for updates in your data source (e.g., new PDF uploads or database entries). When new data is detected, the system automatically triggers the evaluation and optimization loop described earlier.\n", + "\n", + "For example (pseudo code):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# this cell is pseudo-code and not meant to be run as-is\n", + "\n", + "import time\n", + "\n", + "def continuous_monitoring(interval_hours=24):\n", + " \"\"\"Periodically check for new data and trigger the evaluation loop.\"\"\"\n", + " while True:\n", + " print(\"Checking for new data...\")\n", + " if new_data_detected():\n", + " print(\"New data found — running evaluation and optimization loop.\")\n", + " self_evolving_loop()\n", + " else:\n", + " print(\"No new data. Sleeping until next cycle.\")\n", + " time.sleep(interval_hours * 3600)\n", + "\n", + "continuous_monitoring(interval_hours=24)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This approach allows the model to continuously learn and adapt, improving over time as it processes fresh data — a key requirement for maintaining high-quality, real-world performance." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Going Further\n", + "\n", + "### a. Model Evaluation\n", + "\n", + "We now have a fully automated loop improving our prompt with **evals** and accepting the new prompt when the rating is over the defined threshold. \n", + "\n", + "In production, you could use a similar framework to monitor the performance of your agents as new user requests come in.\n", + "As mentioned above, this is a simplified example, and in a real-world scenario you'd want to have additional guardrails and a human-in-the-loop approach to approve new prompts. \n", + "\n", + "Taking this concept further, we can also use evals to test different model parameter candidates such as the model version, verbosity, and reasoning. To see the full available set of parameters that could considered, check the [ModelSettings class in the Agents SDK](https://openai.github.io/openai-agents-python/ref/model_settings/#agents.model_settings.ModelSettings)\n", + "\n", + "The `compare_model_candidates` function is an example of how to:\n", + "1. Optimize the prompt\n", + "2. Generate candidate outputs from the optimized prompt using two or more different models\n", + "3. Use evals to grade the candidate outputs and select the best candidate\n", + "\n", + "It can be worked into the `self_evolving_loop` function with minimal refactoring.\n", + "\n", + "> **NOTE:** Production testing of model versions should be limited to versions within the same family version (e.g. gpt-5, gpt-5-mini, gpt-5-nano). It is recommended to conduct cross family version selection pre-production deployment.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And the final `self_evolving_loop` with model comparison code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from agents import Agent, Runner\n", + "\n", + "async def eval_agent_candidate(agent: Agent, section: str, prompt_text: str, model_name: str):\n", + " summary_result = await Runner.run(agent, section)\n", + " summary = summary_result.final_output\n", + "\n", + " scores = await get_eval_grader_score(\n", + " eval_id=EVAL_ID, summary=summary, section=section\n", + " )\n", + " average = calculate_grader_score(scores)\n", + " lenient_passed = is_lenient_pass(scores, average)\n", + " passed = all(entry.get(\"passed\") is True for entry in scores)\n", + "\n", + " update_best_candidate(\n", + " average_score=average,\n", + " prompt_text=prompt_text,\n", + " model_name=model_name,\n", + " summary_text=summary,\n", + " metadata={\n", + " \"section\": section,\n", + " \"average_score\": average,\n", + " \"grader_results\": scores,\n", + " },\n", + " lenient_passed=lenient_passed,\n", + " )\n", + "\n", + " return {\"summary\": summary, \"scores\": scores, \"average\": average, \"passed\": passed}\n", + "\n", + "async def compare_model_candidates(\n", + " summarization_prompt,\n", + " eval_feedback: str,\n", + " section: str,\n", + " summary: str,\n", + " model_candidates=None,\n", + "):\n", + " \"\"\"Improve the prompt, evaluate it across candidate models, and adopt the top performer.\"\"\"\n", + " if model_candidates is None:\n", + " model_candidates = [\"gpt-5\", \"gpt-5-mini\"]\n", + "\n", + " metaprompt_result = await Runner.run(\n", + " metaprompt_agent,\n", + " input=METAPROMPT_TEMPLATE.format(\n", + " original_prompt=summarization_prompt.current().prompt,\n", + " section=section,\n", + " summary=summary,\n", + " reasoning=eval_feedback,\n", + " ),\n", + " )\n", + " improved_prompt = metaprompt_result.final_output\n", + "\n", + " async def evaluate_model(model_name: str):\n", + " candidate_agent = Agent(\n", + " name=f\"SummarizationAgent:{model_name}\",\n", + " instructions=improved_prompt,\n", + " model=model_name,\n", + " )\n", + " result = await eval_agent_candidate(candidate_agent, section, improved_prompt, model_name)\n", + " return model_name, candidate_agent, result\n", + "\n", + " best = {\n", + " \"average\": float(\"-inf\"),\n", + " \"passed\": False,\n", + " \"agent\": None,\n", + " \"model\": None,\n", + " \"summary\": None,\n", + " }\n", + "\n", + " tasks = [asyncio.create_task(evaluate_model(model_name)) for model_name in model_candidates]\n", + " for task in asyncio.as_completed(tasks):\n", + " model_name, candidate_agent, result = await task\n", + " print(\n", + " f\"Candidate average — {model_name}: {result['average']:.4f} \"\n", + " f\"(passed={result.get('passed', False)})\"\n", + " )\n", + " if result[\"average\"] > best[\"average\"]:\n", + " best.update(\n", + " {\n", + " \"average\": result[\"average\"],\n", + " \"model\": model_name,\n", + " \"summary\": result.get(\"summary\"),\n", + " \"agent\": candidate_agent,\n", + " \"passed\": result.get(\"passed\", False),\n", + " }\n", + " )\n", + "\n", + " for task in tasks:\n", + " if not task.done():\n", + " task.cancel()\n", + "\n", + " if best[\"passed\"] and best[\"model\"]:\n", + " summarization_prompt.update(\n", + " new_prompt=improved_prompt,\n", + " model=best[\"model\"],\n", + " metadata={\"section\": section, \"summary\": best[\"summary\"]},\n", + " )\n", + " print(f\"Updated summarization_prompt with passing model: {best['model']}\")\n", + " return make_summarization_agent(summarization_prompt.current())\n", + "\n", + " print(\n", + " f\"No passing models. Best candidate (model={best['model']}, \"\n", + " f\"avg={best['average']:.4f}) did not pass. Prompt not updated.\"\n", + " )\n", + " return None\n", + "\n", + "async def self_evolving_loop_with_model_comparison(summarization_agent: Agent) -> Agent:\n", + " print(\n", + " f\"Starting self-evolving loop | Initial prompt v{summarization_prompt.current().version}\"\n", + " )\n", + " print(f\"Prompt: {summarization_prompt.current().prompt}\")\n", + " print(f\"Model: {summarization_prompt.current().model}\")\n", + " print(\"-\" * 80)\n", + "\n", + " reset_best_trackers()\n", + " df = pd.read_csv(\"data/dataset.csv\")\n", + "\n", + " with trace(\"Self-evolving Optimization Workflow: model comparison\"):\n", + " for _, row in df.head(5).iterrows():\n", + " content = row.get(\"content\")\n", + " if pd.isna(content) or (isinstance(content, str) and not content.strip()):\n", + " continue\n", + "\n", + " section_number = str(row[\"section_number\"])\n", + " section = str(content)\n", + " current_version = summarization_prompt.current().version\n", + "\n", + " print(f\"[Section {section_number}] Using prompt v{current_version}\")\n", + "\n", + " summary_passed = False\n", + "\n", + " for attempt in range(1, MAX_OPTIMIZATION_RETRIES + 1):\n", + " print(f\"\\tAttempt {attempt}: evaluating summary...\")\n", + "\n", + " summary_result = await Runner.run(summarization_agent, section)\n", + " summary = summary_result.final_output\n", + "\n", + " grader_scores = await get_eval_grader_score(\n", + " eval_id=EVAL_ID, summary=summary, section=section\n", + " )\n", + " average_score = calculate_grader_score(grader_scores)\n", + " total_score = calculate_total_grader_score(grader_scores)\n", + " lenient_passed = is_lenient_pass(grader_scores, average_score)\n", + " print(\n", + " f\"\\tScores — avg={average_score:.3f}, total={total_score:.3f}, lenient_passed={lenient_passed}\"\n", + " )\n", + "\n", + " record_aggregate_prompt_score(\n", + " prompt_version=summarization_prompt.current().version,\n", + " prompt_text=summarization_prompt.current().prompt,\n", + " model_name=summarization_prompt.current().model,\n", + " average_score=average_score,\n", + " total_score=total_score,\n", + " )\n", + "\n", + " update_best_candidate(\n", + " average_score=average_score,\n", + " total_score=total_score,\n", + " prompt_text=summarization_prompt.current().prompt,\n", + " model_name=summarization_prompt.current().model,\n", + " summary_text=summary,\n", + " metadata={\n", + " \"section\": section_number,\n", + " \"average_score\": average_score,\n", + " \"grader_results\": grader_scores,\n", + " \"prompt_version\": summarization_prompt.current().version,\n", + " },\n", + " lenient_passed=lenient_passed,\n", + " prompt_version=summarization_prompt.current().version,\n", + " )\n", + "\n", + " if lenient_passed:\n", + " summary_passed = True\n", + " print(\n", + " f\"\\tPassed with prompt v{summarization_prompt.current().version} (model={summarization_prompt.current().model})\"\n", + " )\n", + " break\n", + "\n", + " print(\"\\tFailed eval. Improving prompt...\")\n", + " eval_feedback = collect_grader_feedback(grader_scores)\n", + "\n", + " new_agent = await compare_model_candidates(\n", + " summarization_prompt=summarization_prompt,\n", + " eval_feedback=eval_feedback,\n", + " section=section,\n", + " summary=summary,\n", + " # model_candidates could be given as an argument if you want to expand options.\n", + " )\n", + "\n", + " if new_agent is None:\n", + " print(\n", + " \"\\tNo passing model found. Optimization failed for this section.\"\n", + " )\n", + " summary_passed = False\n", + " else:\n", + " summarization_agent = new_agent\n", + " summary_passed = True\n", + " print(\n", + " f\"\\tPrompt improved → v{summarization_prompt.current().version} \"\n", + " f\"(model={summarization_prompt.current().model})\"\n", + " )\n", + " break\n", + "\n", + " if not summary_passed:\n", + " print(\n", + " \"\\tAll attempts failed; keeping latest prompt version \"\n", + " f\"v{summarization_prompt.current().version} (model={summarization_prompt.current().model}) for the next section.\"\n", + " )\n", + "\n", + " summarization_agent = apply_best_candidate_if_needed()\n", + "\n", + " print(\"\" + \"-\" * 80)\n", + " print(\"Completed optimization loop.\")\n", + " print(f\"Final prompt version: v{summarization_prompt.current().version}\")\n", + " print(f\"Final model: {summarization_prompt.current().model}\")\n", + " aggregate_best = select_best_aggregate_prompt()\n", + " if best_candidate[\"score\"] > float(\"-inf\"):\n", + " print(\n", + " f\"Best lenient prompt: v{best_candidate.get('version')} (avg={best_candidate['score']:.3f}, model={best_candidate.get('model', 'unknown')})\"\n", + " )\n", + " if aggregate_best:\n", + " per_section = (\n", + " aggregate_best.get(\"total_average\", 0.0) / aggregate_best.get(\"count\", 1)\n", + " if aggregate_best.get(\"count\")\n", + " else 0.0\n", + " )\n", + " print(\n", + " f\"Aggregate best prompt: v{aggregate_best.get('version')} \"\n", + " f\"(total={aggregate_best.get('total_score', 0.0):.3f}, avg/section={per_section:.3f}, model={aggregate_best.get('model', 'unknown')})\"\n", + " )\n", + " print(f\"Final prompt: {summarization_prompt.current().prompt}\")\n", + " print(f\"Final model: {summarization_prompt.current().model}\")\n", + " return summarization_agent\n", + "\n", + "summarization_agent = await self_evolving_loop_with_model_comparison(summarization_agent)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we can see a very similar output with additional information on the model version scores:" + ] + }, + { + "cell_type": "raw", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "Starting self-evolving loop | Initial prompt v0\n", + "Prompt:\n", + "\tYou are a summarization assistant.\n", + "Given a section of text, produce a concise, accurate summary.\n", + "\n", + "[....]\n", + "\n", + "[Section 3.2.P.2.2] Using prompt v2\n", + "\tAttempt 1: evaluating summary...\n", + "\tFailed eval. Improving prompt...\n", + "Candidate average — gpt-5: 0.3533 (passed=False)\n", + "Candidate average — gpt-5-mini: 0.4670 (passed=False)\n", + "No passing models. Best candidate (model=gpt-5-mini, avg=0.4670) did not pass. Prompt not updated.\n", + "\tNo passing model found. Optimization failed for this section.\n", + "\tAttempt 2: evaluating summary...\n", + "Exceeded retries, aborting\n", + "\tPassed with prompt v2\n", + "\n", + "--------------------------------------------------------------------------------\n", + "Completed optimization loop.\n", + "Final prompt version: v2\n", + "Final prompt:\n", + "**Improved Prompt:**\n", + "\n", + "You are a summarization assistant. \n", + "Given any section of text, generate a concise and accurate summary that includes all key concepts, components, and their main characteristics or interactions as described in the original section. Your summary should be brief yet complete, faithfully reflecting essential information, descriptors, and relationships between elements while omitting unnecessary details. Ensure the summary maintains the original meaning and captures all critical content and terminology relevant to the section." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### b. Prompt Optimization with Genetic-Pareto (GEPA)\n", + "\n", + "We've demonstrated that the self-evolving loop works and that a prompt can be improved autonomously using Evals. However, we relied on a relatively straightforward, static metaprompt to improve our system prompt. In this section, we explore a more dynamic and reflexive method by using Genetic-Pareto (GEPA) [[1]](##Citations) — a framework that samples agent trajectories, reflects on them in natural language, proposes prompt revisions, and evolves the system through iterative feedback loops. \n", + "\n", + "The GEPA method, described in the paper available [here](https://doi.org/10.48550/arXiv.2507.19457), offers an compelling blueprint for continuous, self-improving prompt optimization. The code below draws generously on the GEPA Github repository available [here](https://github.com/gepa-ai/gepa)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import gepa\n", + "from gepa import EvaluationBatch\n", + "\n", + "# Extract sections from dataset\n", + "def read_csv_content(file_path: str) -> list[dict]:\n", + " \"\"\"Read csv and return section to summarize.\"\"\"\n", + " df = pd.read_csv(file_path)\n", + " return [{'content': content} for content in df['content'].tolist()]\n", + "\n", + "# Split dataset into training and validation sets\n", + "trainset = read_csv_content(\"data/dataset.csv\")\n", + "val_cut = max(1, int(0.1 * len(trainset)))\n", + "valset = trainset[:val_cut] if len(trainset) > 1 else trainset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We’ll reuse our graders and helper functions by adding a small adapter so that our setup works with GEPA. GEPA’s `GEPAAdapter` makes it easy to plug into our eval framework. We defined three hooks\n", + "- `evaluate`: runs the summarization and grades with graders defined in the previous section (i.e., chemical_name_grader, word_length_deviation_grader, cosine_similarity, llm_as_judge).\n", + "- `get_components_to_update`: gets the text fields GEPA should evolve (here, system_prompt).\n", + "- `make_reflective_dataset`: packages inputs, outputs, and feedback for reflection." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class EvalsBackedSummarizationAdapter:\n", + " \"\"\"\n", + " Minimal adapter for GEPA:\n", + " - evaluate(...) -> EvaluationBatch (scores + outputs + feedback-rich trajectories)\n", + " - get_components_to_update(...) returns the prompt to update\n", + " - make_reflective_dataset(...) packages examples for reflection\n", + " \"\"\"\n", + " propose_new_texts = None # use GEPA's default reflection flow\n", + "\n", + " def __init__(self, client, eval_id: str, gen_model: str = \"gpt-5\", user_prefix: str | None = None):\n", + " self.client = client\n", + " self.eval_id = eval_id\n", + " self.gen_model = gen_model\n", + " self.user_prefix = user_prefix or \"Summarize:\\n\\n\"\n", + "\n", + " # Same summarization agent as in the previous section\n", + " def _summarize(self, system_prompt: str, section: str) -> str:\n", + " resp = self.client.chat.completions.create(\n", + " model=self.gen_model,\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": f\"{self.user_prefix}{section}\"},\n", + " ],\n", + " )\n", + " return resp.choices[0].message.content.strip()\n", + "\n", + " # Required by GEPA: run eval minibatch\n", + " def evaluate(self, inputs: list[dict], candidate: dict, capture_traces: bool = True) -> EvaluationBatch:\n", + " system_prompt = candidate[\"system_prompt\"]\n", + "\n", + " scores: list[float] = []\n", + " outputs: list[str] = []\n", + " trajectories: list[dict] = []\n", + "\n", + " for item in inputs:\n", + " section = item[\"content\"]\n", + "\n", + " # 1) Generate with the candidate prompt\n", + " summary = self._summarize(system_prompt, section)\n", + " outputs.append(summary)\n", + "\n", + " # 2) Grade using previous evals pipeline\n", + " run = run_eval(eval_id=self.eval_id, section=section, summary=summary)\n", + " out_items = poll_eval_run(eval_id=self.eval_id, run_id=run.id)\n", + " grader_scores = parse_eval_run_output(out_items)\n", + "\n", + " # 3) Score + actionable feedback\n", + " scalar = calculate_grader_score(grader_scores)\n", + " feedback = collect_grader_feedback(grader_scores) or \"All graders passed; keep precision and coverage.\"\n", + "\n", + " scores.append(float(scalar))\n", + " trajectories.append(\n", + " {\n", + " \"inputs\": {\"section\": section},\n", + " \"generated_output\": summary,\n", + " \"metrics\": {\n", + " \"combined\": float(scalar),\n", + " \"by_grader\": grader_scores, # keeping for analysis if needed\n", + " },\n", + " \"feedback\": feedback,\n", + " }\n", + " )\n", + "\n", + " return EvaluationBatch(scores=scores, outputs=outputs, trajectories=trajectories)\n", + "\n", + " # Required by GEPA: text field to evolve\n", + " def get_components_to_update(self, candidate: dict) -> list[str]:\n", + " return [\"system_prompt\"]\n", + "\n", + " # Required by GEPA: build the reflective dataset the reflection LM will read\n", + " def make_reflective_dataset(self, candidate: dict, eval_batch: EvaluationBatch, components_to_update: list[str]) -> dict:\n", + " examples = []\n", + " for traj in (eval_batch.trajectories or []):\n", + " examples.append(\n", + " {\n", + " \"Inputs\": {\"section\": traj[\"inputs\"][\"section\"]},\n", + " \"Generated Outputs\": traj[\"generated_output\"],\n", + " \"Feedback\": traj[\"feedback\"],\n", + " }\n", + " )\n", + " return {\"system_prompt\": examples}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that the adapter is ready, we can run GEPA using the same starting prompt (`\"You are a summarization assistant. Given a section of text, produce a summary.\"`) and model (here, `gpt-5`) as in the earlier self-evolving loop for comparison. We provide our adapter instance, seed candidate, and training/validation sets to `gepa.optimize(...)`. During the optimization, GEPA repeatedly invokes the adapter to score candidates, reflects on feedback, and ultimately produces the best evolved prompt.\n", + "\n", + "_Note: GEPA might take ~10-15 minutes to complete._" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "seed_candidate = {\"system_prompt\": \"You are a summarization assistant. Given a section of text, produce a summary.\"}\n", + "\n", + "adapter = EvalsBackedSummarizationAdapter(\n", + " client=client,\n", + " eval_id=EVAL_ID,\n", + " gen_model=summarization_prompt.current().model, \n", + ")\n", + "\n", + "# Keeping max_metric_calls small for the cookbook. \n", + "# In practice, use a larger value to allow more optimization iterations.\n", + "result = gepa.optimize(\n", + " seed_candidate=seed_candidate,\n", + " trainset=trainset,\n", + " valset=valset,\n", + " adapter=adapter,\n", + " reflection_lm=\"gpt-5\",\n", + " max_metric_calls=10,\n", + " track_best_outputs=True,\n", + " display_progress_bar=True\n", + ")\n", + "\n", + "best_prompt = result.best_candidate[\"system_prompt\"]\n", + "print(\"\\n=== Best evolved instruction ===\\n\")\n", + "print(best_prompt)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is an example (abridged) output for the code above:" + ] + }, + { + "cell_type": "raw", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "Iteration 0: Base program full valset score: 0.2183466466681351\n", + "Iteration 1: Selected program 0 score: 0.2183466466681351\n", + "Iteration 1: Proposed new text for system_prompt: \n", + "\n", + "[.......]\n", + "\n", + "Iteration 3: New subsample score 0.6592202195294341 is better than old score 0.6565039300893376. Continue to full eval and add to candidate pool.\n", + "GEPA Optimization: 90%|█████████ | 18/20 [39:21<04:22, 131.19s/rollouts]\n", + "Iteration 3: Full valset score for new program: 0.2225472423976205\n", + "Iteration 3: Full train_val score for new program: 0.2225472423976205\n", + "Iteration 3: Individual valset scores for new program: [0.22866548337721018, 0.21864704884895614, 0.2203291949666952]\n", + "Iteration 3: New valset pareto front scores: [0.23142100182952327, 0.2389098334382265, 0.23513790628541456]\n", + "Iteration 3: Full valset pareto front score: 0.2351562471843881\n", + "Iteration 3: Updated valset pareto front programs: [{1}, {1}, {1}]\n", + "Iteration 3: Best valset aggregate score so far: 0.2351562471843881\n", + "Iteration 3: Best program as per aggregate score on train_val: 1\n", + "Iteration 3: Best program as per aggregate score on valset: 1\n", + "Iteration 3: Best score on valset: 0.2351562471843881\n", + "Iteration 3: Best score on train_val: 0.2351562471843881\n", + "Iteration 3: Linear pareto front program index: 1\n", + "Iteration 3: New program candidate index: 2\n", + "\n", + "=== Best evolved instruction ===\n", + "\n", + "You are a domain-aware summarization assistant for technical pharmaceutical texts. Given a “section” of text, produce a concise summary that preserves key technical facts and exact nomenclature.\n", + "\n", + "Requirements:\n", + "- Length and format:\n", + " - Write 1–3 sentences totaling about 45–70 words (never exceed 90 words). Default to ~60 words.\n", + " - Use a single paragraph (no bullet points, headings, or heavy formatting).\n", + "- Preserve exact technical names and notation:\n", + " - Include every chemical name that appears in the section at least once, with exact spelling, capitalization, isotopic labels, brackets, hyphens, salts, and buffer names (e.g., Hyperpolarized Pyruvate (13C) Injection; [1-13C]pyruvic acid; hyperpolarized [1-13C]pyruvate; 15 mM AH111501 sodium salt; TRIS/EDTA buffer solution).\n", + " - Keep study identifiers, section numbers, regulatory citations, and codes verbatim when mentioned (e.g., GE-101-001, GE-101-003, USP <797>, 3.2.P.7, company codes, CAS numbers).\n", + "...\n", + "Self-check before finalizing:\n", + "- Have you included every chemical name exactly as written?\n", + "- Is the summary within 45–70 words (≤90 max) and a single paragraph?\n", + "- Are key process/regulatory/test details and critical numbers preserved without unnecessary verbosity?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this cookbook, we explored three distinct approaches to prompt optimization:\n", + "\n", + "- **OpenAI Platform Optimizer:** using the _Optimize_ button with a dataset containing manually entered human feedback (thumbs up/down and textual comments), we quickly produced a strong prompt with minimal configuration. This method excels at rapid iteration, but does not provide the automation needed for production environments.\n", + "\n", + "- **Optimization using a static metaprompt:** Our loop, incorporating four different graders,enabled automated exploration and iterative self-improvement without manual intervention. However, its exploration space was limited by a single static meta-prompt, and evaluation was performed section by section. Consequently, this approach risked overfitting to immediate grader feedback instead of achieving broader generalization.\n", + "\n", + "- **GEPA optimization:** Offering a more structured search process, reflective updates were informed by both quantitative scores and textual feedback, while candidates were trained on one dataset and validated on another. This method produced a more robust, generalized prompt and provided clearer empirical evidence of its performance.\n", + "\n", + "_Note: Examples of prompts generated by each method are available in the Appendix._ \n", + "\n", + "Depending on your use case, you may prioritize speed (OpenAI optimizer), lightweight automation (static metaprompt), or systematic generalization (GEPA). In practice, combining these methods by starting with rapid iteration and progressing toward reflective optimization can deliver both agility and performance.\n", + "\n", + "Happy coding!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Contributors\n", + "\n", + "This cookbook is based on a joint collaboration between [Bain](www.bain.com) and [OpenAI](openai.com). \n", + "\n", + "[Calvin Maguranis](https://www.linkedin.com/in/calvin-maguranis-b9956045/) \n", + "[Fanny Perraudeau](https://www.linkedin.com/in/fanny-sabran-perraudeau-494b7573/) \n", + "[Giorgio Saladino](https://www.linkedin.com/in/giorgio-saladino-202/) \n", + "[Shikhar Kwatra](https://www.linkedin.com/in/shikharkwatra/) \n", + "[Valentina Frenkel](https://www.linkedin.com/in/valentina-frenkel/) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Citations\n", + "\n", + "[1] _GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning_ by Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab - https://arxiv.org/abs/2507.19457" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Appendix\n", + "\n", + "### Examples of output prompts:\n", + "\n", + "- **Initial prompt:** \n", + "```pgsql \n", + "You are a summarization assistant. Given a section of text, produce a summary.\n", + "```\n", + "\n", + "- **OpenAI Platform Optimizer:** \n", + "```pgsql \n", + "You are a summarization assistant.\n", + "Task: Summarize the provided text concisely and accurately.\n", + "Output requirements:\n", + "- Output only the summary. Do not add titles, labels (e.g.,\n", + "\"Summary:\"), prefaces, or commentary.\n", + "- Preserve the document's structure. If multiple sections/subsections appear, summarize each one.\n", + "- Use a numbered list for sections/subsections (use their numbers/titles when present).\n", + "- Under each, use short dash bullets for key points.\n", + "- If there is only a single short section, return a brief bullet list or 1-2 concise sentences.\n", + "- Split any inline lists into separate bullets.\n", + "- Use plain, simple language. Keep bullets tight (ideally one line each). Remove redundancy.\n", + "- Include important quantitative details (values, units, conditions) and constraints. Do not invent information.\n", + "- Keep formatting simple: plain text, \"1.\" numbering and \"-\" bullets only. No tables or special markup.\n", + "- Retain exact technical terms/notation from the source (e.g., chemical names, isotopic labels).\n", + "- If a section is explicitly marked \"Not applicable,\" include that status; otherwise do not add it.\n", + "```\n", + "\n", + "- **Static metaprompt:** \n", + "```pgsql \n", + "You are a technical summarization assistant for scientific and regulatory documentation. Your task is to generate a concise, comprehensive, and fully detailed summary of any scientific, technical, or regulatory text provided. Strictly adhere to the following instructions:\n", + "\n", + "---\n", + "\n", + "**1. Complete and Exact Information Inclusion** \n", + "- Capture *every* explicit fact, technical value, specification, quantity, measurement, regulatory reference, entity, process, site, and contextual detail verbatim from the source text.\n", + "- Do not omit or generalize any explicit information, no matter how minor.\n", + "\n", + "**2. Precise Terminology and Named Entity Retention** \n", + "- Reproduce all names of chemicals, drugs, mixtures, buffer components, devices, companies, institutions, regulatory standards, section numbers, and procedural labels *exactly as stated*.\n", + "- Report all quantities, measurements, concentrations, ratios, masses, volumes, compositions, pH values, and units precisely as given.\n", + "- Do not paraphrase, rename, substitute, or simplify any term or value.\n", + "\n", + "**3. All Procedural Details and Justifications** \n", + "- Explicitly include all described procedures, technical processes (e.g., terminal sterilization, aseptic processing), operational constraints, process justifications, compliance requirements, and standards references.\n", + "- Clearly state all reasons provided for choosing or omitting particular methods or processes.\n", + "\n", + "**4. Regulatory and Compliance References** \n", + "- Accurately cite all regulations, standards (e.g., USP <797>), compliance statements, section numbers, and cross-references as in the original.\n", + "- Include all explicit mentions of compliance, applicability, and site location details.\n", + "\n", + "**5. Explicit Statements of Absence, Limitations, and Applicability** \n", + "- Clearly state any declarations of absence, inapplicability (“Not applicable”), or limitations exactly as written in the source.\n", + "\n", + "**6. Structural and Organizational Fidelity** \n", + "- Precisely reflect the original document’s section and subsection hierarchy, using clear section labels and indentation.\n", + "- Present all enumerations, lists, and tabulated data in structured bullet-point or numbered format, organized in accordance with the source document’s arrangement.\n", + "\n", + "**7. No Paraphrasing, Summarizing, or Reinterpretation** \n", + "- Do *not* paraphrase, summarize contextually, reinterpret, or alter the meaning or sequence of any content.\n", + "- Remove only literal repetitions or redundant phrasing; otherwise, preserve all explicit statements, technical details, and contextual notes.\n", + "\n", + "---\n", + "\n", + "**Summary Output Objective:** \n", + "Produce a summary that delivers the full technical, factual, and regulatory content and structure of the original text, reformatted by eliminating only redundant language. The summary must enable audit, regulatory review, or peer reference without loss of any explicit information or terminology from the source.\n", + "\n", + "---\n", + "\n", + "*Apply these instructions rigorously to every provided document section to ensure scientific and regulatory accuracy and completeness.*\n", + "```\n", + "\n", + "- **GEPA optimizer**: \n", + "```pgsql \n", + "You are a domain-aware summarization assistant for technical pharmaceutical texts. Given a “section” of text, produce a concise, single-paragraph summary that preserves key technical facts and exact nomenclature.\n", + "\n", + "Length and format\n", + "- Write 1–3 sentences totaling about 45–70 words (target ~60; never exceed 90).\n", + "- Use one paragraph; no bullets, headings, tables, or heavy formatting.\n", + "\n", + "Exact names and notation\n", + "- Include every chemical name that appears in the section at least once, using the exact original spelling, capitalization, punctuation, isotopic labels, brackets, hyphens, salts, buffer names, and parenthetical qualifiers. Treat distinct case/format variants as distinct names (e.g., [1-13C]pyruvic acid and [1-13C]Pyruvic acid are separate and each must appear once).\n", + "- Examples you must preserve verbatim when present: Hyperpolarized Pyruvate (13C) Injection; non-polarized Pyruvate Injection; Pyruvate (13C) Injection; hyperpolarized [1-13C]pyruvate; Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt; TRIS/EDTA buffer solution; TRIS; NaOH; Na2EDTA; [1-13C]pyruvic acid; AH111501 sodium salt.\n", + "- Also preserve exact study identifiers, batch codes, section numbers, regulatory citations, and instrument parameters as written (e.g., GE-101-001, GE-101-003, USP <797>, 3.2.P.5.2.5, FFF106/140-806, FFF106/142-806, 3T MRI, 5 degree RF pulse, TR=3s, 90 degree pulse, 64 averages, TR=10s, 10 μl Gd/ml solution).\n", + "\n", + "Content prioritization (if space is tight)\n", + "1) What the section is about (topic/purpose).\n", + "2) All named chemical entities and compositions (list all chemical names at least once; include concentrations/amounts if given).\n", + "3) Critical process/handling facts (e.g., aseptic processing vs terminal sterilization; ISO classifications; filtration specs; compounding/filling steps; temperatures/times/volumes; storage/administration limits).\n", + "4) Container/packaging specifics (e.g., cryovials, “sterile fluid path”).\n", + "5) Microbiological/testing/regulatory details (e.g., sterility/pyrogenicity testing timing; USP <797>; state board compliance; site/manufacturer if stated).\n", + "6) Overages/single-dose formulas and key quantities.\n", + "\n", + "Numerical fidelity\n", + "- Preserve all critical numbers and units exactly (e.g., 1.44 g, 27.7 mg, 15 mM, 18 mL, 1.47 g, two 0.2 μm filters, ISO 7, ISO 5, 38 mL).\n", + "- Include testing/analysis parameters when present (e.g., polarization/relaxation time (T1); number of spectra; pulse angles; TR values; MRI location relative to clean room).\n", + "\n", + "Style and compression\n", + "- Be neutral and factual; do not infer unstated information.\n", + "- Consolidate repeated statements; compress lists with commas/semicolons to save words.\n", + "- Mention tables/figures only to convey key data; do not reproduce them.\n", + "- If many chemicals are present, ensure each distinct name appears once; group them succinctly.\n", + "- Avoid symbols or special formatting not in the source text.\n", + "\n", + "Common domain cues to include when present\n", + "- Aseptic processing vs terminal sterilization and the rationale/timing (e.g., “tested for sterility and pyrogenicity subsequent to patient administration”).\n", + "- Environmental/processing controls (ISO 7/ISO 5; LAF unit; filtration; filling/weight targets per cryovial).\n", + "- Site/regulatory context (e.g., USP <797>; California State Board of Pharmacy; University of California, San Francisco Department of Clinical Pharmacy).\n", + "- Study/kit equivalence statements (e.g., equivalence to GE-101-001/GE-101-003 formulations).\n", + "- QC/measurement methods (e.g., capacitive threshold at Administration syringe nominal 38 mL).\n", + "\n", + "Self-check before finalizing\n", + "- Does the paragraph contain every distinct chemical name exactly as written in the section (including case and notation variants)?\n", + "- Is the summary 45–70 words (≤90), in a single paragraph?\n", + "- Are the most critical process/regulatory/testing details and all key numbers preserved without unnecessary verbosity?`\n", + "```" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/partners/self_evolving_agents/data/c13_pyruvate_sample_CMC_from_UCSF.pdf b/examples/partners/self_evolving_agents/data/c13_pyruvate_sample_CMC_from_UCSF.pdf new file mode 100644 index 0000000000..867a530657 Binary files /dev/null and b/examples/partners/self_evolving_agents/data/c13_pyruvate_sample_CMC_from_UCSF.pdf differ diff --git a/examples/partners/self_evolving_agents/data/chemical_names.json b/examples/partners/self_evolving_agents/data/chemical_names.json new file mode 100644 index 0000000000..61c3679ce3 --- /dev/null +++ b/examples/partners/self_evolving_agents/data/chemical_names.json @@ -0,0 +1,77 @@ +[ + "[1-¹³C]Pyruvic acid", + "[1-¹³C]Pyruvate", + "¹²C Pyruvic acid", + "Sodium [1-¹³C]pyruvate", + "Sodium pyruvate (¹²C)", + "AH111501 (Trityl radical)", + "Tris{8-carboxyl-2,2,6,6-tetra[2-(1-methoxyethyl)]-benzo(1,2-d:4,5-d’)bis(1,3)dithiole-4-yl}methyl acid", + "AH111501 sodium salt", + "Methyl, tris[8-carboxy-2,2,6,6-tetrakis(2-methoxyethyl)benzo[1,2-d:4,5-d’]bis[1,3]dithiol-4-yl]-, trisodium salt", + "AH111501 trisodium salt", + "AH111576", + "2,2′,2″,2‴-(4,8-Dibromobenzo[1,2-d:4,5-d′]bis([1,3]dithiole)-2,2,6,6-tetrayl)tetraethanol", + "AH111586", + "4,8-Dibromo-2,2,6,6-tetrakis(2-methoxyethyl)benzo[1,2-d:4,5-d′]bis([1,3]dithiole)", + "AH111709", + "AH111743", + "AH112615", + "4,4-Bis-hydroxymethyl-2-methyl-oxazolidine-2-carboxylic acid", + "AH112623", + "Parapyruvate", + "2-Hydroxy-2-methyl-4-oxo-pentanedioic acid", + "AH113127", + "(4-Hydroxymethyl-oxazolidin-4-yl)-methanol", + "AH113462/E", + "Enol lactone", + "AH113462/K", + "Keto lactone", + "Acetyl bromide", + "Methanol", + "Dimethyl sulfoxide", + "DMSO", + "Tetrahydrofuran", + "THF", + "Acetonitrile", + "ACN", + "Diethyl ether", + "Et₂O", + "N,N-Dimethylacetamide", + "DMA", + "1,3-Dimethyl-2-imidazolidinone", + "DMI", + "Hydrochloric acid", + "HCl", + "Sodium hydroxide", + "NaOH", + "Disodium ethylenediaminetetraacetate", + "Na₂EDTA", + "Ethylenediaminetetraacetic acid", + "EDTA", + "Tris(hydroxymethyl)aminomethane", + "TRIS", + "Trometamol", + "Trifluoroacetic acid", + "TFA", + "Toluene", + "Heptane", + "Ethyl acetate", + "Ethanol", + "Water", + "H₂O", + "Sodium chloride", + "NaCl", + "Cuprous [1-¹³C]cyanide", + "Cu¹³CN", + "Gadolinium", + "Gd", + "Tin", + "Sn", + "Phosphorus", + "P", + "Carbon dioxide", + "CO₂", + "Sodium [1-13C]pyruvate", + "[1-13C]Pyruvic acid", + "1-13C pyruvate" +] diff --git a/examples/partners/self_evolving_agents/data/dataset.csv b/examples/partners/self_evolving_agents/data/dataset.csv new file mode 100644 index 0000000000..20d667ed9c --- /dev/null +++ b/examples/partners/self_evolving_agents/data/dataset.csv @@ -0,0 +1,31 @@ +section_number,toc_index,title,content +7.1,17,Drug Substance,"3.2.S.1 General Information ([1-13C]pyruvic acid) The active ingredient in Hyperpolarized Pyruvate (13C) Injection is hyperpolarized [1-13C]pyruvate. The drug substance is defined as [13C]pyruvic acid, which is neutralized to [1-13C]pyruvate during the compounding process. In several pre-clinical and clinical studies and during evaluation of stability, pyruvic acid has been used instead of [1-13C]pyruvic acid (see Sections 3.2.P.2.2.1 Formulation Development for Hyperpolarized Pyruvate (13C) Injection and Section 8.1 Introduction for Item 8 Pharmacology and Toxicology Info). In the Section 3.2.S Drug Substance, data are presented for both pyruvic acid and for [1-13C]pyruvic acid. For simplicity, the terminology used in headings and captions is [1-13C]pyruvic acid. Batches containing pyruvic acid are specified by footnotes. 3.2.S.1.1 Nomenclature ([1-13C]pyruvic acid) The drug substance used for compounding of Hyperpolarized Pyruvate (13C) Injection is [1-13C]pyruvic acid. Company code: W6578 Chemical name: [1-13C]pyruvic acid CAS registry number: 127-17-3 3.2.S.1.2 Structure ([1-13C]pyruvic acid) Figure 1 Structure of [1-13C]pyruvic acid Molecular formula: C H O 3 4 3 Molecular weight: 89.06 3.2.S.1.3 General Properties ([1-13C]pyruvic acid) Appearance: Colorless to yellow, clear, viscous liquid pKa:Ka:aranWater solubility: Complete The structure of [1-13C]pyruvic acid has been confirmed by spectroscopic analysis (see Section 3.2.S.3.1 Elucidation of Structure and other Characteristics)." +7.2,28,Drug Product Part 1,"3.2.P DRUG PRODUCT (STERILE FLUID PATH COMPONENTS) Hyperpolarized Pyruvate (13C) Injection (drug product) is a sterile solution for intravenous injection. The compounding of Hyperpolarized Pyruvate (13C) Injection is performed by an automated compounding device known as SpinLab. For each patient dose, SpinLab utilizes a single sterile fluid path which contains the following three drug product components: • Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt • TRIS/EDTA buffer solution • Sterile Water for Injection (WFI) The following 3.2.P sections describe the individual drug product components. For aspects related to the compounding of the drug product, Hyperpolarized Pyruvate (13C) Injection, reference is made to 3.2.P for Hyperpolarized Pyruvate (13C) Injection. Commercially available USP quality Sterile Water for Injection (Hospira Inc., USA) is provided by the clinical site. Aspects of this drug product component will therefore not be addressed." +3.2.P.2.1,29,Components of the Drug Product (Drug Product Kit Components),"3.2.P.2.1.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt (a) Drug substance The drug substance, [1−13C]pyruvic acid, is a colorless to yellow, clear, viscous liquid. [1−13C]Pyruvic acid is described in Section 3.2.S Drug Substance. Upon neutralization in the TRIS/EDTA buffer solution, the [1−13C]pyruvic acid is converted to [1−13C]pyruvate. (b) Excipients AH111501 sodium salt is a stable trityl radical, and is added to [1−13C]pyruvic acid to enable hyperpolarization. AH111501 sodium salt is a green to black, fine to granular powder. AH111501 sodium salt is further described in Section 3.2.A.3 Novel Excipients. 3.2.P.2.1.2 TRIS/EDTA buffer solution The TRIS/EDTA buffer solution is an aqueous solution containing 333 mM TRIS, 600 mM NaOH and 333 mg/l Na EDTA. 2 TRIS is used as buffer to stabilize the pH of the Hyperpolarized Pyruvate (13C) Injection at a physiologically acceptable level. NaOH is added to neutralize the [1−13C]pyruvic acid in Mixture of [1−13C]pyruvic acid and 15" +3.2.P.2.2,30,Drug Product (Drug Product Kit Components),"3.2.P.2.2.1 Formulation Development (a) Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is dissolved in WFI and neutralized/buffered in TRIS/EDTA buffer solution to form a solution with a physiologically acceptable pH. The concentration of AH111501 sodium salt of 15 mM has been chosen for optimization of 13C nuclear polarization in Hyperpolarized Pyruvate (13C) Injection. For clinical trials GE-101-001 and GE-101-003, pyruvic acid was used instead of [1-13C]pyruvic acid. For these trials the Pyruvate Injection was not compounded hence; in order to mimic the maximum content of AH111501 in Hyperpolarized Pyruvate (13C) Injection, the kit component used during the clinical trials GE-101-001 and GE 101-003 was Mixture of pyruvic acid and 0.2 mM AH111501 sodium salt. In addition, some pre-clinical studies were performed using pyruvic acid instead of [1- 13C]pyruvic acid. See Section 3.2.P.2.2.1 Formulation development for Hyperpolarized Pyruvate (13C) Injection and section 8.1 Introduction for Item 8 Pharmacology and Toxicology Info) for further details. The amount of [1-13C]pyruvic acid and AH111501 sodium salt mixture per cryovial is 1.47 g, which upon dissolution in the total volume of WFI and TRIS/EDTA buffer solution, gives 250 mM [1- 13C]pyruvate in the final Hyperpolarized Pyruvate (13C) Injection. (b) TRIS/EDTA buffer solution The function of TRIS/EDTA buffer solution is to neutralize the [1-13C]pyruvic acid to [1- 13C]pyruvate and to assure a physiologically acceptable pH of the drug product Hyperpolarized Pyruvate (13C) Injection. TRIS/EDTA buffer solution has not been used during pre-clinical studies or during clinical trials GE-101-001 and GE-101-003. For these studies, the Mixture of [1-13C]pyruvic acid and AH111501 sodium salt was dissolved in a single, manual step in TRIS/EDTA dissolution medium. For compounding of Hyperpolarized Pyruvate (13C) Injection, the Mixture of [1- 13C]pyruvic acid and 15 mM AH111501 sodium salt will first be dissolved in WFI and then neutralized and buffered in TRIS/EDTA buffer solution. See Section 3.2.P.2.2.1 Formulation Development for Hyperpolarized Pyruvate (13C) Injection for details. The amount of [1-13C]pyruvic acid to be dissolved is 1.67 g (equivalent to 18.75 mmol). This amount of acid is neutralized and buffered with 22.5 ml of TRIS/EDTA buffer solution (equivalent to 8.33 mmol of TRIS and 15.00 mmol of NaOH) to a target pH of 7.6 (at 37°C) in the Hyperpolarized Pyruvate (13C) Injection. Sample not for submission mM AH111501 sodium salt to [1−13C]pyruvate in the Hyperpolarized Pyruvate (13C) Injection. Na EDTA has been included in the formulation as a chelating agent to capture traces of 2 paramagnetic metal ions that might be present. 3.2.P.2.2 Drug Product (Drug Product Kit Components) 3.2.P.2.2.1 Formulation Development (a) Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is dissolved in WFI and neutralized/buffered in TRIS/EDTA buffer solution to form a solution with a physiologically acceptable pH. The concentration of AH111501 sodium salt of 15 mM has been chosen for optimization of 13C nuclear polarization in Hyperpolarized Pyruvate (13C) Injection. For clinical trials GE-101-001 and GE-101-003, pyruvic acid was used instead of [1-13C]pyruvic acid. For these trials the Pyruvate Injection was not compounded hence; in order to mimic the maximum content of AH111501 in Hyperpolarized Pyruvate (13C) Injection, the kit component used during the clinical trials GE-101-001 and GE 101-003 was Mixture of pyruvic acid and 0.2 mM AH111501 sodium salt. In addition, some pre-clinical studies were performed using pyruvic acid instead of [1- 13C]pyruvic acid. See Section 3.2.P.2.2.1 Formulation development for Hyperpolarized Pyruvate (13C) Injection and section 8.1 Introduction for Item 8 Pharmacology and Toxicology Info) for further details. The amount of [1-13C]pyruvic acid and AH111501 sodium salt mixture per cryovial is 1.47 g, which upon dissolution in the total volume of WFI and TRIS/EDTA buffer solution, gives 250 mM [1- 13C]pyruvate in the final Hyperpolarized Pyruvate (13C) Injection. (b) TRIS/EDTA buffer solution The function of TRIS/EDTA buffer solution is to neutralize the [1-13C]pyruvic acid to [1- 13C]pyruvate and to assure a physiologically acceptable pH of the drug product Hyperpolarized Pyruvate (13C) Injection. TRIS/EDTA buffer solution has not been used during pre-clinical studies or during clinical trials GE-101-001 and GE-101-003. For these studies, the Mixture of [1-13C]pyruvic acid and AH111501 sodium salt was dissolved in a single, manual step in TRIS/EDTA dissolution medium. For compounding of Hyperpolarized Pyruvate (13C) Injection, the Mixture of [1- 13C]pyruvic acid and 15 mM AH111501 sodium salt will first be dissolved in WFI and then neutralized and buffered in TRIS/EDTA buffer solution. See Section 3.2.P.2.2.1 Formulation Development for Hyperpolarized Pyruvate (13C) Injection for details. The amount of [1-13C]pyruvic acid to be dissolved is 1.67 g (equivalent to 18.75 mmol). This amount of acid is neutralized and buffered with 22.5 ml of TRIS/EDTA buffer solution (equivalent to 8.33 mmol of TRIS and 15.00 mmol of NaOH) to a target pH of 7.6 (at 37°C) in the Hyperpolarized Pyruvate (13C) Injection. Sample not for submission 3.2.P.2.2.2 Overages (a) Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt There are no overages included in the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt. (b) TRIS/EDTA buffer solution There are no overages included in the TRIS/EDTA buffer solution. 3.2.P.2.3 Manufacturing Process Development (Drug Product Kit Components) 3.2.P.2.3.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Terminal sterilization of the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is not possible due to degradation of [1-13C]pyruvic acid. The current process is therefore performed by aseptic processing. 3.2.P.2.3.2 TRIS/EDTA buffer solution Terminal sterilization of TRIS/EDTA buffer solution in various container closure systems has been tested, but generation of particles occurred during sterilization. This is probably caused by the high pH of the TRIS/EDTA buffer solution. The current process is therefore performed by aseptic processing. 3.2.P.2.4 Container Closure System (Sterile Fluid Path Components) 3.2.P.2.4.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The compounding process for Hyperpolarized Pyruvate (13C) Injection requires a custom made container closure system, the sterile fluid path, for the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt. This container closure system is described in more detail in Section 3.2.P.7 Container Closure System. 3.2.P.2.4.2 TRIS/EDTA buffer solution The compounding process for Hyperpolarized Pyruvate (13C) Injection requires a custom made container closure system, the sterile fluid path, for the TRIS/EDTA buffer solution. This container closure system is described in more detail in Section 3.2.P.7 Container Closure System. 3.2.P.2.5 Microbiological Attributes (Sterile Fluid Path Components) 3.2.P.2.5.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Not applicable. The mixture of [1-13C]pyruvic acid and 15 mM AH111501 is compounded immediately prior to patient administration. A sample of the final Hyperpolarized Pyruvate (13C) Injection is tested for" +3.2.P.2.3,31,Manufacturing Process Development (Drug Product Kit Components),"3.2.P.2.3.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Terminal sterilization of the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is not possible due to degradation of [1-13C]pyruvic acid. The current process is therefore performed by aseptic processing. 3.2.P.2.3.2 TRIS/EDTA buffer solution Terminal sterilization of TRIS/EDTA buffer solution in various container closure systems has been tested, but generation of particles occurred during sterilization. This is probably caused by the high pH of the TRIS/EDTA buffer solution. The current process is therefore performed by aseptic processing. Sample not for submission 3.2.P.2.2.2 Overages (a) Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt There are no overages included in the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt. (b) TRIS/EDTA buffer solution There are no overages included in the TRIS/EDTA buffer solution. 3.2.P.2.3 Manufacturing Process Development (Drug Product Kit Components) 3.2.P.2.3.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Terminal sterilization of the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is not possible due to degradation of [1-13C]pyruvic acid. The current process is therefore performed by aseptic processing. 3.2.P.2.3.2 TRIS/EDTA buffer solution Terminal sterilization of TRIS/EDTA buffer solution in various container closure systems has been tested, but generation of particles occurred during sterilization. This is probably caused by the high pH of the TRIS/EDTA buffer solution. The current process is therefore performed by aseptic processing. 3.2.P.2.4 Container Closure System (Sterile Fluid Path Components) 3.2.P.2.4.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The compounding process for Hyperpolarized Pyruvate (13C) Injection requires a custom made container closure system, the sterile fluid path, for the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt. This container closure system is described in more detail in Section 3.2.P.7 Container Closure System. 3.2.P.2.4.2 TRIS/EDTA buffer solution The compounding process for Hyperpolarized Pyruvate (13C) Injection requires a custom made container closure system, the sterile fluid path, for the TRIS/EDTA buffer solution. This container closure system is described in more detail in Section 3.2.P.7 Container Closure System. 3.2.P.2.5 Microbiological Attributes (Sterile Fluid Path Components) 3.2.P.2.5.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Not applicable. The mixture of [1-13C]pyruvic acid and 15 mM AH111501 is compounded immediately prior to patient administration. A sample of the final Hyperpolarized Pyruvate (13C) Injection is tested for Sample not for submission sterility and pyrogenicity subsequent to patient administration. 3.2.P.2.5.2 TRIS/EDTA buffer solution Not applicable The TRIS/EDTA buffer solution is compounded immediately prior to patient administration. A sample of the final Hyperpolarized Pyruvate (13C) Injection is tested for sterility and pyrogenicity subsequent to patient administration. 3.2.P.3.1 Manufacturer(s) (Sterile Fluid Path Components) 3.2.P.3.1.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The compounding of Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt for clinical use is conducted in accordance with compliance of USP <797> and the regulations promulgated by the California State Board of Pharmacy at the licensed pharmacy on the following academic campus: University of California, San Francisco Department of Clinical Pharmacy San Francisco, California 94118 3.2.P.3.1.2 TRIS/EDTA buffer solution The compounding of TRIS/EDTA buffer solution for clinical use is conducted in accordance with compliance of USP <797> and the regulations promulgated by the California State Board of Pharmacy at the licensed pharmacy on the following academic campus: University of California, San Francisco Department of Clinical Pharmacy San Francisco, California 94118 3.2.P.3.2 Single Dose Compounding Formula (Sterile Fluid Path Components) 3.2.P.3.2.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is compounded by aseptic processing. The compounding formula for a single dose prepared immediately prior to patient administration is given in Table 1. Table 1 Compounding formula for Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Ingredient Quantity per container [1-13C]pyruvic acid 1.44 g AH111501 sodium salt 27.7 mg" +3.2.P.2.4,31,Container Closure System (Sterile Fluid Path Components),"3.2.P.2.4.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The compounding process for Hyperpolarized Pyruvate (13C) Injection requires a custom made container closure system, the sterile fluid path, for the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt. This container closure system is described in more detail in Section Sample not for submission 3.2.P.2.2.2 Overages (a) Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt There are no overages included in the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt. (b) TRIS/EDTA buffer solution There are no overages included in the TRIS/EDTA buffer solution. 3.2.P.2.3 Manufacturing Process Development (Drug Product Kit Components) 3.2.P.2.3.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Terminal sterilization of the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is not possible due to degradation of [1-13C]pyruvic acid. The current process is therefore performed by aseptic processing. 3.2.P.2.3.2 TRIS/EDTA buffer solution Terminal sterilization of TRIS/EDTA buffer solution in various container closure systems has been tested, but generation of particles occurred during sterilization. This is probably caused by the high pH of the TRIS/EDTA buffer solution. The current process is therefore performed by aseptic processing. 3.2.P.2.4 Container Closure System (Sterile Fluid Path Components) 3.2.P.2.4.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The compounding process for Hyperpolarized Pyruvate (13C) Injection requires a custom made container closure system, the sterile fluid path, for the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt. This container closure system is described in more detail in Section 3.2.P.7 Container Closure System. 3.2.P.2.4.2 TRIS/EDTA buffer solution The compounding process for Hyperpolarized Pyruvate (13C) Injection requires a custom made container closure system, the sterile fluid path, for the TRIS/EDTA buffer solution. This container closure system is described in more detail in Section 3.2.P.7 Container Closure System. 3.2.P.2.5 Microbiological Attributes (Sterile Fluid Path Components) 3.2.P.2.5.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Not applicable. The mixture of [1-13C]pyruvic acid and 15 mM AH111501 is compounded immediately prior to patient administration. A sample of the final Hyperpolarized Pyruvate (13C) Injection is tested for Sample not for submission sterility and pyrogenicity subsequent to patient administration. 3.2.P.2.5.2 TRIS/EDTA buffer solution Not applicable The TRIS/EDTA buffer solution is compounded immediately prior to patient administration. A sample of the final Hyperpolarized Pyruvate (13C) Injection is tested for sterility and pyrogenicity subsequent to patient administration. 3.2.P.3.1 Manufacturer(s) (Sterile Fluid Path Components) 3.2.P.3.1.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The compounding of Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt for clinical use is conducted in accordance with compliance of USP <797> and the regulations promulgated by the California State Board of Pharmacy at the licensed pharmacy on the following academic campus: University of California, San Francisco Department of Clinical Pharmacy San Francisco, California 94118 3.2.P.3.1.2 TRIS/EDTA buffer solution The compounding of TRIS/EDTA buffer solution for clinical use is conducted in accordance with compliance of USP <797> and the regulations promulgated by the California State Board of Pharmacy at the licensed pharmacy on the following academic campus: University of California, San Francisco Department of Clinical Pharmacy San Francisco, California 94118 3.2.P.3.2 Single Dose Compounding Formula (Sterile Fluid Path Components) 3.2.P.3.2.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is compounded by aseptic processing. The compounding formula for a single dose prepared immediately prior to patient administration is given in Table 1. Table 1 Compounding formula for Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Ingredient Quantity per container [1-13C]pyruvic acid 1.44 g AH111501 sodium salt 27.7 mg" +3.2.P.2.5,31,Microbiological Attributes (Sterile Fluid Path Components),"3.2.P.2.5.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Not applicable. The mixture of [1-13C]pyruvic acid and 15 mM AH111501 is compounded immediately prior to patient administration. A sample of the final Hyperpolarized Pyruvate (13C) Injection is tested for Sample not for submission 3.2.P.2.2.2 Overages (a) Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt There are no overages included in the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt. (b) TRIS/EDTA buffer solution There are no overages included in the TRIS/EDTA buffer solution. 3.2.P.2.3 Manufacturing Process Development (Drug Product Kit Components) 3.2.P.2.3.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Terminal sterilization of the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is not possible due to degradation of [1-13C]pyruvic acid. The current process is therefore performed by aseptic processing. 3.2.P.2.3.2 TRIS/EDTA buffer solution Terminal sterilization of TRIS/EDTA buffer solution in various container closure systems has been tested, but generation of particles occurred during sterilization. This is probably caused by the high pH of the TRIS/EDTA buffer solution. The current process is therefore performed by aseptic processing. 3.2.P.2.4 Container Closure System (Sterile Fluid Path Components) 3.2.P.2.4.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The compounding process for Hyperpolarized Pyruvate (13C) Injection requires a custom made container closure system, the sterile fluid path, for the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt. This container closure system is described in more detail in Section 3.2.P.7 Container Closure System. 3.2.P.2.4.2 TRIS/EDTA buffer solution The compounding process for Hyperpolarized Pyruvate (13C) Injection requires a custom made container closure system, the sterile fluid path, for the TRIS/EDTA buffer solution. This container closure system is described in more detail in Section 3.2.P.7 Container Closure System. 3.2.P.2.5 Microbiological Attributes (Sterile Fluid Path Components) 3.2.P.2.5.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Not applicable. The mixture of [1-13C]pyruvic acid and 15 mM AH111501 is compounded immediately prior to patient administration. A sample of the final Hyperpolarized Pyruvate (13C) Injection is tested for Sample not for submission sterility and pyrogenicity subsequent to patient administration. 3.2.P.2.5.2 TRIS/EDTA buffer solution Not applicable The TRIS/EDTA buffer solution is compounded immediately prior to patient administration. A sample of the final Hyperpolarized Pyruvate (13C) Injection is tested for sterility and pyrogenicity subsequent to patient administration. 3.2.P.3.1 Manufacturer(s) (Sterile Fluid Path Components) 3.2.P.3.1.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The compounding of Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt for clinical use is conducted in accordance with compliance of USP <797> and the regulations promulgated by the California State Board of Pharmacy at the licensed pharmacy on the following academic campus: University of California, San Francisco Department of Clinical Pharmacy San Francisco, California 94118 3.2.P.3.1.2 TRIS/EDTA buffer solution The compounding of TRIS/EDTA buffer solution for clinical use is conducted in accordance with compliance of USP <797> and the regulations promulgated by the California State Board of Pharmacy at the licensed pharmacy on the following academic campus: University of California, San Francisco Department of Clinical Pharmacy San Francisco, California 94118 3.2.P.3.2 Single Dose Compounding Formula (Sterile Fluid Path Components) 3.2.P.3.2.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is compounded by aseptic processing. The compounding formula for a single dose prepared immediately prior to patient administration is given in Table 1. Table 1 Compounding formula for Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Ingredient Quantity per container [1-13C]pyruvic acid 1.44 g AH111501 sodium salt 27.7 mg" +3.2.P.3.1,32,Manufacturer(s) (Sterile Fluid Path Components),"3.2.P.3.1.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The compounding of Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt for clinical use is conducted in accordance with compliance of USP <797> and the regulations promulgated by the California State Board of Pharmacy at the licensed pharmacy on the following academic campus: University of California, San Francisco Department of Clinical Pharmacy San Francisco, California 94118 3.2.P.3.1.2 TRIS/EDTA buffer solution The compounding of TRIS/EDTA buffer solution for clinical use is conducted in accordance with compliance of USP <797> and the regulations promulgated by the California State Board of Pharmacy at the licensed pharmacy on the following academic campus: University of California, San Francisco Department of Clinical Pharmacy San Francisco, California 94118 Sample not for submission sterility and pyrogenicity subsequent to patient administration. 3.2.P.2.5.2 TRIS/EDTA buffer solution Not applicable The TRIS/EDTA buffer solution is compounded immediately prior to patient administration. A sample of the final Hyperpolarized Pyruvate (13C) Injection is tested for sterility and pyrogenicity subsequent to patient administration. 3.2.P.3.1 Manufacturer(s) (Sterile Fluid Path Components) 3.2.P.3.1.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The compounding of Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt for clinical use is conducted in accordance with compliance of USP <797> and the regulations promulgated by the California State Board of Pharmacy at the licensed pharmacy on the following academic campus: University of California, San Francisco Department of Clinical Pharmacy San Francisco, California 94118 3.2.P.3.1.2 TRIS/EDTA buffer solution The compounding of TRIS/EDTA buffer solution for clinical use is conducted in accordance with compliance of USP <797> and the regulations promulgated by the California State Board of Pharmacy at the licensed pharmacy on the following academic campus: University of California, San Francisco Department of Clinical Pharmacy San Francisco, California 94118 3.2.P.3.2 Single Dose Compounding Formula (Sterile Fluid Path Components) 3.2.P.3.2.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is compounded by aseptic processing. The compounding formula for a single dose prepared immediately prior to patient administration is given in Table 1. Table 1 Compounding formula for Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Ingredient Quantity per container [1-13C]pyruvic acid 1.44 g AH111501 sodium salt 27.7 mg Sample not for submission 3.2.P.3.2.2 TRIS/EDTA buffer solution The product comprises an aqueous solution of TRIS, NaOH, and Na2EDTA. The product is compounded by aseptic processing. The compounding formula for a single dose prepared immediately prior to patient administration is given in Table 2. Table 2 Compounding formula for TRIS/EDTA buffer solution 1Quantity of sterile TRIS/EDTA buffer solution aseptically instilled into receiving vessel of sterile fluid path is 18 mL. 3.2.P.3.3 Description of Manufacturing Process and Process Controls (Drug Product Kit Components) 3.2.P.3.3.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The preparation of Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is performed in an ISO 7 area. [1-13C]Pyruvic acid and AH111501 sodium salt are weighed out and added to the preparation vessel in successive order. The solution is allowed to stir to ensure a homogenous solution prior to filtration. As the solution is transferred from the preparation vessel in an ISO 7 area to the filling vessel in an ISO 5 area, it is filtered through two 0.2 μm sterilizing filters. Filling is performed in an ISO 5 area (LAF unit). The filling weight is calibrated to target; each cryovial shall contain 1.47 g of Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt, therefore, the filling weight depends on the assay of the specific batch of [1-13C]pyruvic acid used. Each container is weighed during the filling operation. The compounding process is illustrated in Figure 1." +3.2.P.3.2,32,Single Dose Compounding Formula (Sterile Fluid Path Components),"3.2.P.3.2.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is compounded by aseptic processing. The compounding formula for a single dose prepared immediately prior to patient administration is given in Table 1. Table 1 Compounding formula for Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Ingredient Quantity per container [1-13C]pyruvic acid 1.44 g AH111501 sodium salt 27.7 mg Sample not for submission sterility and pyrogenicity subsequent to patient administration. 3.2.P.2.5.2 TRIS/EDTA buffer solution Not applicable The TRIS/EDTA buffer solution is compounded immediately prior to patient administration. A sample of the final Hyperpolarized Pyruvate (13C) Injection is tested for sterility and pyrogenicity subsequent to patient administration. 3.2.P.3.1 Manufacturer(s) (Sterile Fluid Path Components) 3.2.P.3.1.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The compounding of Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt for clinical use is conducted in accordance with compliance of USP <797> and the regulations promulgated by the California State Board of Pharmacy at the licensed pharmacy on the following academic campus: University of California, San Francisco Department of Clinical Pharmacy San Francisco, California 94118 3.2.P.3.1.2 TRIS/EDTA buffer solution The compounding of TRIS/EDTA buffer solution for clinical use is conducted in accordance with compliance of USP <797> and the regulations promulgated by the California State Board of Pharmacy at the licensed pharmacy on the following academic campus: University of California, San Francisco Department of Clinical Pharmacy San Francisco, California 94118 3.2.P.3.2 Single Dose Compounding Formula (Sterile Fluid Path Components) 3.2.P.3.2.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is compounded by aseptic processing. The compounding formula for a single dose prepared immediately prior to patient administration is given in Table 1. Table 1 Compounding formula for Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt Ingredient Quantity per container [1-13C]pyruvic acid 1.44 g AH111501 sodium salt 27.7 mg Sample not for submission 3.2.P.3.2.2 TRIS/EDTA buffer solution The product comprises an aqueous solution of TRIS, NaOH, and Na2EDTA. The product is compounded by aseptic processing. The compounding formula for a single dose prepared immediately prior to patient administration is given in Table 2. Table 2 Compounding formula for TRIS/EDTA buffer solution 1Quantity of sterile TRIS/EDTA buffer solution aseptically instilled into receiving vessel of sterile fluid path is 18 mL. 3.2.P.3.3 Description of Manufacturing Process and Process Controls (Drug Product Kit Components) 3.2.P.3.3.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The preparation of Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is performed in an ISO 7 area. [1-13C]Pyruvic acid and AH111501 sodium salt are weighed out and added to the preparation vessel in successive order. The solution is allowed to stir to ensure a homogenous solution prior to filtration. As the solution is transferred from the preparation vessel in an ISO 7 area to the filling vessel in an ISO 5 area, it is filtered through two 0.2 μm sterilizing filters. Filling is performed in an ISO 5 area (LAF unit). The filling weight is calibrated to target; each cryovial shall contain 1.47 g of Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt, therefore, the filling weight depends on the assay of the specific batch of [1-13C]pyruvic acid used. Each container is weighed during the filling operation. The compounding process is illustrated in Figure 1." +3.2.P.3.3,33,Description of Manufacturing Process and Process Controls (Drug Product Kit Components),"3.2.P.3.3.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The preparation of Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is performed in an ISO 7 area. [1-13C]Pyruvic acid and AH111501 sodium salt are weighed out and added to the preparation vessel in successive order. The solution is allowed to stir to ensure a homogenous solution prior to filtration. As the solution is transferred from the preparation vessel in an ISO 7 area to the filling vessel in an ISO 5 area, it is filtered through two 0.2 μm sterilizing filters. Filling is performed in an ISO 5 area (LAF unit). The filling weight is calibrated to target; each cryovial shall contain Sample not for submission 3.2.P.3.2.2 TRIS/EDTA buffer solution The product comprises an aqueous solution of TRIS, NaOH, and Na2EDTA. The product is compounded by aseptic processing. The compounding formula for a single dose prepared immediately prior to patient administration is given in Table 2. Table 2 Compounding formula for TRIS/EDTA buffer solution 1Quantity of sterile TRIS/EDTA buffer solution aseptically instilled into receiving vessel of sterile fluid path is 18 mL. 3.2.P.3.3 Description of Manufacturing Process and Process Controls (Drug Product Kit Components) 3.2.P.3.3.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The preparation of Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is performed in an ISO 7 area. [1-13C]Pyruvic acid and AH111501 sodium salt are weighed out and added to the preparation vessel in successive order. The solution is allowed to stir to ensure a homogenous solution prior to filtration. As the solution is transferred from the preparation vessel in an ISO 7 area to the filling vessel in an ISO 5 area, it is filtered through two 0.2 μm sterilizing filters. Filling is performed in an ISO 5 area (LAF unit). The filling weight is calibrated to target; each cryovial shall contain 1.47 g of Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt, therefore, the filling weight depends on the assay of the specific batch of [1-13C]pyruvic acid used. Each container is weighed during the filling operation. The compounding process is illustrated in Figure 1. Sample not for submission Figure 1 Flow chart illustrating the manufacturing process of Mixture of [1- 13C]pyruvic acid and 15 mM AH111501 sodium salt 3.2.P.3.3.2 TRIS/EDTA buffer solution The preparation of TRIS/EDTA buffer solution is performed in an ISO 7 area. Approximately 90% of the total amount of WFI is added to the preparation vessel. TRIS, Na EDTA and NaOH are added successively, allowing each one to dissolve completely by 2 sufficiently stirring between each addition. The bulk solution is adjusted to its final weight by addition of WFI and allowed to stir to ensure a homogenous solution prior to filtration. As the solution is transferred from the preparation vessel in an ISO 7 area to the filling vessel in a ISO 5 area, it is filtered through two 0.2 μm sterilizing filters. Aseptic filling of the TRIS/EDTA buffer solution into the receiving vessel of the sterile fluid path is performed in an ISO 5 area (LAF unit). Weight controls are taken regularly during filling to assure acceptable fill volume for the whole batch." +3.2.P.3.4,36,Controls of Critical Steps and Intermediates (Sterile Fluid Path Components),"3.2.P.3.4.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt A schematic representation of the process flow and the in-process controls is presented in Figure 1, Section 3.2.P.3.3.1. In addition, environmental monitoring (microbiological and non-viable particles) of the production area is performed. 3.2.P.3.4.2 TRIS/EDTA buffer solution A schematic representation of the process flow and the in-process controls is presented in Figure 2, Section 3.2.P.3.3.2. In addition, environmental monitoring (microbiological and non-viable particles) of the production area is performed." +3.2.P.3.5,36,Process Validation and/or Evaluation (Sterile Fluid Path Components),3.2.P.3.5.1 Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt The aseptic compounding process has been validated by simulation of the aseptic process using a microbial nutrient medium. No growth has been observed in any of the media fill batches. Monitoring of the clean room and personnel is carried out and controlled on a routine basis to assure an environment suitable for aseptic processing. 3.2.P.3.5.2 TRIS/EDTA buffer solution The aseptic compounding process has been validated by simulation of the aseptic process using a microbial nutrient medium. No growth has been observed in any of the media fill batches. Monitoring of the clean room and personnel is carried out and controlled on a routine basis to assure an environment suitable for aseptic processing. +3.2.P.6,39,Reference Standards or Materials (Sterile Fluid Path Components),"Sample not for submission automated compounding device, SpinLab, only if the procedures for aseptic compounding the solution are satisfied. The project and utility of SpinLab for automatic compounding of the Hyperpolarized pyruvate (13C) injection drug product is in early development and preliminary specifications may be developed and evaluated as this project continues in the development phase. Considering the early stage of the project and only single doses are compounded immediately prior to patient administration by licensed pharmacy personnel, the specifications are considered justified. 3.2.P.5.6.2 TRIS/EDTA buffer solution A sterile fluid path containing the TRIS/EDTA buffer solution, which is prepared immediately prior to patient administration, will be released by a licensed pharmacist for compounding by the automated compounding device, SpinLab, only if the procedures for aseptic compounding the solution are satisfied. The project and utility of SpinLab for automatic compounding of the Hyperpolarized pyruvate (13C) injection drug product is in early development and preliminary specifications may be developed and evaluated as this project continues in the development phase. Considering the early stage of the project and only single doses are compounded immediately prior to patient administration by licensed pharmacy personnel, the specifications are considered justified. 3.2.P.6 Reference Standards or Materials (Sterile Fluid Path Components) Not applicable. 3.2.P.7 Container Closure System (Sterile Fluid Path) The fluid path system is a single, sterile drug product container, container closure system that provides for rapid and complete dissolution of a frozen hyperpolarized drug product and transports the resulting hyperpolarized drug product solution from its initial location within a polarizer system to a final sterile Medrad syringe outside the polarizer system for clinical administration—injection into a patient. The empty sterile fluid path (Figure 1A) is provided in a double bag plastic tray with a lid of the following approximate size: 60 cm (L) x 35.6 cm (width) x 10.2 cm (depth)/unit or 23.6 inch (L) x 14.0 inch (width) x 4.0 inch (depth)/unit The empty sterile fluid path is designed to be a single-use drug product container, container closure system which upon arrival to a licensed pharmacy, can be aseptically manipulated so that it can be charged with the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt, TRIS/EDTA buffer, and sterile water for injection. The key components of the empty sterile fluid path system that are in contact with the drug product are composed of USP Plastic Class V as follows: - A Radel R Plastic sample vial which will serve to contain a mixture of the drug substance [1- 13C] Pyruvic Acid and the Electron Paramagnetic Agent (EPA) excipient, tris(8-carboxyl- 2,2,6,6-tetra(2-( l-methoxy-2,2-d2-ethyl) )-benzo[ l ,2-d:4,5-d']bis(dithiole-4-yl)methyl sodium Sample not for submission salt, AH111501 Sodium Salt. - A Radel R Plastic syringe which will serve to contain sterile water for injection. - A Radel R Plastic receiver vessel which will serve to contain an aqueous solution of (hydroxymethyl)aminomethane (TRIS), disodium ethylenediaminetretraacetate (Na EDTA) 2 and sodium hydroxide (NaOH). - A Radel R plastic casing containing the EPA ultrahigh molecular weight polyethylene filters. - A Tygon plastic tubing connecting the Receiver vessel to the Sterile filter. - The rest of the assembly is composed of Radel R Plastic co-axial and transfer tubes and Udel plastic valves. The Dynamic seal is designed and integrated into the empty sterile fluid path however it is not in contact with the drug product. The QC appendage is designed and integrated into the empty sterile fluid path, however it is not in contact with the drug product as an aliquot of the drug product is transferred to the QC appendage. Commercially available SSQK 65/115VS Syringe Kits (Bayer Inc., USA) containing a sterile 65 mL Qwik-Fit Syringe which is aseptically added to the sterile empty fluid path for collection of the final drug product, Hyperpolarized Pyruvate (13C) Injection, will not be addressed here and is depicted as Administration syringe. Figure 1A Depiction of empty sterile fluid path in packaging. Figure 1B Basic anatomy of an empty sterile fluid path." +3.2.P.7,39,Container Closure System (Sterile Fluid Path),"The fluid path system is a single, sterile drug product container, container closure system that provides for rapid and complete dissolution of a frozen hyperpolarized drug product and transports the resulting hyperpolarized drug product solution from its initial location within a polarizer system to a final sterile Medrad syringe outside the polarizer system for clinical administration—injection into a patient. The empty sterile fluid path (Figure 1A) is provided in a double bag plastic tray with a lid of the following approximate size: 60 cm (L) x 35.6 cm (width) x 10.2 cm (depth)/unit or Sample not for submission automated compounding device, SpinLab, only if the procedures for aseptic compounding the solution are satisfied. The project and utility of SpinLab for automatic compounding of the Hyperpolarized pyruvate (13C) injection drug product is in early development and preliminary specifications may be developed and evaluated as this project continues in the development phase. Considering the early stage of the project and only single doses are compounded immediately prior to patient administration by licensed pharmacy personnel, the specifications are considered justified. 3.2.P.5.6.2 TRIS/EDTA buffer solution A sterile fluid path containing the TRIS/EDTA buffer solution, which is prepared immediately prior to patient administration, will be released by a licensed pharmacist for compounding by the automated compounding device, SpinLab, only if the procedures for aseptic compounding the solution are satisfied. The project and utility of SpinLab for automatic compounding of the Hyperpolarized pyruvate (13C) injection drug product is in early development and preliminary specifications may be developed and evaluated as this project continues in the development phase. Considering the early stage of the project and only single doses are compounded immediately prior to patient administration by licensed pharmacy personnel, the specifications are considered justified. 3.2.P.6 Reference Standards or Materials (Sterile Fluid Path Components) Not applicable. 3.2.P.7 Container Closure System (Sterile Fluid Path) The fluid path system is a single, sterile drug product container, container closure system that provides for rapid and complete dissolution of a frozen hyperpolarized drug product and transports the resulting hyperpolarized drug product solution from its initial location within a polarizer system to a final sterile Medrad syringe outside the polarizer system for clinical administration—injection into a patient. The empty sterile fluid path (Figure 1A) is provided in a double bag plastic tray with a lid of the following approximate size: 60 cm (L) x 35.6 cm (width) x 10.2 cm (depth)/unit or 23.6 inch (L) x 14.0 inch (width) x 4.0 inch (depth)/unit The empty sterile fluid path is designed to be a single-use drug product container, container closure system which upon arrival to a licensed pharmacy, can be aseptically manipulated so that it can be charged with the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt, TRIS/EDTA buffer, and sterile water for injection. The key components of the empty sterile fluid path system that are in contact with the drug product are composed of USP Plastic Class V as follows: - A Radel R Plastic sample vial which will serve to contain a mixture of the drug substance [1- 13C] Pyruvic Acid and the Electron Paramagnetic Agent (EPA) excipient, tris(8-carboxyl- 2,2,6,6-tetra(2-( l-methoxy-2,2-d2-ethyl) )-benzo[ l ,2-d:4,5-d']bis(dithiole-4-yl)methyl sodium Sample not for submission salt, AH111501 Sodium Salt. - A Radel R Plastic syringe which will serve to contain sterile water for injection. - A Radel R Plastic receiver vessel which will serve to contain an aqueous solution of (hydroxymethyl)aminomethane (TRIS), disodium ethylenediaminetretraacetate (Na EDTA) 2 and sodium hydroxide (NaOH). - A Radel R plastic casing containing the EPA ultrahigh molecular weight polyethylene filters. - A Tygon plastic tubing connecting the Receiver vessel to the Sterile filter. - The rest of the assembly is composed of Radel R Plastic co-axial and transfer tubes and Udel plastic valves. The Dynamic seal is designed and integrated into the empty sterile fluid path however it is not in contact with the drug product. The QC appendage is designed and integrated into the empty sterile fluid path, however it is not in contact with the drug product as an aliquot of the drug product is transferred to the QC appendage. Commercially available SSQK 65/115VS Syringe Kits (Bayer Inc., USA) containing a sterile 65 mL Qwik-Fit Syringe which is aseptically added to the sterile empty fluid path for collection of the final drug product, Hyperpolarized Pyruvate (13C) Injection, will not be addressed here and is depicted as Administration syringe. Figure 1A Depiction of empty sterile fluid path in packaging. Figure 1B Basic anatomy of an empty sterile fluid path." +7.3,41,Drug Product Part 2,"3.2.P DRUG PRODUCT (HYPERPOLARIZED PYRUVATE [13C] INJECTION) Hyperpolarized Pyruvate (13C) Injection (drug product) is a sterile solution for intravenous injection. Compounding the Hyperpolarized Pyruvate (13C) Injection requires the following drug product components: • Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt • TRIS/EDTA buffer solution • Sterile Water for Injection (WFI) Hyperpolarized Pyruvate (13C) Injection is compounded at the clinical site utilizing an automated compounding device, known as SpinLab, according to USP <797> Pharmaceutical Compounding – Sterile Preparations, just prior to administration. For each patient does, SpinLab utilizes a single sterile fluid path that is composed of a cryovial which contains the mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt. The cryovial is lowered into the polarizer and polarized for up to 120 minutes at a temperature of approximately 0.8 K. After polarization, the mixture of [1- 13C]pyruvic acid and 15 mM AH111501 sodium salt is flushed out of the cryovial with heated and pressurized sterile WFI within the sterile fluid path then passed through a mechanical filter for removal of AH111501, then emptied into a receiver vessel containing sterile WFI and TRIS/EDTA buffer solution. A sample of solution from the receiver vessel is automatically extracted for testing by an automated quality control instrument (QC System). While the QC System processes the solution sample, the remaining solution in the receiver vessel is passed through a sterilizing filter (0.2 μm) and then enters the final drug product container for patient administration, a 65 mL MedRad syringe. Based on the results from the QC System, the final release authorization for administration to humans will be performed by a licensed pharmacist. The following 3.2.P sections describe the Hyperpolarized Pyruvate (13C) Injection. For aspects related to the drug product components required for compounding the Hyperpolarized Pyruvate (13C) Injection, reference is made to section 3.2.P for Drug Product Kit Components." +3.2.P.2.1,42,Components of the Drug Product (Hyperpolarized Pyruvate (13C) Injection),"The drug substance, [1−13C]pyruvic acid, is a colorless to yellow, clear, viscous liquid. [1−13C]Pyruvic acid is described in Section 3.2.S Drug Substance. After neutralization in the TRIS/EDTA buffer solution, the [1−13C]pyruvic acid is present as [1−13C]pyruvate. 3.2.P.2.1.2 Excipients AH111501 sodium salt is a stable trityl radical, and is added to [1−13C]pyruvic acid to enable hyperpolarization. After hyperpolarization and compounding, the solution is passed through a filter to remove the AH111501 from the drug product. AH111501 sodium salt is a green to black, fine to granular powder. AH111501 sodium salt is further described in Section 3.2.A.3 Novel Excipients. The TRIS/EDTA buffer solution is an aqueous solution containing 333 mM TRIS, 600 mM NaOH and 333 mg/l Na EDTA. 2 TRIS is added as a buffer to stabilize the pH of the Hyperpolarized Pyruvate (13C) Injection at a physiologically acceptable level. NaOH is added to neutralize the [1−13C]pyruvic acid in the Mixture of [1−13C]pyruvic acid and 15 mM AH111501 sodium salt to [1−13C]pyruvate in the Hyperpolarized Pyruvate (13C) Injection. Na EDTA has been included in the formulation as a chelating agent to capture traces of 2 Sample not for submission Table 1 Composition of Hyperpolarized Pyruvate (13C) Injection Hyperpolarized Pyruvate (13C) Injection is supplied via a sterile disposable Medrad Qwik-Fit Syringe® for contrast media with a fill volume of 65 mL. 3.2P.2.1 Components of the Drug Product (Hyperpolarized Pyruvate (13C) Injection) 3.2.P.2.1.1 Drug substance The drug substance, [1−13C]pyruvic acid, is a colorless to yellow, clear, viscous liquid. [1−13C]Pyruvic acid is described in Section 3.2.S Drug Substance. After neutralization in the TRIS/EDTA buffer solution, the [1−13C]pyruvic acid is present as [1−13C]pyruvate. 3.2.P.2.1.2 Excipients AH111501 sodium salt is a stable trityl radical, and is added to [1−13C]pyruvic acid to enable hyperpolarization. After hyperpolarization and compounding, the solution is passed through a filter to remove the AH111501 from the drug product. AH111501 sodium salt is a green to black, fine to granular powder. AH111501 sodium salt is further described in Section 3.2.A.3 Novel Excipients. The TRIS/EDTA buffer solution is an aqueous solution containing 333 mM TRIS, 600 mM NaOH and 333 mg/l Na EDTA. 2 TRIS is added as a buffer to stabilize the pH of the Hyperpolarized Pyruvate (13C) Injection at a physiologically acceptable level. NaOH is added to neutralize the [1−13C]pyruvic acid in the Mixture of [1−13C]pyruvic acid and 15 mM AH111501 sodium salt to [1−13C]pyruvate in the Hyperpolarized Pyruvate (13C) Injection. Na EDTA has been included in the formulation as a chelating agent to capture traces of 2" +3.2.P.2.2,43,Drug Product (Hyperpolarized Pyruvate (13C) Injection),"3.2.P.2.2.1 Formulation Development The drug product kit components used for compounding of Hyperpolarized Pyruvate (13C) Injection in the polarizer differ slightly from the components used for pre-clinical studies and clinical studies GE-101-001 and GE-101-003 (see Section 8.1 Introduction for Item 8 Pharmacology and Toxicology Info). These differences are explained in the following sections and are summarized in Table 1. (a) Pyruvic acid and [1-13C]pyruvic acid Drug product used for clinical studies GE-101-001 and GE-101-003 was not hyperpolarized. As the need for 13C enriched material was not present, the drug substance used was pyruvic acid, whereas the drug substance used for compounding of Hyperpolarized Pyruvate (13C) Injection is [1-13C]pyruvic acid. Some pre-clinical safety studies were also conducted using pyruvic acid (see Section 8.1 Introduction for Item 8 Pharmacology and Toxicology Info). (b) Content of AH111501 sodium salt AH111501 is removed during compounding of Hyperpolarized Pyruvate (13C) Injection, and the content of this excipient in the final drug product is NMT 3.0 μM. To mimic this situation for clinical studies GE-101-001 and GE-101-003, 0.2 mM AH111501 sodium salt was added to the pyruvic acid in order to obtain 3.0 μM AH111501 in the Pyruvate Injection. For most of the pre- clinical studies (see Section 8.1 Introduction for Item 8 Pharmacology and Toxicology Info) and for compounding of Hyperpolarized Pyruvate (13C) Injection, 15 mM AH111501 sodium salt is added to the [1-13C]pyruvic acid. (c) Content of [1-13C]pyruvate in drug product The drug product was initially formulated to contain 500 mM [1-13C]pyruvate. For this formulation, a Mixture of [1-13C]pyruvic acid and AH111501 sodium salt, containing 2.23 g [1- 13C]pyruvic acid, was dissolved in 50 ml TRIS/EDTA dissolution medium, containing 360 mM NaOH, 200 mM TRIS and 100 mg/l Na EDTA. Because pre-clinical studies using this 2 formulation revealed cardiovascular effects (see Sections 8.2.4.3 Effects on the Cardiovascular Systems (CVS) in the Pentobarbital/Fentanyl Anesthetized Dog, subsections a and b, for Item 8 Pharmacology and Toxicology Info) the product was later reformulated to contain 250 mM [1- 13C]pyruvate. For this formulation, Mixture of [1-13C]pyruvic acid and AH111501 sodium salt, containing 2.23 g [1-13C]pyruvic acid, was dissolved in 100 ml TRIS/EDTA dissolution medium, containing 180 mM NaOH, 100 mM TRIS and 100 mg/l Na EDTA. Some pre- clinical studies 2 were performed with the formulation targeted 500 mM [1-13C]pyruvate. For most pre-clinical (see Section 8.1 Introduction for Item 8 Pharmacology and Toxicology Info) and all clinical studies, the Pyruvate (13C) Injection is targeted to contain 250 mM [1-13C]pyruvate. (d) TRIS/EDTA dissolution medium and TRIS/EDTA buffer solution For clinical studies GE-101-001 and GE-101-003 and pre-clinical studies, the Mixture of pyruvic acid and AH111501 sodium salt was dissolved in TRIS/EDTA dissolution medium in a single step by manual dissolution (see section 3.2.P.2.3 Manufacturing Process Development). The" +3.2.P.2.3,45,Manufacturing Process Development (Hyperpolarized Pyruvate (13C) Injection),"The procedure for compounding of Hyperpolarized Pyruvate (13C) Injection was not used for pre- clinical studies or clinical studies GE-101-001 and GE-101-003. The Pyruvate (13C) Injection for these studies was prepared by manual mixing of the drug product kit components as described in the following section. Prior to mixing the drug product kit components, the components were allowed to reach ambient room temperature. 100 mL of TRIS/EDTA dissolution medium was then added to the vial containing the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt. Immediately, the vial was vigorously shaken for at least 30 seconds to assure homogeneity. The vial was then heated in a 80°C water bath for 10 minutes and cooled in cold tap water for 5 minutes. The vial is then stored in a 37°C water bath for a maximum of 4 hours before use." +3.2.P.3.3,46,Description of Manufacturing Process and Process Controls (Hyperpolarized Pyruvate (13C) Injection),"Hyperpolarized Pyruvate (13C) Injection is compounded at the clinical site prior to administration. For compounding, the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is hyperpolarized by Dynamic Nuclear Polarization (DNP) for approximately 60 minutes at 1.2 K. After hyperpolarization, the Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt is dissolved in sterile WFI and the AH111501 is then removed from the hyperpolarized solution by mechanical filtration. The hyperpolarized solution is then neutralized, buffered and diluted in TRIS/EDTA buffer solution and subsequently passed through a sterilizing filter (0.2 μm) into the final drug product container, an empty sterile disposable Medrad syringe. The SpinLab system used for hyperpolarization and compounding is located in an area adjacent to the MR scanner room. All compounding process steps are in accordance with USP <797> Pharmaceutical Compounding – Sterile Preparations. Immediately after compounding, the Hyperpolarized Pyruvate (13C) Injection is sampled and tested by an automatic quality control instrument (QC System). The final release authorization for administration to humans will be performed by a licensed pharmacist. After release, the Hyperpolarized Pyruvate (13C) Injection will be delivered to the adjoining MR scanner room for patient administration. 3.2.P.3.3.1 Process description (a) Compounding and filling of empty sterile fluid path All process steps for compounding and filling of the empty sterile fluid path (SFP) used in SpinLab to prepare the hyperpolarized (13C) pyruvate injection are performed within a cleanroom ISO 5 area. A clean, sterile, empty SFP is aseptically removed from its packaging and placed into the ISO 5 area. The sterile water for injection (38 g) is aseptically introduced into the dissolution syringe of the SFP and another 18.5 g of sterile water for injection is aseptically introduced into the receiving vessel. The sterile TRIS/EDTA buffer (18 g) is aseptically introduced into the receiving vessel. A mixture of [1-13C]pyruvic acid and 15 mM AH111501 is prepared and then sterilized using a sterilizing filter (0.2 µm) and 1.47 g of the sterile solution is placed into the sterile cryovial. The cyrovial containing the sterile mixture of [1-13C]pyruvic acid and 15 mM AH111501 is then attached to the empty sterile fluid path and sealed utilizing a laser welder. The cryovial is then" +3.2.P.3.4,50,Controls of Critical Steps and Intermediates (Hyperpolarized Pyruvate (13C) Injection),"A schematic representation of the process flow and the in-process controls is presented in Figure 1 in Section 3.2.P.3.3.1 (a). The polarizer software application monitors and controls critical system and process functions and settings such as data communication and temperature settings. Malfunctions or settings detected to be outside pre-set ranges are communicated to the operators via software-generated alarms that prevent further processing. Control of the mechanical functionality of process hardware, such as valves and fittings, and of the He driving pressure, is performed manually by the operator. The final release analyses performed by the QC System ensure that the compounding process has executed as intended and that the Hyperpolarized Pyruvate (13C) Injection is within specifications (see Table 1 in Section 3.2.P.5.1). The post-administration integrity test of the sterilizing filter assesses whether the filter was functional during use." +3.2.P.3.5,50,Process Validation and/or Evaluation (Hyperpolarized Pyruvate (13C) Injection),"3.2.P.3.5.1 IQ/OQ/PQ Program The clean room, polarizer, process equipment and QC system have gone through an extensive IQ/OQ/PQ program prior to use during clinical trials. The clean room and equipment were found to be suitable for their intended use. 3.2.P.3.5.2 Microbiological aspects The compounding process has been validated by simulation of the process using a microbial nutrient medium. No growth has been observed in any of the media fill batches. The microbiological quality has also been demonstrated by sterility and microbial endotoxin testing of repeated runs (n=6). 3.2.P.3.5.3 Compounding process consistency The consistency of the compounding process has been evaluated by repeated (n=10) compounding of Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt and TRIS/EDTA buffer solution. All batches were within specifications (see Section 3.2.P.5.4 Batch Analyses for details). As the QC System only determines a limited set of quality parameters, all batches were also analyzed for related substances of [1-13C]pyruvate and TRIS, assay of [1-13C]pyruvate, osmolality and particulate contamination. Because of limited analytical capability at the site of compounding (UCSF), samples were shipped to GE Healthcare and analyses were performed 8 to 34 days after compounding. The formation of AH112615 after compounding (see Section 3.2.P.5.5 Characterization of Impurities) causes a decrease in the assay of [1-13C]pyruvate. Because of this" +3.2.P.4,51,Control of Excipients (Hyperpolarized Pyruvate (13C) Injection),"There are no excipients added during compounding of Hyperpolarized Pyruvate (13C) Injection. All excipients in the drug product are attributed to the drug product kit components; Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium and TRIS/EDTA buffer solution. Excipients in the drug product kit components are discussed in Section 3.2.P.4 Control of Excipients for Drug Product Kit Components and Section 3.2.A.3 Novel Excipients. Sample not for submission effect, and as the HPLC method does not detect AH112615, for this study determination of [1- 13C]pyruvate content was performed by quantitative 1H NMR analysis using acetate as an internal standard for calibration. It has been shown that the content of AH112615 immediately after compounding is negligible (see Section 3.2.P.5.5 Characterization of Impurities). To determine the content of [1-13C]pyruvate at time of compounding at UCSF, the AH112625 peak was therefore integrated as [1-13C]pyruvate. The formation of AH112615 also causes a decrease in osmolality with time after compounding. For this study the osmolality at time of compounding at UCSF was therefore calculated from the measured osmolality and the content of AH112615 (determined by 1H NMR) at time of analysis. Results from these analyses are stated in Table 1. As can be seen from these results, the assay of [1-13C]pyruvate in (Hyperpolarized Pyruvate (13C) Injection) varied in the range of 222-252 mM with an average of 241 ± 12 mM. Although the observed assay displays a larger variance than would be expected from the drug product kit components used the results are considered to demonstrate an acceptable process consistency. It should be noted that even though the QC system does not determine the assay of [1-13C]pyruvate, the determination of pH constitute a relevant indirect control of this parameter. The level of control obtained through the determination of pH has been investigated in a study where a sample of Mixture of [1- 13C]pyruvic acid and 15 mM AH111501 sodium salt was titrated with TRIS/EDTA buffer solution diluted in sterile WFI. Results from this study are shown in Figure 1. As can be seen from Figure 1 the pH of the solution is a well defined function of the [1- 13C]pyruvate concentration. As expected, the pKa of TRIS is observed at approximately 8.1 and the depletion of buffer capacity towards the acidic range is observed at approximately 280 mM. Estimated from the observed relationship, the specification to pH (6.7 to 8.0) is equivalent to approximately 210 to 270 mM Pyruvate. With regards to the efficacy of the drug product, the 13C NMR determined by the QC system is proportional to the concentration of [1-13C]pyruvate (see Section 3.2.P.5.2.1 Analytical Procedures). As the 13C nuclear polarization reported by the QC system assumes a fixed concentration of [1-13C]pyruvate, it varies linearly with the actual concentration of [1- 13C]pyruvate. Hence, this parameter represents a more relevant assurance of product efficacy than the assay of [1-13C]pyruvate alone. Osmolality varied in the range of 484-513 mOsm/kg with an average of 501 ± 12 mOsm/kg. Particulate contamination was well within the pharmacopoeia limits for all batches. The purity profile observed during this study was as expected from the purity profile of the drug product kit components. No new impurities were observed. With regards to the purity profile of Hyperpolarized Pyruvate (13C) Injection at time of compounding at UCSF, reference is made to 3.2.P.5.5 Characterization of Impurities. 3.2.P.4 Control of Excipients (Hyperpolarized Pyruvate (13C) Injection) There are no excipients added during compounding of Hyperpolarized Pyruvate (13C) Injection. All excipients in the drug product are attributed to the drug product kit components; Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium and TRIS/EDTA buffer solution. Excipients in the drug product kit components are discussed in Section 3.2.P.4 Control of Excipients for Drug Product Kit Components and Section 3.2.A.3 Novel Excipients. 3.2.P.4.1 Specification (Hyperpolarized Pyruvate (13C) Injection) Not applicable." +3.2.P.4.1,51,Specification (Hyperpolarized Pyruvate (13C) Injection),"Sample not for submission effect, and as the HPLC method does not detect AH112615, for this study determination of [1- 13C]pyruvate content was performed by quantitative 1H NMR analysis using acetate as an internal standard for calibration. It has been shown that the content of AH112615 immediately after compounding is negligible (see Section 3.2.P.5.5 Characterization of Impurities). To determine the content of [1-13C]pyruvate at time of compounding at UCSF, the AH112625 peak was therefore integrated as [1-13C]pyruvate. The formation of AH112615 also causes a decrease in osmolality with time after compounding. For this study the osmolality at time of compounding at UCSF was therefore calculated from the measured osmolality and the content of AH112615 (determined by 1H NMR) at time of analysis. Results from these analyses are stated in Table 1. As can be seen from these results, the assay of [1-13C]pyruvate in (Hyperpolarized Pyruvate (13C) Injection) varied in the range of 222-252 mM with an average of 241 ± 12 mM. Although the observed assay displays a larger variance than would be expected from the drug product kit components used the results are considered to demonstrate an acceptable process consistency. It should be noted that even though the QC system does not determine the assay of [1-13C]pyruvate, the determination of pH constitute a relevant indirect control of this parameter. The level of control obtained through the determination of pH has been investigated in a study where a sample of Mixture of [1- 13C]pyruvic acid and 15 mM AH111501 sodium salt was titrated with TRIS/EDTA buffer solution diluted in sterile WFI. Results from this study are shown in Figure 1. As can be seen from Figure 1 the pH of the solution is a well defined function of the [1- 13C]pyruvate concentration. As expected, the pKa of TRIS is observed at approximately 8.1 and the depletion of buffer capacity towards the acidic range is observed at approximately 280 mM. Estimated from the observed relationship, the specification to pH (6.7 to 8.0) is equivalent to approximately 210 to 270 mM Pyruvate. With regards to the efficacy of the drug product, the 13C NMR determined by the QC system is proportional to the concentration of [1-13C]pyruvate (see Section 3.2.P.5.2.1 Analytical Procedures). As the 13C nuclear polarization reported by the QC system assumes a fixed concentration of [1-13C]pyruvate, it varies linearly with the actual concentration of [1- 13C]pyruvate. Hence, this parameter represents a more relevant assurance of product efficacy than the assay of [1-13C]pyruvate alone. Osmolality varied in the range of 484-513 mOsm/kg with an average of 501 ± 12 mOsm/kg. Particulate contamination was well within the pharmacopoeia limits for all batches. The purity profile observed during this study was as expected from the purity profile of the drug product kit components. No new impurities were observed. With regards to the purity profile of Hyperpolarized Pyruvate (13C) Injection at time of compounding at UCSF, reference is made to 3.2.P.5.5 Characterization of Impurities. 3.2.P.4 Control of Excipients (Hyperpolarized Pyruvate (13C) Injection) There are no excipients added during compounding of Hyperpolarized Pyruvate (13C) Injection. All excipients in the drug product are attributed to the drug product kit components; Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium and TRIS/EDTA buffer solution. Excipients in the drug product kit components are discussed in Section 3.2.P.4 Control of Excipients for Drug Product Kit Components and Section 3.2.A.3 Novel Excipients. 3.2.P.4.1 Specification (Hyperpolarized Pyruvate (13C) Injection) Not applicable." +3.2.P.5.2,53,Analytical Procedures (Hyperpolarized Pyruvate (13C) Injection),"The 13C nuclear polarization of Hyperpolarized Pyruvate (13C) Injection undergoes an exponential decay with a time constant of approximately 69 seconds. In order to preserve an acceptable imaging efficacy, the time between start dissolution and start of administration to the patient has to be NMT 50s (see Section 3.2.P.8.1 Stability Summary and Conclusions). Due to this limited user window, analyses to control Hyperpolarized Pyruvate (13C) Injection are performed using an automated analytical system (QC System) that determines a limited set of parameters within a time span of approximately 10s. This QC System is specifically developed for the analysis of Hyperpolarized Pyruvate (13C) Injection immediately prior to administration to the patients. The QC accessory participates in the dissolution process by managing the state of the sterile fluid path. Specifically, the QC accessory controls the upper slide valve, which is used to isolate the receiver from the EPA filter, and the lower slide valve which controls fluid flow to the cuvettes and the Administration syringe. After a dissolution is complete, the QC accessory closes the upper slide valve, measures the temperature of the receiver, and opens the lower slide valve to allow the mixed solution to be drawn out. Once the cuvettes and NMR bulb are filled, the QC measures the pyruvate concentration, EPA concentration, and pH. The percent polarization is also measured. Once the Administration syringe is filled, the QC checks that the volume is above the level of a threshold sensor. All measurement results are reported to the Hyperpolarizer, where they are interpreted and displayed on the screen for an operator to decide how to proceed. After completed analysis, the software performs a comparison of the results to a pre-set list of specifications (see Table 1 in Section 3.2.P.5.1 Specifications) and reports the compliance or non-compliance of the Hyperpolarized Pyruvate (13C) Injection to the specifications. Control of additional parameters is assured through testing performed on the combination of the drug product kit components; Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt dissolved in WFI and TRIS/EDTA buffer solution (Pyruvate (13C) Injection) (see Section" +3.2.P.5.3,55,Validation of Analytical Procedures (Hyperpolarized Pyruvate (13C) Injection),"The analytical procedures are appropriately validated for the current development phase and are suitable for intended use. The validation performed at this stage is summarized in Table 1. Table 1 Validation of analytical procedures performed at this stage Sample not for submission 3.2.P.5.2.5 Volume The volume measurement in the QC is a threshold measurement performed at the Administration syringe after the fluid movement is complete. This measurement is a capacitive measurement that was tuned by the manufacturer during system setup. The sensor was also tuned to have its threshold centered at a nominal volume of 38 mL. 3.2.P.5.3 Validation of Analytical Procedures (Hyperpolarized Pyruvate (13C) Injection) The analytical procedures are appropriately validated for the current development phase and are suitable for intended use. The validation performed at this stage is summarized in Table 1. Table 1 Validation of analytical procedures performed at this stage 3.2.P.5.4 Batch Analyses (Hyperpolarized Pyruvate (13C) Injection) Hyperpolarized Pyruvate (13C) Injection has not been used for pre-clinical studies or clinical studies GE-101-001 and GE-101-003. For the clinical studies, non-polarized Pyruvate Injection has been used. Different formulations of Pyruvate (13C) Injection have been used during non- clinical studies and clinical studies GE-101-001 and GE-101-003. The drug product kit components used for compounding of Hyperpolarized Pyruvate (13C) Injection have been formulated such that the drug product is equivalent to the drug product used for clinical studies GE-101-001 and GE-101-003, as discussed in Section 3.2.P.2.2.1 Formulation Development. Results for batches of Hyperpolarized Pyruvate (13C) Injection are presented in Table 1 and Table 2. Table 1 Batch data for Hyperpolarized Pyruvate (13C) Injection" +3.2.P.5.4,55,Batch Analyses (Hyperpolarized Pyruvate (13C) Injection),"Hyperpolarized Pyruvate (13C) Injection has not been used for pre-clinical studies or clinical studies GE-101-001 and GE-101-003. For the clinical studies, non-polarized Pyruvate Injection has been used. Different formulations of Pyruvate (13C) Injection have been used during non- clinical studies and clinical studies GE-101-001 and GE-101-003. The drug product kit components used for compounding of Hyperpolarized Pyruvate (13C) Injection have been formulated such that the drug product is equivalent to the drug product used for clinical studies GE-101-001 and GE-101-003, as discussed in Section 3.2.P.2.2.1 Formulation Development. Results for batches of Hyperpolarized Pyruvate (13C) Injection are presented in Table 1 and Table 2. Table 1 Batch data for Hyperpolarized Pyruvate (13C) Injection Sample not for submission 3.2.P.5.2.5 Volume The volume measurement in the QC is a threshold measurement performed at the Administration syringe after the fluid movement is complete. This measurement is a capacitive measurement that was tuned by the manufacturer during system setup. The sensor was also tuned to have its threshold centered at a nominal volume of 38 mL. 3.2.P.5.3 Validation of Analytical Procedures (Hyperpolarized Pyruvate (13C) Injection) The analytical procedures are appropriately validated for the current development phase and are suitable for intended use. The validation performed at this stage is summarized in Table 1. Table 1 Validation of analytical procedures performed at this stage 3.2.P.5.4 Batch Analyses (Hyperpolarized Pyruvate (13C) Injection) Hyperpolarized Pyruvate (13C) Injection has not been used for pre-clinical studies or clinical studies GE-101-001 and GE-101-003. For the clinical studies, non-polarized Pyruvate Injection has been used. Different formulations of Pyruvate (13C) Injection have been used during non- clinical studies and clinical studies GE-101-001 and GE-101-003. The drug product kit components used for compounding of Hyperpolarized Pyruvate (13C) Injection have been formulated such that the drug product is equivalent to the drug product used for clinical studies GE-101-001 and GE-101-003, as discussed in Section 3.2.P.2.2.1 Formulation Development. Results for batches of Hyperpolarized Pyruvate (13C) Injection are presented in Table 1 and Table 2. Table 1 Batch data for Hyperpolarized Pyruvate (13C) Injection" +3.2.P.5.5,56,Characterization of Impurities (Hyperpolarized Pyruvate (13C) Injection),"A determination of the impurities in Hyperpolarized Pyruvate (13C) Injection is not part of the analyses performed by the QC System. Hence, documentation and control of the impurities in the drug product rests on analyses performed during the release testing of Pyruvate (13C) Injection (see Section 3.2.P.5.5 Characterization of Impurities for Drug Product Kit Components) and the results from process verification studies. 3.2.P.5.5.1 Differences in dissolution procedures The manual procedure for the compounding of Pyruvate (13C) Injection during preparation of samples for analysis is identical to the procedure used during pre-clinical safety studies and clinical studies GE-101-001 and GE-101-003. The dissolution process during compounding of Hyperpolarized Pyruvate (13C) Injection is different from the manual procedure, particularly with regards to parameters such as time, temperature, flow rates and pressure. These differences influence the purity profile such that the impurities in manually dissolved Pyruvate (13C) Injection, to some extent, are different from those in Hyperpolarized Pyruvate (13C) Injection. These effects and the purity profile of Hyperpolarized Pyruvate (13C) Injection are discussed in Sections 3.2.P.5.5.2, 3.2.P.5.5.3 and 3.2.P.5.5.4. 3.2.P.5.5.2 Transformation between AH113462 and AH112623 During and after manual compounding of Pyruvate (13C) Injection, the major impurity in the drug substance, AH113462/E, transforms through AH113462/K to AH112623 (see Section 3.2.P.5.5.1 (a) Transformation of the [1-13C]pyruvic acid purity profile for Drug Product Kit Components). As the dissolution step during the semi-automated compounding of Hyperpolarized Pyruvate (13C) Injection takes place in less than 10 seconds and the product is administered within 50s from start of dissolution, the transformation from AH113462/E to AH112623 will not be complete. The drug product used in the pre-clinical safety studies and for clinical studies GE-101-001 and GE-101-" +3.2.P.8.1,61,Stability Summary and Conclusion (Hyperpolarized Pyruvate (13C) Injection),"The stability-indicating parameter for Hyperpolarized Pyruvate (13C) Injection is the level of 13C nuclear polarization, which decays rapidly after compounding. The stability testing performed has therefore been limited to determination of the 13C nuclear polarization and relaxation time (T1). 3.2.P.8.1.1 Batches tested Stability testing has been performed on six samples of Hyperpolarized Pyruvate (13C) Injection compounded from Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt batch FFF106/140-806 and TRIS/EDTA buffer solution batch FFF106/142-806. 3.2.P.8.1.2 Storage conditions and testing frequency Testing was performed inside an MRI scanner located next to the clean room where the compounding of the sample took place. For testing frequency, see Section 3.2.P.8.1.3 Analytical Procedures and Specification. 3.2.P.8.1.3 Analytical procedures and specification The level of 13C nuclear polarization was determined using a 3T MRI scanner. The hyperpolarized 13C NMR signal was obtained using a 5 degree RF pulse. During the relaxation of the non-equilibrium polarization, a total of 64 spectra with 5 degree pulse and TR=3s were collected; the first of which was used for calculating 13C polarization. The relaxation time (13C T1) was calculated by fitting these data to a mono-exponential decay curve. After relaxation to thermal equilibrium, a thermal 13C NMR spectrum was collected (90 degree pulse, 64 averages, TR=10s, after addition of 10 μl Gd/ml solution) in order to calculate the 13C polarization. No shelf-life specifications have been established for Hyperpolarized Pyruvate (13C) Injection. Assurance of quality at time of administration rests on analyses performed before release and the time limit for administration after the dissolution step is completed, which is stated in the imaging" +3.2.P.8.2,62,Post-approval Stability Protocol and Stability Commitment (Hyperpolarized Pyruvate (13C) Injection),"Sample not for submission protocol. 3.2.P.8.1.4 Summary of results The stability results are presented in Section 3.2.P.8.3 Stability Data. 3.2.P.8.1.5 Conclusion With a relaxation time of 69s, the polarization decreases by 7% (relative) each 5 seconds. To optimize imaging signal, administration should hence take place as quickly as practically possible. To assure the level of polarization during clinical use, and hence a certain level of imaging signal, the drug product will be administered within 50s from time of start dissolution. With a release specification of NLT 15.0% and a relaxation time of 69s, this user window limit will assure a polarization of NLT 7.3% at time of administration. 3.2.P.8.2 Post-approval Stability Protocol and Stability Commitment (Hyperpolarized Pyruvate (13C) Injection) Not applicable. 3.2.P.8.3 Stability Data (Hyperpolarized Pyruvate (13C) Injection) The average relaxation time determined for the six samples investigated was 68.8 ± 1.3s, with a range of 67.1 to 71.0s. A line derived from the stability results on Hyperpolarized Pyruvate (13C) Injection is shown in Figure 1. In Figure 1, the line represents a sample released at specification limit (NLT 15.0% at start of dissolution), decaying with the average measured relaxation time (69s)." +3.2.P.8.3,62,Stability Data (Hyperpolarized Pyruvate (13C) Injection),"The average relaxation time determined for the six samples investigated was 68.8 ± 1.3s, with a range of 67.1 to 71.0s. A line derived from the stability results on Hyperpolarized Pyruvate (13C) Injection is shown in Figure 1. In Figure 1, the line represents a sample released at specification limit (NLT 15.0% at start of dissolution), decaying with the average measured relaxation time (69s). Sample not for submission protocol. 3.2.P.8.1.4 Summary of results The stability results are presented in Section 3.2.P.8.3 Stability Data. 3.2.P.8.1.5 Conclusion With a relaxation time of 69s, the polarization decreases by 7% (relative) each 5 seconds. To optimize imaging signal, administration should hence take place as quickly as practically possible. To assure the level of polarization during clinical use, and hence a certain level of imaging signal, the drug product will be administered within 50s from time of start dissolution. With a release specification of NLT 15.0% and a relaxation time of 69s, this user window limit will assure a polarization of NLT 7.3% at time of administration. 3.2.P.8.2 Post-approval Stability Protocol and Stability Commitment (Hyperpolarized Pyruvate (13C) Injection) Not applicable. 3.2.P.8.3 Stability Data (Hyperpolarized Pyruvate (13C) Injection) The average relaxation time determined for the six samples investigated was 68.8 ± 1.3s, with a range of 67.1 to 71.0s. A line derived from the stability results on Hyperpolarized Pyruvate (13C) Injection is shown in Figure 1. In Figure 1, the line represents a sample released at specification limit (NLT 15.0% at start of dissolution), decaying with the average measured relaxation time (69s)." \ No newline at end of file diff --git a/images/agent_log_traces.png b/images/agent_log_traces.png new file mode 100644 index 0000000000..ca0ecb0cfc Binary files /dev/null and b/images/agent_log_traces.png differ diff --git a/images/agent_trace_details.png b/images/agent_trace_details.png new file mode 100644 index 0000000000..bf10e8134d Binary files /dev/null and b/images/agent_trace_details.png differ diff --git a/images/baseline_agent.png b/images/baseline_agent.png new file mode 100644 index 0000000000..6322a8bf7f Binary files /dev/null and b/images/baseline_agent.png differ diff --git a/images/code-modernization-phases.png b/images/code-modernization-phases.png new file mode 100644 index 0000000000..0e38d77733 Binary files /dev/null and b/images/code-modernization-phases.png differ diff --git a/images/codex_code_review.png b/images/codex_code_review.png new file mode 100644 index 0000000000..9448d620a6 Binary files /dev/null and b/images/codex_code_review.png differ diff --git a/images/eval_dashboard.png b/images/eval_dashboard.png new file mode 100644 index 0000000000..1ba0a57420 Binary files /dev/null and b/images/eval_dashboard.png differ diff --git a/images/eval_run_results.png b/images/eval_run_results.png new file mode 100644 index 0000000000..3b8086c84b Binary files /dev/null and b/images/eval_run_results.png differ diff --git a/images/eval_set_config.png b/images/eval_set_config.png new file mode 100644 index 0000000000..bf8e07dd77 Binary files /dev/null and b/images/eval_set_config.png differ diff --git a/images/feedback.png b/images/feedback.png new file mode 100644 index 0000000000..8a71d9c365 Binary files /dev/null and b/images/feedback.png differ diff --git a/images/oob_transcription.png b/images/oob_transcription.png new file mode 100644 index 0000000000..35503af901 Binary files /dev/null and b/images/oob_transcription.png differ diff --git a/images/pilot-candidate.png b/images/pilot-candidate.png new file mode 100644 index 0000000000..8191e2cfcb Binary files /dev/null and b/images/pilot-candidate.png differ diff --git a/images/pilot-flow-diagram.png b/images/pilot-flow-diagram.png new file mode 100644 index 0000000000..d007cf19eb Binary files /dev/null and b/images/pilot-flow-diagram.png differ diff --git a/images/pilot-folder-structure.png b/images/pilot-folder-structure.png new file mode 100644 index 0000000000..923b35ceda Binary files /dev/null and b/images/pilot-folder-structure.png differ diff --git a/images/pilot-yaml.png b/images/pilot-yaml.png new file mode 100644 index 0000000000..f0c1055a89 Binary files /dev/null and b/images/pilot-yaml.png differ diff --git a/images/prompt_input.png b/images/prompt_input.png new file mode 100644 index 0000000000..c12b373456 Binary files /dev/null and b/images/prompt_input.png differ diff --git a/images/simplified_reg_agent.png b/images/simplified_reg_agent.png new file mode 100644 index 0000000000..c1e312578d Binary files /dev/null and b/images/simplified_reg_agent.png differ diff --git a/images/updated_prompt.png b/images/updated_prompt.png new file mode 100644 index 0000000000..ed289b5031 Binary files /dev/null and b/images/updated_prompt.png differ diff --git a/images/updated_prompt_feedback.png b/images/updated_prompt_feedback.png new file mode 100644 index 0000000000..a0bb1a2a7b Binary files /dev/null and b/images/updated_prompt_feedback.png differ diff --git a/registry.yaml b/registry.yaml index d84c00e229..bfebed7d39 100644 --- a/registry.yaml +++ b/registry.yaml @@ -4,13 +4,61 @@ # should build pages for, and indicates metadata such as tags, creation date and # authors for each page. +- title: Transcribing User Audio with a Separate Realtime Request + path: examples/Realtime_out_of_band_transcription.ipynb + date: 2025-11-20 + authors: + - minh-hoque + tags: + - realtime + - transcription + - voice + - speech + - audio + +- title: Self-Evolving Agents - A Cookbook for Autonomous Agent Retraining + path: examples/partners/self_evolving_agents/autonomous_agent_retraining.ipynb + date: 2025-11-04 + authors: + - shikhar-cyber + - Calvin Maguranis + - Valentina Frenkel + - Fanny Perraudeau + - Giorgio Saladino + tags: + - partners + - self-evolving-agents + - evals + - llmops + - prompt-engineering + - agent-retraining + +- title: User guide for gpt-oss-safeguard + path: articles/gpt-oss-safeguard-guide.md + date: 2025-10-29 + authors: + - roost + tags: + - gpt-oss + - open-models + - guardrails + +- title: Build, deploy, and optimize agentic workflows with AgentKit + path: examples/agentkit/agentkit_walkthrough.ipynb + date: 2025-10-17 + authors: + - nikhils-oai + tags: + - agentkit + - evals + - title: Building with Realtime Mini path: examples/building_w_rt_mini/building_w_rt_mini.ipynb date: 2025-10-11 authors: - carter-oai tags: - - gpt-realtime-mini + - gpt-realtime-mini - title: Sora 2 Prompting Guide path: examples/sora/sora2_prompting_guide.ipynb @@ -2590,3 +2638,37 @@ tags: - evals - datasets + +- title: Build Code Review with the Codex SDK + path: examples/codex/build_code_review_with_codex_sdk.md + date: 2025-10-21 + authors: + - charlie-openai + tags: + - codex + +- title: Build a coding agent with GPT 5.1 + path: examples/Build_a_coding_agent_with_GPT-5.1.ipynb + date: 2025-11-13 + authors: + - katiagg + tags: + - agents-sdk + +- title: GPT-5.1 Prompting Guide + path: examples/gpt-5/gpt-5-1_prompting_guide.ipynb + date: 2025-11-13 + authors: + - samarth-oai + tags: + - gpt-5.1 + - responses + - reasoning + +- title: Modernizing your Codebase with Codex + path: examples/codex/code_modernization.md + date: 2025-11-19 + authors: + - derrickchoi-openai + tags: + - codex