Bridging Infrastructure and AI: Robinhood’s MCP-Powered Ecosystem
How the Model Context Protocol enables AI-native storage automation
Authors: Andrew Fan , Ashvin Vinodh
Robinhood was founded on a simple idea: that our financial markets should be accessible to all. With customers at the heart of our decisions, Robinhood is lowering barriers and providing greater access to financial information and investing. Together, we are building products and services that help create a financial system everyone can participate in.
Introduction
At Robinhood, ensuring safe database operations is critical. Our storage infrastructure supports real-time financial operations, powering everything from transactional systems to analytical workloads. As our number of customers and services continue to grow, so does the complexity and scale of our storage systems and with it, the cost of manual operations.
To meet this challenge, we built a unified, LLM-driven platform powered by the Model Context Protocol (MCP). The goal of this system is to provide AI-powered reasoning capabilities over real-time database performance insights and diagnostic data in the most secure and least invasive way possible. These insights are surfaced across various surfaces, right from Github Pull Requests to a Slack Application/Bot.
The Problem
Robinhood’s Storage team owns the reliability, safety, and performance of a vast storage footprint that spans hundreds of Postgres databases, our relational database of choice for all online data operations. We leverage Amazon Aurora, a managed-database offering for Postgres.
We deploy hundreds of microservices, each with their dedicated database clusters. With the proliferation of services, our infrastructure workflows encountered bottlenecks including:
- Fragmented tooling and on-call fatigue: Debugging database issues necessitates sifting through numerous dashboards, logs, and bespoke scripts, leading to increased MTTR for alerts/incidents
- High volumes of schema migration Pull Requests: Requires manual reviews from the Storage team to reduce the the risk of backwards incompatible DB changes/long-running migrations
- Insufficient context for PR reviewers, particularly regarding live schema or index states.
We needed an extensible platform capable of integrating infrastructure metadata with intelligent agents to deduce insights from a diverse range of signals. At the same time, we want to do this as securely as possible, removing any risk of sensitive data leaks and performance regressions.
Use Cases
Building an intelligent Storage platform aimed to achieve several key objectives:
- Provide diagnostic and reasoning capabilities over a large number of Storage signals, surfaced via natural language interfaces and automations
- Minimize manual review cycles and accelerate incident triage
- Automate schema migration reviews by leveraging best-practice runbooks and real-time database statistics.
- Make Storage insights more accessible across all teams
Key personas include:
- Storage Engineers: Validate migrations and debug database performance issues
- Application Engineers: Write schema migrations and investigate application performance characteristics related to DB access patterns
Architecture Overview
MCP (Model Context Protocol) is an open-source standard for connecting AI applications to external systems. MCP provides a standardized way to connect LLMs to data sources (e.g. local files, databases) and tools, enabling them to access key contextual information.
By connecting an LLM Agent with the relevant MCP server, it is able to automatically fetch the available tools from the server. Leveraging the metadata defined for each tool, the LLM is able to identify which tool(s) is the most relevant to respond to the current prompt and call them sequentially.
The core principle behind our architecture is this: once tools are exposed via MCP, they become reusable, composable building blocks for any workflow. Our ecosystem builds on this abstraction to deliver consistent and structured access to storage insights.
Our system has three core layers:
- PgAnalyze MCP Server: Surfaces diagnostics, schema, and performance tools using MCP’s structured interface by leveraging PgAnalyze GraphQL APIs.
- Agents: LLM-powered, multi-turn, and protocol-aware applications. The custom MCP client we built serves as the foundation for our intelligent agents including a GitHub PR Agent and a Slack Assistant
- User Interfaces: The insights are surfaced through familiar entrypoints - GitHub PR Checks via CI pipelines, Slack conversations via a new Slack App, third-party clients (Cursor, Claude Code)
Why Pganalyze?
There are various open-source Postgres MCP Servers out there which expose rich tooling around surfacing database metadata as well as tools to execute SQL directly against the database. There are a few key challenges with these however:
- Risk of sensitive/user data being fetched: Require strict controls to ensure MCP Server executes using a dedicated database user with a limited set of permissions
- SPI/PII redaction: If the goal is to fetch user data but filter away sensitive columns, LLMs can often find creative ways to obtain similar data from other tables/columns in your database depending on prompting
- Performance Considerations: LLMs can execute sub-optimal SQL queries via the MCP Server if permitted to execute arbitrary SQL, or overwhelm the database with connections depending on scale
To ensure the highest standards of safety, we instead chose to implement a custom MCP Server over Pganalyze, a third-party Postgres monitoring tool which Robinhood licenses and deploys in our clusters. Pganalyze periodically collects metadata “snapshots” of each database in our fleet by querying various system catalogs and monitoring tables. These snapshots contain information about database schema and entities, table and index statistics, and query performance stats.
Since there is no user-data being exported, Pganalyze helps us remove any risk of data leaks, providing a secure data source to pull insights from. It is important to note that the relevant Pganalyze configurations must be set to enforce log and query text/parameter value filtering/masking so as to avoid any risks of sensitive data being exposed.
Component Deep Dive
Pganalyze MCP Server
We leverage the official MCP Server python SDK (https://github.com/modelcontextprotocol/python-sdk) to build a custom Pganalyze MCP Server, which exposes Postgres database performance insights and metadata tools by wrapping Pganalyze’s native GraphQL APIs.
We define the Pganalyze GraphQL endpoints in a JSON file as shown below, which is then parsed to generate the relevant MCP tool definitions and GraphQL calls dynamically:
"getSchemaTableStats": {
"description": "Deliver time-series metrics for a table (insert/update/delete rates plus table, index, TOAST and total size) over a specified period to spot bloat and growth trends. Returns arrays of [unix_ts, value] pairs that may be empty when the range has no snapshots. Note: timestamp intervals are 600 seconds apart but data is aggregated per 60 seconds.",
"operation_type": "query",
"query": "query getSchemaTableStats($databaseId: ID!, $schemaTableId: ID!, $startTs: Int!, $endTs: Int!) {\n getSchemaTableStats(\n databaseId: $databaseId\n schemaTableId: $schemaTableId\n startTs: $startTs\n endTs: $endTs\n ) {\n __typename\n # deletes # Type: LIST\n # hotUpdates # Type: LIST\n id # Type: NON_NULL\n indexSize # Type: LIST\n inserts # Type: LIST\n tableSize # Type: LIST\n # toastSize # Type: LIST\n totalSize # Type: LIST\n updates # Type: LIST\n }\n}",
"variables": {
"databaseId": {
"type": "string",
"description": "GraphQL variable: $databaseId",
"example": "0",
"required": true
},
"schemaTableId": {
"type": "string",
"description": "GraphQL variable: $schemaTableId",
"example": "0",
"required": true
},
"startTs": {
"type": "integer",
"description": "GraphQL variable: $startTs",
"example": null,
"required": true
},
"endTs": {
"type": "integer",
"description": "GraphQL variable: $endTs",
"example": null,
"required": true
}
}
},
. . .
Key Details:
- Caching & Safety: All data is sourced from Pganalyze metrics snapshots, not live DB queries.
- Security: Since Pganalyze does not (and cannot) access user data, there is no mechanism for the MCP Server to inadvertently surface actual user data.
- Performance Optimized: TTL caching is applied at the MCP tool level to reduce load and improve latency, particularly during high-volume operations like PR analysis.
- Tool Coverage: Over 90 MCP tools are registered dynamically via a JSON config. New tools can be added declaratively without touching application logic.
- Extensibility: Defining a new tool is as simple as adding a new json object for the relevant endpoint. The MCP tool is dynamically generated.
Storage PR Agent
The Storage PR Agent is a multi-turn agent that performs automated reviews of schema migration Github Pull Requests. This surfaces an HTTP API which is called via our Buildkite CI pipeline, and serves as the first line of defense against dangerous schema migration changes.
Example Schema Migration:
Consider the following migration where we add a new index on a large table:
The analysis posted back to the GitHub PR explicitly calls out the size of the table, and the risk of timing out application deployments if this index creation is run as part of it.
Moreover, the PR is explicitly labelled as “High Risk” for clarity.
The above example showcases a schema migration defined as SQL, but we also support a similar workflow for Django (python-based) schema migration definitions.
Key Details:
- CI Integration: When Buildkite pipelines detect schema migration PRs, the Agent is called with the PR contents
- Dynamic Tool Orchestration: Agent pulls context from GitHub PRs and invokes MCP tools to assess table/index stats and other schema characteristics.
- Framework-Aware Runbook Selection: The system prompt uses a language/ORM-aware prompt engineering strategy (Python Django/Golang) to enforce different validation rules based on the detected migration style and internal best practice runbooks.
- Agent Analysis Loop: Uses LLM models in an iterative process
- Risk Labeling: Posts results as GitHub Check Runs with clear markdown summaries and labels such as "Storage: Schema Migration – High Risk".
Slack Storage Assistant
The Slack assistant offers engineers a conversational interface for querying over database metadata and insights, without needing access to specialized dashboards or knowledge of MCP tooling.
We wanted to make this Assistant available via various surfaces, including DMs, threads and channels. For this purpose, we built a custom MCP Client which integrates with the Slack API to aggregate the relevant message context needed, such as thread, channel and DM history.
Key Details:
- Shared Foundation: Leverages the same MCP client integration as the PR Agent, allowing reuse of the full toolset in conversational workflows.
- Context Modes - Channel Mentions: Generates concise, structured summaries contextualized to the conversation thread.
- Context Modes - DMs: Supports multi-turn investigations with retained message history for interactive debugging.
- Context Modes - Empty Mentions: Reviews full thread or channel history to summarize risks or action items (especially useful during incidents).
- Safe by Design: Only authorized Robinhood users can use the Slack App. Since it fetches and analyzes metadata via the Pganalyze MCP Server, there is no risk of data leaks.
Impact
Since adopting this ecosystem, we’ve seen dramatic improvements across our workflows:
- Developer Velocity: Reduced operational burden on the Storage team by enabling seamless database performance analysis and automating complex queries such as cross-shard schema mismatch detection
- Schema review coverage: Expanded from a handful of critical services to automated coverage across all (hundreds) of services
- PR review latency: Decreased from 10+ minutes to <2 minutes for most cases
Looking Ahead
- Expand MCP Server Coverage: Develop additional MCP Server integrations over our metrics, logging and cloud infrastructure to provide real-time insights over our entire operational stack
- Improve Accuracy: Integrate with RAG strategies to optimize accuracy, drawing on continuous usage data and recommendations for best practices
Sr. Technical Lead @ RGBSI | Leading Technical Solutions
1dRunning hundreds of microservices, each backed by independent database clusters, creates a complex performance landscape. But pairing this ecosystem with LLMs via MCP unlocks a new level of observability. The result? Sharper performance diagnostics and noticeably lower MTTR across alerts and incidents. AI is no longer an add-on — it’s becoming an essential part of ops intelligence.
Database Engineer| Specializing in Postgres, MySQL, Troubleshooting, Automation, Monitoring, Scalability, Optimization & Tuning| Dedicated to Design Robust and Efficient High-Performance Databases.
2dThis is really awesome and very useful. It will go a long way in reducing/sharing the workload of DBAs and Database Engineers. Reviewing multiple DB change PRs, estimating their impact, and checking various logs while troubleshooting can be tedious and time-consuming. A solution like this will truly make a significant difference.
UiPath RPA Developer | Python Quant | Pilot
6dVery clever and wise to not analyze data from the database with pii and other sensitive data directly.
AEO Expert| High-Risk Payments| Helping PSPs & High risk merchants to find perfect Payment Solutions and getting them recognized by chatgpt | iGaming | Betting | Forex | Pharma | Adult | PSP |Card solution
2wThis post shines a light on the innovative steps your team at Robinhood is taking in database observability. I'm looking forward to seeing how the Model Context Protocol continues to enhance our understanding and interactions with database systems.
Senior Software Developer | Android, Kotlin, and Java Specialist | I solve problems through software that I write
2wThat is quite interesting. If hundreds of microservices are deployed that each have their own database clusters, then I am sure that the use of LLMs with MCP can lead to improved database performance analysis. I am sure that MTTR for alerts and incidents decreased after this ecosystem was adopted.