In-Depth

Navigating VS Code AI Toolkit and Microsoft Foundry for Agent Development

Everybody is hopping on to the agentic AI bandwagon these days, but the sprawling ecosystem of tools, platforms, and documentation can make it hard to know where to start -- and even harder to understand why you keep hitting walls as you try to scale up. The AI Toolkit for VS Code and Microsoft Foundry are powerful tools for building and deploying AI agents, but they come with their own set of complexities and gotchas that can trip up even experienced developers.

If you just jump right into to build AI agents in VS Code's AI Toolkit extension and then try to scale those agents up using Microsoft Foundry, you will almost certainly run into a wall -- multiple walls, actually, each from a different direction. Rate limits that seem to appear at random. Models that claim to support tools but silently ignore them. Models that exist in the catalog but are mysteriously unavailable in your region. Context errors on prompts that don't feel especially large. And a documentation landscape fragmented enough to make triangulating the real answer a genuine research project.

I learned about a lot about this stuff in a recent hands-on walkthrough of building an agent across AI Toolkit and Foundry -- including model selection friction, rate/size limits, and tool wiring (see "How I Built a 'Journalist' AI Agent in VS Code to Replace Me.")

This article assembles what's actually going on across all of those friction points, explains the underlying systems driving them, and offers practical guidance for navigating the ecosystem with fewer surprises.

What the Two Platforms Actually Are
The AI Toolkit for VS Code is an extension that provides a unified local environment for discovering models, testing prompts, and wiring up AI agents. It connects to model sources including GitHub-hosted models (via GitHub Models), OpenAI, Anthropic, Google, and local runtimes via Ollama or ONNX. Its primary surfaces are: a Model Catalog for browsing and deploying models from multiple providers; a Playground for testing model behavior with direct prompts; an Agent Builder for constructing agents with instructions, tools, and MCP server connections; and Evaluation tools for running bulk tests against datasets.

The intent is that you prototype and refine locally in VS Code, then deploy to Microsoft Foundry for production-scale usage. A significant update in late 2025 added native support for deploying agents to Foundry with a single click, making the bridge between the two surfaces more seamless on paper -- though the seams remain visible in practice.

Microsoft Foundry documentation explains the cloud counterpart. It hosts model deployments, provides the Agent Service for orchestrating multi-tool agent runs, manages quota and billing, and exposes the full Azure-backed model catalog. When you deploy a model in Foundry, you create a dedicated endpoint in a specific Azure region, backed by your subscription's allocated tokens-per-minute (TPM) and requests-per-minute (RPM) quota. This is materially different from using the same model's GitHub-hosted variant through the AI Toolkit's local path -- and that difference is the root cause of most of the quota frustration developers encounter.

How the Two Systems Relate -- and Why That Creates Confusion
The AI Toolkit presents itself as a seamless bridge between local development and cloud deployment, but in practice the seams are very visible. Model catalogs are not unified: a model visible in AI Toolkit's catalog may not appear in Foundry's, and vice versa. Models available in GitHub Copilot Chat within VS Code may not appear in Agent Builder at all, even in the same session. These are distinct catalog surfaces backed by different APIs, and the fragmentation is a known and documented issue with the current toolchain.

Authentication layers are additive. Selecting a GitHub-hosted model in AI Toolkit triggers a GitHub OAuth flow. Connecting to a Foundry deployment requires Azure authentication. Some MCP tools require their own API keys on top of both. Each layer is independent -- and when one silently fails, the error messages are not always clear about which layer is the problem.

There is also a non-obvious UX split inside Agent Builder itself: the form is where you configure instructions, tools, and model connections, but the prompting interface -- where you actually run the agent -- is in the Playground tab. This has produced documented confusion, including a “Try in Agent” button that appears disabled with the message “Create a new agent first” while the user is actively in the process of creating one. The button becomes active only after the agent configuration has been saved and you have navigated to the Playground surface.

Finally, sync between AI Toolkit and the Foundry portal is not real-time. Model deployments made in the portal may not immediately appear in AI Toolkit's resource list. Going directly to Microsoft Foundry for deployment management is more reliable than relying on the extension's sync, especially for new deployments.

The Quota Problem: Why You Hit Walls and Why They're So Hard to Diagnose
The most common and frustrating failure mode when building agents in this ecosystem is hitting rate limits and quota errors with minimal explanation. Understanding why requires knowing that multiple independent rate-limiting systems are operating simultaneously.

When you use a model through the AI Toolkit's GitHub-backed path -- which includes popular models surfaced without an Azure deployment -- your requests are hitting the GitHub Models inference endpoint. As GitHub's own documentation notes, this endpoint is rate-limited by requests per minute, requests per day, tokens per request, and concurrent requests, and is not designed for production use cases. These limits are substantially lower than Azure-dedicated quotas, and they are especially punishing for agentic workflows, where a single agent run may make many sequential tool calls. Each tool call appends its output to the model's context window, meaning what looks like a modest prompt can easily balloon to tens of thousands of tokens by the second or third tool invocation -- tripping “input too large” errors on GitHub-hosted model variants even before rate limiting kicks in.

The AI Toolkit UI does surface a fix in its error toast: “Input too large for the GitHub o1 model. Please reduce it, use GitHub pay-as-you-go models or Deploy to Microsoft Foundry for higher limits.” The Foundry path is the more practical option, because the GitHub pay-as-you-go upgrade is not straightforwardly accessible from within the tooling itself.

Once you deploy a model in Foundry and connect your agent to that deployment, you draw from your Azure subscription's quota pool, allocated per model, per region, per deployment type, in units of TPM and RPM. The RPM is set proportionally to the TPM: a higher TPM allocation yields a proportionally higher RPM. This is why Foundry-connected agents feel more capable -- they are drawing from a dedicated pool scoped to your subscription rather than a shared public pool.

Foundry quota is further tiered. Microsoft uses a tiered quota system with a Free Tier and Tiers 1 through 6, where Tier 6 carries the highest allocations. As the Azure OpenAI quota documentation explains, initial tier assignment is based on existing usage patterns and your Microsoft relationship -- Enterprise Agreement customers start higher -- and tiers are designed to increase automatically as usage grows. This means you may hit limits early in a project simply because the system has not yet recognized your usage pattern.

Above the per-deployment quota sits a tenant-level cap per model that applies across all subscriptions, regions, and deployments under the same Azure tenant. As the Foundry Models quotas documentation makes clear, a customer's usage is defined as the total tokens consumed across all deployments, in all subscriptions, in all regions for a given tenant. This is the least visible limit: you can be well within a single deployment's quota while hitting a tenant-level ceiling. Quota increase requests exist for Azure OpenAI and Anthropic models, but as that same documentation notes, models from partners and community -- Llama, Mistral, Phi, and others -- do not support quota increases at all.

Separate from model rate limits, the Foundry Agent Service itself enforces fixed limits on file attachments per vector store, files per message, and connected tool count. These limits apply to all Foundry projects regardless of subscription type or region, and cannot be increased through quota requests. When you see a “Too Many Requests” error, it is worth checking the VS Code Output Panel rather than relying solely on the toast notification -- the Output Panel often identifies whether the error originates from the model endpoint, the Agent Service, or the MCP tool layer.

The Tool Support Gap: Why You Can't Tell Before Deploying
One of the most significant practical problems in the current ecosystem is that tool compatibility -- whether a model supports function calling, web search grounding, or MCP server invocations -- is not reliably surfaced at model selection time. This is not simply a documentation oversight; it reflects genuine architectural complexity.

Tool and function calling requires models to be specifically trained and fine-tuned to recognize structured tool definitions and generate valid tool call responses. This is distinct from general model intelligence: a highly capable reasoning model may not support tool calling if it was not fine-tuned for it, while a smaller, faster model may support it robustly. The Foundry model catalog spans Azure OpenAI models (which generally support tool calling), Anthropic models (which do), and an expanding catalog of community and partner models. These community models vary enormously in their tool support, and many were not fine-tuned for structured tool calling.

The model cards in the Foundry catalog display capability tags -- “Coding,” “Reasoning,” “Agents,” “Multilingual” -- but these are not comprehensive or guaranteed. The tool best practices documentation states that region and model together determine which tools are available to your agent, and notes that the tool availability table only accounts for service availability -- you also need to verify that the model you want is available in that same region. This information exists in scattered documentation tables, not in the model card UI before deployment.

Microsoft's own regional feature availability documentation is candid about this limitation, explicitly stating that it does not include a single real-time matrix for every model and feature combination, and directing developers to use linked service-specific pages to confirm current availability before deployment. The fragmentation has a second dimension: AI Toolkit's Agent Builder and GitHub Copilot Chat operate from different model catalogs. A model may be accessible in Copilot but absent from Agent Builder, or deployable in Foundry but not surfaceable in the Agent Builder model picker. The only reliable approach right now is empirical -- test the simplest possible tool call immediately after deployment, before investing time in prompt engineering or workflow construction.

Regional Availability: Why East US ≠ East US 2
Azure is a global network of physically distinct data centers, and model availability is not uniform across regions. For AI workloads, the differences are significant and frequently encountered.

Newer, high-capacity models require specific GPU hardware that Microsoft has not deployed uniformly across all Azure regions. East US (eastus) is an older region; East US 2 (eastus2) has received newer infrastructure investments more rapidly. The model and region support documentation for Foundry Agent Service is the authoritative reference for which models are available where, though it lags behind actual platform state for newly launched models. If your project is in East US and you encounter a “Model is not available in the selected AI service's region” error, the model you want is very likely available in East US 2.

Microsoft's regional feature availability documentation recommends validating three things before choosing a production region: whether your required model is available there, whether you have sufficient quota for your expected traffic, and whether all dependent services (Agent Service tools, Content Safety, etc.) are available in that region. The Foundry portal's deployment flow surfaces a region picker, but it only reveals incompatibilities after you attempt deployment -- not before.

One important note on multi-region strategy: as the quota documentation explains, quota is scoped per subscription, per region, per model, which means you can hold deployments in multiple regions under the same subscription without double-billing. If you need a model only available in East US 2 but your project hub is in East US, you can create a secondary Foundry project in East US 2, deploy the model there, and call that endpoint from your primary workflow.

MCP and Tool Connections: Giving Agents External Reach
The Model Context Protocol (MCP) is the primary standard for giving agents access to external tools -- web search, data retrieval, file systems, third-party APIs -- without requiring custom integration code for each tool. In the AI Toolkit Agent Builder, you add tools through either the MCP Server path (connecting an MCP server that exposes multiple capabilities) or the Custom Tool path (a single tool backed by your own code or API endpoint).

When adding an MCP server, you are offered a choice between a “local” connection -- running the MCP server process on your machine, requiring Node.js or Python depending on the server -- and a “foundry” connection, which uses a remotely hosted MCP endpoint. The local option is simpler to set up but introduces the same GitHub-backed quota constraints described above, since MCP tool outputs are injected into the model context and increase token consumption per turn. The Foundry-hosted MCP path integrates more cleanly with a Foundry-deployed model and scales with your Azure quota.

The Agent Service tool best practices documentation recommends registering only the tools your agent actually needs, since each tool definition consumes input tokens on every call. It also recommends reviewing run traces to confirm when your agent calls tools and to inspect tool inputs and outputs -- particularly useful for diagnosing cases where a model silently ignores tool definitions rather than returning an explicit error.

Best Practices for a Smoother Experience
The following practices reduce friction across the VS Code AI Toolkit and Foundry workflow, based on documented behavior and practical experience:

Model Selection and Deployment

  • Deploy to Foundry from the start for any workflow involving multiple tool calls or extended conversations. GitHub-hosted models are explicitly documented as designed for learning and experimentation, not production use cases.
  • Use Microsoft Foundry directly for deployments rather than relying on AI Toolkit's sync. The portal is more reliable and surfaces deployment errors more clearly.
  • Filter the Foundry model catalog by “Agent supported” before selecting a model, then cross-reference the model against the model and region support matrix for the specific tools you need.
  • Test tool compatibility immediately after deployment with the simplest possible tool call -- before investing time in prompt engineering or workflow construction.
  • For the broadest model selection, deploy in East US 2 or Sweden Central. Cross-region endpoint calls add minimal latency and maximize model access.
  • When context window size matters -- particularly for document processing, multi-step research, or editorial workflows -- prioritize models with large context windows over raw benchmark performance. A context-limited model will fail on real-world agentic tasks regardless of its reasoning score.

Quota and Rate Limit Management

  • Check your TPM/RPM allocation for each deployment via Foundry's quota management (Management Center > Quota) before running agent workflows in volume.
  • Request quota increases for Azure OpenAI and Anthropic models proactively, before you need them. Requests are evaluated individually and prioritized for customers with active existing usage.
  • Be aware that tenant-level caps exist above individual deployment quotas. If you are hitting limits inconsistent with your deployment quota, check tenant-level ceilings with Microsoft support.
  • Minimize MCP tools registered to an agent. Each tool definition consumes input tokens on every call. Remove tools not needed for the current task.
  • When debugging rate limit errors, check the Output Panel in VS Code rather than relying solely on the toast notification -- it often identifies whether the error is from the model endpoint, the Agent Service, or the MCP tool layer.

Agent Builder UX

  • Remember that Agent Builder (instructions and tool configuration) and the Playground (where you run the agent) are separate surfaces. Configuration changes take effect when you run in the Playground, not in Agent Builder itself.
  • Use the “Improve” button in the Playground to iterate on instructions -- it submits your current output back to the model for refinement suggestions.
  • Keep agent instructions precise and scoped. Longer instructions consume more tokens per turn and increase the likelihood of hitting context limits when combined with tool outputs.

Model Catalog Fragmentation

  • Do not assume model availability in GitHub Copilot Chat implies availability in Agent Builder, or vice versa. Always confirm the model appears in the surface you are actually building on.
  • If a model you need is not appearing in AI Toolkit's catalog, check the Foundry catalog directly -- Foundry's view is more complete and more current.
  • For MCP server connections, prefer the Foundry-hosted connection type over local when your model is deployed in Foundry. This keeps the tool call pipeline within Azure's managed infrastructure and avoids cross-environment authentication issues.

The Bigger Picture: Ecosystem Maturity
The VS Code AI Toolkit and Microsoft Foundry ecosystem is in active, rapid development. The 2025 updates added substantial capability -- particularly around MCP integration and the one-click Foundry deployment pathway -- but underlying fragmentation between Agent Builder, GitHub Models, Copilot, and the Foundry portal has not fully resolved. Different catalog surfaces, inconsistent model syncing, and underdocumented tool compatibility remain real friction points for developers at all skill levels.

The quota and rate limiting systems are improving -- the tiered quota auto-escalation system is a meaningful step forward -- but the opacity of those systems still requires developers to build empirical intuition rather than relying on published documentation alone. The tool support discoverability problem is perhaps the most actionable remaining gap: a model card that clearly indicates tool calling support, web search grounding compatibility, and regional availability would eliminate a substantial portion of the trial-and-error currently required.

Despite all of this, the core capability is real. A well-configured Foundry-deployed agent with appropriate MCP tools can reliably execute multi-step retrieval, synthesis, and output generation workflows that would have required substantial custom engineering infrastructure not long ago. The friction is in the tooling scaffolding, not in the underlying AI capability -- and that gap is closing.

Resources

comments powered by Disqus

Featured

  • Mastering AI Development and Building AI Apps with GitHub Copilot

    Two Microsoft experts explain how GitHub Copilot is evolving from a coding assistant into a broader platform for building, customizing and testing AI-powered developer workflows.

  • VS Code 1.123 Adds Agent Session Sync, 1M Context Windows

    Microsoft released Visual Studio Code 1.123 on June 3, adding agent-focused features, larger model context support, integrated browser updates and a new delay for some automatic extension updates.

  • Copilot Billing Shock Hits Developers

    Developer complaints about GitHub Copilot's new usage-based billing model have centered on unexpectedly rapid AI credit consumption, and neither GitHub nor Microsoft has responded directly to the backlash, though they have previously published guidance to lessen model usage costs.

  • Hands On with GitHub Copilot App Technical Preview: Turning a Blazor Issue into a PR

    GitHub's brand-new Copilot desktop app, in technical preview, handled a small Blazor issue from planning through pull request creation, but the hands-on test also showed why developers still need to verify agent work in the running app before merging.

Subscribe on YouTube