In-Depth
Going Local (& a Bit Loco) with Open-Source AI in VS Code
I thought using local, open-source AI models for my work in Visual Studio Code would be great, providing benefits such as:
- More privacy, because drafts, notes, and half-baked ideas can stay on my own machine instead of heading off to somebody else's servers.
- Lower ongoing cost, with no per-prompt or per-token meter running in the background every time I ask the model to do something.
- Offline or low-connectivity use, which is appealing if I'm on the road, on flaky Wi-Fi, or just don't want to depend on a cloud service being available.
- More control over the stack, including the ability to choose the model, swap it out, update it, or experiment with alternatives when something isn't working.
- Freedom from vendor lock-in, or at least the comforting illusion of it, since open models and local runtimes promise a more portable workflow.
- No quota surprises, rate limits, or "come back later" moments tied to someone else's platform policies.
- The geeky satisfaction of making VS Code do something a little more self-contained, customizable, and under my control than the standard cloud-AI setup.
That all might be true -- if you have a monster machine with tons of RAM and a beefy GPU, plus the patience to wade through a still-evolving patchwork of model catalogs, runtimes, tool-use requirements, and UI quirks that can make "run it locally" sound a lot simpler than it actually is.
The reality is, if you have a machine with an 11th Gen Intel Core i5-1135G7 processor, 12 GB of RAM, and no dedicated GPU -- like my company laptop -- "local AI" starts to feel less like a sleek, futuristic workflow and more like an endurance test.
Look at what just Ollama and VS Code did to my resources:
[Click on image for larger view.] Not enough resources. (source: Ramel).
That dog PC is all I had to work with, and through countless hours of testing, tweaking, and troubleshooting, it was almost -- but not quite-- able to successfully run one of my primary editorial workflows: formatting a freelance article with custom CSS and HTML tags demanded for our CMS. I could use it with my accustomed practice of using GitHub Copilot Chat agents to handle the formatting, which was a nice bonus. I could actually choose my agent and model from Chat window's pickers, just like normal with my high-powered 4,000-word instruction set aptly handled by Claude Sonnet 4.5.
Well, that comes with an important caveat: My new local, open-source setup could do the job like a toddler could run the Boston Marathon along with the last winner, Kenya's John Korir. Juuuuuuuust a slightly different experience.
It Looked So Promising in AI Toolkit
To get there, I first had to learn that not everything in VS Code that looks local is actually local in the way a normal person might interpret that word. I started in Microsoft's AI Toolkit extension, because that seemed like the obvious, all-in-the-family path. It presented a nice model catalog, tabs for different back ends, and enough shiny buttons to make me think I was about 10 minutes away from editorial-AI bliss.
[Click on image for larger view.] AI Toolkit's Ollama catalog made the local route look easy at first glance. (source: Ramel).
I was not.
Foundry Local Was My First Plot Twist
One early stop was Foundry Local, which sounded perfect. Local. Foundry. Microsoft. VS Code. What could go wrong? Quite a bit, as it turned out. Instead of giving me the cozy, fully on-device experience I had in mind, it smacked me with a tokens-per-minute quota error. Quota. For "local." That was my first clue that this journey was going to involve a lot of branding archaeology and a lot less smooth sailing than advertised.
[Click on image for larger view.]My first "local" adventure ended with a quota error, which was not exactly the on-device dream I had in mind. (source: Ramel).
Ollama Was the Real Local Path
So I backed out of that path and started poking around the Ollama side of AI Toolkit, which is where things got more real. That was the genuinely local route, but it came with its own gotcha: even if the models show up neatly in the catalog inside VS Code, you still need Ollama installed and running underneath for that part of the magic trick to work. So now I was downloading Ollama, waiting on installers, waiting on model downloads, and generally spending more time watching progress bars than doing any actual editorial work.
AI Toolkit quickly made clear that I needed Ollama installed locally.. (source: Ramel).
Then the Hardware Started Laughing at Me
Then came the hardware reality check. Some of the models that looked promising were simply too big for my machine. Phi-4? Nice idea. Also, on this laptop, basically a fantasy. So I had to shop in the bargain bin of smaller models that might actually fit in memory without pushing the whole machine into a medically concerning state.
That led me to Gemma3:4b, which initially seemed like the Goldilocks option. Not too big, not too small, maybe just right. Except it wasn't. It fit the machine better, but it didn't fit the workflow. In practice, it didn't really cooperate with the agent setup I wanted. It could exist in the ecosystem, sure, but that didn't mean it could play nicely in the places I needed it to play nicely.
On my laptop, Gemma3:4b looked like the most realistic option. (source: Ramel).
Added Is Not the Same as Usable
That turned into one of the more useful lessons of this whole exercise: not every model you can see in VS Code is a model you can actually use the way you want. In Model Management, some models could be right-clicked and added to the model picker in the GitHub Copilot Chat window. Some could not. Gemma3:4B could not. That wasn't random. The dividing line was tool use. If a model couldn't support the right capabilities, it might exist in the interface, but not in the workflow I actually cared about.
Gemma3:4b was definitely installed and sitting there under My Resources. That still didn't make it the right model for the workflow. (source: Ramel).
That is how I eventually wound up with Ollama-hosted Qwen2.5 as the model that fit the bill. Not because it was perfect. Not because it had the best prose. Not because angels sang when I selected it. It just happened to be the open-source local model that could be surfaced in the Copilot Chat model picker and used in a way that felt at least vaguely compatible with my existing workflow.
[Click on image for larger view.] Qwen2.5 (source:Ramel).
Why Qwen2.5 Made the Cut When Gemma3 Didn't
The reason comes down to something called tool use, and it's worth understanding because it will save you the confusion I had to work through the hard way.
When you open VS Code's Model Management -- via the Command Palette, under "Language Models: Manage Language Models" -- you can see all the models available to you, including local ones running through Ollama. But right-clicking a model reveals something important: some have an option to add them to the Copilot Chat model picker. Some simply don't. That option is either there or it isn't, and there's no explanation attached to tell you why.
The answer is tool-use support. For a model to be selectable in GitHub Copilot Chat for anything resembling an agent workflow -- reading files, following instructions, taking actions -- it has to support the tool-calling protocol that Copilot Chat depends on. Models that don't support it can exist in the interface, can run in the AI Toolkit playground, can respond to basic prompts. But they can't be plugged into the Copilot Chat model picker for agent use. Gemma3:4b fell into that category. Qwen2.5 did not. That single difference was what determined which model I could actually work with.
[Click on image for larger view.] Show/Hide in Chat Model Picker (source:Ramel).
So if you're shopping for a local model and wondering why a perfectly good-looking candidate won't show up where you need it: check whether it supports tool use. That's the filter that matters, and it isn't prominently labeled anywhere in the UI.
My Workflow Was Not Exactly a Lightweight Test
That existing workflow, by the way, was not some cute little "summarize this paragraph" test. I already had a hulking instruction set for formatting articles -- thousands of words of directions covering HTML cleanup, quote handling, dash conversion, subheads, image markup, sidebars, takeaways, summaries, social posts, QA mode, and special branching rules for different authors. With Claude Sonnet 4.5, that whole monstrosity worked amazingly well. So naturally I thought: what if I just swapped in a local open-source model and kept everything else the same?
That is usually the moment in these stories when the narrator pauses to reflect on his own innocence.
Because while Qwen2.5 could run locally and could be selected in the Copilot Chat window, that did not mean it could gracefully absorb a giant agent file, figure out which author-specific branch to apply, locate the right HTML file, preserve all the finicky CMS conventions, and then output something ready for polite society. Sometimes it went generic. Sometimes it got confused about file paths. Sometimes it seemed to lose the plot entirely and head off in the wrong formatting direction. At one point it behaved like it had wandered into the wrong newsroom entirely.
I tried to help. I simplified the setup. I stripped PowerShell references out of the instructions because the local model clearly wasn't going to become my cheerful little shell-scripting assistant. I reduced friction where I could. I spoon-fed file paths. I explicitly referenced instruction files. I nudged, coaxed, pleaded, and occasionally stared at the screen in the universal expression of people who have trusted software too much.
Eventually, Yes, It Worked -- Sort Of
And yes, eventually, I got it to do enough that I could say the concept worked. That matters. It means this is not vaporware, not fantasy, not some "well, theoretically" thought experiment. I really did get a local open-source model running in VS Code, exposed in the GitHub Copilot Chat picker, and pointed at a real editorial formatting workflow.
But I also learned that there is a canyon between "it runs" and "it replaces the cloud model I was already using."
The Speed Gap Was Not Subtle
Speed alone was enough to drive that point home. A cloud model like Claude, running in infrastructure built for this kind of thing, makes my big formatting workflow feel brisk and confident. The local setup on this laptop felt like every token had to be individually mined out of the earth by hand. That is not a bug. That is just physics, economics, and a little bit of cruelty.
And yet, weirdly, that made the experiment better.
If everything had worked perfectly, I would have wound up with a boring article about how to click a few buttons and enjoy the future. Instead, I got a far more useful story: yes, you can run open-source AI locally in VS Code. Yes, you can even wedge it into a familiar Copilot Chat workflow under the right conditions. But the details matter. The hardware matters. The model's capabilities matter. The current state of the integrations matters. And if your grand plan involves handing a small local model a four-thousand-word rulebook and expecting frontier-model behavior on a modest office laptop, you may discover new emotional textures.
What This Project Actually Proved
Still, I don't consider the project a failure. Quite the opposite. It proved that local AI in VS Code is real enough to be worth exploring, especially for privacy-conscious or cost-conscious workflows. It also proved that "real" does not automatically mean "ready for my most demanding use case." Those are not the same thing, and the sooner people say that out loud, the better.
So yes, I went local. And yes, I went a bit loco waiting literally hours for a prompt to complete -- or just freeze and time out. But somewhere between the quota confusion, the Ollama install, the model picker detective work, the stripped-down agent file, and the glacial formatting run on an overmatched laptop, I did get my answer: the future is here, it just wheezes a little on older hardware.
(Hey boss, about that new laptop request ...)
About the Author
David Ramel is an editor and writer at Converge 360.