Why I Stopped Forcing Llama 3 to Work (And What Switching to Gemma 3 Taught Me)
There’s a point when working with AI where the question quietly changes. Early on, it’s all about capability—what a model can do. But over time, especially when you start building real systems, that question becomes something else entirely.
Not “Can it do this?”
But “Why am I having to work so hard to make it do this consistently?”
That’s where I found myself with Llama 3.
I was running everything locally using Ollama, with orchestration through LangChain. The goal wasn’t anything exotic—take in structured or semi-structured data, extract specific values, and pass that into a downstream process. Straightforward in theory. In practice, it turned into a constant exercise in second-guessing.
The same input wouldn’t always produce the same result. Sometimes fields would be missing. Other times values would shift slightly, just enough to cause problems later on. And occasionally, the model would return something that I couldn’t trace back to the original input at all. That last one is what really breaks confidence. If you can’t explain where an output came from, you can’t rely on it.
At first, I assumed it was something I was doing wrong. So I did what most people do in that situation—I kept improving the prompt. I added more structure, more rules, more explicit instructions. I tried to close every gap the model might fall through. And it did help, at least temporarily. But the cost of that improvement started to show up in other ways. Prompts got longer, responses got slower, and the whole system became fragile. Small changes in wording or input format could undo all that work.
Looking back, what I was really doing was compensating for a mismatch. I wasn’t solving the problem—I was building layers around it.
That realization didn’t fully click until I switched to Gemma 3, still running locally through the same setup. On the surface, the change seemed simple. Outputs became more consistent. I didn’t need nearly as much instruction to get reliable results. Responses were faster, partly because the prompts themselves were smaller. But the more important shift wasn’t technical—it was conceptual.
Gemma 3 didn’t magically “fix” anything. It just made it obvious that I had been trying to force one model into a role it wasn’t naturally suited for.
That’s the part that tends to get overlooked. Different models don’t just vary in size or performance—they behave differently. They have tendencies. Some lean toward flexibility and interpretation. Others are more rigid, more literal, more predictable. When you push a model too far outside its natural behavior, you can usually get it to comply—but only by adding more instructions, more constraints, more effort. Eventually, you end up maintaining the prompt more than the system itself.
What changed for me wasn’t just swapping Llama 3 for Gemma 3. It was stepping back and asking a simpler question:
Is there a better model for what I’m trying to do?
Once you look at it that way, a lot of things fall into place. Instead of trying to standardize behavior through increasingly complex prompts, you start selecting tools based on how they naturally operate. Some models are better suited for open-ended reasoning or content generation. Others are better for structured, repeatable tasks where consistency matters more than creativity. Neither approach is inherently better—it just depends on what you need.
In my case, I wasn’t looking for creativity. I needed repeatability. I needed to be able to run the same input twice and get the same answer, or at least something close enough that it wouldn’t break downstream logic. Gemma 3 aligned with that need more naturally, which meant I could remove complexity instead of adding it.
Llama 3 still has its place. For exploratory work, brainstorming, or anything where flexibility is an advantage, it’s a strong option. But trying to use it as the backbone of a repeatable, structured pipeline required more effort than it should have.
That’s really the takeaway. Not that one model is better than another, but that trying to make a single model do everything leads to unnecessary complexity. When something feels harder than it should be—when you’re adding more instructions just to keep things stable—it’s worth stepping back and asking whether the model itself is the right fit.
Because in a lot of cases, the problem isn’t that the model can’t do the job.
It’s that it’s not the one that does it naturally.

Leave a Reply