One Model for Everything Is Lazy and Expensive

2026-03-16·7 min read·aiengineeringcost

GPT-4 for everything is the AI equivalent of using a Ferrari to deliver groceries. Here's how we route tasks across 5 models and why it cuts costs by 10x.

You're using GPT-4o for everything. I know because everyone is. Classification, extraction, summarization, voice, OCR. One model. One API key. One monthly bill that makes your finance team send increasingly passive-aggressive emails.

It works. That's the problem. It works well enough that nobody questions it. And "well enough" at scale is how you spend $15,000 a month on something that should cost $1,500.

The Ferrari problem

GPT-4o is a remarkable model. It's also $2.50 per million input tokens and $10 per million output tokens. For complex reasoning tasks, multi-step analysis, or nuanced generation, that price is justified. The model earns its keep.

For classifying a support ticket into one of eight categories? You're paying a surgeon to put on a Band-Aid.

This isn't about being cheap. It's about being precise. Different tasks have fundamentally different computational requirements. A classification task needs pattern matching. An extraction task needs structure recognition. A voice task needs real-time streaming. An OCR task needs visual understanding. Pretending these are the same task because they all involve "AI" is like saying a screwdriver and a crane are the same tool because they both build things.

What smart routing looks like

We built a healthcare platform last year. Seven AI components. If we'd used one model for everything, the monthly inference bill would have been north of $8,000. Instead, we routed:

Task	Model	Why	Cost/10K requests
Triage classification	GPT-4o-mini	Binary/categorical, needs speed not depth	$42
Medical document extraction	Gemini 2.5 Pro	Best-in-class structured output, long context	$315
Voice agent	Azure Realtime API	Sub-200ms latency requirement, streaming	$630
Scanned report OCR	DeepSeek via Vertex AI	Visual understanding at 1/10th the cost	$95
Patient summary generation	Claude Haiku	Concise output, handles medical terminology	$68

Total: ~$1,150/month. Single-model approach: ~$5,830/month. Same outputs. Same quality where it matters. 80% cost reduction.

[interactive]Cost Router: Single Model vs. Smart Routing

ClassificationGPT-4o

$840

Document extractionGPT-4o

$1260

Voice processingGPT-4o

$2100

OCR / scansGPT-4o

$950

SummarizationGPT-4o

$680

Monthly total (10K requests/task)

$5,830

One model. One price. One way to burn money.

Each model has a lane

The "just use GPT-4" crowd misses something fundamental: models aren't uniformly good at everything. They have specializations that emerge from their architecture, training data, and optimization targets.

GPT-4o-mini is absurdly good at classification and simple extraction. It's fast, cheap, and accurate on structured tasks. Trying to use it for complex multi-hop reasoning will disappoint you. But for routing a ticket, parsing a date, or categorizing a document? It's the right tool.

Gemini 2.5 Pro handles long documents better than anything else on the market right now. A 200-page contract? Gemini processes the entire thing in one pass with strong structural understanding. GPT-4o would need chunking strategies that add latency and lose cross-reference context.

Claude Haiku writes the cleanest concise output. For summarization tasks where you need tight, readable prose without filler, it consistently outperforms models 10x its cost. It doesn't over-explain. It doesn't add caveats to caveats.

DeepSeek via Vertex AI is the OCR play. Visual document understanding at a fraction of GPT-4V pricing. For scanned documents, handwritten forms, and photographed receipts, the accuracy matches the premium models at 10% of the cost.

Azure Realtime API is the only viable option for production voice agents. Sub-200ms response time. Streaming. Interruption handling. You can't build a voice agent on a batch API and expect users to wait three seconds between turns.

The eval pipeline is the product

Choosing models isn't a one-time decision. It's a continuous process. Models update. Pricing changes. New competitors emerge. Last month's best-in-class extraction model might be dethroned by a model that costs half as much.

This is why eval pipelines matter more than model selection. A proper eval pipeline:

Defines task-specific metrics. Classification accuracy. Extraction F1 score. Voice latency p95. OCR character error rate. Each task has its own success criteria.
Tests on your actual data. Not benchmarks. Not synthetic examples. Your messy, inconsistent, real-world data with all its quirks.
Runs automatically. When a new model drops or a provider changes pricing, the pipeline re-evaluates without human intervention.
Produces a routing table. Task X goes to Model Y. With cost and accuracy data to justify the decision.

The organizational resistance

I've pitched smart routing to engineering leads who looked at me like I'd suggested they adopt a second child. "We don't want to manage five different APIs." "What if one provider has an outage?" "Our team only knows OpenAI."

These are valid concerns dressed up as objections. Here's how they resolve:

"Multiple APIs are complex." An abstraction layer takes two days to build. One function that accepts a task type and returns a model response. The routing logic lives in a config file. Your application code never knows which model it's talking to.

"Provider outages." This is actually an argument FOR multi-model routing. If your entire system depends on one provider and that provider goes down, you go down. With multiple models, you have fallback paths. OpenAI is rate-limiting you? Route overflow to Anthropic. Azure is down? Fall back to Gemini. Resilience is a feature, not a bug.

"Our team only knows one SDK." The SDKs are nearly identical. If your team can use the OpenAI SDK, they can use the Anthropic SDK. The learning curve is measured in hours, not weeks.

Cost is a design constraint, not an afterthought

The teams that treat cost as a design constraint from day one build better systems. Not because they're frugal, but because cost-awareness forces precision. When every token has a price tag, you think harder about what you're sending to the model. You compress prompts. You cache responses. You preprocess inputs to strip irrelevant content.

These optimizations don't just save money. They improve performance. Shorter prompts mean faster responses. Cached responses mean zero latency on repeat queries. Preprocessed inputs mean the model gets cleaner data and produces better output.

The team that uses GPT-4 for everything doesn't think about any of this. They throw the full context at the biggest model and wait. It works. It's slow. It's expensive. And every optimization they eventually need to make is a retrofit instead of a design decision.

The model changes. The routing stays.

GPT-5 will come out. GPT-6 after that. Gemini will release new versions. Anthropic will update Claude. Models are a moving target. If your architecture is coupled to a specific model, every upgrade is a migration project.

If your architecture routes tasks to capabilities, upgrades are config changes. New model drops with better extraction performance? Update the routing table. Run the eval pipeline. If it passes, swap it in. No code changes. No deployment risk. No three-week migration sprint.

This is what engineering looks like in AI. Not picking the best model. Building the system that always uses the best model, automatically, for each specific task, at the lowest viable cost.

One model for everything is the easy choice. It's also the expensive one. And in production, easy and expensive eventually become just expensive.

← More posts Home