The Inference Bill Comes Due — Hard Constraints

For three years, the AI conversation has been dominated by training costs. How many GPUs, how many megawatts, how many billions. Training is the headline number because it’s big, singular, and easy to report. But training is capex. Inference is opex, and opex is what kills companies.

Here’s the arithmetic nobody wants on a slide. A model that costs $100 million to train and serves a billion requests a month at half a cent each is spending $60 million a year on inference. That’s fine if each request produces more than half a cent of value. It is very much not fine if you’ve priced your product as if compute were free — which, judging by the flat-rate subscriptions still floating around, plenty of teams did.

The factory analogy holds

Manufacturing people figured this out a century ago. The cost of building the factory matters less than the cost per unit coming off the line. Toyota didn’t win on cheaper factories; it won on relentlessly lower marginal cost at consistent quality. AI is now hitting the same wall, and the winners will be the ones who treat inference like a production line: measured, optimized, and costed to the fourth decimal.

The levers are familiar to anyone who’s run a plant. Quantization is scrap reduction — you’re throwing away precision the product didn’t need. Batching is line balancing. Speculative decoding is a faster conveyor with a quality check at the end. Distillation is redesigning the part so a cheaper machine can make it.

The uncomfortable truth: most production workloads don’t need a frontier model any more than most parts need aerospace tolerances.

What changes now

Expect three things. First, per-token pricing gets more honest, and some “unlimited” tiers quietly die. Second, small models eat a much bigger share of production traffic — not because they’re impressive, but because they clear the quality bar at a tenth of the cost. Third, inference hardware diversifies fast, because when something is 80% of your compute bill, you stop being sentimental about your vendor.

None of this is bearish on AI. It’s the opposite: it’s what an actual industry looks like. Demos don’t have unit economics. Products do. The companies that survive the next two years won’t be the ones with the best benchmark scores. They’ll be the ones who know, precisely and per request, what their intelligence costs to make — and who charge accordingly.

The inference bill was always going to come due. The only question was whether you’d priced for it.