Tokenmaxxing: Why AI Token Leaderboards Are a Terrible Idea

There's a new term making the rounds in engineering circles: tokenmaxxing. Developers are running background agents on pointless tasks, regenerating code they've already written, and feeding entire codebases into context windows. Not because any of it is productive, but because someone is watching the numbers.

This isn't fringe behavior. It's happening at Meta, Amazon, and Uber, and it's the entirely predictable result of treating token consumption as a proxy for productivity.

The Scoreboard Problem

Several companies have rolled out internal leaderboards that rank engineers by how many AI tokens they consume. The logic: if AI tools make developers more productive, then developers who use more AI must be more productive. Track usage, rank people, and the laggards will get the message.

Every link in that chain is broken.

Amazon set weekly AI-usage targets for the bulk of its developer workforce and started tracking who was hitting them with internal leaderboards showing token consumption. The result? Engineers started running pointless tasks to inflate their scores. The Financial Times reported on the practice, and it's since become a case study in what happens when a productivity metric becomes a target.

Meta built an internal leaderboard ranking employees by token usage across its AI coding tools, with tiers like "Session Immortal" and "Token Legend." Token consumption became informally interpreted as a proxy for AI-driven productivity, while lower usage raised concerns about an employee's adaptation to AI tools. The company has set explicit goals for engineers to produce 50-80% of their code with AI assistance.

This is Goodhart's Law playing out in real time: "When a measure becomes a target, it ceases to be a good measure."

The Budget Explosions

The financial fallout is already here.

Uber burned through its entire 2026 AI budget in four months after incentivizing employees to adopt AI coding tools through an internal leaderboard ranking teams by total usage. Their COO has openly admitted he cannot connect the token consumption growth to actual features shipped. The company is now questioning whether the investment is worth it at all.

Microsoft cancelled most Claude Code licenses in its Experiences and Devices division, the unit responsible for Windows, Microsoft 365, Outlook, Teams, and Surface, after the tool exhausted the division's annual AI budget. Engineers loved it. They used it heavily. And that was the problem. The company is now consolidating on GitHub Copilot CLI for tighter integration and cost control. The irony: the tool was cancelled not because it failed, but because it succeeded too well at generating token consumption.

These aren't scrappy startups burning through runway. These are two of the largest technology companies on earth, discovering in real time that incentivizing usage without measuring outcomes just means you spend more money faster.

We've Seen This Movie Before

Token counting is the lines-of-code metric of the AI era.

In the 1980s and 1990s, some organizations measured developer productivity by lines of code written per day. The problems were immediately obvious to anyone who'd actually written software: a developer who refactors 500 lines into 50 has done more valuable work than one who copy-pastes their way to 2,000 lines. But the metric said otherwise.

The same dynamic applies to tokens. A developer who uses AI to quickly prototype three approaches, picks the best one, and ships a clean implementation has done better work than one who runs an agent in a loop regenerating the same file to keep their numbers up. But the leaderboard can't tell the difference.

Other historically terrible productivity metrics in software:

Lines of code – incentivizes verbosity over clarity
Number of commits – incentivizes splitting work into trivial pieces
Tickets closed – incentivizes working on easy tickets and avoiding hard problems
Hours logged – incentivizes presence over output

Token consumption fits right in. It measures activity, not outcomes. And like every metric before it, the moment you tie it to performance reviews, people optimize for the metric instead of the work.

Who Benefits From This Narrative?

Ask yourself who profits when companies believe that more token consumption equals more productivity.

Nvidia CEO Jensen Huang has been explicit about this. At GTC 2026, he proposed that engineers should consume AI tokens worth roughly half their annual salary each year to be "fully productive." He's suggested that engineers should be evaluated based on how many AI tokens they use, and that compute usage should become a new measure of productivity. He's compared not using AI to "using paper and pencil for chip design."

This is not disinterested advice. Nvidia sells the GPUs that power every token generated, whether that's through direct hardware sales or the cloud providers purchasing capacity to offer AI-as-a-service. Every company that adopts a "more tokens = more productive" philosophy drives demand up the entire supply chain, and Nvidia sits at the top of it. When the CEO of a GPU company tells you that your engineers should each burn through six figures of compute annually, the conflict of interest is not subtle.

The Subsidy Trap

Here's what makes the current spending spree even more reckless: today's token prices are artificially cheap. Model providers are heavily subsidized by venture capital, billing tokens well below their sustainable rate to drive adoption. The playbook is old: get users hooked at below-cost pricing, build dependency into workflows and toolchains, then raise prices once switching costs are high enough that nobody can walk away.

If this sounds like a drug dealer's business model, that's because it is.

Companies building their engineering culture around maximizing token consumption are building on a pricing floor that doesn't exist yet. When the subsidies dry up and providers need to charge rates that actually cover their compute costs, organizations that encouraged tokenmaxxing will face a brutal choice: absorb dramatically higher bills, or rip out workflows that have become load-bearing.

Uber's budget explosion happened at today's subsidized rates. What happens when those rates double or triple?

And yet, executives at major companies are buying the "more tokens = more productive" framing wholesale. Shopify CEO Tobi Lütke declared that AI usage is now a "fundamental expectation of everyone at Shopify" and that teams must demonstrate why they cannot get what they want done using AI before requesting headcount. The mandate itself isn't unreasonable, but when combined with usage tracking and performance pressure, it creates the same perverse incentives.

The Performance Review Trap

The most damaging version of this trend is tying AI token usage to performance reviews. Reports from Business Insider indicate that Meta, Google, and JPMorgan Chase are incorporating AI usage into performance evaluations. At Meta, employees with low token usage reportedly face concerns about their "adaptation to AI tools."

When your job security depends on a number going up, rational people will make that number go up by whatever means necessary. That's not a character flaw. It's what happens when you design bad incentives.

The developers running background agents on worthless tasks aren't lazy or dishonest. They're responding rationally to a system that measures the wrong thing. Tell someone their performance review depends on token consumption, and they know token consumption has no correlation with their actual output quality. They have two choices: do good work and risk a bad review, or game the metric and keep their job.

Most people pick the second option. You would too.

What Actually Matters

AI coding tools are genuinely useful. I use them every day. They're great for generating boilerplate, exploring unfamiliar APIs, rubber-ducking architecture decisions, writing tests, and translating between languages. I'm not arguing against using them.

But the value comes from the quality of the output, not the quantity of the input. A developer who uses 10,000 tokens to solve a problem that would have taken two hours has gotten real value. A developer who burns 500,000 tokens running an agent in circles and then fixes the result manually has wasted money.

The questions that actually matter for AI tool adoption:

Are we shipping features faster?
Is our code quality improving or degrading?
Are developers spending less time on tedious work?
Are we catching bugs earlier?
Is the cost justified by the time saved?

None of these questions are answered by a token leaderboard.

The Path Forward

If you're an engineering leader thinking about AI adoption metrics:

Measure outcomes, not inputs. Track cycle time, defect rates, developer satisfaction, feature throughput. If AI tools are helping, those numbers improve regardless of how many tokens get consumed.

Set budgets, not targets. Give teams a token budget and let them decide how to spend it. If they ship great work using 20% of their budget, that's a win, not a concern.

Trust your engineers. The ones who find AI tools useful will use them. The ones who don't find them useful for their particular work shouldn't be penalized. Not every task benefits equally from AI assistance.

Ignore the hype from those who profit. When a GPU company tells you your engineers need to consume more compute, consider the source. When a model provider says higher usage means higher productivity, remember they bill by the token.

Watch for Goodhart's Law. If usage numbers are climbing but you can't point to corresponding improvements in output, you're measuring the wrong thing.

The companies that get the most value from AI coding tools will be the ones that treat them as tools, not as metrics. A hammer is useful. Counting hammer swings tells you nothing about how well the house is built.