Your Increasing AI Token Spend is an Architecture Problem

Businesses of all sizes are under pressure to do more with AI. More users are testing AI copilots. More teams are building agents. More workflows now include retrieval, tool calls, reasoning steps, and long prompts.

At the same time, token prices are falling. Stanford HAI reported that the cost of querying a GPT-3.5-level model fell from $20 per 1 million tokens in November 2022 to $0.07 by October 2024. On paper, that sounds like AI should be getting cheaper.

But in practice, many enterprise AI bills are still growing.

The reason is simple: companies are not just sending short prompts anymore. They are sending long documents, full chat histories, system prompts, tool instructions, search results, and agent steps into every request. In short, the amount of context being sent to AI systems is growing. Stanford Digital Economy Lab found that agentic coding tasks used about 1000x more tokens than code chat or code reasoning tasks, and the same task could vary by up to 30x in token use.

That is the core issue behind token maxing, the trend of offloading more and more context and tasks to AI to automate as many processes as possible . The cost is not only coming from the model; It is coming from how the full AI system is built and used.

What are Tokens?

A token is a small unit of text that an AI model reads or writes. It can be a word, part of a word, a number, a space, or a symbol.

For example, the input prompt of “Summarize this customer contract and flag the renewal terms.” gets turned into a list of tokens such as: “Sum mar ize this customer contract and flag the renewal terms .” Each of these tokens then gets converted to numbers which the LLM can understand and use to generate new tokens to answer the question.

For example, the input prompt of “Summarize this customer contract and flag the renewal terms.” would be tokenized into a list of tokens such as “Sum mar ize this customer contract and flag the renewal terms .” So what to a person looks like a request with nine words in it will actually be a request that takes up around twelve to fifteen tokens, depending on the LLM.

While that short sentence may only use a small number of tokens, the actual AI request may include much more than that. It may include the full contract, the company policy, user permissions, past notes, system instructions, tool descriptions, and formatting rules.

So the model is not just processing the user’s sentence. It is processing the full package sent with the request.

This is why token cost can grow quietly. A simple user question may trigger thousands of input tokens before the model even starts writing an answer. If the workflow uses an agent, it may call tools, read more data, retry steps, and generate more tokens across the process. Each of those steps grows the number of tokens being used by the model.

That is how one small question can become a costly AI task.

Why Token Load is an Architecture Problem

Many teams try to reduce AI cost by switching to a cheaper model. That can help, but it does not solve the full problem.

If the AI system is sent too much context, repeats the same instructions, loads every tool at once, or routes every task to a large model, the cost will still rise. The model price may be lower, but the workflow is still wasteful. Furthermore, using a less powerful model that is not right for the job increases the risk that the entire workflow fails, with the team still paying for the tokens that were consumed.

This is why token load is an architecture problem.

A strong AI architecture decides what context is needed, which model should handle the task, when to cache repeated content, when to stop an agent loop, and how to track cost across each step.

A weak architecture sends too much to the model and hopes the output improves.

Research does not support that assumption. Stanford Digital Economy Lab found that higher token use did not always improve accuracy in agentic tasks [Stanford Digital Economy Lab]. Long-context research also shows that more context does not always mean better results, especially when the right information gets buried inside a large prompt [Lost in the Middle, NoLiMa].

In enterprise AI, more tokens can mean more cost, more latency, and more noise.

The better question is not: “Which model is cheapest?”

The better question is: “How do we get the right result with the smallest useful amount of context and the best model for the job?”

How Can you Improve Token Efficiency?

Token efficiency means getting the right output without sending more tokens than the task needs.

It is not about cutting context blindly. It is about sending the right context, to the right model, at the right time.

There are five practical areas to focus on: context, model routing, learning, architecture redesign, and guardrails.

Send the right context, not all the context

Context is one of the biggest drivers of token cost.

Every AI system needs context to give useful answers. For example, a support AI may need customer history, ticket data, product docs, and policy rules. But it does not need every document, every past chat, or every tool description for every request.

This is where many systems go wrong. They send too much information because it feels safer. But too much context can make the request slower, more costly, and harder for the model to use.

A better approach is to build context in steps.

First, identify the user’s task. Then retrieve only the most relevant records. Next, remove duplicate or stale content. Finally, send the model only the information it needs to answer or act.

Prompt caching can also help when the same content is used again and again. OpenAI says prompt caching can cut input-token cost by up to 90% and latency by up to 80%. AWS and Google also offer caching options that reduce the cost of repeated context.

The goal is simple: do not pay full price for the same context every time.

Route the models based on the tasks

Not every task needs the largest model.

A customer email classification task may work well on a smaller model. A legal risk review may need a stronger model. A simple data lookup may not need a large language model at all.

Model routing helps decide which model should handle each request based on task type, data sensitivity, cost, latency, and quality needs. A good routing setup can send routine tasks to lower-cost models and reserve stronger models for complex or high-risk work.

AISquared’s UNIFI supports semantic and policy-based model routing so enterprises can choose models based on workspace, role, data sensitivity, task type, latency needs, and cost controls. It also works with the Bolt family of models, which are built for specific enterprise tasks like routing, retrieval, guardrails, and document extraction.

Track what’s working and what’s wasting tokens

Token efficiency improves when the system learns from real usage.

Teams need to know which prompts are expensive, which workflows retry too often, which users are sending long requests, which models are being overused, and which outputs are actually helpful.

Without that visibility, cost control becomes guesswork.

A learning loop should track usage, latency, model choice, user feedback, cost, and task success. If users keep rejecting an answer, the issue may be weak retrieval. If an agent keeps looping, the issue may be poor workflow logic. If a simple task is always routed to a high-cost model, the issue may be routing policy.

Feedback also matters. If users rate outputs or add comments, teams can see where the AI is helping and where it is wasting tokens.

AISquared’s UNIFI includes feedback capture, usage dashboards, audit logs, workflow metrics, and model routing metadata. That kind of visibility helps teams improve the system over time, not just react to the bill at the end of the month.

Re-design the AI stack to bring the control

Sometimes token cost cannot be fixed with prompt edits. The workflow itself needs to change.

For example, imagine an AI agent that reviews invoices. If it sends the full invoice, vendor history, payment policy, approval rules, and audit notes to a large model every time, the cost will add up quickly.

A better design may split the task into smaller steps. One smaller model extracts fields. A rule-based step checks basic policy. A retrieval step pulls only the needed vendor records. A larger model reviews only the cases that need judgment. A human approves high-risk cases.

That is an architecture redesign.

The point is to stop treating every AI task as one large prompt. Many enterprise tasks work better as smaller, controlled steps.

This also improves reliability. AISquared’s AI Controls Model describes workflow orchestration as the layer that adds state management, retries, fallbacks, human review, approvals, and termination controls around AI systems. These controls are important because agents can otherwise repeat steps, lose state, or run longer than expected.

When AI workflows are designed as systems, teams can control cost, speed, risk, and output quality in a much more direct way.

Use guardrails to control cost and risk

Guardrails are not only for safety. They are also useful for cost control.

A guardrail can stop a request that is too long. It can block a user from sending data they are not allowed to use. It can limit the number of times an agent can call a tool. It can route high-risk requests for human review. It can prevent the model from answering when the needed context is missing.

These controls reduce wasted tokens and reduce risk at the same time.

For example, if a user asks a question that requires finance data they cannot access, the system should not send a large context package to the model and then fail later. It should check permissions first.

If an agent has already tried the same tool multiple times, the system should stop the loop and return a clear status.

If a prompt includes private data, the system should apply masking or policy checks before inference.

AISquared’s UNIFI supports guardrails such as PII masking, content policy checks, custom compliance rules, role-based access, audit logs, and workspace boundaries. These controls help teams keep AI usage safe, traceable, and cost-aware.

Conclusion

Token spend is not just a model pricing issue. It is a system design issue.

Cheaper models can help, but they cannot fix poor context design, weak routing, repeated prompts, long agent loops, missing guardrails, or lack of cost visibility.

The real goal is not to use fewer tokens at any cost. The goal is to get more useful work from every token.

That starts with better architecture: retrieve only the right context, route tasks to the right model, cache what repeats, track what happens, learn from feedback, and control how agents run.

As AI moves from pilots to production, token cost will become a core part of enterprise AI planning. The teams that manage it well will not just lower spend. They will build AI systems that are easier to trust, easier to scale, and easier to improve.

Your Increasing AI Token Spend is an Architecture Problem

What are Tokens?

Why Token Load is an Architecture Problem

How Can you Improve Token Efficiency?

Send the right context, not all the context

Route the models based on the tasks

Track what’s working and what’s wasting tokens

Re-design the AI stack to bring the control

Use guardrails to control cost and risk

Conclusion

Measure Your AI Readiness

You May Also Be Interested

What are Tokens?

Why Token Load is an Architecture Problem

How Can you Improve Token Efficiency?

Send the right context, not all the context

Route the models based on the tasks

Track what’s working and what’s wasting tokens

Re-design the AI stack to bring the control

Use guardrails to control cost and risk

Conclusion

Measure Your AI Readiness

You May Also Be Interested

AISquared Launches Bolt to Eliminate Token Burden on Enterprise AI

What Is Agentic AI? Architecture, Workflows & Use Cases

Guide to AI Application Architecture: Building Scalable, Production-Ready AI Systems