The Token Economy- The Architecture of Information in the AI World

For a while, the dominant model for consumer AI has been an all-you-can-eat buffet. Pay monthly, use as much as you like, and the provider would absorb whatever it cost to serve you. That worked tolerably well when most people were using AI the way they’d use a search engine, with a question here, a bit of help drafting something there. Usage was bounded by human conversation speed. You can only type so fast. The early plans set a limit of ‘how many times you could go back to the buffet’ – essentially a limit on the number of conversations – but not on how much you could eat (how many tokens).

What changed everything was the rise of agentic AI. This is AI that doesn’t wait for you to ask the next question, it just gets on with the job, autonomously, and sometimes that job can take hours. A developer like myself might set up an agent to work through a codebase overnight. That agent doesn’t chat; it reads, reasons, writes, tests, and iterates, consuming resources at a rate no flat monthly fee was ever designed to absorb.

Roll on to 2026, and a $200-per-month subscription was being used to run anywhere from $1,000 to $5,000 worth of agent compute tasks. At that point, the economics simply don’t work. The buffet analogy breaks down when a handful of customers start arriving with industrial catering equipment.

The clearest signal of this shift came from Anthropic (the makers of Claude) when they effectively banned, and then unbanned, the most popular agent framework (OpenClaw) from their tiered plans. That was followed by GitHub. GitHub Copilot officially announced it would retire “Premium requests” and transition all plans to usage-based billing on June 1, 2026, using GitHub AI Credits. Cursor and Windsurf made similar moves in 2025 and early 2026 respectively. This isn’t going to be a one-off pricing change, it’s a structural change in how the industry works.

And the unit being counted, in all of these new pricing structures, is the token. So it’s worth knowing what one actually is.

What is a token?

Tokenisation isn’t new. Computers have been breaking text into tokens since the very earliest compilers. When you write a line of code, the first thing the compiler does is slice it into meaningful chunks (keywords, operators, identifiers) before it tries to make any sense of it. That process is called lexical analysis, and it has been the foundation of computing for as long as there have been programming languages. In that context, a token is just the smallest unit of a sequence that carries meaning to the system processing it. AI hasn’t invented the idea. It’s borrowed it and turbo charged it by applying it to human languages.

Natural language is the messy, ambiguous, contextual stuff that humans write to each other. It is an entirely different challenge from source code. Solving that challenge is what drove decades of work in computational linguistics, some of which I wrote about over on the SocialOptic blog recently. The TLDR; is that language turns out to be extraordinarily resistant to the neat rule-based approaches that work so well for programming languages.

The first approaches treated words as the basic unit. One word, one token. That sounds sensible enough. But it created an immediate problem: the system could only handle words it had seen before. Anything new (like a proper noun, a technical term, a misspelling, a name) fell into a gap. And languages with rich inflection, where a single verb might have dozens of forms, made the vocabulary explode to unmanageable sizes.

The solution, which underpins almost every modern AI language model, was to go sub-word.

This is where the linguistics gets genuinely interesting: if you think about how spoken language actually works. When you hear speech, you don’t process it as a stream of individual letters, or as a sequence of complete words. You process it as phonemes — the smallest units of sound that carry distinction. The word “cat” is three phonemes: /k/, /æ/, /t/. Phonemes combine into syllables, syllables into words. None of this is conscious; it’s just how the auditory system is wired. The interesting thing is that this chunking is efficient precisely because it isn’t arbitrary. The chunks that emerge reflect the actual statistical structure of the language. It was something I was first introduced to when I was working on speech recognition and speech synthesis systems as a teenager. That was very much last century, that is how long the actual idea of this approach in computing has been around.

Tokens, in modern AI systems, work on a similar principle. They’re sub-word chunks. They are bigger than individual characters, but smaller than words. They emerge from the statistical patterns in vast quantities of text used to train AI models. The word “unbelievable” might become three tokens: “un,” “believ,” “able.” Those aren’t random slices; they reflect units that appear frequently enough across the language to be worth encoding as a single item. The word “connection” might be a single token, because it’s more common. I’d say there isn’t a lot of obvious rhyme and reason to it, but you gradually get a feel for it as there are patterns. A rare technical term might fragment into many small pieces, because the system has never seen enough of it to learn it whole. Different (AI) models have different approaches, and that shapes the “encoder” in the model, which is what owns the job of turning what you type into tokens.

The upshot is that (very) roughly 750 words is equivalent to about 1,000 tokens in English, which gives you a feel for the scale. That ratio varies considerably by language, in a way that matters and which I’ll come back to.

Why reading costs less than writing

Output tokens, the AI’s response, are typically three to five times more expensive than input tokens, the text you send in. There is a reason for that asymmetry. When the model reads your prompt (the text you type in), it processes everything in parallel. It can read the whole thing at once. When it generates a response, it works one token at a time, sequentially. Each new word requires a full pass of the model’s parameters. It’s the difference between scanning a page and writing one by hand. The reading is fast; the writing is expensive. That’s why in the rawest pricing models, you see different pricing for input and output tokens.

The hidden complexity in “output”

So far, so logical. But the pricing picture gets more complicated than a simple input/output split, in two directions.

The first is on the input side. Different tasks require very different amounts of context to produce a good result. Asking a capable model a short, well-framed question might get you an excellent answer in fifty tokens. Asking the same question of a model that needs more handholding, or working on a task that requires feeding in a lot of material, like analysing a large document or understanding a complex codebase, means sending tens or hundred of tokens, perhaps even thousands, before the model has even begun to respond. The input cost isn’t just about the length of your question; it’s about how much of the world the model needs to see before it can give you a useful answer.

The second complication is on the output side, and it’s one that’s just starting to show up clearly in pricing structures. There are, effectively, two kinds of output tokens.

The first kind is the response you actually read. The second kind is something harder to name cleanly. It is most commonly called reasoning tokens, because that’s the term the industry has settled on, though I’ll confess a certain scepticism about the word “reasoning” in this context. But that’s a discussion for another day. What these tokens represent, in practice, is the internal working that certain models produce before they write their actual response. So-called “thinking models” generate a chain of intermediate steps before committing to an answer. You don’t always see this working, but it’s being computed, and it’s being charged for. Reasoning models often have high time-to-first-token, the time you first get characters starting to appear on your screen. It can be anywhere from ten to a hundred and fifty seconds. For some of the work I do, it can even be hours. The AI models work through this internal process before producing the first word of their response. For some AI models the number of reasoning tokens can dwarf the visible output. One of my favourite prompts for testing new models is “write a long poem” – for a thinking model it writes more about about writing the poem than is in the final poem, as it works out rhythm, rhyme and sense-checks. A question that produces a two-paragraph answer might have consumed thousands of reasoning tokens getting there.

This matters because two models might appear similarly priced on paper, but if one relies heavily on internal reasoning to achieve its results, it will cost significantly more per useful output word than a model that doesn’t. Comparing AI pricing on headline numbers alone is a bit like comparing car running costs purely on purchase price. That isn’t quite how it works.

The speed dimension: tokens per second

There’s another way to think about tokens that goes beyond cost, and it’s equally important for understanding how these systems actually behave: speed.

Tokens per second (usually written as tok/s) measures the rate at which a model generates output. A model at 50 tok/s already outruns any human reader of its output. Average reading speed in English is about 250 words per minute, which works out to close to six tokens per second. So a model generating fifty tokens per second is running at about eight to nine times reading speed. That is fast enough that the bottleneck is you, not it. Under about thirty tokens per second, an interactive experience starts to feel sluggish; forty to fifty is widely considered the threshold at which inference becomes genuinely fluid for a human on the other end. This is important for local AI and I’ll come back to that in future posts.

The speed threshold you need depends entirely on what you’re doing with the output. For a chat interface where you’re reading as the response streams in, fifty tokens per second is absolutely plenty. For an offline task, like generating a long report, processing a batch of documents overnight, the speed matters less than cost. This is also an opportunity to power-shift and be greener too. For agentic workflows, where AI systems are orchestrating dozens of sub-tasks and passing output between models, you’re in entirely different territory. Specialised inference systems optimised for high throughput can reach tens of thousands of tokens per second. For bulk processing at scale, that’s the target.

This creates a real split in the market. Cloud frontier models like Claude, ChatGPT and Gemini typically deliver somewhere between fifty and two hundred and thirty tokens per second. For most interactive use, that’s comfortable. The trade-off is that you’re paying per token for that convenience, on someone else’s hardware.

Local models, running on your own machine, are a different story. A high-end consumer GPU running a mid-sized model can achieve sixty to eighty tokens per second, which is competitive with cloud performance for interactive use. The key constraint is memory: token generation is fundamentally limited by how fast the hardware can move model weights in and out of memory, not by raw processing power. That’s why the amount of video memory in a GPU matters so much for local inference, and why larger, more capable models need either more hardware or slower speeds. The hardware question and the token question are, in the end, the same question. That is a big part of the reason for the explosion in DRAM prices you hear so much about.

Not all words are equal

There is an inequity that doesn’t get talked about enough. Because most of these models were trained primarily on English-language text, the tokenisation is much more efficient for English than for other languages. The same content in Hindi, Telugu, or Arabic gets fragmented into smaller, less efficient pieces. Remember, more tokens means higher cost and less effective “memory” available for the actual conversation. Research suggests the same meaning can be two to five times more expensive to process in many major world languages than in English. The AI economy isn’t evenly distributed, and the tokenisation layer is one of the less visible reasons why. Expect to see the large AI vendors making efforts to distract from this or to recentre the argument.

A reason to be cautiously optimistic

The shift from “flat rate with mysterious limits” to “pay per token” is, in some ways, fairer and more responsible. You are paying for what you use. There is also another reason to genuinely welcome it that goes beyond fairness, and it’s one I find myself returning to. Tokens are, ultimately, a proxy for compute. And compute consumes energy. Charging by the token starts to align the cost of AI more directly with its environmental impact. The more you use, the more you pay, and the more energy is consumed. That alignment doesn’t solve the environmental question, but it’s a better foundation for thinking about it than the old model, where the costs were socialised and the incentive to use more was essentially unlimited. It means that people are putting huge efforts into being more efficient. There is an immediate payback. Whether the industry follows through on that logic is another question. But the direction, at least, points to somewhere better.

There is more to tokens

There is much more that can be said about tokens. Large Language Models (LLMs) are fundamentally dependent on turning continuous human language into these numerical primitives. It isn’t just text, it is also audio, images and video as well. They work in many different ways, but tokenization tools serve as the critical interface between the unstructured, high-dimensional reality of human communication and the discrete, vector-based internal world of the transformer-based neural networks that drive today’s AI. This fundamental procedure determines not only the model’s linguistic boundaries but also its cognitive efficiency, its economic feasibility, and the equitability (an awkward word for an awkward fact) of its global access. There is a huge amount of mathmatics and technology involved in creating more and more effective ways to tokenize, and even a basic introduction would run to many many blog posts, with mysterious terms like BPE (Byte Pair Encoding) and unigrams and SentencePiece.

The token as the fundamental unit of the intelligence economy

Thankfully we don’t all need to understand all of the technologies involved in tokenization, but we do need a basic understanding of what tokens are. AI vendors will inevitably start to push it as a unit of currency, as they start to work towards managing their costs and shape user behaviour.

The transition of AI into the mainstream has transformed the token into the definitive unit of the “intelligence economy.” From the first principles of lexical analysis to the current state-of-the-art byte-level BPE, tokenization has evolved into a highly optimized interface that governs the speed, accuracy, and price of artificial intelligence.

As exemplified by the shift in billing models for ChatGPT, Claude, and GitHub Copilot, the “flat-rate” era is giving way to a consumption-based reality where developers must treat tokens as a metered utility. The emergence of management techniques like prompt caching and compression are providing the tools to scale AI more responsibly. Understanding the mechanics of the token is no longer just a technical skill, it is a foundation piece of knowledge for working with the tools.