AI Tokenomics

Why tokenomics is the new AI board conversation

Every call to a large language model consumes tokens — fragments of text that are the unit currency of inference. The economics of those tokens now drive the total cost of AI ownership for every bank, regulator, and exchange deploying GenAI at scale. Most organisations are running pilots without understanding their run-rate. The ones that do are making fundamentally different architecture and vendor decisions.

~750

Tokens per typical page of text

10–100×

Cost range across model tiers

60–80%

Inference as % of total AI cost

$2–30

Per 1M output tokens (frontier)

The token cost landscape · 2026

Token pricing varies by orders of magnitude depending on model tier, provider, and whether you're consuming input or output tokens. Understanding this landscape is the starting point for any rational AI cost model.

Model tier	Input / 1M tokens	Output / 1M tokens	Context window	Best for
Frontier reasoning (e.g. GPT-4o, Claude Opus, Gemini Ultra)	$2.50–15.00	$10.00–75.00	128K–1M	Complex analysis, strategy
Mid-tier (e.g. Claude Sonnet, GPT-4o-mini, Gemini Flash)	$0.30–3.00	$1.00–15.00	128K–200K	Production workloads
Lightweight (e.g. Haiku, Gemini Nano, Phi, Mistral)	$0.01–0.25	$0.05–1.00	32K–128K	High-volume, classification
Self-hosted open-source (e.g. Llama, Nemotron, Mixtral)	Infra cost only	Infra cost only	Configurable	Sovereignty, cost control

Prices indicative as of Q1 2026. Frontier model pricing is falling ~40–50% annually. The smart play is architecture that can swap models without re-engineering the application layer.

Cost vs Value framework

The board question is never "what does AI cost?" — it's "what does each unit of AI generate in value?" This framework connects token spend to business outcomes.

AI ROI = (Value Generated − Total AI Cost) ÷ Total AI Cost

Where Total AI Cost = Inference (tokens) + Infrastructure + Engineering + Governance + Data

The five cost layers

1. Inference cost (tokens)

40–60% of totalVariable

Direct token consumption across all models in production. Scales linearly with usage. The largest and most controllable cost lever — model selection, prompt engineering, caching, and routing decisions directly impact this line.

2. Infrastructure cost

15–30% of totalFixed + Variable

GPU compute (NVIDIA A100/H100/Blackwell), storage, networking, and orchestration. For self-hosted models this is the dominant cost; for API-consumed models it's embedded in token pricing. The private AI factory vs public cloud decision lives here.

3. Engineering cost

15–25% of totalFixed

Prompt engineering, RAG pipeline development, fine-tuning, evaluation harnesses, MLOps, and integration. Often underestimated — the "last mile" from model to production is where most programmes stall and overspend.

4. Governance & risk cost

5–10% of totalFixed

Model risk management (SR 11-7 / SS1/23), testing, monitoring, audit, bias detection, explainability tooling, and the human oversight layer that regulators demand. Under-investing here is a false economy — it creates regulatory exposure.

5. Data cost

5–15% of totalVariable

Data acquisition, labelling, cleaning, vector embeddings, and ongoing data pipeline maintenance. RAG architectures shift cost from fine-tuning to data infrastructure. The quality of your data layer determines the quality ceiling of every model.

The value side · what tokens generate

Every token consumed should trace to one of four value categories. If it doesn't, you're subsidising experimentation at production prices.

Revenue uplift

Cross-sell/upsell targeting precision
Fee income from AI-powered products (IaaS, data products)
Client advisory quality and throughput
New revenue streams (Intelligence-as-a-Service)

Cost avoidance

Manual processing hours displaced
Straight-through processing uplift
Regulatory reporting automation
Contact centre deflection rate

Risk reduction

Fraud detection accuracy improvement
Credit loss reduction from better models
Operational risk event prevention
Regulatory penalty avoidance

Capital efficiency

RWA optimisation (Basel IV)
Collateral efficiency through better risk models
Faster time-to-market for new products
Employee productivity × decision quality

Private vs Public AI factory · decision framework

The most consequential infrastructure decision in enterprise AI. It determines your cost floor, your sovereignty posture, and your ability to scale.

Dimension	Public cloud / API	Private AI factory	Hybrid
Capital outlay	Low (OpEx)	High (CapEx)	Medium
Unit cost at scale	Higher (margin embedded)	Lower (amortised)	Optimised
Data sovereignty	Shared tenancy risk	Full control	Tiered
Model flexibility	Widest choice	Open-source focused	Best of both
Time to production	Weeks	Months	Weeks–Months
Regulatory comfort	Depends on jurisdiction	Highest	High
Scaling ceiling	Elastic	Capacity-bound	Elastic + local
Best for	Early stage, variable demand	Sovereignty-first, high volume	Most enterprises at scale

The sweet spot for most FS institutions: a hybrid model — sovereign-sensitive workloads (customer data, credit models, surveillance) on a private AI factory; frontier reasoning tasks and low-sensitivity workloads via API. The router layer that decides which request goes where is the most important piece of AI infrastructure you'll build.

Caching and batching economics — concrete savings models

Two technical levers with outsized impact on token costs. Both require architectural decisions made early and are hard to bolt on later.

Semantic caching (20–40% cost reduction)

High ROILow effort

Most enterprise queries have semantic duplicates — "Who are our top 10 customers in the energy sector?" asked different ways by different teams. Semantic caching stores responses keyed by semantic similarity, not exact text match. If a cached answer is 95%+ similar to an incoming query, return it without re-invoking the model.

Implementation: Embed all queries into a vector store. For new query, find top-K similar cached responses. If score > threshold, return cached result with freshness check.

Model economics: A typical enterprise with 10,000 daily queries might see 30–40% hit rates on semantic cache. At $3/1M input tokens and average query = 1,000 tokens, that's $30/day saved per 30% cache hit rate, or ~$10K/year. With caching infrastructure, ROI is positive in week one.

Batch processing (50% cost reduction on non-urgent workloads)

Massive cost leverageRequires planning

Most LLM providers offer batch APIs that are 50% cheaper than real-time. The tradeoff: requests are processed on the provider's schedule, usually 8–24 hours later. Not suitable for customer-facing or time-critical work, but perfect for overnight reports, periodic analysis, and batch scoring.

Use cases: Daily regulatory compliance reports (10K documents), monthly market analysis, quarterly portfolio rebalancing analysis, nightly contact centre call summaries.

Model economics: A bank running 1,000 complex analyses per month at 5,000 tokens each and paying $10/1M tokens for real-time would spend $50. Via batch at $5/1M tokens, that's $25. Annual savings: $300. Multiply across 20 use cases and you're looking at $6–8K annually from this one lever alone. Scaled to a large institution with 100+ batch use cases, batch processing can save $200K–500K/year.

Combining caching + batching (70% cost reduction on low-urgency workloads)

Maximum impact

For periodic batch workloads (e.g. daily regulatory reports), use semantic caching to avoid re-running identical analyses, then batch the remainder at night. This creates a 70% cost reduction vs real-time on the non-cached portion.

Building a TCO model — the spreadsheet you need

Most organisations underestimate the true cost of AI by 3–5×. Here's what a realistic TCO should capture:

Annual AI TCO = (Inference tokens × price/token) + Infrastructure + Engineering + Governance + Data

What each line item should include

Inference token cost

Project tokens per use case (tokens = chars ÷ 4)
Model pricing for each tier (input vs output)
Query volume per month × 12
Caching adjustment (30–40% reduction)
Batch vs real-time split (50% cost for batch portion)
Committed volume discounts (negotiate 20–40% off list price)

Infrastructure cost

Cloud compute (if self-hosted): GPU cost, networking, storage
API gateway and monitoring tools
Vector DB for embeddings (if RAG-heavy)
Load balancing and auto-scaling infrastructure
Disaster recovery and backup
For private AI factory: amortise GPU capex over 3 years

Engineering cost

Prompt engineering (0.5–2 FTE per 5 use cases)
RAG pipeline development (1 FTE per major pipeline)
Fine-tuning (if needed): $500K–2M per model per year
MLOps: monitoring, logging, model versioning (1 FTE per 20 models)
Integration with enterprise systems (0.5–1 FTE per use case)
Loaded cost per FTE: $150K–250K all-in

Governance & risk cost

Model risk management testing and documentation
Bias and fairness testing (1–2 FTE)
Explainability and auditability tools
Regulatory reporting and examination prep
Compliance reviews and sign-offs
Internal audit and quality assurance

Data cost

Data acquisition and licensing
Data labelling (if supervised fine-tuning)
Data cleaning and validation
Vector embeddings generation (recurring)
Data pipeline maintenance and monitoring
Data freshness and versioning infrastructure

Contingency (often forgotten)

Unexpected model retraining (budget 10% overrun)
Performance debugging and tuning
Regulatory change response and re-engineering
Vendor relationship management
Training and change management

Eight levers to optimise token economics

Architecture levers

Model routing — Route each request to the cheapest model that meets the quality bar. Frontier for complex reasoning; lightweight for classification and extraction. A router that picks the right model saves 30–50% of inference cost.
Prompt engineering — Concise, structured prompts reduce input tokens by 30–50%. System prompts cached and reused, not resent per call. A 1,000-token prompt optimisation saves $3–30 per 1M calls depending on query volume.
Semantic caching — Cache responses for similar queries. At scale, 20–40% of enterprise queries are near-duplicates. Caching eliminates repeat inference.
RAG over fine-tuning — Retrieval-augmented generation is cheaper to maintain than fine-tuned models. Data updates don't require retraining. Trade-off: latency and retrieval quality matter more.

Commercial levers

Committed volume pricing — Negotiate committed-use discounts with providers. Provisioned throughput can cut per-token cost by 30–60%. Requires forecasting accuracy and contract flexibility.
Batch vs real-time — Batch inference is 50% cheaper than real-time at most providers. Schedule non-urgent workloads (reports, summaries) for batch. Set expectations: 8–24 hour latency.
Open-source for base workloads — Llama, Mistral, and Nemotron models on owned infrastructure for high-volume, stable workloads. Breaks even at ~$200K/month token spend.
Metering and chargeback — Instrument every application's token consumption. Business units that see their AI bill make better architecture decisions. Cost transparency is the best optimisation lever.

Cost governance framework — how to control runaway spending

AI token costs grow quickly without guardrails. Here's the framework to prevent bill shock:

1. Budgeting by use case

Each AI use case gets a monthly token budget based on expected volume × cost/token. Budgets are tracked in real time. When a use case hits 80% of budget, alert owner. At 100%, trigger manual approval for overflow.

2. Cost per transaction KPIs

Track not just total cost, but cost-per-unit-of-value. For credit memos: $/memo. For customer intelligence: $/recommendation. For market reports: $/page. If cost-per-unit drifts upward, you have a prompt efficiency or query volume problem to solve.

3. Token consumption rate monitoring

Plot your daily/weekly token consumption. If the curve is accelerating (non-linear growth), you have a quality degradation or scope creep problem. Flat or linear growth is healthy. Exponential growth is a red flag.

4. Model tier ratios

Track the split of workloads across model tiers. If you're running 50% of your queries on frontier models and 50% should be lightweight, you're overspending by 5–10×. Healthy ratio: 10% frontier, 40% mid-tier, 50% lightweight.

5. Quarterly business case revalidation

Every quarter, revalidate the original business case vs actual spend and delivered value. If a use case is running 2× the projected token budget with 50% lower value realisation, kill it and redeploy the budget.

What does enterprise AI actually cost? · Three scenarios

Annual run-rate for a mid-size bank or exchange running 15–25 AI use cases in production, modelled across three infrastructure strategies.

All-API

$4–8M

Per year. Fastest to deploy. Highest unit cost. Vendor-dependent. Limited sovereignty.

Hybrid

$2.5–5M

Per year. Best balance. Sovereign workloads on-prem, frontier reasoning via API. Requires router.

Private AI factory

$6–12M

Year 1 (CapEx heavy). Drops to $1.5–3M/yr run-rate. Full sovereignty. Open-source models.

Indicative ranges for illustration. Actual costs depend on use case mix, model selection, data volume, and geographic deployment. The right answer is almost always hybrid.

Questions the board should be asking

What is our current monthly token consumption and run-rate — and who owns the number?
What is the blended cost per AI-assisted decision across our top 5 use cases?
For each use case, can we quantify the value generated per dollar of token spend?
Are we routing requests to the right model tier, or paying frontier prices for commodity tasks?
What percentage of our AI workloads require data sovereignty — and are those workloads on sovereign infrastructure?
Do we have a private AI factory business case, and at what volume does it break even vs API?
Is our AI cost growing linearly with value, or are we on a diverging curve?
Who in the organisation has P&L accountability for AI economics — not just AI delivery?

Why tokenomics is the new AI board conversation

The token cost landscape · 2026

Cost vs Value framework

The five cost layers

1. Inference cost (tokens)

2. Infrastructure cost

3. Engineering cost

4. Governance & risk cost

5. Data cost

The value side · what tokens generate

Revenue uplift

Cost avoidance

Risk reduction

Capital efficiency

Private vs Public AI factory · decision framework

Caching and batching economics — concrete savings models

Semantic caching (20–40% cost reduction)

Batch processing (50% cost reduction on non-urgent workloads)

Combining caching + batching (70% cost reduction on low-urgency workloads)

Building a TCO model — the spreadsheet you need

What each line item should include

Inference token cost

Infrastructure cost

Engineering cost

Governance & risk cost

Data cost

Contingency (often forgotten)

Eight levers to optimise token economics

Architecture levers

Commercial levers

Cost governance framework — how to control runaway spending

1. Budgeting by use case

2. Cost per transaction KPIs

3. Token consumption rate monitoring

4. Model tier ratios

5. Quarterly business case revalidation

What does enterprise AI actually cost? · Three scenarios

All-API

Hybrid

Private AI factory

Questions the board should be asking

Ready to implement this in your organisation?