Every call to a large language model consumes tokens — fragments of text that are the unit currency of inference. The economics of those tokens now drive the total cost of AI ownership for every bank, regulator, and exchange deploying GenAI at scale. Most organisations are running pilots without understanding their run-rate. The ones that do are making fundamentally different architecture and vendor decisions.
Token pricing varies by orders of magnitude depending on model tier, provider, and whether you're consuming input or output tokens. Understanding this landscape is the starting point for any rational AI cost model.
| Model tier | Input / 1M tokens | Output / 1M tokens | Context window | Best for |
|---|---|---|---|---|
| Frontier reasoning (e.g. GPT-4o, Claude Opus, Gemini Ultra) | $2.50–15.00 | $10.00–75.00 | 128K–1M | Complex analysis, strategy |
| Mid-tier (e.g. Claude Sonnet, GPT-4o-mini, Gemini Flash) | $0.30–3.00 | $1.00–15.00 | 128K–200K | Production workloads |
| Lightweight (e.g. Haiku, Gemini Nano, Phi, Mistral) | $0.01–0.25 | $0.05–1.00 | 32K–128K | High-volume, classification |
| Self-hosted open-source (e.g. Llama, Nemotron, Mixtral) | Infra cost only | Infra cost only | Configurable | Sovereignty, cost control |
Prices indicative as of Q1 2026. Frontier model pricing is falling ~40–50% annually. The smart play is architecture that can swap models without re-engineering the application layer.
The board question is never "what does AI cost?" — it's "what does each unit of AI generate in value?" This framework connects token spend to business outcomes.
Direct token consumption across all models in production. Scales linearly with usage. The largest and most controllable cost lever — model selection, prompt engineering, caching, and routing decisions directly impact this line.
GPU compute (NVIDIA A100/H100/Blackwell), storage, networking, and orchestration. For self-hosted models this is the dominant cost; for API-consumed models it's embedded in token pricing. The private AI factory vs public cloud decision lives here.
Prompt engineering, RAG pipeline development, fine-tuning, evaluation harnesses, MLOps, and integration. Often underestimated — the "last mile" from model to production is where most programmes stall and overspend.
Model risk management (SR 11-7 / SS1/23), testing, monitoring, audit, bias detection, explainability tooling, and the human oversight layer that regulators demand. Under-investing here is a false economy — it creates regulatory exposure.
Data acquisition, labelling, cleaning, vector embeddings, and ongoing data pipeline maintenance. RAG architectures shift cost from fine-tuning to data infrastructure. The quality of your data layer determines the quality ceiling of every model.
Every token consumed should trace to one of four value categories. If it doesn't, you're subsidising experimentation at production prices.
The most consequential infrastructure decision in enterprise AI. It determines your cost floor, your sovereignty posture, and your ability to scale.
| Dimension | Public cloud / API | Private AI factory | Hybrid |
|---|---|---|---|
| Capital outlay | Low (OpEx) | High (CapEx) | Medium |
| Unit cost at scale | Higher (margin embedded) | Lower (amortised) | Optimised |
| Data sovereignty | Shared tenancy risk | Full control | Tiered |
| Model flexibility | Widest choice | Open-source focused | Best of both |
| Time to production | Weeks | Months | Weeks–Months |
| Regulatory comfort | Depends on jurisdiction | Highest | High |
| Scaling ceiling | Elastic | Capacity-bound | Elastic + local |
| Best for | Early stage, variable demand | Sovereignty-first, high volume | Most enterprises at scale |
The sweet spot for most FS institutions: a hybrid model — sovereign-sensitive workloads (customer data, credit models, surveillance) on a private AI factory; frontier reasoning tasks and low-sensitivity workloads via API. The router layer that decides which request goes where is the most important piece of AI infrastructure you'll build.
Two technical levers with outsized impact on token costs. Both require architectural decisions made early and are hard to bolt on later.
Most enterprise queries have semantic duplicates — "Who are our top 10 customers in the energy sector?" asked different ways by different teams. Semantic caching stores responses keyed by semantic similarity, not exact text match. If a cached answer is 95%+ similar to an incoming query, return it without re-invoking the model.
Implementation: Embed all queries into a vector store. For new query, find top-K similar cached responses. If score > threshold, return cached result with freshness check.
Model economics: A typical enterprise with 10,000 daily queries might see 30–40% hit rates on semantic cache. At $3/1M input tokens and average query = 1,000 tokens, that's $30/day saved per 30% cache hit rate, or ~$10K/year. With caching infrastructure, ROI is positive in week one.
Most LLM providers offer batch APIs that are 50% cheaper than real-time. The tradeoff: requests are processed on the provider's schedule, usually 8–24 hours later. Not suitable for customer-facing or time-critical work, but perfect for overnight reports, periodic analysis, and batch scoring.
Use cases: Daily regulatory compliance reports (10K documents), monthly market analysis, quarterly portfolio rebalancing analysis, nightly contact centre call summaries.
Model economics: A bank running 1,000 complex analyses per month at 5,000 tokens each and paying $10/1M tokens for real-time would spend $50. Via batch at $5/1M tokens, that's $25. Annual savings: $300. Multiply across 20 use cases and you're looking at $6–8K annually from this one lever alone. Scaled to a large institution with 100+ batch use cases, batch processing can save $200K–500K/year.
For periodic batch workloads (e.g. daily regulatory reports), use semantic caching to avoid re-running identical analyses, then batch the remainder at night. This creates a 70% cost reduction vs real-time on the non-cached portion.
Most organisations underestimate the true cost of AI by 3–5×. Here's what a realistic TCO should capture:
AI token costs grow quickly without guardrails. Here's the framework to prevent bill shock:
Each AI use case gets a monthly token budget based on expected volume × cost/token. Budgets are tracked in real time. When a use case hits 80% of budget, alert owner. At 100%, trigger manual approval for overflow.
Track not just total cost, but cost-per-unit-of-value. For credit memos: $/memo. For customer intelligence: $/recommendation. For market reports: $/page. If cost-per-unit drifts upward, you have a prompt efficiency or query volume problem to solve.
Plot your daily/weekly token consumption. If the curve is accelerating (non-linear growth), you have a quality degradation or scope creep problem. Flat or linear growth is healthy. Exponential growth is a red flag.
Track the split of workloads across model tiers. If you're running 50% of your queries on frontier models and 50% should be lightweight, you're overspending by 5–10×. Healthy ratio: 10% frontier, 40% mid-tier, 50% lightweight.
Every quarter, revalidate the original business case vs actual spend and delivered value. If a use case is running 2× the projected token budget with 50% lower value realisation, kill it and redeploy the budget.
Annual run-rate for a mid-size bank or exchange running 15–25 AI use cases in production, modelled across three infrastructure strategies.
Per year. Fastest to deploy. Highest unit cost. Vendor-dependent. Limited sovereignty.
Per year. Best balance. Sovereign workloads on-prem, frontier reasoning via API. Requires router.
Year 1 (CapEx heavy). Drops to $1.5–3M/yr run-rate. Full sovereignty. Open-source models.
Indicative ranges for illustration. Actual costs depend on use case mix, model selection, data volume, and geographic deployment. The right answer is almost always hybrid.
Get in touch to discuss how this accelerator fits your institution.
Book a Consultation →