SubQ 12M Context Window — Long-Context Pricing Impact

I wrote this with the pricing table open, not as a generic AI-tools list. The useful question is simple: where does this choice change the bill, the cap, or the model-routing decision?

Subquadratic has launched SubQ, a new long-context AI model aimed squarely at repository-scale coding, long-running agents, and document-heavy workflows. The headline claim is unusually aggressive: a 12-million-token context window today, with a 50-million-token target later this year.

The pricing headline isn’t a public per-token rate yet. Subquadratic is positioning the model as a cost-efficiency play: its site says SubQ runs at one-fifth the cost of other leading LLMs, while its coding-agent product claims about 25% lower bills and faster exploration by redirecting expensive model turns.

That makes SubQ worth watching for one reason: if the claims hold up in production, it attacks one of the most expensive parts of modern AI systems, stuffing huge context into premium frontier models.

What changed

Subquadratic says SubQ is built on Subquadratic Selective Attention, an architecture designed to avoid the quadratic scaling problem of standard transformer attention. In plain English: instead of comparing every token with every other token, the model tries to select the relationships that matter without making the selection step itself quadratic.

The company is launching two beta products:

Product	What it does	Pricing implication
SubQ API	OpenAI-compatible API with a 12M-token context window, streaming, and tool use	Lets teams send entire repositories, large histories, or pipeline state in one call
SubQ Code	Coding-agent layer for Claude Code, Codex, Cursor-style workflows	Claims ~25% lower bills by using long context for exploration and redirecting expensive turns

Subquadratic also says the model reaches 150 tokens per second and can reason across 12M tokens while costing 1/5 of other leading LLMs. Exact token pricing has not been published, so buyers should treat those as vendor claims until beta invoices or a public pricing table are available.

Why this matters for AI bills

Long context is valuable, but it’s usually expensive. OpenAI’s GPT-5.5 is listed at $5 per 1M input tokens and $30 per 1M output tokens. Claude Opus 4.7 is $5 / $25, and Gemini 3 Pro is $2 / $12. Those are manageable for normal prompts; they become painful when applications push hundreds of thousands or millions of tokens into every request.

A 1M-token prompt to GPT-5.5 costs about $5 before output at standard input rates. Ten such prompts cost $50 before the model writes a single answer. If a workflow repeatedly ships large repositories, customer histories, legal archives, research corpora, or agent memory into a frontier model, context becomes the bill.

SubQ’s pitch is that architecture can change that math. If a long-context model can process much larger prompts at materially lower cost, teams may not need to choose between expensive brute-force context and complex retrieval pipelines for every workload.

For current reference rates, compare our OpenAI pricing, Anthropic pricing, and Google AI pricing pages, or model your own token mix in the AI token calculator.

Context window comparison

Model / platform	Public context headline	Pricing status	Best-fit workload
SubQ	12M tokens	Public per-token price not posted; vendor claims 1/5 cost	Full repositories, long agent state, huge document sets
GPT-5.5	1.05M tokens	$5 input / $30 output per 1M tokens	Premium reasoning, coding, research synthesis
Gemini 3.1 Pro	Up to 2M tokens in current site coverage	Paid API tiers	Long-context and multimodal analysis
Claude Opus 4.7	Strong long-context/coding focus	$5 input / $25 output per 1M tokens	High-quality coding, reasoning, editorial workflows

The important distinction is that context length and cost are separate buying questions. A huge window is only useful if retrieval quality holds up near the end of the prompt, latency stays acceptable, and the price doesn’t destroy margins.

What Subquadratic claims on benchmarks

The New Stack reports several strong claims from Subquadratic’s launch materials:

92.1% needle-in-a-haystack retrieval at 12M tokens
52x faster than dense attention at 1M tokens
82.4% on SWE-bench Verified in one reported run
83 on MRCR v2, described as ahead of OpenAI in the launch comparison

Those numbers are interesting, especially for coding agents and long-context retrieval. But they need caution. The reported evaluations are early, the model is in beta, and The New Stack notes that each model was run only once in the technical paper because inference was expensive. SWE-bench margins can also depend heavily on the harness, not only the model.

The safe interpretation: SubQ isn’t yet a proven replacement for GPT-5.5, Claude Opus 4.7, or Gemini in every high-value workflow. It’s a serious new benchmark target for teams whose costs are dominated by massive context.

Who should test SubQ first

Coding-agent teams should pay attention. Repository-scale exploration is one of the clearest use cases for a 12M-token model. If SubQ Code can map a full repo and reduce the number of expensive frontier-model calls, the savings could be real even if final implementation turns still use Claude, GPT-5.5, or Gemini.

Enterprise search and research teams are another fit. Legal archives, customer histories, compliance material, incident logs, and internal documentation often exceed the comfortable context limits of mainstream models. A larger context window can simplify architecture if retrieval quality stays high.

Agent platforms with persistent state should also test it. Today’s agents often summarize, compress, retrieve, and discard state because context is scarce. A 12M-token window changes what can be kept directly available, though it doesn’t eliminate the need for memory hygiene.

Who should wait

Most production teams shouldn’t immediately replace their model routing with SubQ. Wait if your workload is mostly short prompts, simple classification, routine support drafts, or extraction jobs that already run cheaply on smaller models.

Also wait if you need mature enterprise controls, published SLAs, deterministic pricing, audited data handling, or stable public documentation before adopting a new model vendor. The launch is exciting, but procurement teams need invoices, contract terms, and real usage data.

Practical advice

If you get beta access, test SubQ against a concrete long-context bill, not a generic benchmark.

Pick one expensive workflow. Good candidates are repo-wide coding questions, multi-document research, contract review, or long agent history analysis.
Compare total task cost. Measure SubQ against GPT-5.5, Claude Opus 4.7, and Gemini on cost per successful answer, not only token price.
Track retrieval failures. With huge contexts, the key question is whether the model finds the right details reliably near the end of the prompt.
Keep a frontier fallback. Use SubQ for context gathering or first-pass analysis, then escalate hard reasoning to your current best model when needed.
Wait for public pricing before committing architecture. A vendor claim of 1/5 cost is useful, but a published rate card is what lets teams forecast spend.

My read

SubQ is a pricing story because it challenges the assumption that huge context must mean huge bills.

If Subquadratic’s architecture performs as advertised, long-context workloads could move from expensive frontier-model calls toward a cheaper dedicated context layer. That wouldn’t kill retrieval, caching, or model routing; it would give builders another lever.

For now, treat SubQ as a high-priority benchmark candidate for long-context AI systems. The next thing buyers need is simple: public API pricing and real production cost data.

For more context, read our context windows and cost impact guide and GPT-5.5 vs GPT-5.4 pricing comparison.

Sources: Subquadratic launch site and The New Stack: Subquadratic debuts a 12-million-token window.