A startup claims it broke through a bottleneck that’s holding back LLMs
MIT Technology Review

A startup claims it broke through a bottleneck that’s holding back LLMs

Miami-based AI startup Subquadratic came out of stealth mode last month with a huge claim. It announced that it had solved a mathematical bottleneck that had been holding back large language models for almost a decade. The details were thin, and many people were unconvinced. But Subquadratic has started to bring the receipts, sharing the…

A startup claims it broke through a bottleneck that’s holding back LLMs

Miami-based AI startup Subquadratic came out of stealth mode last month with a huge claim. It announced that it had solved a mathematical bottleneck that had been holding back large language models for almost a decade. The details were thin, and many people were unconvinced. But Subquadratic has started to bring the receipts, sharing the results of an independent evaluation of its new tech.

The results suggest that the company’s claims might be worth paying attention to. According to Subquadratic, it has developed a new kind of LLM, called SubQ, that is faster and cheaper and uses a lot less energy than any other model on the market. The company also claims that SubQ is able to process up to 12 times as much text at once than most other models, allowing it to carry out a range of data-heavy tasks, such as analyzing hundreds of documents or entire code bases.

What’s more, Subquadratic says, SubQ does this while more or less matching the performance of the best models put out by Google DeepMind, OpenAI, and Anthropic on key tasks like coding. The problem was that the company at first provided little evidence for its claims beyond a handful of self-published test scores. And it has yet to make SubQ widely available for people to try out themselves. So it’s no surprise that Subquadratic’s claims were met with skepticism.

Dan McAteer, an artificial intelligence engineer, captured the overall response on X: “SubQ is either the biggest breakthrough since the Transformer ... or it’s AI Theranos.”

A month on, the company has published more information about its model, including the results of additional independent tests run by third-party firm Appen. “We expected healthy skepticism,” says Subquadratic cofounder and chief technology officer Alex Whedon. “In hindsight, releasing the third-party benchmarks alongside the initial announcement would have preempted much of the skepticism, which is why we’re taking the time to make sure any future results are fully verified before putting them out.”

Subquadratic asked Appen, which evaluates other companies’ models, to run its tests on SubQ. The results seem to back up a lot of Subquadratic’s claims. “That was really exciting to me, it validated their architecture,” says Jeanine Sinanan-Singh, Appen’s director of generative AI research. “I was like, ‘Wow, this could be a game changer,’ because models struggle with speed and inefficiency,” she adds. “But when you have kind of shocking results, it’s really not as credible when you say it yourself.”

SubQ won’t replace existing top models across the board, but it could offer huge increases in speed at a fraction of the typical cost for certain tasks. Subquadratic insists that in the long run, though, its breakthrough could change how LLMs are built. “We hope we’re kicking off a new age of efficiency,” says Justin Dangel, the firm’s cofounder and CEO. “We don’t think anybody will be building on transformers in a few years.”

Attention!

To understand why Subquadratic’s claims are a big deal, let’s dig into how most LLMs work. The key mechanism inside an LLM is a type of neural network called a transformer, which runs a process known as dense attention. Today’s LLMs typically chain together multiple transformers. (The foundational paper of the LLM era, published by researchers at Google in 2017, was titled “Attention Is All You Need.”)

Dense attention works like this: When a transformer processes a chunk of text, it first encodes each word (or part of a word, known as a token) with a number. To capture the meaning of the full text, it then multiplies each of those numbers with every other number for that text. For example, a piece of text 10,000 words long would kick off almost 50 million individual multiplications. That’s a lot of computation and the main reason that LLMs are notorious power hogs.

“If you want to summarize The Great Gatsby, you have to look at the first word and the last word together, and then you have to look at every other combination,” says Dangel.

As the length of the text increases, the number of computations skyrockets. That’s because each additional number must be multiplied by all other previous numbers. Double the number of words, and you roughly quadruple the number of computations, a rate of increase known as a quadratic expansion. (You can picture this yourself: Draw a circle and mark dots around its edge. Each dot is a token. Then draw lines between pairs of dots to represent the multiplication of those two tokens. A circle with five dots will have 10 lines crossing it. Make it 10 dots and you will have 45 lines, 20 dots and you will have 190 lines, and so on.)

Slashing costs

Subquadratic’s solution is to ditch dense attention, the core operation of a transformer, in favor of what’s known as sparse attention, which slashes the number of computations needed. Instead of multiplying the number assigned to each token by every other number, sparse attention selects just some of the numbers to multiply. The idea is that not all relationships between words in a piece of text matter.

“Sparse attention says not all of those relationships are important, because they’re not,” says Whedon. “If you’re reading a book, you’re not going to look at the first and second words, first and third-that’s insane.”

It’s a simple approach, and Subquadratic is not the first to try it. “Pretty much everything under the sun has been attempted,” says Will Depue, an independent AI researcher who previously worked at OpenAI. “It’s not impossible, but it’s akin to running a four-minute mile.”

Previous techniques for selecting which numbers to multiply and which to ignore have not produced a mechanism that can capture the meaning of a document as well as dense attention can. Subquadratic claims to have cracked the problem at last. It pitches SubQ as the first sparse-attention LLM that rivals mainstream dense-attention models in performance.

“Historically, most mechanisms have used fixed patterns, like always comparing the first word to the fifth,” says Whedon. “That’s pretty limiting. Language is too sophisticated for that. And so, one of the things that makes our mechanism unique is that we dynamically select which ones are important.”

The firm won’t say exactly how SubQ chooses which words to focus on, but the selection is calculated on the fly and differs for each piece of text the model is given. “That’s kind of where the secret sauce is,” says Whedon.

Testing, testing

The upshot is that for certain tasks, SubQ may be faster and cheaper to run than most other models. Appen evaluated SubQ on a handful of standard tests.

  • In a straight-up speed test, which sets a baseline for how fast a model can operate in theory rather than assess what a model can actually do, Appen found that SubQ was 56 times faster than models using FlashAttention, a previous sparse-attention technique.
  • On LiveCodeBench, a test that looks at how well models perform on competitive coding problems taken from real contests, SubQ scored 89.7%, putting it in the same ballpark as other top coding models.

“This model continues to provide frontier-level performance in coding,” says Appen’s Sinanan-Singh.

Subquadratic's claims about cost are harder to verify because SubQ is not yet widely available. According to Dangel, it costs $2600 to run Anthropic's LLM Opus 4.6 through RULER 128, a test developed by Nvidia to assess a model's ability to retrieve information from large data sets. And SubQ? "It cost us eight dollars," he says.

SubQ does seem to be able to handle very large data sets. The model has a context window (roughly akin to a working memory) up to 12 million tokens long. Most top models today have context windows one million tokens long. In a demo that Whedon ran for me, he asked SubQ to perform a task that required it to reason about information contained in 400 documents. It responded in seconds. When he gave Perplexity-a popular LLM-powered search engine-the same task, it failed to load all 400 documents.

Appen also ran the needle-in-a-haystack test, which assesses how well a model can retrieve specific information buried in a large amount of data. In its report, Appen states that SubQ scored 98% with context windows six million and 12 million tokens long, “sustaining near-perfect long-context retrieval at scales few models are tested at.”

Too good to be true?

Despite the high scores, benchmarks paint an incomplete picture of what a model can and cannot do. Testing under very specific conditions is not a substitute for running a model on a wide range of real tasks. Subquadratic is offering SubQ as a model tailored to coding and to searching very large data sets. It says that tens of thousands of potential users have already signed up for early access, including more than 500 enterprise customers. But there’s a long waitlist, and the firm has given very few people access so far.

Subquadratic’s response is that it is a new, small company with limited resources and cannot serve too many people at once. Until more people get their hands on the model and try it out for themselves, some skepticism is justified.

One nagging issue is that Subquadratic reused the weights (values set within a model during training that determine how it will behave) from a version of the Chinese open-source model Qwen to bootstrap SubQ, rather than training it from scratch. That’s a common thing for model makers to do, but it cuts across Subquadratic’s claim that it has fully reinvented how LLMs work.

“They may have built something real and useful,” says Depue. “But the public evidence does not yet justify the stronger claim that they have solved the quadratic attention bottleneck.”

In the meantime, Subquadratic cofounder Whedon insists that making something different was his only option. If you want to build a competitive model, you have to have new ideas, he says: “We’re more up against it than OpenAI is.”

Comments

No comments yet. Start the discussion.