
Companies that scale AI deployments are hitting an invisible performance wall. The culprit? Static speculators who can’t keep up with changing workloads.
Speculators are smaller AI models that work alongside large language models during inference. They prepare several tokens in advance, which the main model checks in parallel. This technique (called speculative decoding) has become essential for companies trying to reduce inference costs and latency. Instead of generating one token at a time, the system can accept multiple tokens at once, drastically improving throughput.
Together AI today announced research and a new system called ATLAS (AdapTive-LeArning Speculator System) that aims to help companies overcome the challenge of static speculators. The technique provides a self-learning inference optimization capability that can help deliver inference performance up to 400% faster than a base level of performance available in existing inference technologies such as vLLM.
The company that had its beginning in 2023, has focused on optimizing inference on your enterprise AI platform. At the beginning of this year the company raised US$305 million as adoption and customer demand grew.
“The companies we work with generally, as they grow, see changes in workloads and then don’t see as much acceleration in speculative execution as they once did,” Tri Dao, chief scientist at Together AI, told VentureBeat in an exclusive interview. “These speculators often don’t perform well when their workload domain starts to change.”
The workload diversion problem no one talks about
Most speculators in production today are “static” models. They are trained once on a fixed dataset representing expected workloads and then deployed without any adaptability. Companies like Meta and Mistral ship pre-trained speculators alongside their core models. Inference platforms like vLLM use these static speculators to increase throughput without changing the quality of the output.
But there is a problem. When a company’s use of AI evolves, the static speculator’s accuracy plummets.
“If you are a company that produces coding agents and most of your developers write in Python, suddenly some of them switch to writing Rust or C, then you see the speed starts to slow down,” Dao explained. “The speculator has a mismatch between what he was trained in and what the actual workload is.”
This shift in workload represents a hidden tax on AI scaling. Companies either accept performance degradation or invest in recycling custom speculators. This process only captures a snapshot in time and quickly becomes outdated.
How Adaptive Speculators Work: A Dual Model Approach
ATLAS uses a dual speculator architecture that combines stability with adaptation:
The static speculator – A heavy model trained on large data provides consistent baseline performance. Serves as a “speed floor.”
The adaptive speculator – A lightweight model continuously learns from live traffic. It specializes in emerging domains and usage patterns.
The conscious controller of trust – An orchestration layer dynamically chooses which speculator to use. Adjusts speculation “look ahead” based on trust scores.
“Before the adaptive speculator learns anything, we still have the static speculator to help provide the speedup at the beginning,” Ben Athiwaratkun, AI scientist at Together AI, explained to VentureBeat. “When the adaptive speculator becomes more confident, the speed increases over time.”
The technical innovation lies in balancing acceptance rate (how often the target model agrees on drafted tokens) and draft latency. As the adaptive model learns from traffic patterns, the controller relies more on the light speculator and extends the look-ahead. This increases performance gains.
Users do not need to adjust any parameters. “On the user side, users do not need to turn any knobs,” Dao said. “For our part, we rotate these knobs for users to adjust to a setting that achieves good acceleration.”
Performance that rivals custom silicon
Together AI tests show that ATLAS achieves 500 tokens per second on DeepSeek-V3.1 when fully adapted. Most impressively, these numbers on Nvidia B200 GPUs match or exceed specialized inference chips like Groq’s custom hardware.
“Software and algorithmic improvement are able to bridge the gap with truly specialized hardware,” Dao said. “We were seeing 500 tokens per second on these huge models that are even faster than some custom chips.”
The 400% speedup the company claims for inference represents the cumulative effect of Together’s Turbo optimization suite. FP4 quantization provides 80% speedup over the FP8 baseline. Static Turbo Speculator adds another 80-100% gain. The adaptive system is at the top. Each optimization combines the benefits of the others.
Compared to standard inference engines like vLLM or Nvidia’s TensorRT-LLM, the improvement is substantial. Together, the AI compares the strongest baseline between the two for each workload before applying speculative optimizations.
The memory-computation tradeoff explained
Performance gains come from exploiting a fundamental inefficiency in modern inference: wasted computational power.
Dao explained that normally during inference, much of the computational power is not fully utilized.
“During inference, which is actually the dominant workload nowadays, you mainly use the memory subsystem,” he said.
Speculative decoding trades idle computation for reduced memory access. When a model generates one token at a time, it becomes memory bound. The GPU is idle while waiting for memory. But when the speculator proposes five tokens and the target model checks them simultaneously, computation utilization increases while memory access remains approximately constant.
“The total amount of computation to generate five tokens is the same, but you only needed to access the memory once instead of five times.” Dao said.
Think of it as a smart cache for AI
For infrastructure teams familiar with traditional database optimization, adaptive speculators work like an intelligent caching layer, but with a crucial difference.
Traditional caching systems like Redis or memcached require exact matches. You store exactly the same query result and retrieve it when the specific query is executed again. Adaptive speculators work differently.
“You can see this as a clever way of caching, not by caching exactly, but by discovering some patterns you see,” Dao explained. “Broadly speaking, we’re looking at you working with similar code, or working with similar code, you know, controlling computation in a similar way. We can then predict what the big model will say. We just get better and better at predicting it.”
Instead of storing exact answers, the system learns patterns from how the model generates tokens. It recognizes that if you are editing Python files in a specific codebase, certain token sequences become more likely. The speculator adapts to these patterns, improving his predictions over time without needing identical data.
Use cases: RL training and evolving workloads
Two business scenarios particularly benefit from adaptive speculators:
Reinforcement Learning Training: Static speculators quickly fall out of alignment as policy evolves during training. ATLAS continuously adapts to changes in policy distribution.
Evolving workloads: As companies discover new AI use cases, the composition of the workload changes. “Maybe they started using AI for chatbots, but then they realized, hey, it can write code, so they started switching to code,” Dao said. “Or they realize that these AIs can actually call up tools and control computers and do accounting and things like that.”
In a flutter session, the adaptive system can specialize for the specific codebase being edited. These are files not seen during training. This further increases acceptance rates and decoding speed.
What this means for businesses and the inference ecosystem
ATLAS is now available on Together AI’s dedicated endpoints as part of the platform at no additional cost. The company’s more than 800,000 developers (compared to 450,000 in February) have access to optimization.
But the broader implications go beyond a supplier’s product. The shift from static to adaptive optimization represents a fundamental rethinking of how inference platforms should work. As companies deploy AI across domains, the industry will need to move beyond once-trained models toward systems that continually learn and improve.
Together AI has historically released some of its research techniques as open source and collaborated on projects like vLLM. Although the fully integrated ATLAS system is proprietary, some of the underlying techniques may eventually influence the broader inference ecosystem.
For companies looking to lead in AI, the message is clear: adaptive algorithms on commodity hardware can match custom silicon at a fraction of the cost. As this approach matures across the industry, software optimization increasingly outperforms specialized hardware.

Leave a Reply