PromptHub
Developer Tools Machine Learning

SSD: The Parallel Speculative Decoding Engine That Eliminates Drafting Overhead

B

Bright Coding

Author

11 min read
29 views
SSD: The Parallel Speculative Decoding Engine That Eliminates Drafting Overhead

SSD: The Parallel Speculative Decoding Engine That Eliminates Drafting Overhead

What if every token your LLM generated came with zero drafting delay? Not faster drafting—no drafting. While the rest of the AI world obsesses over bigger models and longer contexts, a quiet revolution is happening in inference optimization. And the most shocking part? It doesn't sacrifice a single bit of accuracy.

If you've deployed large language models in production, you know the excruciating pain: users staring at loading spinners, token-by-token generation that feels like watching paint dry, and the brutal economics of GPU clusters sitting partially idle while draft models sequentially guess and verify. Traditional speculative decoding promised relief but delivered only incremental gains—because drafting and verification still happened in lockstep, on the same hardware, one after another.

Enter SSD (Speculative Speculative Decoding)—a lightweight inference engine that shatters this sequential bottleneck by running drafting and verification in parallel on distinct hardware. The small model doesn't just guess the next tokens; it anticipates all possible verification outcomes and speculates for every branch simultaneously. When its prediction hits, the result returns instantly. No overhead. No waiting. Exact results at insane speeds.

This isn't theoretical. SSD is production-ready code built by Tanishq Kumar with Tri Dao and Avner May, accepted at ICLR 2026, and already benchmarked on H100 clusters with Llama-3 and Qwen3 models. In this deep dive, you'll discover why top ML engineers are quietly abandoning synchronous speculative decoding—and how to make the switch yourself.

What is SSD? Unpacking the Speculative Speculative Decoding Revolution

SSD is a new exact LLM inference algorithm that reimagines speculative decoding from the ground up. Created by researcher Tanishq Kumar alongside Tri Dao (creator of FlashAttention) and Avner May, this isn't an incremental tweak—it's a fundamental architectural shift in how draft and target models collaborate during generation.

Traditional speculative decoding (SD) follows a simple but limiting pattern: a small, fast draft model generates candidate token sequences, then the large, slow target model verifies them in a single forward pass. The catch? These phases execute sequentially on identical hardware. The draft model sits idle during verification; the target model sits idle during drafting. GPUs burn memory and power while achieving only partial utilization.

SSD obliterates this sequential dependency through parallel speculative decoding on distinct hardware. Here's the breakthrough: instead of drafting one sequence and hoping verification succeeds, the draft model pre-computes speculations for all likely verification outcomes across multiple branches. It speculates speculatively—hence the name. If the target model's verification aligns with any pre-computed branch, that result returns immediately without any drafting overhead whatsoever.

The literary epigraph on the repository captures this beautifully: Jorge Luis Borges's "Garden of Forking Paths" describes a labyrinth where all possible choices exist simultaneously. SSD operationalizes this metaphor in silicon—every potential token path is explored in parallel, and reality simply selects the matching branch.

Why it's trending now: With LLM deployment costs dominating AI infrastructure spend, inference optimization has become the critical battleground. SSD arrives at the perfect moment—offering exact (not approximate) results with dramatic speedups, compatible with modern optimizations like PagedAttention and CUDA graphs. The repository has accumulated significant attention since its ICLR 2026 acceptance, with engineers particularly excited about its clean implementation and real-world benchmark results.

Key Features: The Technical Arsenal Behind SSD's Speed

SSD isn't merely a research prototype—it's a production-grade inference engine packing serious engineering firepower. Let's dissect what makes this system tick.

Parallel Speculative Decoding Core: The heart of SSD is its asynchronous execution model. While traditional SD interleaves drafting and verification on shared GPUs, SSD assigns dedicated hardware to each phase. The draft model runs on its own GPU(s), continuously generating multi-branch speculations. The target model runs separately, consuming these pre-computed branches during verification. When verification matches a branch, zero drafting latency is incurred—literally eliminating the drafting overhead entirely.

Exact, Lossless Generation: Unlike speculative methods that trade quality for speed, SSD guarantees mathematically identical outputs to autoregressive generation. The "speculative speculative" naming emphasizes this: it's speculation about speculation, not approximation of the target distribution. You get the exact same tokens, just faster.

Multi-Model Family Support: Out-of-the-box compatibility with Qwen3 and Llama3 model families, covering the most widely deployed open-weight architectures. The codebase is architected for clean extension to additional models.

Tensor Parallelism: Scales across multiple GPUs with sophisticated tensor parallelism strategies, enabling 70B parameter models to run efficiently on modest clusters.

Production Optimizations Stack: SSD integrates the modern inference optimization toolkit:

  • PagedAttention: Efficient KV-cache memory management preventing fragmentation
  • CUDA Graphs: Minimizing CPU launch overhead for repeated operations
  • Torch Compilation: JIT optimization of compute kernels
  • Prefix Caching: Reusing computations for shared prompt prefixes

Comprehensive Benchmarking Infrastructure: The bench/ directory provides rigorous evaluation across four datasets (HumanEval, Alpaca, and others), with built-in comparisons against autoregressive baselines, synchronous speculative decoding, SGLang, and vLLM backends.

Interactive Chat Interface: Real-time streaming chat with metrics output—token counts, generation speed, and Time-To-First-Token (TTFT)—enabling intuitive performance validation.

Use Cases: Where SSD Transforms LLM Inference

1. High-Throughput Production APIs

Serving millions of requests daily? SSD's parallel architecture means draft model GPUs never stall waiting for verification. Your throughput ceiling rises dramatically because hardware utilization approaches theoretical maximums. For API providers charging per token, this directly improves margins without quality degradation.

2. Real-Time Interactive Applications

Chatbots, coding assistants, and creative writing tools demand sub-100ms token latency for fluid user experience. Traditional SD's sequential drafting creates perceptible pauses—especially with cold starts. SSD's branch prediction means common token sequences flow without drafting interruption, delivering genuinely streaming-like responsiveness.

3. Multi-Tenant GPU Clusters

In shared infrastructure where GPU allocation fragments across users, SSD's decoupled draft/target hardware enables better scheduling. A small GPU pool handles drafting for multiple target models; larger GPUs focus purely on verification. Resource allocation becomes more flexible and economically efficient.

4. Research and Model Evaluation

When benchmarking new architectures or evaluating across diverse datasets, SSD's exact output guarantee is crucial. You can accelerate evaluation 10x without wondering if speedups come from approximation artifacts. The built-in --all benchmark flag across four datasets ensures robust, distribution-aware performance characterization.

5. Cost-Constrained Startups

For teams running 70B models on limited hardware, SSD's efficiency translates to fewer GPUs for equivalent performance or better performance on existing hardware. The ability to run Llama-3.1 70B with a 1B draft model on 5 GPUs (4 target + 1 draft) versus 4 GPUs synchronous, while achieving superior speed, is a genuine economic unlock.

Step-by-Step Installation & Setup Guide

Ready to benchmark SSD yourself? The setup is streamlined for modern Python workflows, though the H100 requirement means you'll need serious hardware or cloud credits.

Prerequisites

System Requirements:

  • Python 3.11+
  • CUDA >= 12.8
  • NVIDIA H100 GPUs (tested hardware; A100/L40/4090 may work with arch adjustments)
  • Sufficient GPU memory for your target model (70B parameters need multiple GPUs)

Install UV Package Manager

SSD uses uv for fast, reliable dependency management:

# Install uv if not present
curl -LsSf https://astral.sh/uv/install.sh | sh

# Ensure uv is in PATH for current shell
export PATH="$HOME/.local/bin:$PATH"

Clone and Configure SSD

# Clone the repository
git clone https://github.com/tanishqkumar/ssd && cd ssd

# Install core dependencies
uv sync

# Optional: add script dependencies for model/dataset downloading
# uv sync --extra scripts

# Activate the virtual environment
source .venv/bin/activate

# Verify installation
python -c "from ssd import LLM; print('ok')"

Environment Configuration

Critical: these paths must be set correctly before any operations:

# HuggingFace hub directory containing models--org--name/ subdirectories
# Example: /data/huggingface/hub  (NOT /data/huggingface/)
export SSD_HF_CACHE=/path/to/huggingface/hub

# Directory containing dataset subdirectories: humaneval/, alpaca/, etc.
export SSD_DATASET_DIR=/path/to/processed_datasets

# GPU architecture: 9.0=H100, 8.0=A100, 8.9=L40/4090
export SSD_CUDA_ARCH=9.0

Download Models and Datasets

If you already have models via huggingface-cli, skip to datasets—just verify SSD_HF_CACHE points correctly.

# Requires scripts extra: uv sync --extra scripts

# Download Llama models (respects SSD_HF_CACHE)
python scripts/download_from_hf.py llama

# Configure dataset cache location
export HF_DATASETS_CACHE=/path/to  # parent directory of SSD_DATASET_DIR

# Download and process datasets (10K samples)
python scripts/get_data_from_hf.py --num-samples 10000

Pro tip: Always use python -O for benchmarks and chat—this disables Python debug assertions that add overhead in tight inference loops.

REAL Code Examples: SSD in Action

Let's examine actual code patterns from the SSD repository, with detailed explanations of what each accomplishes.

Example 1: Autoregressive Baseline Benchmark

This establishes your performance floor before enabling speculative decoding:

cd bench

# Autoregressive generation with Llama-3 70B on 4 GPUs
# --b 1: batch size 1 (single sequence)
# --temp 0: greedy decoding (deterministic)
# --numseqs 128: 128 prompts per dataset
# --output_len 512: generate 512 tokens per prompt
# --all: evaluate across all four benchmark datasets
python -O bench.py \
    --llama \
    --size 70 \
    --gpus 4 \
    --b 1 \
    --temp 0 \
    --numseqs 128 \
    --output_len 512 \
    --all

What's happening: This runs vanilla autoregressive generation—each token computed sequentially by the full 70B model. The 4 GPUs operate in tensor-parallel mode, splitting layer computations. The --all flag ensures you see performance across HumanEval (code), Alpaca (instruction), and other distributions—critical because predictability varies by domain. This baseline typically achieves ~15-20 tokens/second for 70B models.

Example 2: Synchronous Speculative Decoding

Traditional SD: draft and verify sequentially on shared hardware:

# Sync spec decode: 70B target + 1B draft, 4 GPUs total
# --spec: enable speculative decoding
# --k 6: draft 6 tokens ahead before verification
python -O bench.py \
    --llama \
    --size 70 \
    --gpus 4 \
    --spec \
    --k 6 \
    --b 1 \
    --temp 0 \
    --numseqs 128 \
    --output_len 512 \
    --all

What's happening: The 1B draft model generates 6 candidate tokens; the 70B model verifies all in one forward pass. On 4 GPUs, both models share hardware—drafting pauses during verification, verification pauses during drafting. Speedups typically 2-3x over autoregressive, but limited by this sequential bottleneck and GPU memory contention between models.

Example 3: Asynchronous Speculative Decoding (SSD)

The main event—parallel execution on distinct hardware:

# Async spec decode (SSD): 70B target (4 GPUs) + 1B draft (1 GPU)
# --async: enable parallel speculative decoding
# --k 7: speculation depth (tokens ahead)
# --f 3: branching factor (parallel speculation paths)
# --gpus 5: 4 for target + 1 for draft
python -O bench.py \
    --llama \
    --size 70 \
    --gpus 5 \
    --spec \
    --async \
    --k 7 \
    --f 3 \
    --b 1 \
    --temp 0 \
    --numseqs 128 \
    --output_len 512 \
    --all

What's happening: This is where SSD shines. The --async flag activates parallel speculative decoding. With --k 7 --f 3, the draft model on its dedicated GPU generates 3 parallel speculation branches, each 7 tokens deep, anticipating different verification outcomes. The 70B target on 4 GPUs consumes these branches; when verification matches any pre-computed path, zero drafting overhead occurs. The extra GPU (5 vs 4) pays for itself through eliminated idle time and higher effective throughput.

Example 4: Interactive Chat with Performance Metrics

Real-time streaming with SSD's chat interface:

cd bench

# SSD chat mode with 5 GPUs, streaming metrics
python -O chat.py \
    --ssd \
    --spec \
    --async \
    --k 7 \
    --f 3 \
    --gpus 5 \
    --metrics

What's happening: The --ssd flag selects SSD's native backend (alternatives: --sglang, --vllm). --metrics prints token count, generation speed (tokens/sec), and TTFT after each assistant response—essential for subjective quality-of-experience validation. The chat loop supports AR, sync SD, and async SD modes for direct A/B comparison.

Example 5: Backend Comparison Commands

SSD includes built-in comparisons against leading inference engines:

# SGLang with speculative decoding
python -O chat.py --sglang

# SGLang autoregressive baseline
python -O chat.py --sglang --ar

# vLLM with speculative decoding
python -O chat.py --vllm

What's happening: These launch SGLang/vLLM servers automatically and route chat through them. SSD's benchmarking infrastructure ensures fair, identical-prompt comparisons—no cherry-picking. Engineers can validate SSD's claims independently using their preferred baselines.

Advanced Usage & Best Practices

Tune --k and --f for your distribution: The optimal speculation depth (-k) and branching factor (-f) depend heavily on token predictability. Code (HumanEval) often permits deeper speculation than open-ended creative writing. Start with k=7, f=3 and grid-search for your use case.

Monitor branch hit rates: SSD's efficiency depends on the draft model correctly anticipating verification paths. If hit rates drop below ~60%, consider a larger draft model or reducing --f to focus computation on fewer, higher-quality branches.

Leverage prefix caching for multi-turn: In chat applications with long contexts, prefix caching avoids recomputing shared prompt prefixes across turns. Combine with --metrics to quantify TTFT improvements.

CUDA graphs for steady-state: After warmup, CUDA graphs eliminate CPU launch overhead. Ensure your first few generations complete before measuring—reported speeds stabilize after graph capture.

Draft data parallel (coming soon): The roadmap includes multi-GPU draft parallelism (up to 4 devices) to prevent draft model compute saturation. For now, single-GPU draft with f <= 3 typically balances well.

Comparison with Alternatives

Feature SSD SGLang SD vLLM SD Standard SD
Draft/Verify Parallelism ✅ Async, distinct HW ❌ Sequential ❌ Sequential ❌ Sequential
Exact Output Guarantee ✅ Yes ✅ Yes ✅ Yes ✅ Yes
Draft Overhead Elimination ✅ On branch hit ❌ Always present ❌ Always present ❌ Always present
Multi-Branch Speculation --f parameter ❌ Single branch ❌ Single branch ❌ Single branch
Built-in Benchmark Suite ✅ 4 datasets Partial Partial No
Interactive Chat + Metrics ✅ Native Via server Via server No
OpenAI-Compatible API 🚧 Roadmap ✅ Yes ✅ Yes Varies
Model Support Llama3, Qwen3 Broader Broader Varies

Why choose SSD? When you need maximum inference efficiency with exact outputs and can dedicate separate GPU(s) for drafting. The parallel architecture fundamentally removes a bottleneck that synchronous methods cannot escape. For pure throughput-per-dollar on large clusters, SSD's hardware decoupling enables smarter scheduling.

When to prefer alternatives: If you must run everything on minimal hardware (single GPU), synchronous SD or vLLM's optimized kernels may win. If you need immediate OpenAI API compatibility, SGLang/vLLM are currently ahead—though SSD's roadmap addresses this.

FAQ

Q: Does SSD change my model's outputs? A: Absolutely not. SSD is mathematically exact. It produces identical tokens to autoregressive generation—just faster. The "speculative speculative" refers to predicting verification paths, not approximating distributions.

Q: What hardware do I actually need? A: Tested on H100s with CUDA 12.8+. Theoretically works on A100 (arch 8.0) and L40/4090 (8.9) with SSD_CUDA_ARCH adjustment, but your mileage may vary. You need at least one GPU more than your target model requires for the draft model.

Q: How much faster is SSD versus synchronous speculative decoding? A: Speedups depend on distribution predictability and branch hit rates. On favorable distributions with well-tuned --k and --f, SSD can achieve 2-3x over synchronous SD, which itself is 2-3x over autoregressive. The key win is consistent performance—no drafting stalls.

Q: Can I use my own draft model? A: The repository currently supports Llama3 and Qwen3 families with paired draft/target sizes. Check bench/bench.py for model configuration details. Custom draft models require architecture alignment with the target.

Q: Is production deployment ready? A: The inference engine is robust for research and internal deployment. OpenAI-compatible HTTP serving is on the roadmap—contribute or watch for updates if you need API compatibility immediately.

Q: Why "Speculative Speculative Decoding"—the name seems redundant? A: It's precisely descriptive! Normal SD speculates tokens; SSD speculates about which speculation branch verification will follow. The nested speculation enables parallel pre-computation.

Q: Where can I read the technical details? A: The ICLR 2026 paper provides full algorithmic description and theoretical analysis. The repository implements the exact methods described.

Conclusion: The Future of Inference is Forking, Not Sequential

SSD represents a genuine paradigm shift in how we architect LLM inference systems. By recognizing that drafting and verification need not be sequential prisoners on shared hardware, it unlocks efficiency gains that synchronous methods fundamentally cannot achieve. The Borges epigraph isn't mere literary ornament—it's the core insight: in the garden of forking token paths, why traverse one branch at a time when you can explore all simultaneously?

For engineers running production LLM services, SSD demands serious evaluation. The exact-output guarantee removes the quality-versus-speed tradeoff that has plagued approximate acceleration methods. The built-in benchmarking against SGLang and vLLM ensures you can validate claims with your own workloads. And the clean, modern codebase—built on uv, with clear environment configuration and comprehensive examples—respects your time.

The roadmap promises even more: draft data parallelism for larger branch spaces, OpenAI-compatible serving, and expanded model support including GPT-OSS and Kimi-K2.5. With ICLR 2026 recognition and active development, SSD is positioned to become a foundational tool in the inference optimization toolkit.

Don't let your GPUs idle while drafting and verification take turns. Clone the SSD repository today, run the benchmarks on your hardware, and experience what parallel speculative decoding actually delivers. The fork in the path isn't a dilemma—when you take all paths at once, every choice is correct.

Comments (0)

Comments are moderated before appearing.

No comments yet. Be the first to share your thoughts!

Support us! ☕