Back to Blog
InfrastructureFebruary 13, 2026

8 Large Language Models on a Single NVIDIA Blackwell Server

James WithallCo-founder
6 min read

8 Large Language Models on a Single NVIDIA Blackwell Server

We recently deployed 8 open-source large language models on a server equipped with 8x NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs — one model per GPU. Each GPU has 96 GB of GDDR7 memory, enough to run 70B-parameter models on a single card using 4-bit quantization.

This post covers the hardware, the models we chose, how we configured them, and how our pricing compares to other inference providers.

The Hardware

The NVIDIA RTX PRO 6000 Blackwell Server Edition is built on the GB202 chip (NVIDIA's Blackwell architecture) with 96 GB of GDDR7 memory per GPU. Key specs:

  • Architecture: Blackwell (Compute Capability 12.0)
  • Memory: 96 GB GDDR7 per GPU
  • CUDA Driver: 580.126.09
  • Server: 8 GPUs, no NVLink interconnect (each GPU runs independently)

The 96 GB memory per GPU is significant. Previous-generation professional GPUs topped out at 48 GB (RTX 6000 Ada) or required multi-GPU setups for large models. With Blackwell, a 70B-parameter model quantized to INT4 (~35 GB weights) fits comfortably on a single GPU with room left for KV cache and concurrent requests.

The Models

We selected 8 models that cover general chat, coding, reasoning, and multilingual use cases:

ModelParametersPrecisionVRAM UsedContext
Qwen 3 32B32BBF16~65 GB32K
Qwen 3 30B MoE30B (3B active)FP16~61 GB32K
Llama 3.3 70B Instruct70BAWQ INT4~35 GB32K
Nemotron 70B Instruct70BAWQ INT4~35 GB32K
Mistral Large 123B123BAWQ INT4~65 GB32K
Qwen 2.5 72B Instruct72BGPTQ INT4~37 GB32K
Qwen 2.5 Coder 32B32BBF16~65 GB32K
DeepSeek R1 Distill 32B32BBF16~65 GB32K

The 32B models run at full BF16 precision — no quantization needed. The 70B and 123B models use INT4 quantization (AWQ or GPTQ), which reduces memory by roughly 4x while preserving most of the model's quality.

The Qwen 3 30B MoE model is a Mixture of Experts architecture with 30B total parameters but only 3B active per token, making it the fastest model in the lineup.

Infrastructure Setup

All models run on vLLM v0.15.1 with the CUDA 13.0 container image. Each model gets a dedicated GPU in its own Docker container. The key configuration:

  • --enforce-eager — We tested CUDA graph compilation (torch.compile) on Blackwell and found it currently causes a significant performance regression. Eager mode is faster on this architecture.
  • --gpu-memory-utilization 0.95 — Allocates 95% of VRAM to vLLM, leaving a small buffer for system overhead.
  • --max-model-len 32768 — 32K context window for all models.
  • --max-num-seqs 64 — Supports up to 64 concurrent sequences per GPU.

Performance Highlights

We benchmarked all 8 models with 200-token and 500-token generation tests. Some highlights:

  • The Qwen 3 30B MoE model averaged 33.4 tokens/second — the fastest in the lineup, thanks to its Mixture of Experts architecture which only activates 3B parameters per token despite having 30B total.
  • Qwen 2.5 72B achieved 31.9 tok/s despite being a 72B-parameter model, demonstrating efficient INT4 quantization with minimal throughput penalty.
  • Both 70B models (Llama 3.3 and Nemotron) maintained a consistent ~28.5 tok/s, showing that AWQ INT4 quantization on Blackwell delivers stable single-stream performance.
  • Mistral Large 123B — the largest model at 123 billion parameters — ran at 18.5 tok/s on a single GPU. Running a 123B model on one GPU at all is only possible because of the 96 GB GDDR7 capacity.

All benchmarks were run on a single GPU per model with no batching — these are single-stream generation speeds.

Pricing Compared to Other Providers

All models are available through our Token Factory API, which is fully OpenAI-compatible. Here's how our pricing compares for models where direct comparisons exist:

Llama 3.3 70B Instruct

ProviderInput (per 1M tokens)Output (per 1M tokens)Source
DeepInfra$0.23$0.40deepinfra.com
Nebius (Base)$0.13$0.40nebius.com
Together.ai$0.88$0.88together.ai
Fireworks.ai$0.90$0.90fireworks.ai
AWS Bedrock$0.72$0.72aws.amazon.com
packet.ai$0.10$0.10dash.packet.ai

Qwen 2.5 72B Instruct

ProviderInput (per 1M tokens)Output (per 1M tokens)
DeepInfra$0.23$0.40
Together.ai$0.90$0.90
packet.ai$0.10$0.10

Mistral Large (123B)

ProviderInput (per 1M tokens)Output (per 1M tokens)
Together.ai$2.00$6.00
Fireworks.ai$3.00$9.00
packet.ai$0.10$0.10

At $0.10 per million tokens for both input and output, we're significantly cheaper than every other provider on this list. But low price isn't the only differentiator:

  • Dedicated GPU per model — Your requests aren't sharing a GPU with other customers. You get consistent, predictable latency.
  • No rate limits — Concurrent request limits are set by the hardware, not by an artificial cap.
  • Enterprise infrastructure — Our servers run in managed US datacenters with 99.9% SLA.
  • Full model selection — 8 large models available simultaneously, including specialized options like Qwen 2.5 Coder 32B and DeepSeek R1 Distill 32B that are harder to find hosted elsewhere.

Models You Won't Find Everywhere

Several of our models aren't commonly available from other inference providers:

  • Qwen 3 32B — Alibaba's latest dense model with both thinking and non-thinking modes.
  • Qwen 3 30B MoE — The Mixture of Experts variant, offering lower latency than similar-quality dense models.
  • Nemotron 70B — NVIDIA's helpfulness-optimized model, fine-tuned from Llama 3.1.
  • Qwen 2.5 Coder 32B — A coding-specialist model that rivals larger models on code generation benchmarks.
  • DeepSeek R1 Distill 32B — A reasoning model distilled from DeepSeek R1, capable of chain-of-thought reasoning at a fraction of the cost of the full model.

Getting Started

All models are accessible through our OpenAI-compatible Token Factory API. You can use any OpenAI SDK or HTTP client — just point it at your packet.ai endpoint and swap in the model name.

Try them in the Token Factory Playground or integrate directly via the API.