8 Large Language Models on a Single NVIDIA Blackwell Server
8 Large Language Models on a Single NVIDIA Blackwell Server
We recently deployed 8 open-source large language models on a server equipped with 8x NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs — one model per GPU. Each GPU has 96 GB of GDDR7 memory, enough to run 70B-parameter models on a single card using 4-bit quantization.
This post covers the hardware, the models we chose, how we configured them, and how our pricing compares to other inference providers.
The Hardware
The NVIDIA RTX PRO 6000 Blackwell Server Edition is built on the GB202 chip (NVIDIA's Blackwell architecture) with 96 GB of GDDR7 memory per GPU. Key specs:
- Architecture: Blackwell (Compute Capability 12.0)
- Memory: 96 GB GDDR7 per GPU
- CUDA Driver: 580.126.09
- Server: 8 GPUs, no NVLink interconnect (each GPU runs independently)
The 96 GB memory per GPU is significant. Previous-generation professional GPUs topped out at 48 GB (RTX 6000 Ada) or required multi-GPU setups for large models. With Blackwell, a 70B-parameter model quantized to INT4 (~35 GB weights) fits comfortably on a single GPU with room left for KV cache and concurrent requests.
The Models
We selected 8 models that cover general chat, coding, reasoning, and multilingual use cases:
| Model | Parameters | Precision | VRAM Used | Context |
|---|---|---|---|---|
| Qwen 3 32B | 32B | BF16 | ~65 GB | 32K |
| Qwen 3 30B MoE | 30B (3B active) | FP16 | ~61 GB | 32K |
| Llama 3.3 70B Instruct | 70B | AWQ INT4 | ~35 GB | 32K |
| Nemotron 70B Instruct | 70B | AWQ INT4 | ~35 GB | 32K |
| Mistral Large 123B | 123B | AWQ INT4 | ~65 GB | 32K |
| Qwen 2.5 72B Instruct | 72B | GPTQ INT4 | ~37 GB | 32K |
| Qwen 2.5 Coder 32B | 32B | BF16 | ~65 GB | 32K |
| DeepSeek R1 Distill 32B | 32B | BF16 | ~65 GB | 32K |
The 32B models run at full BF16 precision — no quantization needed. The 70B and 123B models use INT4 quantization (AWQ or GPTQ), which reduces memory by roughly 4x while preserving most of the model's quality.
The Qwen 3 30B MoE model is a Mixture of Experts architecture with 30B total parameters but only 3B active per token, making it the fastest model in the lineup.
Infrastructure Setup
All models run on vLLM v0.15.1 with the CUDA 13.0 container image. Each model gets a dedicated GPU in its own Docker container. The key configuration:
--enforce-eager— We tested CUDA graph compilation (torch.compile) on Blackwell and found it currently causes a significant performance regression. Eager mode is faster on this architecture.--gpu-memory-utilization 0.95— Allocates 95% of VRAM to vLLM, leaving a small buffer for system overhead.--max-model-len 32768— 32K context window for all models.--max-num-seqs 64— Supports up to 64 concurrent sequences per GPU.
Performance Highlights
We benchmarked all 8 models with 200-token and 500-token generation tests. Some highlights:
- The Qwen 3 30B MoE model averaged 33.4 tokens/second — the fastest in the lineup, thanks to its Mixture of Experts architecture which only activates 3B parameters per token despite having 30B total.
- Qwen 2.5 72B achieved 31.9 tok/s despite being a 72B-parameter model, demonstrating efficient INT4 quantization with minimal throughput penalty.
- Both 70B models (Llama 3.3 and Nemotron) maintained a consistent ~28.5 tok/s, showing that AWQ INT4 quantization on Blackwell delivers stable single-stream performance.
- Mistral Large 123B — the largest model at 123 billion parameters — ran at 18.5 tok/s on a single GPU. Running a 123B model on one GPU at all is only possible because of the 96 GB GDDR7 capacity.
All benchmarks were run on a single GPU per model with no batching — these are single-stream generation speeds.
Pricing Compared to Other Providers
All models are available through our Token Factory API, which is fully OpenAI-compatible. Here's how our pricing compares for models where direct comparisons exist:
Llama 3.3 70B Instruct
| Provider | Input (per 1M tokens) | Output (per 1M tokens) | Source |
|---|---|---|---|
| DeepInfra | $0.23 | $0.40 | deepinfra.com |
| Nebius (Base) | $0.13 | $0.40 | nebius.com |
| Together.ai | $0.88 | $0.88 | together.ai |
| Fireworks.ai | $0.90 | $0.90 | fireworks.ai |
| AWS Bedrock | $0.72 | $0.72 | aws.amazon.com |
| packet.ai | $0.10 | $0.10 | dash.packet.ai |
Qwen 2.5 72B Instruct
| Provider | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| DeepInfra | $0.23 | $0.40 |
| Together.ai | $0.90 | $0.90 |
| packet.ai | $0.10 | $0.10 |
Mistral Large (123B)
| Provider | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Together.ai | $2.00 | $6.00 |
| Fireworks.ai | $3.00 | $9.00 |
| packet.ai | $0.10 | $0.10 |
At $0.10 per million tokens for both input and output, we're significantly cheaper than every other provider on this list. But low price isn't the only differentiator:
- Dedicated GPU per model — Your requests aren't sharing a GPU with other customers. You get consistent, predictable latency.
- No rate limits — Concurrent request limits are set by the hardware, not by an artificial cap.
- Enterprise infrastructure — Our servers run in managed US datacenters with 99.9% SLA.
- Full model selection — 8 large models available simultaneously, including specialized options like Qwen 2.5 Coder 32B and DeepSeek R1 Distill 32B that are harder to find hosted elsewhere.
Models You Won't Find Everywhere
Several of our models aren't commonly available from other inference providers:
- Qwen 3 32B — Alibaba's latest dense model with both thinking and non-thinking modes.
- Qwen 3 30B MoE — The Mixture of Experts variant, offering lower latency than similar-quality dense models.
- Nemotron 70B — NVIDIA's helpfulness-optimized model, fine-tuned from Llama 3.1.
- Qwen 2.5 Coder 32B — A coding-specialist model that rivals larger models on code generation benchmarks.
- DeepSeek R1 Distill 32B — A reasoning model distilled from DeepSeek R1, capable of chain-of-thought reasoning at a fraction of the cost of the full model.
Getting Started
All models are accessible through our OpenAI-compatible Token Factory API. You can use any OpenAI SDK or HTTP client — just point it at your packet.ai endpoint and swap in the model name.
Try them in the Token Factory Playground or integrate directly via the API.
