LLMs in Production #2: How Much VRAM Do I Need?

Before you download the 30 gigabytes, before you request the cluster, before you spin up the instance — you want to know if the model fits. Not roughly. Actually.

You have a GPU with a known amount of VRAM. You have a model with a known number of parameters. What you need is the bridge between those two facts: how parameters become bytes, how bytes become VRAM, and where the surprises hide.

Spoiler: the surprises are real. A 7B BF16 model fits on a 24GB GPU with room to spare — until you start serving a handful of users simultaneously, and the KV cache eats what was left. The weights didn’t change. The GPU didn’t change. But the math did, and it wasn’t in the model card.

This post is that math. Not the full story of quantization (that’s Post 5), but the baseline every practitioner needs: how dtype maps to bytes, how bytes map to VRAM, and how to run a quick sanity check before you commit to anything.

What’s actually in those bytes

Before the arithmetic, you need to understand what format those bytes are in. When a model card says “BF16 weights,” it’s telling you two things: how many bits per parameter, and how those bits are arranged. The arrangement matters more than you’d think.

A floating-point number is carved up into three fields: a sign bit (positive or negative), exponent bits (the scale: how large or small the number is overall), and mantissa bits (the precision: the significant digits within that scale). These three fields determine a format’s behavior in practice.

Here’s what they look like across the formats you’ll encounter:

FP32: 32 bits
┌─┬────────────────┬───────────────────────────────────────────────────┐
│S│   Exponent (8) │                   Mantissa (23)                   │
└─┴────────────────┴───────────────────────────────────────────────────┘
 1        8                              23                    = 32 bits

FP16: 16 bits
┌─┬───────────┬───────────────────┐
│S│  Exp (5)  │   Mantissa (10)   │
└─┴───────────┴───────────────────┘
 1     5              10          = 16 bits

BF16: 16 bits
┌─┬──────────────────┬───────────┐
│S│   Exponent (8)   │ Mant. (7) │
└─┴──────────────────┴───────────┘
 1        8                7      = 16 bits

FP8 (E4M3): 8 bits
┌─┬────────┬───────┐
│S│ Exp(4) │ M.(3) │
└─┴────────┴───────┘
 1    4        3    = 8 bits

FP8 (E5M2): 8 bits
┌─┬──────────┬─────┐
│S│  Exp (5) │M.(2)│
└─┴──────────┴─────┘
 1     5       2    = 8 bits

Look at FP16 and BF16 side by side. Both are 16 bits. They take identical memory. But the bits are split completely differently.

FP16 has 5 exponent bits and 10 mantissa bits. That 10-bit mantissa gives you roughly 3 decimal digits of precision, which is genuinely fine for inference. But 5 exponent bits means a limited range: values larger than about 65,504 overflow to infinity. During training, gradients can spike well past that. You get inf or NaN, your loss turns to garbage, and you’re debugging a numerical stability problem instead of a model problem.

BF16 copies FP32’s exponent field verbatim: 8 bits, same range as single precision. That’s the whole point. Its mantissa is only 7 bits, less precision than FP16, but you almost never care, because the stability benefit of not overflowing during training is enormous. This is why most modern LLMs are trained in BF16, and why BF16 is the default dtype you’ll see on most Hugging Face model cards today.

FP32 is what used to be the standard: 32 bits per parameter, 4 bytes each, full single precision. You mostly encounter it now as the accumulation dtype inside matrix multiplications, or in older models that predate the widespread adoption of 16-bit training.

TF32 trips up almost everyone who first encounters it. It is not a 16-bit format. It is not a storage format at all. TF32 is a compute mode on NVIDIA Ampere and later GPUs: when you enable it, the tensor cores internally use 19 bits (1 sign + 8 exponent + 10 mantissa) for matrix multiply operations, while still reading and writing FP32 memory. Your weights are FP32 on disk and in VRAM; TF32 just changes what the multiplication hardware does with them. If you see TF32 mentioned in a training config, it’s a throughput setting, not a storage decision.

All the formats above are floating-point. The INT8, INT4, and NF4 rows in the table below are a different kind: low-bit storage formats. Their primary purpose is compression. In the most common deployment pattern — weight-only quantization — weights are stored compressed and dequantized to the compute dtype (BF16 or FP16) just before each matrix multiply; tensor cores still compute in full precision, and only the weights at rest are smaller. Some runtimes also support true integer compute paths (vLLM’s W8A8, for example, runs actual INT8 matrix multiplications with quantized activations), but for the purposes of understanding memory footprint, weight-only is what most practitioners encounter: quantization changes the storage cost, and the compute format may or may not change depending on the scheme.

INT8 stores each weight as a signed 8-bit integer: 8 bits, 1 byte per parameter. The mapping from float to integer uses per-group scale factors that record what the original value range was, so the conversion back is precise enough for most inference workloads. LLM.int8() (bitsandbytes) and SmoothQuant are the two schemes you’ll encounter most.

INT4 halves the bits again: 4 bits per weight, nominally 0.5 bytes. NF4 — Normal Float 4 — is a different 4-bit format, introduced in the QLoRA paper (Dettmers et al., 2023). INT4 spaces its 16 quantization levels uniformly across a value range. NF4 spaces them non-uniformly, concentrating more levels near zero to match the roughly bell-curve distribution of weights in trained neural networks. In practice NF4 preserves more information than INT4 at the same bit width, which is why it became the standard 4-bit format for fine-tuning pipelines.

Neither 4-bit format reaches exactly 0.5 bytes per parameter on disk. Quantization is done in groups — typically 32 to 128 consecutive weights share a scale factor stored in FP16 — and those scale factors add a few percent overhead. A “3.5 GB” 7B model in NF4 typically lands closer to 3.8–4 GB once you include them. NF4 also supports double quantization (quantizing the scale factors themselves), which recovers most of that overhead. The full mechanics of grouping, scale granularity, and double quantization are covered in Post 5.

The memory math

This is the calculation you need to be able to do in your head:

VRAM for weights = number of parameters × bytes per parameter

That’s it. The only two things to know are your parameter count and your dtype’s byte width:

Dtype	Bits per param	Bytes per param
FP32	32	4
BF16 / FP16	16	2
INT8	8	1
INT4 / NF4	4	0.5

Walk through a 7B model in BF16: 7 × 10⁹ parameters × 2 bytes = 14 × 10⁹ bytes = 14 GB. In FP32 that’s 28 GB. In INT8 it’s 7 GB. In 4-bit it’s 3.5 GB.

A note on units: all figures here use SI gigabytes (1 GB = 10⁹ bytes). Tools like nvidia-smi and most frameworks report in GiB (1 GiB = 2³⁰ bytes ≈ 1.07 GB), so the same 14 GB model will show as roughly 13 GiB in system output. The gap is about 7% — small enough not to change any sizing decision, but worth knowing when you’re cross-referencing these estimates against real system readouts.

Here’s the table across the model sizes you’re likely to encounter:

Model size	FP32	BF16 / FP16	INT8	INT4 / NF4
7B	~28 GB	~14 GB	~7 GB	~3.5 GB
13B	~52 GB	~26 GB	~13 GB	~6.5 GB
34B	~136 GB	~68 GB	~34 GB	~17 GB
70B	~280 GB	~140 GB	~70 GB	~35 GB
120B	~480 GB	~240 GB	~120 GB	~60 GB
240B	~960 GB	~480 GB	~240 GB	~120 GB
450B	~1,800 GB	~900 GB	~450 GB	~225 GB

These are weight-only numbers. The weights are the dominant cost when the model is sitting idle, but as soon as you start doing anything — inference, training, serving concurrent users — other memory consumers show up. We’ll get there in a moment.

Why training costs so much more

The table above is inference memory, just the model weights sitting in VRAM. The moment you try to train that model, you need to hold a lot more in memory simultaneously.

Full training is greedy. You need the weights themselves, same as inference. But you also need gradients — one per parameter, typically held in FP32 — and if you’re using Adam, two moment tensors per parameter on top of that, also in FP32. The numbers stack fast.

The following assumes the standard BF16 mixed-precision workflow, which requires Ampere or newer hardware (CC 8.0+; A100, RTX 30xx and later). On older GPUs — Volta, Turing — BF16 has no tensor-core support and FP16 AMP is the practical alternative.

Add it up: weights at 2 bytes (BF16) + gradients at 4 bytes (FP32) + two Adam moments at 4 bytes each = roughly 14 bytes per parameter. For a 7B model that’s about 98 GB before you count activations. You’re not fitting that on a single GPU without significant engineering.

The actual number often cited is 16-18 bytes per parameter for full BF16 training with FP32 optimizer states. The gap from 14 to 18 comes from one more item the 14-byte tally omits: standard mixed-precision training maintains a separate FP32 master copy of the weights alongside the BF16 compute copy. The optimizer applies updates to the master copy — where small gradient steps won’t be lost to rounding — then casts back to BF16 for the forward pass. That’s another 4 bytes per parameter, landing at 18. The 16-byte end of the range applies to setups that skip FP32 gradients and accumulate in BF16 instead, keeping only the FP32 master weights and FP32 optimizer moments. For a 7B model, 16-18 bytes/param is 112-126 GB — roughly 8x the inference weight cost.

LoRA attacks the optimizer state problem. Instead of computing gradients for all 7 billion parameters, you freeze the base weights and only train a small set of low-rank adapter matrices. The adapter might be a few hundred million parameters at most, often far less. Your optimizer states shrink to match, and suddenly training a 7B model is feasible on a single 40GB GPU with some headroom.

QLoRA goes further. It keeps the frozen base weights in 4-bit storage (NF4) rather than BF16. Instead of 14 GB for the 7B base, you’re holding roughly 4 GB. You still compute in BF16: the 4-bit weights are dequantized on the fly before matrix multiplications, but the memory footprint of the base model drops by roughly 3.5x. The adapter matrices you’re actually training remain in BF16 or FP16. QLoRA is how people finetune 13B models comfortably on a single 24GB consumer GPU, and push 34B into range with careful settings — short sequences, small batch sizes, gradient accumulation.

The hierarchy: full training > LoRA > QLoRA, in order of memory cost and, roughly, in reverse order of accessibility. Full training is not a dead end — it’s where the multi-GPU playbook begins: gradient checkpointing avoids storing intermediate activations by recomputing them on the backward pass (at roughly 33% extra compute cost), and ZeRO/FSDP shards weights, gradients, and optimizer states across devices so no single GPU carries the full 18-bytes-per-parameter budget. These are the tools that make training large models tractable at all; what this post has given you is the baseline to understand what they’re working against.

On Hopper, Ada Lovelace, and Blackwell: native FP8 training

Hardware released since 2022 changed this picture. The H100 (Hopper) introduced native FP8 tensor cores; the RTX 4000 series (Ada Lovelace) followed; Blackwell expanded further to MXFP8 and FP4. On these GPUs, NVIDIA’s Transformer Engine enables a training scheme where matrix multiplications run in FP8 rather than BF16 — cutting compute weights from 2 bytes to 1.

How much memory you actually save beyond that depends on which other components stay in FP32. Master weights, gradients, and optimizer states can remain at full precision or be reduced further, producing very different memory profiles:

Component	BF16 AMP	FP8 conservative	FP8 aggressive (COAT)
Compute weights	2 bytes (BF16)	1 byte (FP8)	1 byte (FP8)
Master weights	4 bytes (FP32)	4 bytes (FP32)	2 bytes (BF16)
Gradients	4 bytes (FP32)	4 bytes (FP32)	1 byte (FP8)
Adam m1	4 bytes (FP32)	4 bytes (FP32)	1–2 bytes
Adam m2	4 bytes (FP32)	4 bytes (FP32)	1–2 bytes
Total	~18 bytes	~17 bytes	~6–8 bytes

The conservative column reflects implementations that keep master weights and gradients in FP32 — including DeepSeek V3’s published recipe. The per-parameter weight memory saving is modest: one byte per parameter saved. The larger wins in this regime are activation memory (activations are also FP8 during the forward pass, which dominates at large batch sizes) and throughput: roughly 35–40% faster training on H100s compared to BF16 AMP. One implementation detail further reduces the actual savings regardless of recipe: FP8’s limited dynamic range requires per-tensor scaling factors — additional metadata for each tensor that adds a small but real overhead on top of the raw bit savings. On Blackwell hardware using the MXFP8 format specifically, Transformer Engine additionally maintains a transposed copy of the quantized weights to accelerate the backward pass, adding further overhead; this is a format-specific behavior, not a property of FP8 training in general.

The aggressive column represents COAT and similar research that also quantizes optimizer states, reaching roughly 6–8 bytes per parameter in training state. That looks like a 2–3× reduction on paper relative to BF16’s ~18 bytes, but COAT reports a ~1.5× end-to-end memory reduction — because parameter states are only part of total training memory. Activations, gradient communication buffers, and framework overhead make up the remainder and don’t compress as dramatically. This is active research rather than a standard production recipe.

The recipe differences above trace partly to a detail in the format itself: FP8 has two subtypes — E4M3 (higher precision) and E5M2 (higher range). Some recipes, including Transformer Engine’s default, use E4M3 for weights and activations and E5M2 for gradients on the theory that gradients need more dynamic range. Others, including DeepSeek V3’s published recipe, use E4M3 throughout. It’s a tunable choice, not a requirement — which is why the conservative column above doesn’t specify a format split.

At frontier scale the combination is decisive: DeepSeek V3, a 671B MoE model (671B total parameters, ~37B active per forward pass), was trained natively in FP8 and ships only FP8 weights (a BF16 conversion script is provided for experimentation). Qwen 3.5’s infrastructure blog similarly describes a native FP8 training pipeline.

A terminology distinction that matters here: “FP8 weights” on a model card can mean two very different things. The FP8 model checkpoints you find on Hugging Face — including Qwen’s own FP8 releases — are typically post-hoc quantized artifacts: a BF16-trained model compressed to FP8 after training, the same process as any other quantization. That is different from native FP8 training, where the model was built in FP8 from the start. The infrastructure blog above describes the latter; the FP8 checkpoint you download is the former. When you see an FP8 model card, assume post-hoc quantization unless the documentation explicitly states the model was trained natively in FP8. True native FP8 training is currently limited to labs operating at frontier scale on Hopper-class or newer hardware.

The KV cache: the thing that surprises everyone

You’ve loaded your 7B model in BF16. It’s using 14 GB. You have 24 GB. You run a few short prompts and everything is fine. Then you try to serve it to multiple users with longer contexts, and suddenly you’re out of memory again, even though the model didn’t change.

The culprit is the KV cache.

Transformers work by computing attention: each token attends to every previous token. To avoid recomputing those previous tokens’ key and value matrices on every generation step, the runtime caches them. These cached tensors are the KV cache, and they’re separate from the model weights entirely.

The cache scales with everything that makes a model useful. More layers, more tensors to cache. Longer context, more tokens per sequence. More concurrent users, more sequences in flight simultaneously. The shape of each cached element is set by the number of KV heads times the head dimension, and the byte cost per element is 2 for BF16, 1 if you’ve enabled KV cache quantization. Written out as a formula:

num_layers × num_kv_heads × head_dim × 2 (K and V) × sequence_length × batch_size × bytes_per_element

For Qwen1.5-7B — a January 2024 model from before GQA became standard (see below), simple enough architecture to follow clearly: 32 uniform layers, 32 KV heads, head dimension 128 — in BF16, with a 4096-token context and a batch of 8 concurrent users:

32 × 32 × 128 × 2 × 4096 × 8 × 2 bytes ≈ 17.2 GB

That’s more than the entire weight VRAM for the same model — 14 GB of weights, 17 GB of cache — and that’s with a modest context and a small batch. Scale to 32K context or 64 concurrent users and the cache dwarfs the weights entirely.

This is why “will this model fit?” is never just a weight question when you’re deploying. The KV cache is the hidden tax on inference, and it scales with exactly the things that make inference useful: longer prompts and more users.

Some runtimes (vLLM in particular) let you quantize the KV cache to INT8, cutting that 17.2 GB roughly in half. That’s a separate knob from weight quantization: you can have BF16 weights with INT8 KV cache, or 4-bit weights with BF16 KV cache. They’re independent decisions.

A note on modern models and GQA. The 17.2 GB figure above reflects a dense architecture where every attention layer keeps a full set of KV heads — one per query head, 32 total. That was standard in 2023. In May 2023 a Google Research paper introduced Grouped Query Attention (GQA): instead of one KV pair per query head, multiple query heads share a single KV pair, so the cache only stores one entry per group. The saving is proportional to how aggressively the heads are grouped.

Nearly every major model released since late 2023 uses GQA. Qwen2.5-7B — the direct successor to Qwen1.5-7B — uses 4 KV heads instead of 32, across 28 layers. Same workload:

28 × 4 × 128 × 2 × 4096 × 8 × 2 bytes ≈ 1.9 GB

From 17.2 GB to 1.9 GB — a 9× reduction — for the same context length and the same number of concurrent users. Weight memory is essentially unchanged (both models are ~15 GB in BF16); only the KV cache shrinks. When you read num_key_value_heads in a config.json and it’s smaller than num_attention_heads, that’s GQA at work. Post 3 walks through exactly those fields.

Will this fit? A mental model for GPU sizing

The question is never just “how big is the model” — it’s “how big is the model relative to what I need it to do.” Here’s how to think through it.

For inference:

Look up (or calculate) weight VRAM from the table above.
Add 20-30% headroom for activations during forward passes, framework overhead, and KV cache at short-to-moderate contexts.
If the result fits in your GPU’s VRAM, you’re probably fine for single-user or low-concurrency use.
For serving multiple users or long contexts, budget KV cache explicitly: it can easily add 50-100% of weight VRAM at realistic workloads.

A rough rule of thumb that holds for most models: you can comfortably run a model if your GPU has at least 1.2-1.3x the weight VRAM in total capacity, for inference with short contexts. For production serving, assume you’ll need 2x the weight VRAM to have breathing room.

For finetuning:

LoRA (BF16 base): you need the full weight VRAM plus significant overhead for activations and optimizer state. A 7B model needs at least 20-24 GB; 24 GB cards are tight.
QLoRA (NF4 base): weight VRAM drops to roughly 3.5-4 GB for 7B. Total VRAM needed is roughly 12-16 GB for 7B with moderate sequence lengths. A 16 GB GPU can work; 24 GB is comfortable.
Full training: multiple high-memory GPUs, gradient checkpointing enabled, and a sharding strategy (ZeRO-3 or FSDP). The per-GPU burden distributes significantly, but the total cluster requirement stays high.

The quick sanity check:

Inference budget:  weight_GB × 1.3
LoRA budget:       weight_GB (BF16) × 1.5  [rough estimate]
QLoRA budget:      weight_GB (NF4)  × 3.5  [accounts for activations and overhead]

These are estimates, not guarantees. Actual usage depends on sequence length, batch size, framework version, and which runtime you’re using. But they’re close enough to tell you whether you’re in the right neighborhood before you launch a job.

One final note: the 24 GB GPU that “runs great” on the model card was almost certainly tested with a single user, short context, and default settings. That’s the minimum viable case, not a production characterization. If you’re planning real serving, budget generously.

What you now know, and what’s next

You can now translate a model card into a VRAM estimate. You know why BF16 is everywhere (range stability, not precision), why TF32 is a compute mode and not a storage format, and why the KV cache is the hidden cost that breaks inference budgets. You have a mental framework for whether something will fit, and a sense of why full BF16 training costs roughly 8–9× as much as BF16 inference for the same model.

In Post 3, we’ll open the model’s box: read the config.json that specifies the full architecture, count the parameters yourself from shapes, and inspect the files without loading any weights. After that, Post 4 covers the hardware side — which GPU generations support which compute formats, and why the gap between what hardware claims and what runtimes use is where most confusion lives.

Next: Post 3 | Reading the model spec and counting parameters yourself

LLMs in Production #2: How Much VRAM Do I Need?

What’s actually in those bytes

The memory math

Why training costs so much more

The KV cache: the thing that surprises everyone

Will this fit? A mental model for GPU sizing

What you now know, and what’s next

Related

One thought on “LLMs in Production #2: How Much VRAM Do I Need?”

Leave a ReplyCancel reply

US Government Halts Anthropic’s AI Models Citing Security Fears, Sparks Industry Controversy

The Build Log That Spoke to AI Agents

Half a Billion Dollar AI Blunder: The Hidden Costs of Unchecked Tech Spending

ECC v2.0: Elevating Agentic Work with Versatile Operator Systems and Open-Source Innovation

The Vulnerability Bottleneck Has Moved

China’s First Real Gaming GPU Is Here — And That Matters More Than FPS

Shai-Hulud and the Danger of Trusted Packages

When the Future Remembers First

YellowKey Turns BitLocker Into an Open Door