You download the model. Thirty gigabytes of something arrives on your drive. You run the loading script. Maybe it works. The model answers your prompts, roughly as advertised. Maybe it doesn’t: CUDA out of memory, a dtype mismatch, a format your runtime doesn’t recognize.
Either way, you’re left with a nagging suspicion that you don’t really understand what you’re holding. What is inside those files? When someone says “run this model in FP8,” what are they actually asking? When your GPU says it supports INT4, why does the model still run slowly?
This series is about answering those questions, not abstractly, but for the specific decisions you’ll face as someone who works with language models: loading them, running them, quantizing them, finetuning them, converting them between formats, and deploying them to places that weren’t the ones they were designed for.
This first post is about foundations. Before anything else, you need a mental model: a small set of concepts that will make every subsequent choice legible. We’ll cover what weights actually are, the three distinct ways precision comes up in practice, the three tensor classes that matter for LLMs, and how a model moves from disk to inference output. Nothing in this post is complicated. But if you’re fuzzy on any of it, the rest of the series will feel arbitrary.
Who this series is for
This series is aimed at early-to-intermediate practitioners: people who have loaded a model or two, maybe run some inference, maybe tried finetuning, and now want to understand the mechanics well enough to make deliberate choices instead of cargo-culting configs from tutorials.
If some of that is unfamiliar, here is a one-paragraph orientation. A language model is a program that predicts the next word (more precisely, the next token) given a sequence of previous ones. It’s built from many stacked layers, each performing a series of mathematical operations on numbers flowing through the network. The numbers that define those operations are called parameters or weights: they were learned by feeding the model enormous amounts of text and adjusting the values to reduce prediction error, a process called training. When you download a model, you’re downloading those learned weights. That’s it. Everything else in this series builds on that.
If even that feels too fast, 3Blue1Brown’s Neural Networks playlist (YouTube, visual and short) and Jay Alammar’s The Illustrated Transformer (a single web article, no math required) are genuinely good starting points that won’t eat your weekend.
What a model actually is: a file of numbers
There’s a temptation to treat a language model as something almost magical, an emergent mind, a black box, an oracle. But from a purely mechanical perspective, a model is a function. Specifically, it’s a function approximator: a mathematical object that takes a sequence of tokens as input and produces a probability distribution over the next token as output.
That function is defined by its weights: numbers, organized into matrices, that parameterize every linear transformation in every layer of the network. When you download a “70 billion parameter model,” you’re downloading roughly 70 billion of these numbers. At two bytes each (a common storage size), that’s around 140GB. At one byte each (a compressed format), it’s closer to 70GB. The exact size depends on how the numbers are stored, which is precisely what this series is about.
The weights are stored in tensor files, most commonly safetensors or pytorch_model.bin files. Open one of them and you’d find named matrices: model.layers.0.self_attn.q_proj.weight, model.layers.23.mlp.gate_proj.weight, and so on. Each entry in each matrix is a weight: a learned numerical value that hasn’t changed since training finished. These numbers are what the training process optimized. They encode, in some distributed and opaque way, everything the model “knows.”
The 30GB blob on your drive is not a mystery. It’s a dictionary of named matrices containing learned numerical data: floating-point values in full-precision checkpoints, or quantized codes plus per-block scales/metadata in compressed ones.
The three dtypes you must keep separate
Here’s where it gets important, and where most confusion lives.
When someone says “this model is FP16” or “this model is 4-bit,” they’re making a claim about precision, but precision shows up in three distinct places, and conflating them is the source of countless loading errors, unexpected memory usage, and performance surprises.
Storage dtype is how the weights are encoded in the files on disk. This is usually what people mean when they put a precision in a model’s name. A “BF16 model” stores each weight as a 16-bit brain float. A “4-bit model” stores each weight as a 4-bit compressed value plus some metadata. Storage dtype determines file size and, importantly, determines what formats your runtime can load.
Compute dtype is the precision at which the actual matrix multiplications run. This is separate from storage. A model might be stored in 4-bit, dequantized to BF16 on the way into the GPU’s math units, and then multiplied in BF16. The storage says 4-bit; the compute says BF16. This is not a contradiction: it’s the normal case for most “quantized” LLMs today.
Accumulation dtype is the precision used for the internal running sums inside each matrix multiply. When you multiply two matrices, you’re accumulating thousands of partial products. Those accumulators are often kept at higher precision (usually FP32 or sometimes BF16) to prevent numerical error from compounding. This dtype is almost never exposed in model cards, but it matters for training stability and occasionally for inference correctness.
The practical takeaway: when someone says “4-bit model,” that’s usually a storage claim. The compute is almost always happening in BF16 or FP16. The accumulation is usually FP32 internally. Keep these three levels distinct and a lot of apparent contradictions resolve.
The three tensor classes that matter for LLMs
“The model” isn’t just one thing in memory. During inference and training, there are three distinct categories of tensors, each with different memory characteristics and different implications for what “quantization” helps with. [For full finetuning, also include gradients and optimizer state; this section focuses on the main inference-time tensors.]
Weights are the static parameters, the 70 billion numbers from the files you downloaded. They live in GPU memory throughout inference. They’re the dominant contributor to VRAM usage at rest. Weight-only quantization (storing weights in 4-bit instead of 16-bit) directly reduces this footprint. This is why quantization is so valuable: for a 70B model, going from 16-bit to 4-bit storage cuts weight VRAM from roughly 140GB to roughly 35GB. [Back-of-envelope estimate; real 4-bit footprints are higher due to scales and metadata.]
Activations are the intermediate results that flow between layers during a forward pass. Unlike weights, activations depend on batch size and sequence length: they’re proportional to how much you’re processing right now, not only to how big the model is. Activations dominate memory during finetuning (especially with long sequences), because you have to store them for the backward pass. Most “quantized inference” formats don’t help you here; they compress weights, not activations.
KV cache is the hidden third consumer, and it surprises almost everyone when they first encounter it. During inference, transformers maintain a cache of computed key and value tensors for every token in the context, across every layer. For a model with a 128K context window serving many concurrent users, the KV cache can easily dwarf the weight storage. Some runtimes can quantize the KV cache (often to int8), which is a separate concern from weight quantization.
Here’s how they map together:
| Weights | Activations | KV Cache | |
|---|---|---|---|
| Storage dtype | On disk; quantized to save file size | Not stored on disk | Not stored on disk |
| Compute dtype | Dequantized for math | Processed at runtime precision | Cached at runtime precision |
| When it dominates | Right after loading* | Training / long sequences | Long-context inference / many users |
Usually true right after loading the model. In long-context or high-concurrency serving, KV cache can become the larger VRAM consumer.
The reason this table matters: when someone asks “will quantizing help with my OOM error?” the answer depends entirely on which of these three is causing the problem. Weight quantization helps if your model is too big to load. It doesn’t help if your KV cache is eating your VRAM at inference time. It barely touches the activations dominating your finetuning run.
From disk to output: the minimal pipeline
A model moves through your system in stages, and precision (or format) can change at each one.
┌─────────────────────┐
│ Disk / Download │ <- Storage dtype (what's in the files)
│ (safetensors/GGUF) │
└──────────┬──────────┘
│ load + possibly quantize/dequantize
v
┌─────────────────────┐
│ GPU Memory (VRAM) │ <- Runtime weight dtype (may differ from storage)
│ Weights loaded │
└──────────┬──────────┘
│ forward pass
v
┌─────────────────────┐
│ Compute Kernels │ <- Compute dtype + accumulation dtype
│ (tensor cores) │
└──────────┬──────────┘
│
v
┌─────────────────────┐
│ Output tokens │
└─────────────────────┘
Each arrow in that diagram is a potential transformation. The loader can quantize weights on the fly (BF16 on disk to NF4 in VRAM). [Runtime-dependent; some stacks require pre-quantized checkpoints.] The kernels can dequantize weights before math (NF4 in VRAM to BF16 for matmul). The accumulation happens internally at a precision the user rarely controls.
This is why “can my GPU run this format?” is a three-part question:
– Can the loader read the storage format?
– Can the runtime hold the weights in VRAM (in some form)?
– Does a compute kernel exist for that format on your specific GPU?
All three need to be true. A 4-bit model might load fine but run in an unoptimized fallback path that’s slower than BF16. An FP8 model might fail to load at all if your GPU predates FP8 tensor cores. The pipeline view makes these failure modes legible.
What you now know, and what’s next
You now have the mental model that the rest of this series builds on:
- A model is a file of numbers (weights) organized into named matrices.
- Precision appears in three distinct places: storage (disk), compute (the math units), and accumulation (the running sums). These often differ from each other.
- Memory during LLM work is consumed by three different tensor classes: weights (static), activations (scale with batch/sequence), and KV cache (scales with context and concurrency).
- A model travels through a pipeline from disk to output, and format/precision can change at each stage.
In Post 2, we’ll make this concrete: you’ve downloaded a 30GB file, and you need to know whether it’ll fit. We’ll do the actual memory math and build the table every LLM practitioner should have in their head.
In Post 3, we’ll open the model’s box before moving further: reading the config.json specification, counting parameters from shapes, and inspecting the files without loading a single tensor into memory.
In Post 4, we’ll look at the hardware side: not all GPUs are equal, and the gap between what your GPU supports and what your runtime uses is where a surprising amount of confusion lives.
The series follows a model from first download to production. You’re at the beginning of that journey.
Next: [Post 2 | First Contact: the memory math every practitioner needs in their head]