LLMs in Production #3: Reading the Model Spec

Post 2 gave us the math: 2 bytes per parameter in BF16. Apply that to a 9-billion-parameter model and you get roughly 18 GB of weights. Your GPU has 24 GB. The numbers work out. Time to get the model.

Getting the model

We’re using Qwen3.5-9B throughout this series — a 9-billion-parameter multimodal model (text + image + video) released in early 2026, Apache 2.0 licensed, nothing to accept. Its Hugging Face page is at Qwen/Qwen3.5-9B.

You have three ways to get it, and the choice matters more than it looks.

Hugging Face — the source

Before you download anything, spend a minute on the repo page itself. What you land on is the model card — a README documenting the model’s intended use, training details, evaluation benchmarks, and known limitations. It’s the closest thing open-weight models have to official documentation, and it varies enormously in quality.

Two tabs at the top are worth bookmarking. Files and versions shows every file in the repository, with sizes, and lets you browse or download individual files — useful for peeking at config.json before pulling thirty gigabytes. Community is the most underrated part of Hugging Face: users post questions, benchmarks, compatibility notes, and workarounds there. If the model does something unexpected, or a quantization behaves oddly, or a particular runtime version has a bug — someone has usually documented it in Community before it made it into any official changelog. It’s worth checking before you spend an afternoon debugging.

# Standalone installer (recommended — no Python environment needed):
curl -LsSf https://hf.co/cli/install.sh | bash

# Or with uv, as a persistent global tool:
uv tool install huggingface_hub

# Then download:
hf download Qwen/Qwen3.5-9B \
    --local-dir ./Qwen3.5-9B

Or the Python equivalent:

from huggingface_hub import snapshot_download
snapshot_download("Qwen/Qwen3.5-9B", local_dir="./Qwen3.5-9B")

Some models require accepting a license agreement before HF will serve them. For those, you run hf auth login first and the CLI handles the auth header. Qwen3.5-9B is ungated — no login needed.

After the download, here’s what you’ll find:

Qwen3.5-9B/
├── config.json                                    ←  architecture spec (3 kB)
├── chat_template.jinja                            ←  conversation format template (separate file)
├── preprocessor_config.json                       ┐
├── video_preprocessor_config.json                 ┘  vision/video input config (multimodal)
├── model.safetensors-00001-of-00004.safetensors   ┐  5.28 GB
├── model.safetensors-00002-of-00004.safetensors   │  5.34 GB  weights split across
├── model.safetensors-00003-of-00004.safetensors   │  5.37 GB  4 shards (19.3 GB total)
├── model.safetensors-00004-of-00004.safetensors   ┘  3.33 GB
├── model.safetensors.index.json                   ←  index: tensor name → shard file
├── tokenizer.json                                 ←  vocabulary + merge rules (12.8 MB)
├── tokenizer_config.json                          ←  tokenizer settings
├── merges.txt                                     ←  BPE merge rules (also in tokenizer.json)
└── vocab.json                                     ←  token-to-ID lookup

19.3 GB total — 9.65B parameters (text model + vision encoder) × 2 bytes (BF16) = 19.3 GB. The 1.3 GB above a naive 9 × 2 = 18 GB estimate comes from the actual count being 9.65B once the vision encoder is included, and the “9B” in the name being a rounded marketing figure. The weights are split across four .safetensors shards because a single ~19 GB file is unwieldy; each shard is simply a slice of the full weight set. The .index.json file maps every tensor name to whichever shard contains it. (All sizes in this post use decimal GB — 1 GB = 10⁹ bytes — consistent with how Hugging Face and most download tools report file sizes.)

Ollama

The model’s Ollama library page is at qwen3.5:9b. The default pull:

ollama pull qwen3.5:9b          # downloads the default tag: Q4_K_M, 6.6 GB
ollama pull qwen3.5:9b-q8_0     # higher quality, 11 GB

The library page shows the most common tags. The full list — every available variant with its quantization type, exact file size, and upload date — is at qwen3.5/tags. That’s where to look when you want to compare options before committing to a download. Untagged qwen3.5:9b defaults to Q4_K_M.

What you’re downloading:

Ollama stores models using a content-addressable manifest system — similar in structure to how Docker stores image layers, with each component identified by hash rather than filename. The actual model weights are stored internally as a GGUF file, a self-contained binary developed by the llama.cpp project that bakes the architecture spec, tokenizer vocabulary, and model weights (quantized or full-precision) together in a single file. We’ll look at the GGUF format properly after working through the Hugging Face download below.

Why 6.6 GB:

The bytes-per-parameter estimates from the previous post are the right mental model and hold well for a full-precision text-only model. A few real-world factors shift the number for this specific download.

The model’s actual parameter count is 9.65 billion — the full model count including the vision encoder; “9B” is the marketing name. More significantly, Q4_K_M is not pure 4-bit storage. It’s a mixed-precision K-quant: weights are quantized in blocks of 256, and each block stores a scale factor in FP16. That overhead raises effective storage to roughly 5 bits per weight rather than 4. The “M” (medium) variant also keeps the most quality-sensitive tensors at Q6_K, roughly 6.6 bits per weight — in llama.cpp’s default quantization, the exact selection is determined by tensor-category heuristics and varies by model and llama.cpp version, but typically includes the LM head and certain feed-forward tensors on critical layers. The LM head alone is 4,096 × 248,320 ≈ 1B parameters; at Q6_K it contributes ~0.8 GB rather than the ~0.6 GB it would at Q4_K. And because Qwen3.5-9B is multimodal, the vision encoder components are quantized separately from the main text model weights — the exact precision depends on the quantization recipe and converter used. (Quantization defaults reflect llama.cpp’s heuristics at the time of conversion; run ollama show qwen3.5:9b to confirm the exact quant type for your local artifact.)

Work through the numbers: the ~9.1B text parameters at Q4_K_M, averaging ~5.3 bits/weight once Q6_K uplift on the largest tensors is factored in, contribute ~6.0 GB. The vision encoder, tokenizer, and file metadata together account for the remaining ~0.6 GB — that figure is the empirical difference between the 6.6 GB total and the text model’s calculable contribution; the exact split between vision weights and metadata within it depends on the specific artifact and quantization recipe. The same factors apply to any quantized multimodal artifact; the bytes-per-parameter formula stays accurate once you account for them.

What you can inspect:

Ollama is not a black box. You can read what you pulled:

ollama show qwen3.5:9b            # architecture, quant level, context length, parameter count
ollama show qwen3.5:9b --verbose  # full tensor listing
ollama list                       # all local models with sizes and pull dates

The model files live under ~/.ollama/models/ in the manifest/blob structure. The layout is not meant to be hand-edited, but the model’s contents are fully visible through ollama show and the GGUF inspection tools we’ll cover later in this post.

A note on scope: Ollama started as a local inference tool and that remains its primary use case. As of 2025 it also offers Ollama Cloud — cloud-hosted inference, including a paid Ollama Turbo tier — so a model name like qwen3.5:9b can route to local or remote compute depending on configuration. In contexts where API calls must stay local (data residency, air-gapped environments), be explicit about which endpoint you’re targeting.

The quantization scheme for a given Ollama tag is baked into that artifact — switching to a different level means pulling a different tag. Converting between formats is generally possible; Post 5 covers the quality trade-offs and Post 8 covers the conversion landscape.

LM Studio

LM Studio is a desktop application (macOS, Windows, Linux) for discovering, downloading, and running local models. Its model discovery interface searches Hugging Face for GGUF files, shows available quantization variants alongside their sizes and quality tags before you commit to a download, and links to the source model card. That makes it somewhat more transparent than Ollama’s CLI about what you’re actually pulling.

Beyond the GUI, LM Studio ships a local OpenAI-compatible REST API server and a headless daemon (lms) for non-interactive use, which makes it usable in scripted environments and CI without the GUI running. It supports loading multiple models simultaneously with independent context configurations and GPU-offload ratios — useful when you’re comparing two quantization levels side by side, or mixing a text model with an embeddings model for a retrieval pipeline.

The same caveats apply as with Ollama: you’re working with pre-compressed GGUF artifacts. The quantization level is fixed for the artifact you downloaded — changing it means downloading a different variant. Finetuning from a GGUF file is not a standard workflow; for that you typically want a full-precision source. Format conversion and re-quantization from GGUF is possible in many cases, though quality implications vary by method.

Reading the Hugging Face download

Ollama and LM Studio are runtime-first: get something running with minimum friction. The Hugging Face download is different — it’s inspection-first. You have the full set of specification files and direct access to every tensor before loading anything into GPU memory. That’s what makes it the right starting point for understanding a model.

One thing worth noting before we dive in: Hugging Face is a hosting platform, not a file format. The Qwen3.5-9B repository happens to distribute BF16 safetensors — but plenty of models on HF publish GGUF files, GPTQ checkpoints, AWQ variants, or other formats in their “Files” tab alongside or instead of safetensors. The file listing is how you know what you actually got. For this walk-through, we’re working with the safetensors download.

Let’s go through what you just downloaded, starting with the file that controls everything.

config.json

Every Hugging Face model repository ships a config.json — a short JSON file that fully specifies the model’s architecture. It’s the ground truth: layer count, head configuration, vocabulary size, context window, and every other structural parameter that determines how large the model is and how it behaves. If a model card makes a claim about the architecture, this is where you verify it.

Qwen3.5-9B’s is worth going through in detail, because it contains a few things you won’t see in a plain-transformer config.

Here’s an annotated excerpt:

// comments added for explanation; real config.json contains none
{
  "architectures": ["Qwen3_5ForConditionalGeneration"],
  "model_type": "qwen3_5",
  "image_token_id": 248056,              // <- vocabulary ID used as placeholder for image tokens
  "video_token_id": 248057,              // <- vocabulary ID used as placeholder for video tokens
  "vision_start_token_id": 248053,       // <- wraps the image/video token block
  "vision_end_token_id": 248054,
  "tie_word_embeddings": false,          // <- LM head is a separate matrix, not shared with embedding
  "text_config": {
    "model_type": "qwen3_5_text",
    "num_hidden_layers": 32,              // <- 32 transformer blocks total (9B; 27B uses 64)
    "hidden_size": 4096,                  // <- residual stream width
    "num_attention_heads": 16,            // <- Q heads in full-attention layers
    "num_key_value_heads": 4,             // <- GQA: 4 KV heads shared across 16 Q heads
    "head_dim": 256,                      // <- per-head dimension (full attention)
    "intermediate_size": 12288,           // <- MLP hidden dimension
    "vocab_size": 248320,                 // <- vocabulary size
    "max_position_embeddings": 262144,    // <- native context window (256K tokens)
    "full_attention_interval": 4,         // <- every 4th layer uses full attention; rest use linear
    "layer_types": ["linear_attention", "linear_attention", "linear_attention", "full_attention", ...],
    "attn_output_gate": true,             // <- additional gating on attention output
    "mtp_num_hidden_layers": 1,           // <- one multi-token prediction (MTP) block
    "mtp_use_dedicated_embeddings": false,
    "mamba_ssm_dtype": "float32",         // <- DeltaNet SSM computations in FP32
    "linear_conv_kernel_dim": 4,          // <- DeltaNet: convolution kernel size
    "linear_key_head_dim": 128,           // <- DeltaNet: key head dimension
    "linear_num_key_heads": 16,           // <- DeltaNet: number of key heads
    "linear_value_head_dim": 128,         // <- DeltaNet: value head dimension
    "linear_num_value_heads": 32,         // <- DeltaNet: number of value heads (9B; 27B uses 48)
    "rope_parameters": {
      "mrope_interleaved": true,          // <- multimodal RoPE: interleaves dims per modality
      "mrope_section": [11, 11, 10],      // <- splits 32 RoPE dims: language / image / video
      "partial_rotary_factor": 0.25,      // <- only 25% of each head's dims are rotated
      "rope_type": "default",
      "rope_theta": 10000000              // <- RoPE base frequency (10M, tuned for 256K context)
    }
  },
  "vision_config": {
    "depth": 27,                          // <- 27 Vision Transformer blocks
    "hidden_size": 1152,                  // <- ViT residual stream width
    "intermediate_size": 4304,            // <- ViT MLP hidden dimension
    "num_heads": 16,
    "out_hidden_size": 4096,              // <- projection to text model hidden size
    ...
  }
}

A few things are immediately different from the configs you may have seen before. First, this is a multimodal model (text + image + video), so the config has a top-level wrapper: text_config holds the language model parameters, and a sibling vision_config block holds the visual encoder. When you’re doing text-only memory math, you read text_config.

Second, notice full_attention_interval: 4. That’s the hybrid architecture marker. Qwen3.5-9B stacks 32 layers, but not all 32 are standard transformer attention. Every 4th layer uses full scaled dot-product attention (the standard O(n²) kind); the other 24 layers use Gated DeltaNet, a linear attention variant that avoids the quadratic cost. GQA (num_key_value_heads: 4) applies to those 8 full-attention layers.

That’s the complete language architecture in one block. Let’s unpack what each field actually controls.

What the fields control

This section covers the fields that determine model size, memory cost, and architecture. Fields governing training behaviour (dropout, initializer_range, attention_bias) are omitted — they’re not relevant to inference planning.

A terminology note: in LLM configs, “layers” means complete transformer blocks — each one is a self-contained compute unit combining an attention module (full self-attention in standard transformers, or a linear attention variant in hybrid architectures) with an MLP feed-forward network. This is not the same as “layer” in the classical deep-learning sense (a single linear transformation). A transformer block contains several linear layers. The field name num_hidden_layers counts blocks; one “layer” here equals one full block.

Field	Value	What it does
`num_hidden_layers`	32	Number of transformer blocks (each = an attention module + MLP; for Qwen3.5-9B, 24 of 32 use Gated DeltaNet linear attention rather than full self-attention). Multiplier on every per-block weight and on KV cache cost.
`hidden_size`	4096	Width of the residual stream — the vector every block reads from and writes back to. All weight matrices are sized relative to this: Q projection is `4096 × (heads × head_dim)`, each MLP matrix is `4096 × 12288`.
`num_attention_heads`	16	Number of Q attention heads in the full-attention blocks. Each head attends independently using `head_dim`-dimensional (256-dim) key/query/value vectors. Q projection = `hidden_size × (16 × 256)` = 16.7M params per block.
`num_key_value_heads`	4	GQA: 4 KV heads shared among 16 Q heads. K and V projection matrices shrink to `hidden_size × (4 × 256)` = 4.2M params each (vs 16.7M with full MHA), and the KV cache shrinks by the same 4× factor.
`full_attention_interval`	4	Every 4th block uses standard full attention; the other 24 use linear attention with a fixed-size recurrent state that does not grow with context length. Only the 8 full-attention blocks contribute a growing KV cache.
`intermediate_size`	12288	MLP hidden width. With SwiGLU, 3 matrices per block at `4096 × 12288` ≈ 50M params each — ~150M per block × 32 blocks ≈ 4.8B total, more than half the model.
`vocab_size`	248,320	The number of distinct tokens the model recognizes — its vocabulary. The embedding table maps each of the 248,320 tokens to a 4096-dim vector: 248,320 × 4,096 ≈ 1B params.
`max_position_embeddings`	262,144	The longest sequence the positional encoding can handle reliably — the native context window.
`rope_theta`	10,000,000	Base frequency for RoPE (Rotary Position Embedding) — the mechanism that tells the model where each token sits in the sequence by rotating its key/query vectors based on position. Higher theta stretches the positional range. The original LLaMA value (10,000) handled up to ~4K tokens; 10M is 1000× higher, needed to distinguish positions across a 256K window.

When num_key_value_heads is smaller than num_attention_heads, that’s GQA in action — fewer KV heads means smaller K and V projection matrices and a proportionally smaller KV cache. Post 2 covered the memory implications in full, including a worked example showing a 9× KV cache reduction between Qwen1.5-7B (pre-GQA) and Qwen2.5-7B (GQA). For Qwen3.5-9B, the savings go further because of the hybrid architecture: only the 8 full-attention blocks accumulate a growing KV cache; the other 24 linear-attention blocks maintain a fixed-size recurrent state regardless of sequence length (roughly 20–30 MB per active sequence across all 24 DeltaNet blocks in a BF16 runtime implementation — negligible at batch-size-1 inference, but scales linearly with batch size). The per-token KV cache cost in BF16:

8 blocks × 2 (K + V) × 4 KV heads × 256 head_dim × 2 bytes = 32,768 bytes per token (32 KB)

At 128K context: 32,768 × 131,072 ≈ 4.3 GB for the attention cache. A hypothetical all-full-attention model with standard MHA — 32 blocks, 16 KV heads — would cost 32 × 2 × 16 × 256 × 2 = 524,288 bytes per token, or roughly 67 GB at 128K context. GQA (4× fewer KV heads) combined with the hybrid architecture (4× fewer blocks with a growing cache) delivers approximately 16× lower KV cache for long-context workloads.

Counting parameters yourself

The model card says 9B. Let’s work through the count from config.json. Most components derive directly from config fields. The MTP block and vision encoder require estimates — their internal layouts are not fully enumerated in config.json — but together they account for less than 1B of the ~9.65B total and are flagged as estimates below.

Qwen3.5-9B’s hybrid architecture means you have to account for two different layer types. Same approach as always — multiply through the shapes — just more parts.

MLP parameters (all 32 layers):

All 32 layers share the same MLP structure. With SwiGLU, there are three projections per layer:
– Gate: hidden_size × intermediate_size = 4096 × 12,288 = 50,331,648
– Up: same = 50,331,648
– Down: intermediate_size × hidden_size = 12,288 × 4096 = 50,331,648

MLP per layer: ~151M. 32 layers: ~4.83B — more than half the model in this one component alone.

Full-attention layers (8 layers, every 4th):

These use standard GQA with num_attention_heads: 16, num_key_value_heads: 4, head_dim: 256:
– Q projection: hidden_size × (16 × 256) = 4096 × 4096 = 16,777,216
– K projection: hidden_size × (4 × 256) = 4096 × 1024 = 4,194,304
– V projection: same as K = 4,194,304
– O projection: (16 × 256) × hidden_size = 4096 × 4096 = 16,777,216

Full-attention per layer: ~42M. 8 layers: ~336M

Linear attention layers (24 layers, the rest):

The Gated DeltaNet layers have a different QKV configuration. From the config: linear_num_key_heads: 16, linear_key_head_dim: 128, linear_num_value_heads: 32, linear_value_head_dim: 128:
– Q projection: 4096 × (16 × 128) = 8.4M
– K projection: same = 8.4M
– V projection: 4096 × (32 × 128) = 16.8M
– O projection: (32 × 128) × 4096 = 16.8M
– Gate projection (attn_gate.weight): hidden_size × hidden_size = 4096 × 4096 = 16.8M (attn_output_gate: true means a square gate matrix; confirmed in the GGUF tensor listing below)

Roughly ~67M per layer in attention weights.

24 layers × ~67M: ~1.6B

Embedding table:

vocab_size × hidden_size = 248,320 × 4,096 ≈ 1.02B

LM head:

Qwen3.5-9B does not tie the LM head to the embedding matrix (tie_word_embeddings: false). That’s a separate weight matrix: another ~1.02B.

Remaining components:

The MTP block (mtp_num_hidden_layers: 1) adds one additional transformer-like layer to the text LLM — approximately ~0.25B (estimated from typical MTP block sizes; the MTP architecture is not fully enumerated in config.json). Layer norms and other minor weights are negligible. Together these account for ~0.30B. The vision encoder is a separate ~0.55B (from vision_config: 27 ViT blocks with hidden_size: 1152 and intermediate_size: 4304 ≈ 20M params each).

Approximate total:

Component	Params
MLP (32 layers)	~4.83B
Full-attention (8 layers)	~0.34B
Linear attention (24 layers)	~1.60B
Embedding table	~1.02B
LM head (untied)	~1.02B
MTP block + layer norms	~0.30B (est.)
Total (text LLM)	~9.1B

(est.) rows are approximated; all other rows derive directly from config.json fields.

The text LLM counts to ~9.1B. Adding the vision encoder (~0.55B) gives 9.65B — the full model count, and 9.65B × 2 bytes ≈ 19.3 GB, consistent with the safetensors download.

A note on hybrid models: For a clean standard transformer you can verify the count with a few lines of algebra. For hybrid architectures like this one, the arithmetic is more involved and it’s worth using the safetensors inspection below as your primary verification tool. The important thing is knowing which config fields control which components — that’s what lets you trust or question the numbers on any model card.

Inspecting safetensors without loading the weights

You can verify the parameter count from the files themselves — without loading a single tensor into memory. The safetensors format stores a JSON metadata header at the start of each file containing every tensor’s name, dtype, and shape. You can read it in a few lines of Python:

import json, struct

def inspect_safetensors(path):
    with open(path, "rb") as f:
        # First 8 bytes: length of the JSON header
        header_len = struct.unpack("<Q", f.read(8))[0]
        header = json.loads(f.read(header_len))

    total_params = 0
    for name, info in header.items():
        if name == "__metadata__":      # <- skip the file-level metadata entry
            continue
        shape = info["shape"]
        dtype = info["dtype"]
        n = 1
        for dim in shape:
            n *= dim
        total_params += n
        print(f"{name:<60s} {str(shape):<25s} {dtype}")

    print(f"\nTotal parameters: {total_params:,}")

# For a single-file model:
inspect_safetensors("model.safetensors")

# Most 9B+ models are sharded across multiple files.
# Run the same function on each shard and sum the results.
# If the model uses tied embeddings (embedding table = LM head), check
# model.safetensors.index.json — if lm_head.weight is absent from the index
# entirely, it is not stored as a separate tensor; count model.embed_tokens.weight
# once and do not add a second 1.02B for the LM head.
# (This convention holds for HF Transformers safetensors checkpoints;
#  other formats may handle tied weights differently.)

This reads only the JSON header — not the tensor data itself. For a sharded model with, say, four .safetensors files, you call this on each file. The __metadata__ key at the top level holds file-level information (format version, sometimes a hash); skip it when summing shapes.

Tensor naming conventions vary by model family and architecture. A standard LLaMA-style model uses patterns like model.layers.{n}.self_attn.q_proj.weight and model.layers.{n}.mlp.gate_proj.weight, but Qwen3.5-9B’s hybrid architecture means the 24 linear attention blocks have different tensor names from the 8 full-attention blocks — you can see the GGUF naming scheme for this model’s architecture-specific tensors (e.g., blk.9.attn_gate.weight, blk.9.ssm_out.weight) in the GGUF section below. The HF safetensors equivalents follow the same architecture but use the Transformers library’s naming scheme. Use the inspection script to discover the actual names in your checkpoint; scanning those names tells you exactly which components are present, which is invaluable when debugging a loading error.

Tokenizer files

The other files in the Qwen3.5-9B directory are the tokenizer. These are worth understanding because they’re the interface between human text and the numbers the model processes.

tokenizer.json is the main tokenizer file for most modern models. It’s a large JSON file — typically 3–7 MB for text-only models, but multimodal models with expanded vocabularies run larger; Qwen3.5-9B’s 248,320-token vocabulary puts it at 12.8 MB — containing the vocabulary, the merge rules for BPE tokenizers, and the pre-tokenization configuration. This file is the single source of truth for how text becomes token IDs and back.

tokenizer_config.json is the tokenizer configuration. It specifies the tokenizer class, the model type, and the special token definitions. It also sometimes contains the chat_template field — a Jinja2 template string that defines how conversations should be formatted for this model.

In Qwen3.5-9B, the chat template lives in a separate file: chat_template.jinja. This is an increasingly common pattern for newer and multimodal models where the template is too long and complex to embed cleanly inline. Hugging Face Transformers reads the template from whichever location it finds it. For practitioners, the key point is: if you’re looking for the chat template and there’s no chat_template key in tokenizer_config.json, check for a .jinja file in the same directory. We’ll spend an entire post on this (Post 7).

Special token mappings (which token signals “start of sequence,” “end of sequence,” and so on) are embedded directly in tokenizer_config.json for Qwen3.5-9B rather than in a separate special_tokens_map.json. The tokens themselves are not English words — they’re structural signals the model treats differently from content tokens. The exact tokens vary by model family, and mixing them up causes behavior that looks like quality issues but is actually a formatting problem. That’s covered in depth in Post 7.

For models that use SentencePiece instead of the Hugging Face fast tokenizer, you’ll also find a tokenizer.model file: a binary file containing the SentencePiece vocabulary. Older models (Llama 1/2, Mistral v0.1) used this; most newer models have moved to tokenizer.json.

GGUF: a self-contained model format

The files above — safetensors shards, config.json, tokenizer.json — are all separate artifacts that a runtime assembles at load time. GGUF works differently. It emerged from the llama.cpp project in August 2023 as a replacement for the earlier GGML binary format. Where GGML stored weights but kept metadata scattered or inferred at runtime, GGUF added an extensible key-value header that encodes the full architecture spec, tokenizer vocabulary, and each tensor’s quantization type together in a single self-contained binary. A llama.cpp-based runtime — which includes Ollama, and LM Studio (which uses llama.cpp as its primary inference backend) — can load a GGUF file without any external config or tokenizer files (for standard text GGUFs; some multimodal setups distribute the vision encoder as a separate companion file). The Q4_K_M / Q6_K / Q8_0 labels you see in Ollama tags and GGUF filenames map directly to quantization-type identifiers baked into each tensor’s metadata within the file.

Hugging Face also hosts GGUF files. Many model repositories publish official or community-contributed GGUF variants in their “Files” tab alongside safetensors, or sometimes instead of them. Downloading from HF does not guarantee safetensors; the format depends on what the maintainers chose to publish. GGUF is not the only alternative either — GPTQ, AWQ, EXL2, and compiled engine formats all serve different use cases. Post 8 covers that landscape in full.

Inspecting GGUF files

The quickest way to inspect a GGUF file is the gguf-dump command, which ships with llama.cpp. Here’s the output for the Qwen3.5-9B BF16 model — available as a GGUF file directly on Hugging Face, or generated from the safetensors download using llama.cpp’s convert_hf_to_gguf.py. This dump covers the text model weights; depending on the conversion path, the vision encoder may be embedded in the same file or distributed as a companion file:

$ gguf-dump Qwen3.5-9B.gguf
INFO:gguf-dump:* Loading: Qwen3.5-9B.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.
* Dumping 44 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 427
      3: UINT64     |        1 | GGUF.kv_count = 41
      4: STRING     |        1 | general.architecture = 'qwen35'                                <- note: 'qwen35', not 'qwen3_5'
      5: STRING     |        1 | general.type = 'model'
      6: STRING     |        1 | general.name = 'Qwen3.5 9B'
      7: STRING     |        1 | general.basename = 'Qwen3.5'
      8: STRING     |        1 | general.size_label = '9B'
      9: STRING     |        1 | general.license = 'apache-2.0'
     10: STRING     |        1 | general.license.link = 'https://huggingface.co/Qwen/Qwen3.5-9B/blob/main/LICENSE'
     11: UINT32     |        1 | general.base_model.count = 1
     12: STRING     |        1 | general.base_model.0.name = 'Qwen3.5 9B Base'
     13: STRING     |        1 | general.base_model.0.organization = 'Qwen'
     14: STRING     |        1 | general.base_model.0.repo_url = 'https://huggingface.co/Qwen/Qwen3.5-9B-Base'
     15: [STRING]   |        1 | general.tags = ['image-text-to-text']
     16: UINT32     |        1 | qwen35.block_count = 32                                        <- num_hidden_layers
     17: UINT32     |        1 | qwen35.context_length = 262144                                 <- max_position_embeddings
     18: UINT32     |        1 | qwen35.embedding_length = 4096                                 <- hidden_size
     19: UINT32     |        1 | qwen35.feed_forward_length = 12288                             <- intermediate_size
     20: UINT32     |        1 | qwen35.attention.head_count = 16                               <- num_attention_heads
     21: UINT32     |        1 | qwen35.attention.head_count_kv = 4                             <- GQA: 4 KV heads
     22: [INT32]    |        4 | qwen35.rope.dimension_sections = [11, 11, 10, 0]               <- same 3 sections as config.json; trailing 0 is GGUF padding
     23: FLOAT32    |        1 | qwen35.rope.freq_base = 10000000.0                             <- rope_theta
     24: FLOAT32    |        1 | qwen35.attention.layer_norm_rms_epsilon = 9.999999974752427e-07
     25: UINT32     |        1 | qwen35.attention.key_length = 256                              <- head_dim
     26: UINT32     |        1 | qwen35.attention.value_length = 256
     27: UINT32     |        1 | general.file_type = 32                                         <- 32 = BF16 (unquantized weights)
     28: UINT32     |        1 | qwen35.ssm.conv_kernel = 4                                     <- linear attention (DeltaNet) config
     29: UINT32     |        1 | qwen35.ssm.state_size = 128
     30: UINT32     |        1 | qwen35.ssm.group_count = 16
     31: UINT32     |        1 | qwen35.ssm.time_step_rank = 32
     32: UINT32     |        1 | qwen35.ssm.inner_size = 4096                                   <- linear_num_value_heads × linear_value_head_dim = 32 × 128
     33: UINT32     |        1 | qwen35.full_attention_interval = 4                             <- hybrid: full attn every 4th block
     34: UINT32     |        1 | qwen35.rope.dimension_count = 64
     35: UINT32     |        1 | general.quantization_version = 2
     36: STRING     |        1 | tokenizer.ggml.model = 'gpt2'
     37: STRING     |        1 | tokenizer.ggml.pre = 'qwen35'
     38: [STRING]   |   248320 | tokenizer.ggml.tokens = ['!', '"', '#', '%', '&', ...]         <- full vocab embedded in one file
     39: [INT32]    |   248320 | tokenizer.ggml.token_type = [1, 1, 1, 1, 1, 1, ...]
     40: [STRING]   |   247587 | tokenizer.ggml.merges = ['Ġ Ġ', 'ĠĠ ĠĠ', 'i n', 'Ġ t', 'ĠĠĠĠ ĠĠĠĠ', 'e r', ...]
     41: UINT32     |        1 | tokenizer.ggml.eos_token_id = 248046
     42: UINT32     |        1 | tokenizer.ggml.padding_token_id = 248044
     43: BOOL       |        1 | tokenizer.ggml.add_bos_token = False
     44: STRING     |        1 | tokenizer.chat_template = '{%- set image_count = namespace(value=0) %}\n{%- set video...'
* Dumping 427 tensor(s)
      1: 1017118720 |  4096, 248320,     1,     1 | BF16    | output.weight                    <- LM head: hidden_size × vocab_size = 1.02B params (~2.0 GB at BF16)
      2: 1017118720 |  4096, 248320,     1,     1 | BF16    | token_embd.weight                <- embedding table: same shape, stored separately (tie_word_embeddings: false)
      3:   50331648 | 12288,  4096,     1,     1 | BF16    | blk.14.ffn_down.weight            <- MLP down projection: intermediate_size × hidden_size
      4:   50331648 |  4096, 12288,     1,     1 | BF16    | blk.14.ffn_gate.weight
      ...
    421:     131072 |  4096,    32,     1,     1 | BF16    | blk.9.ssm_alpha.weight             <- DeltaNet decay rates: hidden_size × time_step_rank (32)
    422:     131072 |  4096,    32,     1,     1 | BF16    | blk.9.ssm_beta.weight
    423:   16777216 |  4096,  4096,     1,     1 | BF16    | blk.9.attn_gate.weight             <- attention gate projection: hidden_size × hidden_size
    424:        128 |   128,     1,     1,     1 | F32     | blk.9.ssm_norm.weight              <- SSM normalization kept at F32 for precision
    425:   16777216 |  4096,  4096,     1,     1 | BF16    | blk.9.ssm_out.weight               <- DeltaNet output projection: inner_size × hidden_size
    426:       4096 |  4096,     1,     1,     1 | F32     | blk.9.post_attention_norm.weight
    427:       4096 |  4096,     1,     1,     1 | F32     | output_norm.weight                 <- final layer norm before LM head

(Comments added for clarity; the actual gguf-dump output doesn’t include them.)

Output reflects GGUF v3 format (GGUF.version = 3, field 1 above) from a llama.cpp build circa early 2026. Field names and K-quant defaults evolve across llama.cpp releases; treat this as a representative example, not a fixed specification.

The same architecture fields you read from config.json are now baked into the file header — no separate spec file needed. general.file_type = 32 means BF16; the main weight tensors show BF16 in the dtype column, while norm weights (ssm_norm, post_attention_norm, output_norm) are kept at F32 for numerical stability. qwen35.full_attention_interval = 4 tells llama.cpp to dispatch a different compute kernel every 4th block — the hybrid architecture is a first-class citizen in the file format. Lines 28–32 (ssm.*) are the DeltaNet linear attention configuration, confirmed against config.json: ssm.inner_size = 4096 directly encodes linear_num_value_heads × linear_value_head_dim = 32 × 128. These fields have no equivalent in a standard transformer GGUF.

The two largest tensors — output.weight and token_embd.weight — are 4,096 × 248,320 = 1.02B parameters each; at BF16 that is ~2.0 GB per tensor — the two largest individual tensors in the file. For a Q4_K_M quantized file, a typical llama.cpp conversion would show general.file_type = 15, Q4_K on most tensors, and Q6_K on the LM head and other quality-sensitive tensors identified by llama.cpp’s heuristics — exact assignments vary by model and llama.cpp version. More on quantization-type naming in Post 8.

You can also inspect GGUF programmatically:

from gguf import GGUFReader   # pip install gguf

reader = GGUFReader("model.gguf")

# Key-value metadata (architecture, tokenizer, etc.)
for field in reader.fields.values():
    print(f"{field.name}: {field.parts[-1].tolist()}")

# Tensor listing (names, shapes, types — not the data itself)
for tensor in reader.tensors:
    print(f"{tensor.name:<40s} {list(tensor.shape)}")

The gguf Python package is the reference implementation from the llama.cpp project. It reads metadata without pulling tensor data into memory — the same principle as the safetensors inspection above.

What you now know, and what’s next

You can now read a model’s full specification before running a single line of inference code. config.json tells you the architecture: layer count, width, head split for GQA, vocabulary size, context window. You can multiply through those shapes and reproduce the parameter count yourself. You can inspect safetensors files without loading the weights, and you know how GGUF embeds the same information in a self-contained format.

The tokenizer files are there too, and now you know what each one does. The one we didn’t dig into — the chat template, whether it’s a field in tokenizer_config.json or its own .jinja file — is where a lot of practitioners quietly go wrong. That comes later.

Next, we look at the hardware side. All that arithmetic assumed your GPU can actually execute the formats you care about. Whether it can depends on your GPU generation in ways that aren’t obvious from the spec sheets.

Next: [Post 4 | Know Your Rig: what your hardware can actually do]

LLMs in Production #3: Reading the Model Spec

Getting the model

Hugging Face — the source

Ollama

LM Studio

Reading the Hugging Face download

config.json

What the fields control

Counting parameters yourself

Inspecting safetensors without loading the weights

Tokenizer files

GGUF: a self-contained model format

Inspecting GGUF files

What you now know, and what’s next

Related

Leave a ReplyCancel reply

US Government Halts Anthropic’s AI Models Citing Security Fears, Sparks Industry Controversy

The Build Log That Spoke to AI Agents

Half a Billion Dollar AI Blunder: The Hidden Costs of Unchecked Tech Spending

ECC v2.0: Elevating Agentic Work with Versatile Operator Systems and Open-Source Innovation

The Vulnerability Bottleneck Has Moved

China’s First Real Gaming GPU Is Here — And That Matters More Than FPS

Shai-Hulud and the Danger of Trusted Packages

When the Future Remembers First

YellowKey Turns BitLocker Into an Open Door