Understanding Model Parameters: Total Parameters vs Active Parameters
An explanation of how parameters are used during LLM inference, and the key differences between total and active parameters in Dense vs Mixture of Experts (MoE) architectures.
When comparing Large Language Models (LLMs), you’ll often see specifications like:
- Llama 3.1 8B – 8 billion parameters
- DeepSeek-V3 – 671 billion total parameters, 37 billion active parameters
- Mixtral 8x7B – 47 billion total parameters, ~13 billion active parameters
At first glance, these numbers can be confusing. If DeepSeek has 671B parameters, why does it only use 37B? This article explains how parameters are used during inference and the difference between total and active parameters.
What Are Parameters?
A parameter is a learned numerical value (weight) inside a neural network.
During training, the model adjusts billions of these values so that it can recognize patterns, predict the next token, answer questions, write code, and perform reasoning tasks.
Think of parameters as the model’s learned knowledge encoded as numbers.
How Are Parameters Used?
During inference (when you ask the model something), the model doesn’t “look up” an answer from memory. Instead, it repeatedly performs mathematical operations using its parameters.
For every transformer layer, the computation looks roughly like:
Input Embeddings
│
▼
Multi-Head Attention
│
▼
Feed Forward Network
│
▼
Output
Almost every computation inside these blocks involves multiplying the input vectors by matrices containing the model’s learned parameters.
For example:
Output = Input × Weight Matrix + Bias
Here, the weight matrix consists entirely of learned parameters.
Every transformer layer performs these operations, meaning the model repeatedly reads and uses its parameters to generate the next token.
Dense Models
Most traditional language models are dense models.
In a dense model:
- Every transformer layer is executed.
- Every weight matrix is used.
- Nearly every parameter participates in generating every token.
Input
│
▼
Layer 1 (All Parameters)
│
▼
Layer 2 (All Parameters)
│
▼
Layer 3 (All Parameters)
│
▼
Output
If a model has 8 billion parameters, almost all 8 billion participate in every forward pass.
Examples:
| Model | Total Parameters | Active Parameters |
|---|---|---|
| Llama 3.1 8B | 8B | 8B |
| Llama 3.1 70B | 70B | 70B |
For dense models:
Total Parameters = Active Parameters
Why Does This Become Expensive?
As models grow larger, every generated token requires computation using every parameter.
For a 70B dense model:
- 70 billion parameters are read.
- Huge matrix multiplications are performed.
- GPU memory bandwidth becomes the bottleneck.
- Latency increases.
Simply increasing parameter count eventually becomes inefficient.
Mixture of Experts (MoE)
Instead of using one enormous network for every token, MoE models divide parts of the network into multiple experts.
Instead of this:
Feed Forward Network
an MoE layer becomes:
Router
│
┌──────┬────┴────┬──────┐
▼ ▼ ▼ ▼
Expert1 Expert2 Expert3 Expert4
│
Combine Output
The router decides which experts should process the current token.
Only the selected experts perform computation.
The others are skipped.
Total Parameters
Total parameters refer to every parameter stored inside the model.
Example:
64 Experts
10B parameters each
Total parameters:
64 × 10B = 640B
All of these parameters exist in memory.
Even if most experts aren’t used for a particular token, they still contribute to the model’s total parameter count.
Active Parameters
Active parameters are the parameters actually used during one forward pass.
Suppose the router selects only:
Expert 8
Expert 21
Each expert contains:
10B parameters
Then only:
10B + 10B = 20B
parameters participate in computation.
Everything else remains idle for that token.
Visual Comparison
Dense Model
Stored Parameters
████████████
Used Parameters
████████████
Every parameter is used.
MoE Model
Stored Parameters
████████████████████████████
Used Parameters
██
Most parameters stay inactive during that forward pass.
Why Active Parameters Matter
During inference, the expensive part isn’t how many parameters exist.
The expensive part is how many parameters are actually computed.
Every active parameter requires:
- Memory access
- Matrix multiplication
- GPU computation
- Power consumption
Inactive experts don’t perform any computation.
This is why inference cost is much closer to the active parameter count than the total parameter count.
Memory vs Compute
This distinction is important.
Memory
The model must still store all parameters.
For example:
671B total parameters
All 671B parameters need to exist in GPU memory, CPU memory, or be distributed across multiple machines.
Memory usage depends primarily on total parameters.
Compute
During inference:
671B total
↓
Router selects
↓
37B active
Only those 37B parameters participate in matrix multiplications.
Compute depends primarily on active parameters.
Real Examples
| Model | Architecture | Total Parameters | Active Parameters |
|---|---|---|---|
| Llama 3.1 8B | Dense | 8B | 8B |
| Llama 3.1 70B | Dense | 70B | 70B |
| Mixtral 8x7B | MoE | ~47B | ~13B |
| DeepSeek-V3 | MoE | ~671B | ~37B |
Notice that MoE models have significantly more stored parameters than active parameters.
Why Build MoE Models?
MoE models provide two major advantages:
- They increase the model’s overall capacity by adding more experts.
- They avoid computing every parameter for every token.
This allows models to scale into hundreds of billions of parameters without requiring the same compute cost as an equivalently sized dense model.
The trade-off is increased routing complexity and higher memory requirements.
Summary
- Parameters are learned numerical weights inside a neural network.
- During inference, parameters are repeatedly used in matrix multiplications to transform the input into the next token.
- In dense models, every parameter participates in every forward pass.
- In Mixture of Experts (MoE) models, a router activates only a small subset of experts for each token.
- Total parameters represent everything stored inside the model.
- Active parameters represent only the parameters used during one forward pass.
- Memory requirements depend mainly on total parameters.
- Inference compute depends mainly on active parameters.
This is why a model with 671B total parameters and 37B active parameters can provide the capacity of an extremely large model while requiring computation closer to that of a much smaller dense model.