When comparing Large Language Models (LLMs), you’ll often see specifications like:

Llama 3.1 8B – 8 billion parameters
DeepSeek-V3 – 671 billion total parameters, 37 billion active parameters
Mixtral 8x7B – 47 billion total parameters, ~13 billion active parameters

At first glance, these numbers can be confusing. If DeepSeek has 671B parameters, why does it only use 37B? This article explains how parameters are used during inference and the difference between total and active parameters.

What Are Parameters?

A parameter is a learned numerical value (weight) inside a neural network.

During training, the model adjusts billions of these values so that it can recognize patterns, predict the next token, answer questions, write code, and perform reasoning tasks.

Think of parameters as the model’s learned knowledge encoded as numbers.

How Are Parameters Used?

During inference (when you ask the model something), the model doesn’t “look up” an answer from memory. Instead, it repeatedly performs mathematical operations using its parameters.

For every transformer layer, the computation looks roughly like:

Input Embeddings
       │
       ▼
 Multi-Head Attention
       │
       ▼
 Feed Forward Network
       │
       ▼
Output

Almost every computation inside these blocks involves multiplying the input vectors by matrices containing the model’s learned parameters.

For example:

Output = Input × Weight Matrix + Bias

Here, the weight matrix consists entirely of learned parameters.

Every transformer layer performs these operations, meaning the model repeatedly reads and uses its parameters to generate the next token.

Dense Models

Most traditional language models are dense models.

In a dense model:

Every transformer layer is executed.
Every weight matrix is used.
Nearly every parameter participates in generating every token.

Input
   │
   ▼
Layer 1 (All Parameters)
   │
   ▼
Layer 2 (All Parameters)
   │
   ▼
Layer 3 (All Parameters)
   │
   ▼
Output

If a model has 8 billion parameters, almost all 8 billion participate in every forward pass.

Examples:

Model	Total Parameters	Active Parameters
Llama 3.1 8B	8B	8B
Llama 3.1 70B	70B	70B

For dense models:

Total Parameters = Active Parameters

Why Does This Become Expensive?

As models grow larger, every generated token requires computation using every parameter.

For a 70B dense model:

70 billion parameters are read.
Huge matrix multiplications are performed.
GPU memory bandwidth becomes the bottleneck.
Latency increases.

Simply increasing parameter count eventually becomes inefficient.

Mixture of Experts (MoE)

Instead of using one enormous network for every token, MoE models divide parts of the network into multiple experts.

Instead of this:

Feed Forward Network

an MoE layer becomes:

          Router
             │
 ┌──────┬────┴────┬──────┐
 ▼      ▼         ▼      ▼
Expert1 Expert2 Expert3 Expert4
             │
         Combine Output

The router decides which experts should process the current token.

Only the selected experts perform computation.

The others are skipped.

Total Parameters

Total parameters refer to every parameter stored inside the model.

Example:

64 Experts

10B parameters each

Total parameters:

64 × 10B = 640B

All of these parameters exist in memory.

Even if most experts aren’t used for a particular token, they still contribute to the model’s total parameter count.

Active Parameters

Active parameters are the parameters actually used during one forward pass.

Suppose the router selects only:

Expert 8
Expert 21

Each expert contains:

10B parameters

Then only:

10B + 10B = 20B

parameters participate in computation.

Everything else remains idle for that token.

Visual Comparison

Dense Model

Stored Parameters

████████████

Used Parameters

████████████

Every parameter is used.

MoE Model

Stored Parameters

████████████████████████████

Used Parameters

██

Most parameters stay inactive during that forward pass.

Why Active Parameters Matter

During inference, the expensive part isn’t how many parameters exist.

The expensive part is how many parameters are actually computed.

Every active parameter requires:

Memory access
Matrix multiplication
GPU computation
Power consumption

Inactive experts don’t perform any computation.

This is why inference cost is much closer to the active parameter count than the total parameter count.

Memory vs Compute

This distinction is important.

Memory

The model must still store all parameters.

For example:

671B total parameters

All 671B parameters need to exist in GPU memory, CPU memory, or be distributed across multiple machines.

Memory usage depends primarily on total parameters.

Compute

During inference:

671B total

↓

Router selects

↓

37B active

Only those 37B parameters participate in matrix multiplications.

Compute depends primarily on active parameters.

Real Examples

Model	Architecture	Total Parameters	Active Parameters
Llama 3.1 8B	Dense	8B	8B
Llama 3.1 70B	Dense	70B	70B
Mixtral 8x7B	MoE	~47B	~13B
DeepSeek-V3	MoE	~671B	~37B

Notice that MoE models have significantly more stored parameters than active parameters.

Why Build MoE Models?

MoE models provide two major advantages:

They increase the model’s overall capacity by adding more experts.
They avoid computing every parameter for every token.

This allows models to scale into hundreds of billions of parameters without requiring the same compute cost as an equivalently sized dense model.

The trade-off is increased routing complexity and higher memory requirements.

Summary

Parameters are learned numerical weights inside a neural network.
During inference, parameters are repeatedly used in matrix multiplications to transform the input into the next token.
In dense models, every parameter participates in every forward pass.
In Mixture of Experts (MoE) models, a router activates only a small subset of experts for each token.
Total parameters represent everything stored inside the model.
Active parameters represent only the parameters used during one forward pass.
Memory requirements depend mainly on total parameters.
Inference compute depends mainly on active parameters.

This is why a model with 671B total parameters and 37B active parameters can provide the capacity of an extremely large model while requiring computation closer to that of a much smaller dense model.