From Switch Transformers to DeepSeek-V2 — how conditional computation reshapes large-scale AI

Over the past five years, model scaling has followed a predictable recipe — double the parameters, wait for better results. But dense scaling quickly hits physical and financial limits.
The AI industry's scaling obsession began with a simple observation: larger models generally performed better. GPT-3's 175 billion parameters crushed benchmarks that GPT-2's 1.5 billion couldn't touch. This led to an arms race where "more is better" became gospel.
But this approach has fundamental problems:
A 70-billion-parameter dense model requires 140GB of memory just to store the weights (FP16). Training requires 4-8× that for gradients and optimizer states. A single training run can cost millions in compute.
For enterprises, inference costs scale linearly—every user query activates all 70B parameters, even if they only need a fraction.
Research shows that dense models exhibit massive redundancy. Studies using pruning and distillation reveal that 30-40% of parameters can be removed with minimal performance loss.
This suggests most parameters aren't critical for most inputs—they're just along for the ride, consuming energy and memory.
Training a large language model can emit as much CO₂ as five cars over their lifetime. The carbon footprint of AI is becoming a regulatory and reputational risk for enterprises.
Different tasks require different capabilities. A model analyzing financial statements doesn't need the same neural pathways as one generating creative fiction.
Yet dense models train every parameter on every example, forcing a one-size-fits-all approach that's inherently inefficient.
Most inputs only need a small subset of model parameters to reason effectively. A movie-review sentiment classifier doesn't need the same neurons that decode protein structures. A financial risk query doesn't need creative writing capabilities.
This observation—that conditional computation could match dense performance at a fraction of the cost—gave rise to Mixture-of-Experts (MoE) architectures.
The MoE concept isn't new. Early work by Jacobs et al. (1991) and Jordan & Jacobs (1994) explored modular networks with gating mechanisms. But these systems were limited to small-scale problems.
Google's 2017 paper "Outrageously Large Neural Networks" by Shazeer et al. brought MoE to Transformers, showing that sparse gating could scale to billions of parameters. However, training instability and load imbalance limited adoption.
The breakthrough came in 2021-2024 with Switch Transformers, GLaM, and DeepSeek-V2, which solved these challenges through careful engineering and algorithmic innovations.
Instead of running all layers for every token, MoE introduces experts (sub-modules) and a router that dynamically decides which experts to activate based on input characteristics.
The genius of MoE lies in its simplicity: treat the model as a collection of specialists, each excellent at different subtasks, then route inputs to the most relevant specialists.
Let's denote:
where gᵢ(x) ≈ 0 for most experts (sparse gating).
In practice, we use top-k gating:
If only k experts are activated per token:
This means a model with 128 experts but k=2 runs 64× faster than if all experts were activated.
Imagine your company has 100 specialists (doctors, lawyers, engineers, designers). When a problem arises, you don't gather all 100 people in a room—that's expensive and slow.
Instead, you have a router (project manager) who reads the problem and calls in just 2-3 relevant specialists. The meeting is faster, cheaper, and often more effective because each specialist brings focused expertise.
MoE does exactly this for neural networks: route each input to the most relevant specialists, keeping the rest idle.
Hence, a model can hold billions of parameters but run like a much smaller one. This is the essence of sparse activation—large capacity, small cost.
A production MoE layer is more sophisticated than the basic formulation suggests. Understanding each component is crucial for implementation:
| Component | Function |
|---|---|
| Router / Gating Network | Chooses top-k experts for each token |
| Experts | Independent FFNs or modules, often replicated per layer |
| Load Balancer | Ensures tokens are distributed fairly (avoids "hot" experts) |
| Sparse Dispatch / Combine | Efficiently route inputs to selected experts and aggregate outputs |
In Transformers, the FFN accounts for ~66% of total parameters (for standard architectures). This makes it the ideal target for sparsification.
Attention layers, by contrast, are kept dense because they're relatively small and handle critical cross-token dependencies that shouldn't be sparse.
Thus, MoE typically replaces each FFN with N expert FFNs, multiplying total capacity by N while keeping active cost constant.
Let's examine three pivotal MoE architectures that shaped modern sparse models, understanding not just what they did, but why their design choices mattered.
The Switch Transformer, introduced by Fedus et al., made MoE practical through radical simplification: one expert per token (k = 1).
Previous MoE systems used k=2 or k=4, believing that blending multiple experts was necessary for quality. Switch challenged this assumption.
By using k=1, Switch eliminated expensive weighted combinations and simplified routing to a single argmax operation, making training and inference significantly faster.
Router picks top-1 expert via softmax gating:
During training, a load-balancing loss encourages uniform utilization:
where fᵢ = fraction of tokens routed to expert i, and pᵢ = average routing probability to expert i.
Advantages:
Limitations:
Switch introduced expert capacity—a maximum number of tokens each expert can process per batch. When an expert reaches capacity, overflow tokens are passed through residually.
Typical capacity_factor = 1.25, allowing 25% headroom for load imbalance. This prevents memory explosions and ensures deterministic performance.
GLaM extended Switch with top-2 experts per token and introduced critical innovations for production deployment.
Research showed that k=1 occasionally made catastrophic routing errors. By using k=2, GLaM provides a safety net: if the top expert is wrong, the second can compensate.
The weighted combination also allows soft specialization—experts can partially activate based on input ambiguity.
The output becomes a weighted sum:
GLaM's load-balancing loss:
where α is a tunable coefficient (typically 0.01)
GLaM pioneered expert parallelism—distributing experts across GPUs where each device hosts a subset of experts.
During forward pass:
This scales to thousands of experts across hundreds of GPUs without duplicating expert weights.
GLaM demonstrated that MoE could match or exceed dense models on real-world tasks:
DeepSeek-V2 represents the current state-of-the-art, introducing fine-grained routing and hierarchical expert organization.
Unlike previous models where all experts are homogeneous, DeepSeek organizes experts into specialized groups:
This allows for more targeted specialization than random expert assignment.
DeepSeek uses a two-level routing system:
Level 1: Task Router ↓ Selects expert group (e.g., "Code") ↓ Level 2: Expert Router ↓ Selects specific experts within group (e.g., "Python Expert", "Systems Expert")
This reduces routing complexity from O(N) to O(√N) and improves specialization accuracy.
A novel technique where expert gradients are periodically merged to share knowledge:
Goal: Build a working PyTorch implementation to understand routing, load balancing, and expert dispatch in practice.
We'll implement:
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class SimpleMoE(nn.Module):
6 """
7 Minimal Mixture-of-Experts layer with top-k routing and load balancing.
8
9 Args:
10 d_model: Hidden dimension size
11 num_experts: Total number of expert networks
12 expert_dim: Expert hidden layer size (typically 4× d_model)
13 k: Number of experts to activate per token
14 dropout: Dropout probability
15 """
16 def __init__(
17 self,
18 d_model=512,
19 num_experts=8,
20 expert_dim=2048,
21 k=2,
22 dropout=0.1
23 ):
24 super().__init__()
25 self.num_experts = num_experts
26 self.k = k
27 self.d_model = d_model
28
29 # Create N expert networks (each is a 2-layer FFN)
30 self.experts = nn.ModuleList([
31 nn.Sequential(
32 nn.Linear(d_model, expert_dim),
33 nn.ReLU(),
34 nn.Dropout(dropout),
35 nn.Linear(expert_dim, d_model)
36 ) for _ in range(num_experts)
37 ])
38
39 # Gating network: learns to route tokens to experts
40 self.gate = nn.Linear(d_model, num_experts)
41
42 # For tracking expert usage (load balancing)
43 self.register_buffer('expert_counts', torch.zeros(num_experts))
44
45 def forward(self, x, return_load_loss=True):
46 """
47 Args:
48 x: Input tensor [batch_size, seq_len, d_model]
49 return_load_loss: Whether to compute load balancing loss
50
51 Returns:
52 output: MoE layer output [batch_size, seq_len, d_model]
53 load_loss: Load balancing auxiliary loss (if return_load_loss=True)
54 """
55 batch_size, seq_len, d_model = x.shape
56
57 # Flatten to [batch * seq, d_model] for routing
58 x_flat = x.view(-1, d_model)
59
60 # Compute routing scores for each expert
61 gate_logits = self.gate(x_flat) # [batch*seq, num_experts]
62 gate_scores = F.softmax(gate_logits, dim=-1)
63
64 # Select top-k experts per token
65 topk_scores, topk_indices = torch.topk(
66 gate_scores, self.k, dim=-1
67 ) # Both: [batch*seq, k]
68
69 # Normalize top-k scores to sum to 1
70 topk_scores = topk_scores / topk_scores.sum(dim=-1, keepdim=True)
71
72 # Initialize output
73 output = torch.zeros_like(x_flat)
74
75 # Dispatch tokens to selected experts
76 for i in range(self.k):
77 expert_idx = topk_indices[:, i] # [batch*seq]
78 expert_weight = topk_scores[:, i].unsqueeze(-1) # [batch*seq, 1]
79
80 # Process each expert separately (inefficient but clear)
81 for expert_id in range(self.num_experts):
82 mask = (expert_idx == expert_id)
83 if mask.any():
84 # Route tokens to this expert
85 expert_input = x_flat[mask]
86 expert_output = self.experts[expert_id](expert_input)
87
88 # Add weighted output back
89 output[mask] += expert_output * expert_weight[mask]
90
91 # Track expert usage
92 self.expert_counts[expert_id] += mask.sum()
93
94 # Reshape back to [batch, seq, d_model]
95 output = output.view(batch_size, seq_len, d_model)
96
97 # Compute load balancing loss
98 if return_load_loss:
99 # f_i: fraction of tokens assigned to expert i
100 f = torch.zeros(self.num_experts, device=x.device)
101 for i in range(self.num_experts):
102 f[i] = (topk_indices == i).float().sum()
103 f = f / (batch_size * seq_len * self.k)
104
105 # p_i: average routing probability to expert i
106 p = gate_scores.mean(dim=0)
107
108 # Auxiliary loss: encourages f_i ≈ p_i ≈ 1/N
109 load_loss = self.num_experts * (f * p).sum()
110
111 return output, load_loss
112
113 return output
114
115# Example usage
116if __name__ == "__main__":
117 # Create a mini-MoE layer
118 moe = SimpleMoE(
119 d_model=512,
120 num_experts=8,
121 expert_dim=2048,
122 k=2
123 )
124
125 # Random input: batch=4, seq=10, dim=512
126 x = torch.randn(4, 10, 512)
127
128 # Forward pass
129 output, load_loss = moe(x, return_load_loss=True)
130
131 print(f"Input shape: {x.shape}")
132 print(f"Output shape: {output.shape}")
133 print(f"Load balancing loss: {load_loss.item():.4f}")
134 print(f"Expert usage: {moe.expert_counts}")This toy version demonstrates the core concepts. In practice, frameworks like DeepSpeed-MoE, Fairseq-MoE, and Megatron-LM implement optimized all-to-all communication for dispatching tokens across GPUs with minimal overhead.
Without constraints, neural networks exhibit a troubling behavior: expert collapse—a few experts dominate while others never train. This is MoE's biggest training challenge.
Early in training, random initialization causes some experts to perform slightly better. The router learns to prefer them. They get more training signal, improving further. Other experts get fewer tokens, weaker gradients, and fall behind permanently.
In extreme cases, 1-2 experts can handle 80% of tokens while 90% of experts remain idle—wasting capacity and compute.
This is analogous to the "rich get richer" phenomenon in economics, also known as preferential attachment in network theory.
Common balancing techniques prevent this collapse:
| Method | Idea | Equation / Mechanism |
|---|---|---|
| Auxiliary Loss | Penalize uneven traffic | |
| Noise Jitter | Adds randomness to gate logits | g(x) = softmax(xWg + ϵ) |
| Token Drop | Skip overflow tokens to cap load | Ensures deterministic batch size |
| Capacity Factor (α) | Max tokens per expert | capacity = α · tokens/experts |
Routing decides which sub-model to fire per query type, keeping inference cost constant even as the system's knowledge base expands.
Separate experts for credit, market, operational, climate risks. Shared embedding + router allows dynamic switching across domains.
Experts specialized on molecule synthesis, toxicology, trial protocol, post-market surveillance. Router learns to send prompts to appropriate scientific expert automatically.
In DataXpert, a hierarchical MoE router orchestrates: a Data Science Expert (for numerical analysis), a Programming Expert (for code synthesis), and a Business Expert (for KPI explanation) — forming an Agentic MoE system that mimics real consulting workflows.
| Model | Total Params | Active Params | Speed-up vs Dense | Paper |
|---|---|---|---|---|
| Switch Transformer | 1.6T | 10B | 4× | Google, 2021 |
| GLaM | 1.2T | 97B | 12× | Google, 2022 |
| DeepSeek-V2 | 671B | 37B | 10× | DeepSeek, 2024 |
These results prove that conditional compute beats brute force — unlocking trillion-parameter capacity at sub-100-billion-parameter cost.
1┌────────────┐
2Input →── │ Shared Encoder │
3 └──────┬─────┘
4 ↓
5 ┌────────────┐
6 │ Router NN │───┐
7 └────────────┘ │
8 │ Top-k │
9 ┌────────────┴─────────────┐
10 │ Expert-1 Expert-2 ... │ ← each trained on sub-domain data
11 └────────────┬─────────────┘
12 ↓
13 Aggregate + FFN
14 ↓
15 Output / LogitsAt runtime, the router picks a few experts per token — often dispatched across GPUs via AllToAll communication primitives.
| Dimension | Dense | Mixture-of-Experts |
|---|---|---|
| Parameters active per token | All | k ≪ N |
| Compute efficiency | Low | High |
| Training stability | Stable | Requires careful balancing |
| Memory footprint | Scales with N | Scales with k |
| Inference throughput | Linear | Sub-linear |
| Interpretability | Uniform | Experts offer explainable modularity |
Training MoE models requires specialized techniques beyond standard Transformer training:
Since only k experts receive gradients per token, training can be unstable. Use larger batch sizes and gradient accumulation to ensure all experts receive sufficient training signal.
Randomly dropping experts during training forces the model to learn redundancy and prevents over-reliance on specific experts.
Emerging research trends:
Coarse task selector → fine expert (DeepSeek-V2)
Text, image, code unified in one MoE
Dynamic expert spawning for new domains
Multiple autonomous LLM agents specialized by role — precisely the paradigm Finarb is building for its multi-agent data-analytics systems
| Metric | Dense Model | MoE Model |
|---|---|---|
| Training compute | 100% baseline | 25–30% |
| Inference latency | ↑ linear | constant (k active) |
| Energy cost | High | Reduced |
| Scalability | Limited by GPU RAM | Horizontally scalable across experts |
| Domain adaptation | Full retrain | Add expert module only |
MoE fundamentally shifts the economics of AI — enabling enterprises to own large, modular AI systems that scale capacity without scaling cost.
MoE formalizes conditional computation — selectively using parts of a massive network — analogous to how human brains recruit specialized cortical regions per task.
Mathematically:
where p = k/N.
Thus, you can increase N arbitrarily while keeping compute fixed by reducing p — the essence of scaling "horizontally" instead of "vertically."
Mixture-of-Experts architectures mark a paradigm shift:
For enterprises, that means AI systems that grow without growing costs — experts that specialize by function, department, or domain, much like a real organization.
At Finarb, this principle already powers our internal multi-agent products: KPIxpert with specialized expert modules for KPI optimization, and DataXpert with MoE-style orchestration of domain-specific data experts. Together, they exemplify how applied innovation meets scalable intelligence.