The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), has fundamentally changed how we approach sequence modeling. At its core lies the self-attention mechanism.
The Self-Attention Formula
The attention function maps a query and a set of key-value pairs to an output:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
Where: - Q (Query): What we're looking for - K (Key): What each position offers - V (Value): The actual content - d_k: The dimension of the keys
Multi-Head Attention
Rather than performing a single attention function, we project Q, K, V into multiple subspaces:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
Why It Works
Self-attention allows the model to attend to information from different representation subspaces at different positions.
The key insight is that attention provides a dynamic routing mechanism — unlike convolutions or recurrence, the receptive field is determined by the content itself, not by architecture.
Complexity Analysis
| Operation | Time Complexity | Sequential Ops |
|---|---|---|
| Self-Attention | O(n²·d) | O(1) |
| Recurrence | O(n·d²) | O(n) |
| Convolution | O(k·n·d²) | O(1) |
For sequence lengths shorter than the representation dimension, self-attention is faster than recurrence.