Understanding Transformer Attention Mechanisms

The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), has fundamentally changed how we approach sequence modeling. At its core lies the self-attention mechanism.

The Self-Attention Formula

The attention function maps a query and a set of key-value pairs to an output:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

Where: - Q (Query): What we're looking for - K (Key): What each position offers - V (Value): The actual content - d_k: The dimension of the keys

Multi-Head Attention

Rather than performing a single attention function, we project Q, K, V into multiple subspaces:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

Why It Works

Self-attention allows the model to attend to information from different representation subspaces at different positions.

The key insight is that attention provides a dynamic routing mechanism — unlike convolutions or recurrence, the receptive field is determined by the content itself, not by architecture.

Complexity Analysis

Operation	Time Complexity	Sequential Ops
Self-Attention	O(n²·d)	O(1)
Recurrence	O(n·d²)	O(n)
Convolution	O(k·n·d²)	O(1)

For sequence lengths shorter than the representation dimension, self-attention is faster than recurrence.