Multi-head self-attention

Objectives

Understand what Multi-head self-attention is and why it is important
Explore how it implemented in a transformer block

What is Multi-head attention

Multi-head attention divides the attention process into multiple independent parallel instances
- Each parallel processes is called an “attention head”
- Extend attention mechanism to multiple parallel processes

Why is it important?

Independent parallel attention heads allow the model to focus on different aspects of the input simultaneously
Help model recognize more intricate and nuanced patterns in the text than a single-head attention mechanism
- E.g., One head might capture grammatical structure while another captures semantic meaning

Multi-head attention mechanism

alt text

Independent Projections using multiple sets of learnable weight matrices (\(W_{q}\), \(W_{k}\), \(W_{v}\)) creates distinct Query, Key, and Value matrices for each head
Each head executes attention mechanism in parallel generate multiple context vectors

alt text

Concatenation:
- The output vectors from all the parallel heads are joined (concatenated) together to form a single, longer vector
Final Linear Projection:
- Combine the information from all heads into the final output dimension required by the next layer

Note

In efficient implementations, rather than literally stacking separate layers, Multi-head attention is often achieved by projecting the input into a large dimension and then mathematically “splitting” it into heads for processing

Coding

Coding Multi-head Attention Mechanisms:

Multi-head Attention by by Sebastian Raschka