Introduction to embedding

Objectives

Understand what embedding is and why it is important
Explore how embedding layers in LLMs process tokens

Token embedding

What is token embedding?

Token embedding is the process of converting discrete tokens (specifically token-IDs) into vectors
These vectors can be represented in a high-dimensional space
- Token ID -> vector with length N (points in N-dimensional space)
Representing tokens in a high-dimensional space enables to effectively capture complex patterns and relationships

What token embedding is important?

Token-IDs are just arbitrary integers assigned during tokenization that do not have mathematical relationship between these integers
- Tokens are discrete numerical labels with no geometric or relational structure
Token embedding convert these numerical labels to structured representations - vectors (points in a high-dimensional space) in a way that captures the relationships between tokens
In this high-dimensional space, semantically related tokens like “dog”, “cat”, “animal” cluster together
This structure is learned during model training process so that words with similar contextual roles have similar vector representations (similarity between vectors represent relationships between tokens)

Demo

Explore token embeddings (Word2Vec embeddings):

Word2Vec embedding was widely used before the introduction of LLM technology

Word2Vec Figure shows that the words with similar roles in natural language cluster together when represented in high-dimensional space

Python implementation

import gensim.downloader as api
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Download and load the pre-trained Google News model
wv = api.load('word2vec-google-news-300')

print(f"Dimensions of 'king' embeddings: {wv['king'].shape}")
print(f"First 10 elements of 'king' embeddings: {wv['king'][:10]}")

similar_words = wv.most_similar('king', topn=3)
[print(f"{word}: {similarity:.4f}")for word, similarity in similar_words]
print(similar_words)

def get_3d_projections(word_list, wv):
    # Extract embeddings for the given words
    print("Word List:", word_list)
    embeddings = [wv[word] for word in word_list]

    # Apply PCA to reduce dimensions to 3D
    pca = PCA(n_components=3)
    projections = pca.fit_transform(embeddings)

    return projections

def plot_3d_projections(projections, words):
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111, projection='3d')

    for i, word in enumerate(words):
        x, y, z = projections[i]
        ax.scatter(x, y, z, label=word)
        ax.text(x, y, z, word, fontsize=10)

    ax.set_title("3D PCA Projection of Word2Vec Embeddings")
    ax.set_xlabel("PC1")
    ax.set_ylabel("PC2")
    ax.set_zlabel("PC3")
    #plt.legend(loc='upper left')
    #plt.tight_layout()
    plt.show()

words = [word for word, _ in similar_words]
words.extend(["software", "internet", "web" ])
reduced_embeddings = get_3d_projections(words, wv)
plot_3d_projections(reduced_embeddings, words)

Token embeddings in LLMs

Tokenizer in GPT-2 smallest model has a vocabulary of 50257 tokens
This GPT-2 tokenizer maps tokens to integers 0-50256 with no mathematical relationship in those assignments
For example, tokenizer maps input - cat sat on the to tokens [9246, 3332, 319, 262] that do not have inherent relationship between these numbers themselves
Token embeddings convert these arbitrary Token-IDs into dense vectors in a continuous space of 768 dimensions
Now cat sat on the aren’t just different numbers—they’re points in a high-dimensional space where the similarity between vectors has meaning
Capturing semantic meaning through token embeddings is learned during LLM pre-training process (detailed later)

GPT-2 Token embeddings: Step-by-Step Breakdown

Initiate a learnable matrix (dimensions [vocab_size × embedding_dim])
- Each row represents one token’s embedding vector
Initiate the matrix with small random values (e.g., -2.84 to 1.58) to break symmetry
- if all embeddings were identical, tokens couldn’t differentiate during training
As the model processes training examples, it makes predictions using these random embeddings
Prediction errors generate gradients that flow back through the network to the embedding layer
Token embeddings are optimize through backpropagation (Tokens appearing in similar contexts receive similar updates)
Through thousands of iterations in the pre-training process, random vectors evolve into meaningful representations where “cat” and “dog” cluster together
The final optimized embeddings encode semantic relationships learned entirely from the training data patterns

Explore LLM token embeddings

Embedding Dimensions:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2",)

# Access the word token embedding layer
wte = model.transformer.wte

# Get vocabulary size and embedding dimension
print(f"Vocabulary Size: {wte.num_embeddings}; Embedding Dimension: {wte.embedding_dim}")

# The embedding matrix is stored in the 'weight' attribute
print(f"Shape of the embedding matrix: {wte.weight.shape}")

LLM Embedding of made-up words:

text_rand = "rand_xyz"
rand_token_ids = tokenizer.encode(text_rand)
print(f"Token IDs for text '{text_rand}': {rand_token_ids}")
print(f"Decoded text: {tokenizer.decode(rand_token_ids)}")
print()

# Use evaluation mode and not gradient calculation (training)
with torch.no_grad():
    rand_token_embeddings = wte(torch.tensor(rand_token_ids))
print(f"Shape of the random token embeddings: {rand_token_embeddings.shape}")
print()

for i in range(len(rand_token_ids)):
    print(f"Token {i}: {tokenizer.decode(rand_token_ids[i])} -> {rand_token_ids[i]};\n\tEmbeddings (first 5): {rand_token_embeddings[i][:5]}")

Output

Embedding Dimensions

Vocabulary Size: 50257; Embedding Dimension: 768
Shape of the embedding matrix: torch.Size([50257, 768])

LLM Embedding of made-up words

Token IDs for text 'rand_xyz': [25192, 62, 5431, 89]
Decoded text: rand_xyz

Shape of the random token embeddings: torch.Size([4, 768])

Token 0: rand -> 25192;
	Embeddings (first 5D): tensor([-0.0456, -0.1112,  0.2527,  0.0098, -0.0464])
Token 1: _ -> 62;
	Embeddings (first 5D): tensor([-0.0073, -0.0894,  0.0005,  0.0701, -0.0090])
Token 2: xy -> 5431;
	Embeddings (first 5): tensor([-0.1123, -0.0957,  0.1115, -0.0743,  0.0958])
Token 3: z -> 89;
	Embeddings (first 5D): tensor([-0.0141, -0.0427,  0.0941, -0.1052,  0.0594])

Position embeddings

Token embeddings process (converting unstructured token-ids to dense vectors) help capture relationships between tokens
Token embeddings process treats all positions equally
Token embedding alone makes models unable to distinguish token order without position information
- Unable to distinguish between “dog bites man” and “man bites dog”
Position embeddings injects position information to embedding vectors

Embedding layer of LLMs

Token embeddings convert discrete token IDs into vector representations through a learnable matrix
Positional embeddings added to inject sequence order information to LLM embeddings
Embedding vectors (combined token and position embeddings) are parses to the next layer of the LLM - transformer layer/block

LLM embeddings — *Source: transformer-explainer*

Demo

Token and position embeddings:

Steps in the following script:

Tokenize input text bank is near the river bank
Pass input to token and position embeddings
Calculate the similarity between 1st and last token

import torch

text_1 = "bank is near the river bank"
token_ids = tokenizer.encode(text_1)

for i in range(0,len(token_ids)):
    print(f"Token {i}: {tokenizer.decode(token_ids[i])} -> {token_ids[i]}")

wte = model.transformer.wte # Token embedding
wpe = model.transformer.wpe # Position embedding

bank_1_id = token_ids[0]
wte_bank_1 = wte(torch.tensor(bank_1_id))
wpe_bank_1 = wpe(torch.tensor(0))

bank_2_id = token_ids[-1]
wte_bank_2 = wte(torch.tensor(bank_2_id))
wpe_bank_2 = wpe(torch.tensor(1))


cosine_sim = torch.nn.functional.cosine_similarity(
    wte_bank_1 + wpe_bank_1,
    wte_bank_2 + wpe_bank_2,
    dim=0
)
print(f"Similarity: {cosine_sim.item():<.4f}")  # Less than 1.0 - they're different!

The contextualization happens through the Transformer layers of the LLM by updating vector embeddings
Transformer update vectors from embedding layer to reflect the attention patterns that capture semantic relationships like “river” → “bank” (as in riverbank)

Output

LLM Embedding

## Notice the token 0 - "bank" and " bank" (with leading space)

Token 0: bank -> 17796
Token 1:  is -> 318
Token 2:  near -> 1474
Token 3:  the -> 262
Token 4:  river -> 7850
Token 5:  bank -> 3331

Similarity: 0.5472