Introduction to Large Language Models (LLMs)

Objectives

Understand why the filed of language modeling needed LLMs.
Explore what LLMs are.
Understand how do LLMs differ from other machine learning approaches.

Language modeling (LM)

alt text

LM was introduced in early 1980s with the introduction of Recurrent Neural Networks (RNNs)
With the advances in the field of LM, more advance techniques to RNNs were introduced to
- preserve gradients and maintain information (1997-2014; Gating mechanisms)
- handle long-term memory (2015; Attention for RNNs)
- manage variable-length input output sequences (2014; Encoder-decoder for RNNs)

Why does the LM field needs LLMs?

RNNs process inputs sequentially and the attention mechanism was not build into the core architecture
RNNs are slow and lead to scalability challenges
Transformer technology introduced in the paper “Attention Is All You Need” addressed this limitations in Modern RNNs
Transformer technology (therefore LLMs) eliminate sequential dependency:
- all positions can be computed in parallel
- Scalable model training and inference

LLMs

Transformer-based neural networks with large number of parameters (billions to trillions) that employ self-attention mechanisms and trained on vast amounts data (billions to trillions of tokens)

LLMs vs other ML approaches

Behavior of traditional ML approaches are specifically tied to the training objectives
LLMs exhibits capabilities that were not explicitly trained
- i.e., Simple training objectives lead to complex capabilities
- e.g., LLM’s ability to translate despite never being specifically trained for translation
These capabilities are referred “Emergent Behavior” of LLMs
This complexity emerges from:
- Large scale + rich data + powerful architecture
- Emergent mechanism is still not fully understood

Anatomy of a LLM

alt text

Tokenizer
Embedding layer
Transformer block
- Self-attention layer
- Feedforward neural network
Language modeling head (LM head)