Member-only story

Building a 13 Billion Parameter LLM from Scratch Using Python — PART 2

8 min read4 days ago

Part 2: Mathematical Foundations and Core Components

1. Overview and Recap

In Part 1, we introduced the motivation for building a 13B parameter model, reviewed the fundamentals of LLMs and the Transformer architecture, and outlined the project roadmap. Now, we delve into the core mathematical components that power modern Transformers. Understanding these elements is essential not only for implementing our model but also for troubleshooting and optimizing performance during training.

2. Mathematical Foundations of Self-Attention

Scaled Dot-Product Attention

At the heart of the Transformer is the self-attention mechanism, which allows the model to weigh the importance of different tokens in an input sequence relative to each other. The computation is performed as follows:

Here, Q (queries), K (keys), and V (values) are matrices derived from the input embeddings. The scaling factor sqrt{d_k} is used to stabilize the gradients during training.

Detailed Derivation and Intuition

Dot-Product Computation:
The term QK^T computes the…