L LLM From Scratch

Chapter Summaries

Chapter 1: Understanding Large Language Models
  • Objective: \(\max_\theta\ \prod_{t} p_\theta\big(x_t\,\big|\,x_{\lt t}\big)\) — predict the next token given history.
  • Transformer (decoder‑only): masked self‑attention + MLP blocks with residuals and LayerNorm.
  • Stages: implement core → pretrain on large corpora → fine‑tune for tasks or instructions.
Chapter 2: Working with Text Data
  • BPE tokenization: learn merges from frequency of symbol pairs; robust to OOV.
  • Embeddings: token embeddings + positional (learned) embeddings; typical dims 256–4096.
  • Packing/sampling: sliding windows across streams create (input, target) with right‑shifted labels.
Chapter 3: Attention Mechanisms
  • Self‑attention: \(Q=X W_{Q},\ K=X W_{K},\ V=X W_{V};\ \mathrm{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V\).
  • Causal mask: upper‑triangular positions are \(-\infty\) before softmax to block future tokens.
  • Multi‑head: split \(D\) into \(H\) heads with \(d_k=D/H\); concat heads then project with \(W_O\). Cost \(\mathcal{O}(T^2 D)\).
Chapter 4: Implementing a GPT Model
  • Pre‑LN block: LN → MHA → residual; LN → FFN → residual; improves stability in deep stacks.
  • FFN: two linear layers with GELU; hidden size ≈ 4×D; dominant param count.
  • Regularization: dropout, attention masking, weight decay.
Chapter 5: Pretraining
  • Loss: cross‑entropy on next token; perplexity \(= \exp(\mathrm{NLL})\).
  • Optimization: AdamW, warmup + cosine decay, gradient clipping, mixed precision.
  • Decoding: temperature scaling, top‑k and top‑p (nucleus) sampling.
  • Checkpoints: reuse open weights (e.g., GPT‑2) to skip costly pretraining.
Chapter 6: Fine‑tuning for Classification
  • Head: replace LM head with linear classifier; get features via pooled last‑token/mean over tokens.
  • Efficient fine‑tuning: freeze base; train head + top layers; monitor accuracy/F1; handle class imbalance.
Chapter 7: Fine‑tuning to Follow Instructions
  • SFT data: instruction–response pairs with a prompt template (roles/system/user/assistant).
  • Behavior: supervise to follow directions; optionally add RLHF/AI‑feedback later.
  • Evaluation: win‑rate, MT‑Bench‑style pairwise judging, task‑specific metrics.

Welcome to the Interactive Guide!

This website is an interactive companion to the book "Build a Large Language Model From Scratch." Here, you can explore the core concepts of building an LLM with hands-on examples and visualizations.

The LLM Building Process

Building an LLM involves three main stages, which we'll cover in detail:

1. Understanding Large Language Models

What is an LLM?

A deep neural network trained on huge text corpora to predict the next token and generate human‑like text.

The Transformer Architecture

GPT‑style models use a decoder‑only transformer: masked self‑attention + small feed‑forward networks stacked for autoregressive generation.

2. Working with Text Data

The Data Processing Pipeline

Before an LLM can learn, raw text must be converted into a numerical format. This involves several steps: tokenization, converting tokens to token IDs, and then creating embedding vectors.

Tokenization

LLMs can't process raw text. First, we need to break the text down into smaller units called tokens. This can be done at the word level, character level, or sub-word level. Byte-Pair Encoding (BPE) is a common sub-word tokenization algorithm used by models like GPT-2, which can handle any word by breaking unknown words into smaller known pieces.

Interactive Tokenizer

This demo uses a simple regex to split text into words and punctuation.

Interactive BPE Merge Simulator

A tiny, educational BPE that applies a few merges to show how subwords form.

Embeddings & Positional Encoding

After tokenization, we map each token ID to a high-dimensional vector called an embedding. These vectors capture the semantic meaning of the tokens. However, the basic self-attention mechanism is position-agnostic. To solve this, we add positional embeddings to the token embeddings, giving the model a sense of word order.

3. Attention Mechanisms

Self-Attention: The Core Idea

Self-attention allows the model to weigh the importance of different words in a sequence when processing a particular word. For each token, it creates three vectors: a **Query**, a **Key**, and a **Value**. The attention score between two tokens is calculated by taking the dot product of the Query vector of the current token and the Key vector of the other token. These scores are then normalized (using softmax) to create attention weights, which are used to create a weighted sum of the Value vectors.

Deep Dive: Multi-Head Attention Shapes

Attention Math (at a glance)

Q = X W_Q,  K = X W_K,  V = X W_V
scores = (Q K^T) / sqrt(d_k)
masked_scores = scores + causal_mask  // -inf above diagonal for GPT
weights = softmax(masked_scores)
output = weights V
                        

The causal mask prevents a position t from attending to future positions t' > t.

Interactive Attention Matrix

Hover over a word in the sentence. The matrix below will show simulated attention scores from that word (Query) to all other words (Keys). Higher scores (darker blue) mean more attention.

The cat sat on the mat

Causal and Multi-Head Attention

Causal Attention: For generative models like GPT, we need to prevent a token from "seeing" future tokens. This is done by masking the attention scores for subsequent positions, ensuring the prediction for a word only depends on the words that came before it.

Multi-Head Attention: Instead of performing attention once, we do it multiple times in parallel with different weight matrices (multiple "heads"). Each head can learn different types of relationships between words. The results from all heads are then combined.

4. Implementing a GPT Model from Scratch

The Transformer Block

A GPT-like model is essentially a stack of these "Transformer Blocks". Each block contains the same core components, allowing the model to build progressively more complex representations of the input text.

  • Multi-Head Attention: The component we explored in the previous chapter. It finds relationships between tokens.
  • Layer Normalization: Applied before major components to stabilize the training process by normalizing the activations.
  • Feed Forward Network: A simple two-layer neural network that processes each token's representation independently. It adds computational depth.
  • Shortcut Connections (Residuals): The input to a sub-layer (like attention) is added to its output. This helps with the vanishing gradient problem in deep networks.

5. Pretraining on Unlabeled Data

The Pretraining Task: Next-Word Prediction

The model is trained on a simple yet powerful task: predicting the next word in a sentence. Given a sequence of words, the model calculates a probability distribution over its entire vocabulary for what the next word should be. The difference between its prediction and the actual next word (the loss) is used to update its weights. By doing this on a massive dataset, the model learns grammar, facts, and reasoning abilities.

Softmax & Temperature Playground

Decoding Strategies for Generation

When generating text, simply picking the most likely next word every time (greedy decoding) can be repetitive. We use decoding strategies to introduce controlled randomness.

  • Temperature Scaling: Controls the "peakedness" of the probability distribution. Higher temperature means more randomness, lower temperature means more deterministic.
  • Top-k Sampling: Limits the word choice to the 'k' most likely next words, preventing the model from picking very unlikely words.

Interactive Text Generation

0.7
10
0.9

Perplexity Calculator

Enter probabilities assigned to the true next tokens (0-1), comma-separated. Lower perplexity is better.

6. Fine-tuning for Classification

Adapting the Model for a New Task

A pretrained model is a generalist. To make it a specialist, we perform fine-tuning. For a task like spam classification, we replace the model's final layer (which outputs logits for the entire vocabulary) with a new "classification head" that outputs logits for our specific classes (e.g., two outputs for "spam" and "not spam"). Then, we continue training on a smaller, labeled dataset. Often, we only train the new head and a few of the top layers, keeping the base of the model "frozen" to save computational resources.

Interactive Spam Classifier

This demo simulates a fine-tuned model by checking for common spam keywords.

7. Fine-tuning to Follow Instructions

Creating a Chat Model

To create a personal assistant or chatbot, we perform "instruction fine-tuning." We fine-tune the model on a dataset of instructions and desired responses, often formatted in a specific prompt style (like the Alpaca format). This teaches the model to follow commands and have a conversation, rather than just completing text.

Mini Chatbot

This is a simple demo. Ask it about concepts from the book!

8. Resources, Study Plan, and Downloadables

Quick-Start: Minimal GPT (pseudocode)

// Shapes: batch B, sequence T, embedding D, heads H
// Token + positional embeddings
E = Embedding(vocab, D)
P = PositionalEmbedding(T, D)

for t in range(T):
  x = E[token_ids[:, :t+1]] + P[:t+1]
  // Multi-head self-attention (causal)
  Q = x Wq, K = x Wk, V = x Wv
  att = softmax((Q K^T)/sqrt(d_k) + causal_mask)
  x  = concat(att V across heads) Wo
  // MLP block
  x = x + GELU(x W1 + b1) W2 + b2
  // LayerNorm + residuals around each sublayer
  logits = LayerNorm(x) Wout

loss = cross_entropy(logits[:, :-1], token_ids[:, 1:])
update(parameters)  // AdamW, cosine LR, weight decay

This sketch mirrors the blocks described throughout this guide and in the referenced book. Use it as a mental model while reading.

References & Attribution

  • Inspired by the excellent book Build a Large Language Model (From Scratch) by Sebastian Raschka.
  • Figures replaced with native SVG diagrams for reliable, fast loading.
  • Transformer architecture and GPT block semantics follow the canonical literature: “Attention Is All You Need.”