Part I: The Foundations of Modern LLMs

Transformer basics, tokenization, embeddings, and the attention mechanism.

The Transformer Architecture

Parallel self-attention enables large-scale training and long-range dependencies.

Tokenization

Subword units balance vocabulary size with the ability to handle rare words.

Embeddings & Positional Encoding

Tokens are mapped to dense vectors and enriched with positional information.

The Attention Mechanism: Q, K, V Explained

Click on Q, K, or V to see each role in self-attention.

Query (Q)

The current token's question about context.

+
Key (K)

An index of what each token offers.

=
Attention Score

Similarity of Q to K determines relevance.

×
Value (V)

The content that is aggregated.

How it Works

Self-attention compares each token's Query to all Keys to create weights over Values, yielding context-aware representations.