Part I: The Foundations of Modern LLMs
Transformer basics, tokenization, embeddings, and the attention mechanism.
The Transformer Architecture
Parallel self-attention enables large-scale training and long-range dependencies.
Tokenization
Subword units balance vocabulary size with the ability to handle rare words.
Embeddings & Positional Encoding
Tokens are mapped to dense vectors and enriched with positional information.
The Attention Mechanism: Q, K, V Explained
Click on Q, K, or V to see each role in self-attention.
Query (Q)
The current token's question about context.
+
Key (K)
An index of what each token offers.
=
Attention Score
Similarity of Q to K determines relevance.
×
Value (V)
The content that is aggregated.
How it Works
Self-attention compares each token's Query to all Keys to create weights over Values, yielding context-aware representations.