Interactive AI Engineering Compendium

From Foundations to Deployment

An interactive journey through the core concepts, architectures, and practical applications of modern AI engineering. Explore the building blocks of systems like GPT and learn how to create your own intelligent applications.

Mind Map: AI Engineering Landscape

Explore the high-level overview derived from your NotebookLM map. Click the image to view full-screen.

Part I: The Foundations of Modern LLMs

This section deconstructs the core scientific and engineering principles that power today's large language models. We'll explore the Transformer architecture, from the basic idea of next-token prediction to the sophisticated mechanics of the attention mechanism.

The Transformer Architecture

The core innovation that enabled modern LLMs. By replacing sequential RNNs with a parallelized attention mechanism, it allowed for training models at an unprecedented scale on GPUs.

Tokenization

The process of converting raw text into a sequence of numerical IDs. Modern models use subword tokenization, balancing vocabulary size with the ability to handle unknown words.

Embeddings & Positional Encoding

Tokens are mapped to high-dimensional vectors (embeddings) that capture semantic meaning. A positional encoding is added to give the model a sense of sequence order.

AI Engineering Overview

Build generative AI systems end-to-end
Scale with data, infra, and evaluation
Interdisciplinary: AI, data, product, UX

Understanding Foundation Models

Evolution from LLMs to multimodal FM
General-purpose capabilities via pretraining
Adaptation with prompting and fine-tuning

Probabilistic Nature of AI

Sampling process drives variability
Open-ended outputs: creativity & hallucination
Control via temperature, top-k, nucleus

The Attention Mechanism: Q, K, V Explained

The core of the Transformer is self-attention, which allows the model to weigh the importance of different words in a sequence. This is achieved using three learned vectors: Query, Key, and Value. Click on each component to learn its role.

Query (Q)

Represents the current word's "question" about the context. It's what the word is looking for.

Key (K)

Represents what a word "offers." It's like a label or index for the information in that token.

Attention Score

A high dot-product score between a Query and a Key indicates high relevance.

Value (V)

Represents the actual content or information of the word.

How it Works

The model calculates an attention score by comparing the **Query** of the current word with the **Key** of every other word. These scores are then used to create a weighted sum of all the **Value** vectors, producing a new, context-aware representation for the current word. Click on Q, K, or V to see a detailed explanation.

Part II: Building Applications with LLM APIs

This section transitions from theory to practice, focusing on the essential skills for programmatically controlling pre-trained models. We'll cover API interaction, advanced prompt engineering, and the powerful technique of tool calling.

Interactive Flow: The Tool Calling Loop

Tool calling allows LLMs to interact with external systems. It's a multi-step conversation between your app and the model. Click each step to understand the flow.

Step 1: Request with Tools

The developer sends a prompt to the model, including a list of available `tools` defined by a JSON schema. This tells the model what functions it can potentially use.

Structured Outputs: JSON and Schemas

Modern models support constrained generation for predictable outputs. Prefer JSON mode and function/tool schemas to reduce parsing errors and improve reliability.

JSON Mode

Ask the model to return strict JSON. Validate and reject malformed responses; retry with a shorter context if needed.

Function Schemas

Define `tools` with JSON Schema. Models select a function and emit typed arguments, lowering post-processing complexity.

Schema Tips

Mark required fields in `required`
Use enums for finite choices
Bound numbers with `minimum`/`maximum`
Prefer small, composable functions

Sampling Strategies

Greedy / Logits: Deterministic best-token decoding; low diversity.
Temperature: Scales logits; lower is more deterministic.
Top-k / Top-p: Limit candidate set by count or cumulative mass.
Beam search: Multiple hypotheses; costly and less used for open text.
Constrained / JSON mode: Enforce schema for structured outputs.

Prompt Engineering

Clarity: Roles, goals, constraints, and examples.
Context: Provide retrieved snippets and metadata.
Defensive: Refusal guidance, jailbreak-resistant patterns.
Tool-first: Encourage calling functions with schemas.
Chain-of-thought (hidden): Use internal reasoning, keep summaries external.

Part III: Engineering Intelligent Systems

Here, we explore how to construct complex systems that can perform multi-step tasks. This involves architecting "agents," grounding them in factual data with RAG, and designing multi-agent systems for collaboration.

The RAG Pipeline: Grounding LLMs in Your Data

Retrieval-Augmented Generation (RAG) prevents hallucinations by giving the LLM an "open book" of your data to consult before answering. Explore the three main stages of the pipeline.

🗂️

1. Indexing

Documents are split into chunks, converted to vector embeddings, and stored in a vector database.

🔍

2. Retrieval

The user's query is embedded, and a similarity search finds the most relevant chunks from the database.

✍️

3. Generation

The retrieved chunks are added to the prompt, giving the LLM context to generate a grounded answer.

RAG Best Practices

Chunking

Use semantic or heading-aware chunking with modest overlap (e.g., 128–256 tokens) to preserve coherence without duplication.

Hybrid Search

Combine dense vectors with keyword/BM25 to capture exact terms, numbers, and symbols.

Reranking

Apply cross-encoder rerankers on top-K candidates to improve precision while keeping latency manageable.

Filters & Metadata

Index rich metadata (source, section, date) and filter by scope/time to reduce noise.

Citations

Return source URLs/IDs with spans. Encourage grounded answers and enable auditability.

Evaluation

Measure retrieval precision/recall, answer faithfulness, and context sensitivity on a task-specific test set.

Multi-Agent Frameworks: A Comparison

To build systems where multiple agents collaborate, developers use frameworks that manage their interaction. The choice of framework involves a trade-off between control and ease of use.

Feature	LangGraph	CrewAI
Core Abstraction	State Graphs (Nodes & Edges)	Agents, Tasks, Crews
Control Level	Low-level, explicit control over state and transitions.	High-level, abstracts away orchestration.
Ease of Use	Steeper learning curve; requires graph theory concepts.	Beginner-friendly; intuitive role-playing paradigm.
Ideal Use Case	Complex, production-grade systems with custom control flows.	Rapid prototyping of collaborative agent teams.

AI Agents

Components

Planner
Tools & Environment
Memory & State

Planning & Tool Use

Decompose tasks to toolable steps
Validate preconditions and outputs
Fallback paths and retries

Memory Systems

Short-term: chat window/context
Long-term: vector store/user profile
Episodic/task memory: artifacts

Part IV: Advanced Customization & Evaluation

To build truly effective applications, we must customize models for specific tasks and rigorously evaluate their performance. This section covers fine-tuning with LoRA and the critical frameworks for agent evaluation.

Fine-Tuning Efficiency: LoRA vs. Full Fine-Tuning

Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA drastically reduce the computational cost of adapting models. This chart visualizes the difference in trainable parameters for a typical large model.

Evaluating Agent Performance

Evaluating agents requires looking beyond simple accuracy. A robust framework assesses the entire process, from tool use to response quality. Key metrics include:

✓ Task Performance: Success rate, accuracy, error rate.
✓ Efficiency: Cost (tokens consumed) and latency.
✓ Trajectory Evaluation: The sequence of actions, including tool use precision and recall.
✓ Safety & Ethics: Checks for bias and harmful content.

Preference Optimization

RLHF: Policy optimized with a reward model; high quality but complex.
DPO/ORPO: Direct optimization from preference pairs; simpler training, strong results.
Self-Play/Constitutional: Model critiques and revises outputs using principles to reduce human labeling.

Latency & Cost Optimizations

Speculative Decoding: Draft model proposes tokens; larger model verifies for speedups.
Caching: Reuse embeddings, retrieval results, and prompt prefixes; apply ETags for HTTP resources.
Distillation: Train smaller student models for inference while preserving quality.

Advanced Fine-Tuning & Data

PEFT techniques: LoRA, QLoRA, adapters for efficiency.
Model merging: Blend specialist checkpoints carefully.
Curation: High-quality instruction/preference pairs.
Synthesis: Generate data with guardrails and review.

Inference Optimization

Metrics: Latency (p50/p95), throughput, cost.
Compression: Quantization, pruning, KV cache reuse.
Service level: Batching, streaming, autoscaling.

Part V: Multimodal AI & Deployment

The final frontier is expanding beyond text and making our applications accessible to the world. This section covers voice and image generation, and the essentials of deploying apps with platforms like Hugging Face and Gradio.

Deployment in 5 Steps

Using modern tools like Gradio and Hugging Face Spaces, deploying a live AI demo has never been easier. The process abstracts away complex web infrastructure.

Write Script

Create `app.py` with Gradio UI.

Define Deps

List libraries in `requirements.txt`.

Create Space

On Hugging Face, select Gradio SDK.

Push Code

Use Git to upload your files.

Go Live!

Your app is automatically deployed.

Production Ops

Operational excellence keeps LLM systems reliable and affordable. Track quality, cost, and safety with rigorous monitoring.

Observability

Structured logs with request IDs
Traces across retrieval, tools, and model calls
Prompt/version tracking

Cost Control

Token and tool budgets per request
Early exit and truncation policies
Autoscaling and request batching

Reliability

Retries with backoff and idempotency keys
Circuit breaking on upstream errors
Chaos testing for tool failures

User Feedback Loop

Collect thumbs up/down and free-form feedback
Capture input, retrieval context, and tool traces
Triage to datasets for supervised or preference tuning

Safety & Governance

Bake-in safeguards and continuous evaluation to mitigate risks while maintaining utility.

Runtime Controls

Input/output filters and PII redaction
Tool allowlists and rate limits
Safety classifiers with blocklists and context-aware rules

Red Teaming & Audits

Adversarial prompts and jailbreak testing
Dataset audits for bias and leakage
Post-incident reviews and action tracking