Part IV: Advanced Customization & Evaluation
LoRA vs full fine-tuning, preference optimization, and latency/cost techniques.
Fine-Tuning Efficiency
Evaluating Agent Performance
- Task Performance: Success rate, accuracy.
- Efficiency: Cost and latency.
- Trajectory: Tool use precision and recall.
- Safety & Ethics: Bias and harm checks.
Preference Optimization
- RLHF: Reward-model based, powerful but complex.
- DPO/ORPO: Train directly from preference pairs.
- Constitutional/Self-Play: AI feedback reduces human labeling.
Latency & Cost
- Speculative Decoding: Draft-and-verify for speedups.
- Caching: Reuse embeddings, retrieval, and prompt prefixes.
- Distillation: Smaller students for inference.
Advanced Fine-Tuning & Data
- PEFT: LoRA, QLoRA, adapters.
- Model Merging: Blend specialists carefully.
- Curation: High-quality instruction/preference data.
- Synthesis: Guardrailed generation + review.
Inference Optimization
- Metrics: p50/p95 latency, throughput, cost.
- Compression: Quantization, pruning, KV cache reuse.
- Service: Batching, streaming, autoscaling.