Part IV: Advanced Customization & Evaluation

LoRA vs full fine-tuning, preference optimization, and latency/cost techniques.

Fine-Tuning Efficiency

Evaluating Agent Performance

  • Task Performance: Success rate, accuracy.
  • Efficiency: Cost and latency.
  • Trajectory: Tool use precision and recall.
  • Safety & Ethics: Bias and harm checks.

Preference Optimization

  • RLHF: Reward-model based, powerful but complex.
  • DPO/ORPO: Train directly from preference pairs.
  • Constitutional/Self-Play: AI feedback reduces human labeling.

Latency & Cost

  • Speculative Decoding: Draft-and-verify for speedups.
  • Caching: Reuse embeddings, retrieval, and prompt prefixes.
  • Distillation: Smaller students for inference.

Advanced Fine-Tuning & Data

  • PEFT: LoRA, QLoRA, adapters.
  • Model Merging: Blend specialists carefully.
  • Curation: High-quality instruction/preference data.
  • Synthesis: Guardrailed generation + review.

Inference Optimization

  • Metrics: p50/p95 latency, throughput, cost.
  • Compression: Quantization, pruning, KV cache reuse.
  • Service: Batching, streaming, autoscaling.