FastDistill — Production-grade Data Distillation Framework
Bridge the gap between raw data and high-quality student model training. Generate trustworthy datasets through a unified provider gateway, deterministic data contracts, and rigorous quality gates.
Rich step and task components covering the full pipeline of data generation, evaluation, and filtering
A full-stack data distillation solution built for ML engineers
Seamlessly integrate OpenAI-compatible endpoints (vLLM, SGLang, Ollama, OpenRouter) and switch teacher models with one click.
Canonical inputs + sample IDs + manifests ensure full reproducibility and auditability.
Multi-stage filtering: rules → execution → LLM judge, ensuring data quality at every step.
Auto-generated per-stage timing and quality reports for quick bottleneck identification.
Data generation and model training are fully separated, supporting MLX LoRA, vLLM, and other backends.
YAML configuration with environment/runtime layered overrides for easy reproduction and sharing.
End-to-end data flow from raw data to distilled datasets
Normalize raw inputs into a stable data schema
Deduplicate using MinHash and embedding algorithms
Generate candidate data using high-quality teacher LLMs
Rule filtering → execution eval → LLM judge
Just a few commands to launch your data distillation pipeline
# Install FastDistill
pip install fastdistill
# Or install with Ollama support for local distillation
pip install "fastdistill[ollama]"
# Run end-to-end distillation example
OLLAMA_MODEL=qwen3:0.6b python examples/fastdistill/ollama_distill_e2e.py
# Agent distillation - one-click training
fastdistill agent distill --task "Build a SQL query helper"
WikiSQL 1k Text2SQL distillation experiment results
Join the FastDistill community and explore the limitless possibilities of synthetic data and AI feedback.