Now open source

AI Agent for ML Engineers
Train Any Model with One Click

FastDistill — Production-grade Data Distillation Framework

Bridge the gap between raw data and high-quality student model training. Generate trustworthy datasets through a unified provider gateway, deterministic data contracts, and rigorous quality gates.

Core Capabilities

Rich step and task components covering the full pipeline of data generation, evaluation, and filtering

💬
ChatGeneration
Chat generation tasks with multi-turn dialogue and role-playing support for building instruction-following datasets.
Generation Dialogue
🔄
SelfInstruct
Self-guided data generation that automatically expands diverse instruction data from seed tasks.
Augmentation Bootstrap
📈
EvolComplexity
Evolutionary complexity enhancement that increases instruction complexity through multi-round evolution, pushing model capability boundaries.
Evolution Complexity
UltraFeedback
UltraFeedback quality assessment with multi-dimensional scoring to precisely identify high-quality samples.
Evaluation Quality
🔍
DEITA Filtering
Complexity and quality-based data filtering that retains the most valuable training samples.
Filtering Selection
🎯
SQLiteExecEval
SQL execution evaluation that verifies generated SQL correctness in a sandbox — essential for Text2SQL.
Execution Verification

Core Features

A full-stack data distillation solution built for ML engineers

🛡️

Unified Provider Gateway

Seamlessly integrate OpenAI-compatible endpoints (vLLM, SGLang, Ollama, OpenRouter) and switch teacher models with one click.

🔒

Deterministic Data Contracts

Canonical inputs + sample IDs + manifests ensure full reproducibility and auditability.

🚦

Rigorous Quality Gates

Multi-stage filtering: rules → execution → LLM judge, ensuring data quality at every step.

📊

Detailed Observability

Auto-generated per-stage timing and quality reports for quick bottleneck identification.

Decoupled Training Pipeline

Data generation and model training are fully separated, supporting MLX LoRA, vLLM, and other backends.

🎛️

Config-Driven

YAML configuration with environment/runtime layered overrides for easy reproduction and sharing.

Workflow

End-to-end data flow from raw data to distilled datasets

1

Ingest & Normalize

Normalize raw inputs into a stable data schema

2

Deduplication

Deduplicate using MinHash and embedding algorithms

3

Teacher Generation

Generate candidate data using high-quality teacher LLMs

4

Quality Gate

Rule filtering → execution eval → LLM judge

Quick Start

Just a few commands to launch your data distillation pipeline

# Install FastDistill
pip install fastdistill

# Or install with Ollama support for local distillation
pip install "fastdistill[ollama]"

# Run end-to-end distillation example
OLLAMA_MODEL=qwen3:0.6b python examples/fastdistill/ollama_distill_e2e.py

# Agent distillation - one-click training
fastdistill agent distill --task "Build a SQL query helper"

Proven Results

WikiSQL 1k Text2SQL distillation experiment results

11.9%
Teacher Model Baseline
DeepSeek V3.2
30.9%
Gold Match Accuracy
Strict Match Metric

Start Building Your Data Pipeline

Join the FastDistill community and explore the limitless possibilities of synthetic data and AI feedback.