Now open source

AI Agent for ML Engineers
Train Any Model with One Click

FastDistill — Production-grade Data Distillation Framework

Bridge the gap between raw data and high-quality student model training. Generate trustworthy datasets through a unified provider gateway, deterministic data contracts, and rigorous quality gates.

Quick Start View Docs

Capabilities

Core Capabilities

Rich step and task components covering the full pipeline of data generation, evaluation, and filtering

💬

ChatGeneration

Chat generation tasks with multi-turn dialogue and role-playing support for building instruction-following datasets.

Generation Dialogue

🔄

SelfInstruct

Self-guided data generation that automatically expands diverse instruction data from seed tasks.

Augmentation Bootstrap

📈

EvolComplexity

Evolutionary complexity enhancement that increases instruction complexity through multi-round evolution, pushing model capability boundaries.

Evolution Complexity

⭐

UltraFeedback

UltraFeedback quality assessment with multi-dimensional scoring to precisely identify high-quality samples.

Evaluation Quality

🔍

DEITA Filtering

Complexity and quality-based data filtering that retains the most valuable training samples.

Filtering Selection

🎯

SQLiteExecEval

SQL execution evaluation that verifies generated SQL correctness in a sandbox — essential for Text2SQL.

Execution Verification

Features

Core Features

A full-stack data distillation solution built for ML engineers

🛡️

Unified Provider Gateway

Seamlessly integrate OpenAI-compatible endpoints (vLLM, SGLang, Ollama, OpenRouter) and switch teacher models with one click.

🔒

Deterministic Data Contracts

Canonical inputs + sample IDs + manifests ensure full reproducibility and auditability.

🚦

Rigorous Quality Gates

Multi-stage filtering: rules → execution → LLM judge, ensuring data quality at every step.

📊

Detailed Observability

Auto-generated per-stage timing and quality reports for quick bottleneck identification.

⚡

Decoupled Training Pipeline

Data generation and model training are fully separated, supporting MLX LoRA, vLLM, and other backends.

🎛️

Config-Driven

YAML configuration with environment/runtime layered overrides for easy reproduction and sharing.

How it works

Workflow

End-to-end data flow from raw data to distilled datasets

Ingest & Normalize

Normalize raw inputs into a stable data schema

Deduplication

Deduplicate using MinHash and embedding algorithms

Teacher Generation

Generate candidate data using high-quality teacher LLMs

Quality Gate

Rule filtering → execution eval → LLM judge

Quick start

Quick Start

Just a few commands to launch your data distillation pipeline

                    
                    # Install FastDistill
pip install fastdistill

# Or install with Ollama support for local distillation
pip install "fastdistill[ollama]"

# Run end-to-end distillation example
OLLAMA_MODEL=qwen3:0.6b python examples/fastdistill/ollama_distill_e2e.py

# Agent distillation - one-click training
fastdistill agent distill --task "Build a SQL query helper"

Results

Proven Results

WikiSQL 1k Text2SQL distillation experiment results

11.9%

Teacher Model Baseline

DeepSeek V3.2

92.9%

Student Execution Pass Rate

Qwen3-0.6B + MLX LoRA

30.9%

Gold Match Accuracy

Strict Match Metric

AI Agent for ML Engineers Train Any Model with One Click