DeepFabric Documentation - Home Overview DeepFabric is a tool designed to generate high-quality synthetic datasets at scale, specifically for language model training, evaluation, and research. It leverages topic-driven data generation, using hierarchical topic trees and experimental topic graphs to produce diverse, context-rich training examples. Key Users: Researchers Engineers Practitioners needing synthetic data for: Model distillation Agent evaluation Statistical research DeepFabric supports creating conversational datasets, domain-specific training examples, and evaluation benchmarks, emphasizing quality and diversity in the generated data. --- Core Capabilities DeepFabric's pipeline features three main stages: Topic Generation Creates hierarchical trees or complex graph-based domain representations. Dataset Generation Engine Produces contextually relevant training examples based on generated topics. Packaging Outputs datasets in standard formats ready for immediate use. The methodology extends beyond basic prompt-based generation by building a conceptual map of your domain, resulting in broader domain coverage and consistency. --- Topic Trees and Graphs Topic Trees Hierarchical topic breakdowns ideal for domains with clear categorical structures like academic subjects, product categories, or organizational hierarchies. Topic Graphs (Experimental) Feature cross-connections between topics, suitable for domains with interconnected concepts such as research fields, technical topics, or social phenomena. Both utilize large language models to expand topics intelligently and generate relevant content. Tip: Choosing Between Trees and Graphs Trees fit hierarchical relationships. Graphs are best for domains with complex, interrelated topics. --- Getting Started To create your first dataset with DeepFabric: Installation Configuration Generation A step-by-step process is available in the Getting Started section, including practical runnable examples. YAML-based configuration offers full control over dataset generation. Python API access allows programmatic integration mirroring CLI functionality. --- Integration Ecosystem DeepFabric integrates with a broad machine learning ecosystem: Providers: OpenAI Anthropic Local Ollama instances Cloud-based solutions Export: Datasets export directly to Hugging Face Hub with automated dataset cards and metadata. CLI commands support modular workflows: deepfabric validate — Configuration validation. deepfabric visualize — Topic graph exploration. deepfabric upload — Dataset publishing. --- Next Steps Start with the Installation Guide. Generate your first dataset via the First Dataset tutorial. Deep dive into YAML config in the Configuration Guide. Explore programmatic usage through the API Reference. --- Badges & Community Links License: Apache 2.0 CI Status: Active PyPI Version and Downloads: Available Community Chat: Join Discord GitHub Repository: lukehinds/deepfabric --- Summary: DeepFabric is a powerful synthetic dataset generator using topic-driven approaches with hierarchical and graph models. It facilitates scalable, diverse, and high-quality dataset creation for NLP tasks and integrates well with existing ML ecosystems. The documentation guides users from installation to advanced configuration and offers both CLI and API interfaces for flexibility.