Data Generation¶
smelt includes smelt-datagen, a deterministic data generation tool for creating realistic test datasets. It generates Parquet files from a YAML configuration, with support for partitioning, foreign keys, entity pools, and configurable distributions.
When to use datagen¶
| Approach | Best for | Size |
|---|---|---|
| Seeds (CSV) | Small reference data, version-controlled test fixtures | < 1,000 rows |
| smelt-datagen (Parquet) | Large test datasets with realistic distributions | 1K–100M+ rows |
| Inline test data | Unit test fixtures in .test.sql files |
< 100 rows |
Use datagen when you need datasets large enough to test performance, partitioning, or incremental models — but still want deterministic, reproducible output.
Installation¶
smelt-datagen is included with the smelt package:
Verify the installation:
Quick start¶
Create a file called datagen.yaml:
seed: 42
datasets:
- name: users
output: data/users
num_rows: 1000
columns:
- name: user_id
generator: { type: sequential_id }
- name: name_prefix
generator:
type: weighted_choice
values:
Mr: 0.40
Ms: 0.35
Dr: 0.05
"": 0.20
- name: country
generator:
type: one_of
values: [US, UK, DE, FR, CA, AU]
- name: signup_date
generator: { type: date, start: "2022-01-01", end: "2024-12-31" }
Generate the data:
This creates data/users/data.parquet. Inspect it with DuckDB:
Configuration reference¶
Top-level keys¶
seed: 42 # Global random seed (default: 42)
scale_factor: 1.0 # Multiplier for all dataset num_rows
datasets:
- ... # List of dataset configurations
Dataset keys¶
- name: orders # Dataset name (used by foreign_key references)
output: data/orders # Output directory
num_rows: 1000000 # Number of rows to generate
seed: 123 # Per-dataset seed (overrides global)
partition: ... # Optional: Hive-style date partitioning
entity: ... # Optional: entity pool for sticky attributes
columns: # Column definitions
- name: order_id
generator: { type: sequential_id }
Generator types¶
Identifiers¶
sequential_id¶
Auto-incrementing integer starting at 1.
uuid¶
Deterministic UUID v4.
foreign_key¶
Random reference to another dataset's sequential ID. The referenced dataset must appear earlier in the config.
Strings¶
constant¶
Fixed value for every row.
weighted_choice¶
Select from options with probability weights. Weights are normalized automatically.
generator:
type: weighted_choice
values:
completed: 0.70
shipped: 0.12
cancelled: 0.05
returned: 0.05
one_of¶
Uniform random selection from a list.
string_pattern¶
Template-based string generation with embedded placeholders.
Supported placeholders:
| Placeholder | Description | Example output |
|---|---|---|
{sequential_id} |
Row index + 1 | 42 |
{uuid} |
Random UUID | a1b2c3d4-... |
{uniform_int:MIN-MAX} |
Random integer in range | 4527 |
{one_of:a,b,c} |
Random choice from list | b |
# SKU codes
generator: { type: string_pattern, template: "SKU-{uniform_int:1000-9999}" }
# Usernames
generator: { type: string_pattern, template: "{one_of:alpha,beta,gamma}-{sequential_id}" }
Numbers¶
uniform_int¶
Uniform random integer in [min, max).
uniform_float¶
Uniform random float in [min, max).
log_normal¶
Log-normal distribution, useful for prices, counts, and durations. The median is the center, sigma controls spread, and max caps the output.
geometric¶
Geometric distribution — "number of failures before first success". Useful for count data skewed toward zero.
Dates and timestamps¶
date¶
Random date within a range, output as YYYY-MM-DD string.
timestamp¶
Random timestamp within a range, output as YYYY-MM-DDTHH:MM:SS string.
Boolean and nullable¶
bool¶
Boolean with configurable probability of true.
optional¶
Wraps another generator, producing null with probability 1 - prob.
generator:
type: optional
prob: 0.08
inner:
type: one_of
values: [Defective, Wrong Item, Changed Mind]
Partitioning¶
Add a partition block to create Hive-style date-partitioned output. Each day gets its own directory with a data.parquet file, and rows are distributed evenly across days.
- name: events
output: data/events
num_rows: 5000000
partition:
column: event_date
start: "2024-07-01"
days: 90
columns:
- name: event_id
generator: { type: sequential_id }
- name: event_type
generator:
type: weighted_choice
values:
page_view: 0.50
click: 0.30
purchase: 0.20
Output directory structure:
data/events/
event_date=2024-07-01/data.parquet
event_date=2024-07-02/data.parquet
...
event_date=2024-09-28/data.parquet
Entity pools¶
Entity pools create a fixed set of "entities" (e.g., visitors, devices) whose attributes are consistent across all rows that reference them. This models the real-world pattern where the same user appears in multiple events with the same country, device, etc.
- name: sessions
output: data/sessions
num_rows: 10000000
entity:
pool_ratio: 0.2 # 10M rows × 0.2 = 2M unique visitors
columns:
- name: visitor_id
generator: { type: uuid }
- name: country
generator:
type: weighted_choice
values:
US: 0.40
UK: 0.15
DE: 0.10
Other: 0.35
columns:
- name: session_id
generator: { type: uuid }
- name: platform
generator:
type: one_of
values: [web, ios, android]
Each row gets a randomly selected entity from the pool. The entity's visitor_id and country are the same every time that entity appears.
Foreign keys¶
Foreign keys create referential integrity between datasets. The referenced dataset must be listed earlier in the config so its row count is known.
datasets:
# Dimension table (listed first)
- name: customers
output: data/customers
num_rows: 100000
columns:
- name: customer_id
generator: { type: sequential_id }
- name: segment
generator:
type: weighted_choice
values: { Regular: 0.50, Premium: 0.25, VIP: 0.10, New: 0.15 }
# Fact table (references customers)
- name: orders
output: data/orders
num_rows: 1000000
columns:
- name: order_id
generator: { type: sequential_id }
- name: customer_id
generator: { type: foreign_key, dataset: customers }
The foreign_key generator produces random integers in [1, referenced_dataset_row_count], matching the sequential_id values in the dimension table.
Scale factors¶
Scale factors let you resize datasets for different environments without changing the config.
In the config file:
Or override via CLI:
# CI: fast, small
smelt-datagen --config datagen.yaml --scale-factor 0.01
# Development: moderate
smelt-datagen --config datagen.yaml --scale-factor 0.1
# Load testing: full size
smelt-datagen --config datagen.yaml --scale-factor 1.0
The scale factor multiplies each dataset's num_rows. Foreign key references automatically adjust to match the scaled dimension table sizes.
Using datagen with smelt sources¶
A typical workflow: generate Parquet data with datagen, then configure smelt to read it as a source.
1. Generate the data:
2. Configure sources in sources.yml:
version: 1
sources:
- name: raw
tables:
- name: customers
path: data/customers
- name: orders
path: data/orders
3. Write models that reference the sources:
-- models/marts/customer_orders.sql
SELECT
c.customer_id,
c.segment,
COUNT(*) as order_count
FROM smelt.ref('raw', 'customers') c
JOIN smelt.ref('raw', 'orders') o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.segment
4. Build:
CLI reference¶
smelt-datagen [OPTIONS]
Options:
--config <PATH> YAML config file
-o, --output <PATH> Output directory [default: output]
-s, --seed <SEED> Random seed [default: 42]
--scale-factor <FACTOR> Scale multiplier for dataset sizes
-q, --quiet Suppress progress output
-h, --help Print help
Example: retail analytics¶
The examples/retail_analytics/datagen.yaml in the smelt repository defines a complete star schema with:
- Dimension tables:
customers(100K),products(10K),stores(500) - Fact tables:
orders(1M),order_items(3M),web_events(5M)
Fact tables use foreign_key to reference dimensions, and are date-partitioned across 90 days. Run it at 1% scale for quick iteration: