Skip to content

Data Generation

smelt includes smelt-datagen, a deterministic data generation tool for creating realistic test datasets. It generates Parquet files from a YAML configuration, with support for partitioning, foreign keys, entity pools, and configurable distributions.

When to use datagen

Approach Best for Size
Seeds (CSV) Small reference data, version-controlled test fixtures < 1,000 rows
smelt-datagen (Parquet) Large test datasets with realistic distributions 1K–100M+ rows
Inline test data Unit test fixtures in .test.sql files < 100 rows

Use datagen when you need datasets large enough to test performance, partitioning, or incremental models — but still want deterministic, reproducible output.

Installation

smelt-datagen is included with the smelt package:

pip install smelt-sql

Verify the installation:

smelt-datagen --help

Quick start

Create a file called datagen.yaml:

seed: 42

datasets:
  - name: users
    output: data/users
    num_rows: 1000
    columns:
      - name: user_id
        generator: { type: sequential_id }
      - name: name_prefix
        generator:
          type: weighted_choice
          values:
            Mr: 0.40
            Ms: 0.35
            Dr: 0.05
            "": 0.20
      - name: country
        generator:
          type: one_of
          values: [US, UK, DE, FR, CA, AU]
      - name: signup_date
        generator: { type: date, start: "2022-01-01", end: "2024-12-31" }

Generate the data:

smelt-datagen --config datagen.yaml

This creates data/users/data.parquet. Inspect it with DuckDB:

duckdb -c "SELECT * FROM 'data/users/data.parquet' LIMIT 10"

Configuration reference

Top-level keys

seed: 42              # Global random seed (default: 42)
scale_factor: 1.0     # Multiplier for all dataset num_rows

datasets:
  - ...                # List of dataset configurations

Dataset keys

- name: orders            # Dataset name (used by foreign_key references)
  output: data/orders     # Output directory
  num_rows: 1000000       # Number of rows to generate
  seed: 123               # Per-dataset seed (overrides global)
  partition: ...          # Optional: Hive-style date partitioning
  entity: ...             # Optional: entity pool for sticky attributes
  columns:                # Column definitions
    - name: order_id
      generator: { type: sequential_id }

Generator types

Identifiers

sequential_id

Auto-incrementing integer starting at 1.

generator: { type: sequential_id }

uuid

Deterministic UUID v4.

generator: { type: uuid }

foreign_key

Random reference to another dataset's sequential ID. The referenced dataset must appear earlier in the config.

generator: { type: foreign_key, dataset: customers }

Strings

constant

Fixed value for every row.

generator: { type: constant, value: "active" }

weighted_choice

Select from options with probability weights. Weights are normalized automatically.

generator:
  type: weighted_choice
  values:
    completed: 0.70
    shipped: 0.12
    cancelled: 0.05
    returned: 0.05

one_of

Uniform random selection from a list.

generator:
  type: one_of
  values: [red, green, blue, yellow]

string_pattern

Template-based string generation with embedded placeholders.

generator: { type: string_pattern, template: "user_{sequential_id}@example.com" }

Supported placeholders:

Placeholder Description Example output
{sequential_id} Row index + 1 42
{uuid} Random UUID a1b2c3d4-...
{uniform_int:MIN-MAX} Random integer in range 4527
{one_of:a,b,c} Random choice from list b
# SKU codes
generator: { type: string_pattern, template: "SKU-{uniform_int:1000-9999}" }

# Usernames
generator: { type: string_pattern, template: "{one_of:alpha,beta,gamma}-{sequential_id}" }

Numbers

uniform_int

Uniform random integer in [min, max).

generator: { type: uniform_int, min: 1950, max: 2005 }

uniform_float

Uniform random float in [min, max).

generator: { type: uniform_float, min: 0.0, max: 1.0 }

log_normal

Log-normal distribution, useful for prices, counts, and durations. The median is the center, sigma controls spread, and max caps the output.

generator: { type: log_normal, median: 2500, sigma: 1.2, max: 500000 }

geometric

Geometric distribution — "number of failures before first success". Useful for count data skewed toward zero.

generator: { type: geometric, p: 0.5 }

Dates and timestamps

date

Random date within a range, output as YYYY-MM-DD string.

generator: { type: date, start: "2020-01-01", end: "2024-12-31" }

timestamp

Random timestamp within a range, output as YYYY-MM-DDTHH:MM:SS string.

generator: { type: timestamp, start: "2024-01-01T00:00:00", end: "2024-03-31T23:59:59" }

Boolean and nullable

bool

Boolean with configurable probability of true.

generator: { type: bool, prob: 0.15 }

optional

Wraps another generator, producing null with probability 1 - prob.

generator:
  type: optional
  prob: 0.08
  inner:
    type: one_of
    values: [Defective, Wrong Item, Changed Mind]

Partitioning

Add a partition block to create Hive-style date-partitioned output. Each day gets its own directory with a data.parquet file, and rows are distributed evenly across days.

- name: events
  output: data/events
  num_rows: 5000000
  partition:
    column: event_date
    start: "2024-07-01"
    days: 90
  columns:
    - name: event_id
      generator: { type: sequential_id }
    - name: event_type
      generator:
        type: weighted_choice
        values:
          page_view: 0.50
          click: 0.30
          purchase: 0.20

Output directory structure:

data/events/
  event_date=2024-07-01/data.parquet
  event_date=2024-07-02/data.parquet
  ...
  event_date=2024-09-28/data.parquet

Entity pools

Entity pools create a fixed set of "entities" (e.g., visitors, devices) whose attributes are consistent across all rows that reference them. This models the real-world pattern where the same user appears in multiple events with the same country, device, etc.

- name: sessions
  output: data/sessions
  num_rows: 10000000
  entity:
    pool_ratio: 0.2    # 10M rows × 0.2 = 2M unique visitors
    columns:
      - name: visitor_id
        generator: { type: uuid }
      - name: country
        generator:
          type: weighted_choice
          values:
            US: 0.40
            UK: 0.15
            DE: 0.10
            Other: 0.35
  columns:
    - name: session_id
      generator: { type: uuid }
    - name: platform
      generator:
        type: one_of
        values: [web, ios, android]

Each row gets a randomly selected entity from the pool. The entity's visitor_id and country are the same every time that entity appears.

Foreign keys

Foreign keys create referential integrity between datasets. The referenced dataset must be listed earlier in the config so its row count is known.

datasets:
  # Dimension table (listed first)
  - name: customers
    output: data/customers
    num_rows: 100000
    columns:
      - name: customer_id
        generator: { type: sequential_id }
      - name: segment
        generator:
          type: weighted_choice
          values: { Regular: 0.50, Premium: 0.25, VIP: 0.10, New: 0.15 }

  # Fact table (references customers)
  - name: orders
    output: data/orders
    num_rows: 1000000
    columns:
      - name: order_id
        generator: { type: sequential_id }
      - name: customer_id
        generator: { type: foreign_key, dataset: customers }

The foreign_key generator produces random integers in [1, referenced_dataset_row_count], matching the sequential_id values in the dimension table.

Scale factors

Scale factors let you resize datasets for different environments without changing the config.

In the config file:

scale_factor: 0.01   # Generate 1% of full dataset

Or override via CLI:

# CI: fast, small
smelt-datagen --config datagen.yaml --scale-factor 0.01

# Development: moderate
smelt-datagen --config datagen.yaml --scale-factor 0.1

# Load testing: full size
smelt-datagen --config datagen.yaml --scale-factor 1.0

The scale factor multiplies each dataset's num_rows. Foreign key references automatically adjust to match the scaled dimension table sizes.

Using datagen with smelt sources

A typical workflow: generate Parquet data with datagen, then configure smelt to read it as a source.

1. Generate the data:

smelt-datagen --config datagen.yaml

2. Configure sources in sources.yml:

version: 1
sources:
  - name: raw
    tables:
      - name: customers
        path: data/customers
      - name: orders
        path: data/orders

3. Write models that reference the sources:

-- models/marts/customer_orders.sql
SELECT
    c.customer_id,
    c.segment,
    COUNT(*) as order_count
FROM smelt.ref('raw', 'customers') c
JOIN smelt.ref('raw', 'orders') o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.segment

4. Build:

smelt build

CLI reference

smelt-datagen [OPTIONS]

Options:
  --config <PATH>          YAML config file
  -o, --output <PATH>      Output directory [default: output]
  -s, --seed <SEED>        Random seed [default: 42]
  --scale-factor <FACTOR>  Scale multiplier for dataset sizes
  -q, --quiet              Suppress progress output
  -h, --help               Print help

Example: retail analytics

The examples/retail_analytics/datagen.yaml in the smelt repository defines a complete star schema with:

  • Dimension tables: customers (100K), products (10K), stores (500)
  • Fact tables: orders (1M), order_items (3M), web_events (5M)

Fact tables use foreign_key to reference dimensions, and are date-partitioned across 90 days. Run it at 1% scale for quick iteration:

smelt-datagen --config examples/retail_analytics/datagen.yaml --scale-factor 0.01

Further reading

  • Seeds for small CSV-based test data
  • Sources for configuring external data references
  • Testing for inline test data in .test.sql files