Architecture¶

smelt is a SQL-to-SQL compiler and orchestrator for data pipelines. It parses SQL models written in smelt's dialect, resolves dependencies, optionally optimizes across model boundaries, and emits dialect-specific SQL for target execution engines.

Three ideas make smelt different from dbt:

Logical/Physical Separation -- Users write WHAT to compute; the planner and dialect-aware printer decide HOW it executes on each backend.
Cross-Model Planning -- The planner operates at the model-graph level (creating shared materializations, redirecting refs, merging models), not at the expression level.
First-Class Editor Support -- LSP + Salsa + Rowan for incremental compilation and real-time diagnostics.

Compilation Pipeline¶

Source files flow through five stages before reaching a target engine:

Source Files (.sql / .py)
     |
     v
Parse  (smelt-parser)       -- Rowan CST, error-recovery, lossless
     |
     v
Analyze  (smelt-db)         -- Salsa incremental queries: refs, types, diagnostics
     |
     v
Plan  (smelt-planner)       -- Model-graph transforms (optional)
     |
     v
Generate  (smelt-dialect)   -- CST walk --> target-specific SQL string
     |
     v
Execute  (smelt-backend-*)  -- Send SQL to DuckDB / Spark / etc.

Key invariant: The Rowan CST is the single representation from parse through generation. There is no intermediate IR. This avoids fidelity boundaries and preserves comments, formatting, and smelt extensions throughout the pipeline.

Stage Details¶

Parse -- The parser uses recursive descent with error recovery at sync points (semicolons, keywords). Invalid input produces ERROR nodes in the CST rather than aborting. Python model files are discovered via subprocess/PyO3 and their SQL is extracted from @model decorators before parsing.

Analyze -- Salsa provides automatic incremental recomputation. When a file changes, only affected queries re-evaluate. The CST itself is never mutated; semantic information is computed as derived queries (parse_file, model_refs, resolve_ref, file_diagnostics, model_schema).

Plan -- The planner reads CST structure to detect patterns, then emits Transformation instructions and new SQL strings. It does not mutate existing CSTs. Planning is optional; models run correctly without it.

Generate -- The dialect-aware printer walks the CST in a single forward pass and emits target-specific SQL. Each construct that needs translation is a match arm in the recursive walk. For the native dialect (DuckDB), the printer emits SQL identical to the input modulo smelt.<path> resolution.

Execute -- Each backend implementation handles DDL (CREATE TABLE AS, views, incremental inserts) and returns ExecutionResult with duration, row count, and optional data preview as Arrow RecordBatch.

The Backend trait also exposes load_table(schema, name, arrow_schema, batches), the cross-backend Arrow ingest path used by seed loading and any other "build a table from in-memory data" surface. DuckDB implements it via the Appender API's append_record_batch (zero-copy Arrow → DuckDB); Spark implements it by writing a temporary Parquet file and calling createDataFrame(...).saveAsTable(...). The method validates nullability against the caller-supplied arrow_schema before touching the database — a NULL value in a column declared nullable: false returns a BackendError::NullInNonNullableColumn immediately.

Crate Structure¶

Core Pipeline¶

Crate	Purpose	Key Types	Sync/Async
`smelt-types`	SQL data types shared across crates	`DataType`, `TypedColumn`	sync
`smelt-parser`	Rowan-based error-recovery parser	`SyntaxNode`, `SyntaxKind`, `Parse`	sync
`smelt-core`	Project config and model discovery	`Config`, `ModelFile`, `ModelDiscovery`	sync
`smelt-db`	Salsa incremental queries over parsed models	`Database`, `parse_file`, `model_refs`, `model_schema`	sync
`smelt-dialect`	Dialect-aware CST printer and backend capabilities	`SqlDialect`, `BackendCapabilities`	sync
`smelt-planner`	Model-graph optimization rules	`Transformation`, `ExecutionStep`, `Opportunity`	sync

Execution¶

Crate	Purpose	Key Types	Sync/Async
`smelt-cli`	Command-line interface (`run`, `explain`, `backbuild`)	--	async
`smelt-backend`	Execution trait and result types	`Backend`, `ExecutionResult`	async
`smelt-backend-duckdb`	DuckDB execution	`DuckDbBackend`	async
`smelt-backend-spark`	Spark/Databricks execution	`SparkBackend`	async

Tools¶

Crate	Purpose	Key Types	Sync/Async
`smelt-lsp`	Language Server Protocol server	--	async (tower-lsp)
`smelt-state`	Run manifests and interval tracking	`RunManifest`, `IntervalStore`	sync
`smelt-ui`	Web dashboard and run execution	--	async

Testing¶

Crate	Purpose	Sync/Async
`smelt-bench`	Benchmarks	standalone
`smelt-datagen`	Test data generation	standalone
`smelt-parser-compat`	Compatibility tests against pg_query, sqlparser-rs, sqlglot	sync

Dependency Graph¶

                      smelt-types
                          |
                      smelt-parser
                          |
                   +------+------+
                   |      |      |
              smelt-core  |  smelt-dialect
                   |      |      |
                   |  smelt-db   |
                   |      |      |
            +------+------+------+
            |      |      |
         smelt-lsp |  smelt-planner
            |      |      |
            |  smelt-cli--+
            |      |
            |  smelt-backend
            |      |
            |   +--+------+
            |   |         |
            |  duckdb   spark
            |  backend  backend
            |
         (LSP binary)

Why smelt-dialect is separate

The LSP needs dialect information (e.g., "QUALIFY will be rewritten for PostgreSQL") but must not link against heavy async/native dependencies like Arrow, Tokio, and DuckDB. smelt-dialect is a lightweight sync crate that both smelt-lsp and smelt-cli depend on, without either needing to depend on smelt-backend.

Key Design Decisions¶

Rowan CST as Single Representation¶

There is no intermediate IR like DataFusion LogicalPlan. The concrete syntax tree from parsing flows through the entire pipeline. This preserves:

Comments and whitespace -- important for readable generated SQL
smelt extensions -- smelt.<path> refs and calls, named parameters
Original formatting -- the printer's default arm emits tokens verbatim
Error nodes -- the LSP works with incomplete/invalid code

Avoiding an IR also eliminates two fidelity boundaries (CST-to-IR and IR-to-SQL) that would lose information or require complex round-tripping.

Salsa for Incremental Computation¶

Salsa tracks query dependencies automatically. When a file changes, only affected queries are recomputed. This is critical for LSP responsiveness -- parsing 1000 models once takes around 1 second, but re-analyzing a single changed file takes around 50ms.

Key Salsa queries in smelt-db:

parse_file() -- CST from source text
model_refs() -- ref names and source positions
resolve_ref() -- target model for a ref
file_diagnostics() -- errors and warnings
model_schema() -- inferred column types

Error-Recovery Parsing¶

Developers write invalid code most of the time while editing. The parser always produces a tree, even from invalid input. Rowan's ERROR nodes allow the LSP to provide diagnostics, completions, and go-to-definition on incomplete code. The parser uses sync points (semicolons, SQL keywords) to recover and continue parsing after errors.

Two-Graph Architecture¶

smelt maintains two distinct graph representations:

Logical graph -- The user's models as written. Each model maps to a .sql file. References between models form a DAG. This graph is never modified by the planner.

Physical graph -- The execution plan. The planner may add synthetic nodes (shared materializations, cube split intermediates), remove nodes (fused models), redirect references, and change materialization strategies. The physical graph is what actually gets executed.

This separation means users always see their original model structure, while the execution engine works with an optimized plan.

Sync Core, Async Edges¶

All core logic (parsing, analysis, optimization, printing) is synchronous. Async is only at the execution boundary where network I/O happens (backend crates, LSP server via tower-lsp). This keeps the codebase simple, testable, and compatible with Salsa (which is sync).

Expression Optimization is the Engine's Job¶

smelt does not attempt predicate pushdown, join reordering, or cost-based optimization within a single query. DuckDB, Spark, and BigQuery all have mature query optimizers for that. smelt's value is in cross-model planning that no single engine can do because it doesn't see the full pipeline.

Planner Rule System¶

The planner is smelt's key differentiator. It inspects model SQL (via the CST) and the model dependency graph, then produces transformations.

Transformation Types¶

Rules emit Transformation instructions that describe graph edits:

Transformation	Purpose
`CreateNode`	Add a synthetic intermediate model (shared materialization, cube split temp)
`RemoveNode`	Remove a model from execution (fused into another)
`RedirectRef`	Point all references from one model to another
`SetMaterialization`	Override a model's materialization strategy
`SetIncremental`	Mark a model for incremental execution with time partitioning
`ReplaceWithPlan`	Replace single-query execution with a multi-step plan

Multi-step plans use ExecutionStep variants: CreateTemp, AppendToTemp, FinalQuery, DropTemp.

Rule Structure¶

Each rule implements a detect/rewrite pattern:

// Detection: inspect model SQL and graph structure
pub fn detect(model: &ModelInfo) -> Result<Option<Opportunity>, String>;

// Rewriting: produce transformations for a detected opportunity
pub fn rewrite(model: &ModelInfo) -> Option<Vec<ExecutionStep>>;

Current rules:

Cube split -- Detects models with multiple COUNT(DISTINCT) aggregations and splits into parallel sub-queries joined on GROUP BY keys, reducing memory pressure.
Incremental materialization -- Detects time-partitioned GROUP BY and generates DELETE+INSERT execution plans.

Phase Ordering¶

The planner applies rules in two phases:

Cross-model rules first -- Graph-level transforms (shared materializations, model fusion, ref redirection) that restructure the dependency graph.
Single-model rules second -- Execution transforms (cube split, incremental detection) that change how individual models are executed.

Python Rule Extensibility¶

Rules can be written in Python via PyO3. Python rules implement the same detect/rewrite pattern and return serialized Transformation values. This allows data engineers to write domain-specific optimization rules without modifying Rust code.

Dialect-Aware Printer¶

The printer in smelt-dialect walks the CST in a single forward pass. Each construct that needs dialect translation is a match arm in the recursive walk:

smelt.<path> calls are resolved to backend-resolvable identifiers (<schema>.<emitted_name> for models and seeds, source-declared name for sources)
QUALIFY is rewritten to a subquery wrapper for backends that lack native support
Array literal syntax is adapted per dialect
DATE literals, JSON functions, and other constructs are remapped as needed

The default arm emits tokens verbatim, preserving whitespace and comments. Adding a new rewrite means adding a new match arm. Nested rewrites compose naturally through recursion.

Identity property: For the native dialect (DuckDB), the printer emits SQL identical to the input modulo smelt extension resolution. This is a testable correctness invariant.

LSP Integration¶

The LSP (smelt-lsp) is a thin async shell (tower-lsp) over sync Salsa queries. It provides:

Parse error diagnostics with accurate positions
Undefined smelt.<path> ref diagnostics for any project entity (model, source, seed, function, or test)
Undeclared column diagnostics for references to columns not in upstream schemas or source declarations
Go-to-definition for smelt.<path> refs, CTE names, table aliases, and column references (traces through SELECT * wildcards)
Hover with type information
Column completions including table alias completions
Entity completions in smelt.<path> expressions

The LSP depends on smelt-dialect for dialect-specific informational hints (e.g., "QUALIFY will be rewritten to a subquery for PostgreSQL") without linking to any backend binary.