Architecture¶
smelt is a SQL-to-SQL compiler and orchestrator for data pipelines. It parses SQL models written in smelt's dialect, resolves dependencies, optionally optimizes across model boundaries, and emits dialect-specific SQL for target execution engines.
Three ideas make smelt different from dbt:
- Logical/Physical Separation -- Users write WHAT to compute; the planner and dialect-aware printer decide HOW it executes on each backend.
- Cross-Model Planning -- The planner operates at the model-graph level (creating shared materializations, redirecting refs, merging models), not at the expression level.
- First-Class Editor Support -- LSP + Salsa + Rowan for incremental compilation and real-time diagnostics.
Compilation Pipeline¶
Source files flow through five stages before reaching a target engine:
Source Files (.sql / .py)
|
v
Parse (smelt-parser) -- Rowan CST, error-recovery, lossless
|
v
Analyze (smelt-db) -- Salsa incremental queries: refs, types, diagnostics
|
v
Plan (smelt-planner) -- Model-graph transforms (optional)
|
v
Generate (smelt-dialect) -- CST walk --> target-specific SQL string
|
v
Execute (smelt-backend-*) -- Send SQL to DuckDB / Spark / etc.
Key invariant: The Rowan CST is the single representation from parse through generation. There is no intermediate IR. This avoids fidelity boundaries and preserves comments, formatting, and smelt extensions throughout the pipeline.
Stage Details¶
Parse -- The parser uses recursive descent with error recovery at sync points (semicolons, keywords). Invalid input produces ERROR nodes in the CST rather than aborting. Python model files are discovered via subprocess/PyO3 and their SQL is extracted from @model decorators before parsing.
Analyze -- Salsa provides automatic incremental recomputation. When a file changes, only affected queries re-evaluate. The CST itself is never mutated; semantic information is computed as derived queries (parse_file, model_refs, resolve_ref, file_diagnostics, model_schema).
Plan -- The planner reads CST structure to detect patterns, then emits Transformation instructions and new SQL strings. It does not mutate existing CSTs. Planning is optional; models run correctly without it.
Generate -- The dialect-aware printer walks the CST in a single forward pass and emits target-specific SQL. Each construct that needs translation is a match arm in the recursive walk. For the native dialect (DuckDB), the printer emits SQL identical to the input modulo smelt.ref() resolution.
Execute -- Each backend implementation handles DDL (CREATE TABLE AS, views, incremental inserts) and returns ExecutionResult with duration, row count, and optional data preview as Arrow RecordBatch.
Crate Structure¶
Core Pipeline¶
| Crate | Purpose | Key Types | Sync/Async |
|---|---|---|---|
smelt-types |
SQL data types shared across crates | DataType, TypedColumn |
sync |
smelt-parser |
Rowan-based error-recovery parser | SyntaxNode, SyntaxKind, Parse |
sync |
smelt-core |
Project config and model discovery | Config, ModelFile, ModelDiscovery |
sync |
smelt-db |
Salsa incremental queries over parsed models | Database, parse_file, model_refs, model_schema |
sync |
smelt-dialect |
Dialect-aware CST printer and backend capabilities | SqlDialect, BackendCapabilities |
sync |
smelt-planner |
Model-graph optimization rules | Transformation, ExecutionStep, Opportunity |
sync |
Execution¶
| Crate | Purpose | Key Types | Sync/Async |
|---|---|---|---|
smelt-cli |
Command-line interface (run, explain, backbuild) |
-- | async |
smelt-backend |
Execution trait and result types | Backend, ExecutionResult |
async |
smelt-backend-duckdb |
DuckDB execution | DuckDbBackend |
async |
smelt-backend-spark |
Spark/Databricks execution | SparkBackend |
async |
Tools¶
| Crate | Purpose | Key Types | Sync/Async |
|---|---|---|---|
smelt-lsp |
Language Server Protocol server | -- | async (tower-lsp) |
smelt-state |
Run manifests and interval tracking | RunManifest, IntervalStore |
sync |
smelt-ui |
Web dashboard and run execution | -- | async |
Testing¶
| Crate | Purpose | Sync/Async |
|---|---|---|
smelt-bench |
Benchmarks | standalone |
smelt-datagen |
Test data generation | standalone |
smelt-parser-compat |
Compatibility tests against pg_query, sqlparser-rs, sqlglot | sync |
Dependency Graph¶
smelt-types
|
smelt-parser
|
+------+------+
| | |
smelt-core | smelt-dialect
| | |
| smelt-db |
| | |
+------+------+------+
| | |
smelt-lsp | smelt-planner
| | |
| smelt-cli--+
| |
| smelt-backend
| |
| +--+------+
| | |
| duckdb spark
| backend backend
|
(LSP binary)
Why smelt-dialect is separate
The LSP needs dialect information (e.g., "QUALIFY will be rewritten for PostgreSQL") but must not link against heavy async/native dependencies like Arrow, Tokio, and DuckDB. smelt-dialect is a lightweight sync crate that both smelt-lsp and smelt-cli depend on, without either needing to depend on smelt-backend.
Key Design Decisions¶
Rowan CST as Single Representation¶
There is no intermediate IR like DataFusion LogicalPlan. The concrete syntax tree from parsing flows through the entire pipeline. This preserves:
- Comments and whitespace -- important for readable generated SQL
- smelt extensions --
smelt.ref(),smelt.metric(), named parameters - Original formatting -- the printer's default arm emits tokens verbatim
- Error nodes -- the LSP works with incomplete/invalid code
Avoiding an IR also eliminates two fidelity boundaries (CST-to-IR and IR-to-SQL) that would lose information or require complex round-tripping.
Salsa for Incremental Computation¶
Salsa tracks query dependencies automatically. When a file changes, only affected queries are recomputed. This is critical for LSP responsiveness -- parsing 1000 models once takes around 1 second, but re-analyzing a single changed file takes around 50ms.
Key Salsa queries in smelt-db:
parse_file()-- CST from source textmodel_refs()-- ref names and source positionsresolve_ref()-- target model for a reffile_diagnostics()-- errors and warningsmodel_schema()-- inferred column types
Error-Recovery Parsing¶
Developers write invalid code most of the time while editing. The parser always produces a tree, even from invalid input. Rowan's ERROR nodes allow the LSP to provide diagnostics, completions, and go-to-definition on incomplete code. The parser uses sync points (semicolons, SQL keywords) to recover and continue parsing after errors.
Two-Graph Architecture¶
smelt maintains two distinct graph representations:
Logical graph -- The user's models as written. Each model maps to a .sql file. References between models form a DAG. This graph is never modified by the planner.
Physical graph -- The execution plan. The planner may add synthetic nodes (shared materializations, cube split intermediates), remove nodes (fused models), redirect references, and change materialization strategies. The physical graph is what actually gets executed.
This separation means users always see their original model structure, while the execution engine works with an optimized plan.
Sync Core, Async Edges¶
All core logic (parsing, analysis, optimization, printing) is synchronous. Async is only at the execution boundary where network I/O happens (backend crates, LSP server via tower-lsp). This keeps the codebase simple, testable, and compatible with Salsa (which is sync).
Expression Optimization is the Engine's Job¶
smelt does not attempt predicate pushdown, join reordering, or cost-based optimization within a single query. DuckDB, Spark, and BigQuery all have mature query optimizers for that. smelt's value is in cross-model planning that no single engine can do because it doesn't see the full pipeline.
Planner Rule System¶
The planner is smelt's key differentiator. It inspects model SQL (via the CST) and the model dependency graph, then produces transformations.
Transformation Types¶
Rules emit Transformation instructions that describe graph edits:
| Transformation | Purpose |
|---|---|
CreateNode |
Add a synthetic intermediate model (shared materialization, cube split temp) |
RemoveNode |
Remove a model from execution (fused into another) |
RedirectRef |
Point all references from one model to another |
SetMaterialization |
Override a model's materialization strategy |
SetIncremental |
Mark a model for incremental execution with time partitioning |
ReplaceWithPlan |
Replace single-query execution with a multi-step plan |
Multi-step plans use ExecutionStep variants: CreateTemp, AppendToTemp, FinalQuery, DropTemp.
Rule Structure¶
Each rule implements a detect/rewrite pattern:
// Detection: inspect model SQL and graph structure
pub fn detect(model: &ModelInfo) -> Result<Option<Opportunity>, String>;
// Rewriting: produce transformations for a detected opportunity
pub fn rewrite(model: &ModelInfo) -> Option<Vec<ExecutionStep>>;
Current rules:
- Cube split -- Detects models with multiple
COUNT(DISTINCT)aggregations and splits into parallel sub-queries joined on GROUP BY keys, reducing memory pressure. - Incremental materialization -- Detects time-partitioned GROUP BY and generates DELETE+INSERT execution plans.
Phase Ordering¶
The planner applies rules in two phases:
- Cross-model rules first -- Graph-level transforms (shared materializations, model fusion, ref redirection) that restructure the dependency graph.
- Single-model rules second -- Execution transforms (cube split, incremental detection) that change how individual models are executed.
Python Rule Extensibility¶
Rules can be written in Python via PyO3. Python rules implement the same detect/rewrite pattern and return serialized Transformation values. This allows data engineers to write domain-specific optimization rules without modifying Rust code.
Dialect-Aware Printer¶
The printer in smelt-dialect walks the CST in a single forward pass. Each construct that needs dialect translation is a match arm in the recursive walk:
smelt.ref()calls are resolved toschema.model_nameQUALIFYis rewritten to a subquery wrapper for backends that lack native support- Array literal syntax is adapted per dialect
- DATE literals, JSON functions, and other constructs are remapped as needed
The default arm emits tokens verbatim, preserving whitespace and comments. Adding a new rewrite means adding a new match arm. Nested rewrites compose naturally through recursion.
Identity property: For the native dialect (DuckDB), the printer emits SQL identical to the input modulo smelt extension resolution. This is a testable correctness invariant.
LSP Integration¶
The LSP (smelt-lsp) is a thin async shell (tower-lsp) over sync Salsa queries. It provides:
- Parse error diagnostics with accurate positions
- Undefined ref diagnostics for
smelt.ref()andsmelt.source() - Undeclared column diagnostics for references to columns not in upstream schemas or
sources.yml - Go-to-definition for
smelt.ref(),smelt.source(), CTE names, table aliases, and column references (traces throughSELECT *wildcards) - Hover with type information
- Column completions including table alias completions
- Model name completions in
smelt.ref()
The LSP depends on smelt-dialect for dialect-specific informational hints (e.g., "QUALIFY will be rewritten to a subquery for PostgreSQL") without linking to any backend binary.