Web analytics example: amplitude_id and three refinements¶
A self-contained smelt example demonstrating a bronze→silver→gold pipeline
over JSON-encoded web events, with an Amplitude-style always-present
amplitude_id as the no-merging baseline plus three parallel refinements
that progressively merge the identity space — all surfaced side-by-side in a
single wide event-level table for direct comparison.
The example lives at
examples/web_analytics/
in the repo.
What this example demonstrates¶
- A complete bronze → silver → gold lineage over a synthetic 60-day event
stream of ~1M events (at
scale_factor=1.0), generated bysmelt-datageninto Parquet on disk and ingested viasmelt.sources. - Incremental sessionization with a 30-minute inactivity rule plus a platform-boundary split, using a 7-day rolling lookback so late-arriving events join to the correct session.
- A never-NULL
amplitude_idsynthesised atsilver/events_parsed:'u:' || user_idwhen the event carries a signed-in user, else'd:' || device_id. The'u:'/'d:'prefix keeps the namespaces disjoint. This is the no-merging baseline Amplitude would emit before any cross-event stitching. - Three parallel identity refinements materialised as separate
gold-layer models, joined into the wide
eventstream_with_identityrow per event with COALESCE to the device-prefixamplitude_idfor rows the refinement doesn't tag. Every method's column is non-null on every event. - Two gold-mart views (
daily_active_users_by_method,identity_method_comparison) that quantify the per-day and per-pair algorithmic differences. - Five inline
.test.sqlassertions covering session boundaries, each identity algorithm's defining invariant, and the four-way amplitude_id monotonicity invariant.
The four methods¶
Inspired by the Amplitude track-unique-users methodology. The methods here are not faithful reproductions of any Amplitude implementation, but they cover the same algorithmic spectrum (no-merging → in-session → per-device → cross-device).
amplitude_id (no-merging baseline)¶
'u:' || user_id when the event is signed-in, else 'd:' || device_id.
Always non-null. Two events on the same anonymous device share a 'd:'
identity; two signed-in events from the same user on different devices share
a 'u:' identity.
Forward-only¶
Within-session refinement. Each session's identifiable events are tagged with
the latest in-session signed-in user, via arg_max(user_id, event_ts) FILTER
(WHERE user_id IS NOT NULL) grouped by session_id, prefixed with 'u:'.
No cross-session propagation, no per-device election, no clustering. Sessions
with zero signed-in observations resolve to NULL at the model boundary; the
eventstream COALESCEs that NULL to 'd:' || device_id downstream.
Backward-fill¶
Per-device canonical-user election. The most-frequent signed-in user across
all sessions on a device wins; ties are broken by first-seen, then by
smallest user_id. Every event on that device — anonymous or signed-in,
across every session — retroactively belongs to that user, output as
'u:' || elected user_id. Devices that never had a signed-in event fall
back to 'd:' || device_id at the eventstream COALESCE.
Connected-components¶
Cross-device clustering via union-find over the (device, user)
co-occurrence graph. The cluster representative is 'u:' || min(user_id)
in the cluster, and every event of every device in the cluster is tagged
with that representative. Implemented as 8-iteration unrolled label
propagation over silver/device_user_edges.
Why four?¶
The methods span an algorithmic tradeoff that real-world product analytics teams face every day:
- amplitude_id is the no-merging baseline — every event has an identity, but anonymous events on the same device collapse to one device-prefix id, and signed-in events that should belong to the same person across two devices stay distinct. It costs nothing beyond a string concat at silver.
- Forward-only is the simplest refinement: no cross-session reasoning,
no shared state. It promotes anonymous events to
'u:' ||the latest in-session signed-in user, but only inside a session that had at least one signin. The most conservative refinement under shared devices: it never attributes events to a user who didn't sign in during that session. - Backward-fill broadens the promotion to every event on a device that has ever had a signed-in user. This is the Amplitude-basic model — it promotes many more events, at the cost of potentially attributing events to a user who wasn't actually present (e.g., a family-shared tablet).
- Connected-components is the most aggressive: it clusters across devices via shared users and attributes every event in the cluster to the cluster minimum. This can attribute events on a device the canonical user never actually used (because that device shares a user with the canonical user's device). It can also over-cluster if a device is shared by unrelated users.
Each refinement promotes a strict-or-equal superset of events from the
device-fallback 'd:' namespace to the 'u:' namespace — the central
identified-events invariant the marts encode. Per-day DAU is only
partially monotonic because the four methods are different partitions of
the same identity space, not a single refinement chain (see below).
The two marts make the tradeoffs concrete:
daily_active_users_by_methodreports per-dayidentified_events_*anddau_*counts under each of the four methods.identified_events_*is fully four-way monotone per day:raw ≤ forward_only ≤ backward_fill ≤ connected_components.dau_*is only partially monotone per day. The guaranteed inequality isdau_backward_fill ≥ dau_connected_components(cluster collapse). Otherwise,dau_forward_onlycan exceeddau_rawon days where forward_only inherits identities from prior-day signins via session continuation, anddau_backward_fillcan exceeddau_forward_onlyon days where many devices have only anon-only sessions but signed-in events on other days.
identity_method_comparisonis a pairwise event-level breakdown:(agree_user_events, agree_device_events, disagree_events, only_left_user, only_right_user, total_events)for each pair of methods. The disjointness invariantagree_user + agree_device + disagree + only_left_user + only_right_user = total_eventsholds on every row; the qualitative shape is dataset-dependent and is the primary tool for inspecting where the refinements diverge.
The DAU monotonicity inline test
(tests/dau_monotonicity_invariants.test.sql)
and the integration-test gate in
crates/smelt-datagen/tests/example_web_analytics.rs
are the correctness guards on the identified_events_* four-way
monotonicity and the dau_backward_fill ≥ dau_connected_components per-day
inequality.
Where to find the code¶
- Example directory — full source, README, datagen config, and inline tests.
models/gold/identity_forward_only.sqlmodels/gold/identity_backward_fill.sqlmodels/gold/identity_connected_components.sqlmodels/marts/daily_active_users_by_method.sqlmodels/marts/identity_method_comparison.sql
Run it locally¶
git clone https://github.com/adbrowne/smelt-sql.git
cd smelt-sql/examples/web_analytics
smelt-datagen --config datagen.yaml --scale-factor 0.01
duckdb target/dev.duckdb < setup_sources.sql
smelt build
smelt test
See the
example README
for a deeper walkthrough including example duckdb-prompt queries against
the marts.