Snowflake in Commerce: From Data Silo to Shared Architecture

Commerce data lives in silos. The shop knows the order, the ERP knows the delivery, the CRM knows the contact, the marketing stack knows the campaign, the service tool knows the case — and each of these worlds carries its own definition of what "customer" and "revenue" actually mean. For CDOs and CIOs this becomes visible the moment CLV reports, margin analyses and campaign ROI sit side by side and contradict each other. A data foundation on Snowflake with dbt modelling is the answer we have built most often in DACH commerce projects in recent years — because it resolves exactly this contradiction without forcing a redesign of the operating model. This article shows why commerce data is fragmented, what changes once a shared architecture is in place, and how Snowflake differentiates itself from BigQuery, Databricks and Redshift.

Why commerce data is fragmented

The data landscape in commerce has grown over years, not been designed. With retailers and industrial manufacturers, we typically see five to seven operational systems, each of which works well on its own but shares no common data model: shop platform (Shopware, commercetools, Spryker), ERP (SAP S/4HANA, Microsoft Dynamics, Infor), CRM (Salesforce, HubSpot, Dynamics 365), marketing stack (Klaviyo, Emarsys, Adobe Campaign), service tool (Zendesk, Salesforce Service Cloud), PIM, and in many cases a subscription or loyalty system.

Each of these systems writes its own truth. The shop closes an order with gross revenue; the ERP books it net, with discounts, cancellations and returns; the CRM only knows the contact event. The consequence is not just a reporting problem — it's a definition problem. Marketing optimises against gross revenue, controlling calculates net margin, sales measures pipeline. Three pillars, three truths.

On top of that comes the identity problem. In B2C, a customer is often known under five different identifiers in five systems — guest order in the shop, email hash in the marketing stack, customer number in the ERP, account in the CRM, ticket ID in service. In B2B it becomes more complex: buying centres, multi-branch structures and contract-based assortments mean that a "customer" can also be a hierarchy. Anyone calculating CLV without clean identity resolution is systematically wrong on the sum.

As long as these systems only exchange data among themselves — point-to-point integrations, nightly exports, BI tools with proprietary connectors — complexity grows quadratically. Each new source means n new connections. A shared data layer reduces this to 1:n.

Snowflake and dbt as a shared data layer

Snowflake is a cloud data platform with separated compute and storage: data sits once, workloads run in independent virtual warehouses. That is the technical precondition for marketing, controlling, data science and BI to work on the same dataset simultaneously without blocking each other. dbt (data build tool) adds a declarative modelling layer in SQL on top — version-controlled in Git, with tests, lineage and documentation.

In foobar projects we work with four layers that have become the de-facto standard in the dbt community:

- raw — the unchanged ingestion tables from source systems, loaded via connectors such as Fivetran, Airbyte or Snowpipe. - staging — typed, renamed, uniquely referenced versions of the source tables. One staging layer per source system. - intermediate — joins, identity resolution, business logic that is not yet report-ready but is reused as a building block. - marts — domain models: `marts.commerce`, `marts.marketing`, `marts.finance`. This is where the tables sit on which dashboards and AI models are built.

This layering is conceptually close to the "Bronze/Silver/Gold" logic from the Databricks lakehouse vocabulary; in the dbt world the terms raw/staging/intermediate/marts are more common. Equivalent in content, cleaner terminology for a Snowflake setup.

What changes concretely once the foundation is in place: CLV, revenue and margin have exactly one definition. The `dim_customer` table in `marts.commerce` is the single source of truth — marketing automation, service routing, forecasting and board reporting all draw from the same source. Metrics are defined once in dbt and consumed identically everywhere. Discussions about whose number is correct don't end because people start calculating the same way — they end because the model is authoritative.

Architecture sketch: sources, layers, applications

A commerce data foundation on Snowflake can be thought of in three columns — sources on the left, Snowflake layers in the middle, applications on the right. On the source side: shop, ERP, CRM, marketing stack, service and PIM. Via connectors their tables land in `raw` — a schema-stable mirror of the source, ideally with change data capture so that changes, not just end states, arrive.

In the middle, dbt models the staging layer (typing, renaming), the intermediate layer (joins, identity resolution, sessionizing) and the marts layer (dimensions and facts per domain). This is where the central tables emerge: `dim_customer`, `dim_product`, `fct_orders`, `fct_marketing_touchpoints`, `fct_service_cases`. dbt tests monitor uniqueness, completeness and value ranges; the generated documentation shows where each column originates.

On the application side, four typical consumer groups draw from the same data model: BI tools (Looker, Tableau, Power BI) for reporting; marketing automation for segments and campaigns; operations and service for real-time lookups; data science for models like CLV, churn and next-best-action. Snowflake Cortex extends this layer with native ML and LLM functions that operate directly on the modelled tables — without exporting data into a separate ML platform.

This architecture is not spectacular. It is anchored in business processes, networked between the commerce and intelligence layer, and forward-thought because new sources or applications attach to the foundation instead of producing another point-to-point connector.

Snowflake, BigQuery, Databricks, Redshift: three decision criteria

The question of which platform is the right one cannot be answered off the cuff. It depends on three criteria we work through consistently in architecture workshops.

First: multi-cloud. BigQuery is tightly bound to Google Cloud, Redshift to AWS, Azure Synapse to Microsoft. Snowflake runs natively on AWS, Azure and GCP — the same account can host workloads on different cloud providers. For DACH groups with hybrid cloud strategies (for example SAP world on Azure, analytics on AWS) this is a real difference. Databricks is multi-cloud as well, but functionally a lakehouse, not a classical warehouse — its strengths sit with streaming, Spark workloads and ML platforms.

Second: workload separation. Snowflake's separated virtual warehouses mean that a BI dashboard, a nightly dbt run and an ad-hoc data-science notebook scale independently of each other. With BigQuery, slots are shared; with Redshift, you scale clusters or use serverless. Anyone with many competing workloads — reporting, marketing automation, forecasting, ML training — gets the simplest isolation with Snowflake.

Third: data sharing. Snowflake shares data across accounts without copying it. In a commerce context that means supplier data, POS data from retail partners or market research feeds can be incorporated live. For retailers with marketplace strategies or industrial manufacturers with dealer networks, this is a function for which BigQuery (Analytics Hub) and Databricks (Delta Sharing) have counterparts, but where Snowflake is the most mature.

None of these arguments makes Snowflake the universally correct choice. But they are the three questions that most decisions can be tied to with confidence.

Five signs your commerce data needs a foundation

If you need a quick check in an architecture review — these five signs are the ones we most often see trigger a foundation project.

#

Sign

What it means

1

Marketing, sales and controlling each report a different revenue figure

Definitions are misaligned; each source calculates differently

2

A new campaign needs two weeks of data lead time

Identity resolution and segmentation happen manually, not in the model

3

CLV exists as a concept, but as a number only in PowerPoint

There is no productive table that multiple teams consume

4

Each new data source requires a new connector to each destination

Point-to-point integration instead of a central data layer

5

BI reports and marketing segments diverge

No shared dim_customer; each tool reads from its own source

Industry case: RFM-based CLV at a DACH multi-channel retailer

A multi-channel retailer from the DACH region with its own TV, online and app presence built a customer lifetime value prediction on Snowflake and dbt with foobar Agency. The starting position was typical: order data in the ERP, contact events in multiple marketing tools, master data in the CRM, customer service data in a third system. A consistent statement about customer value across reports was not achievable.

The foundation was built in three steps. First: identity resolution. Orders, contact points and service cases were merged into a unique `dim_customer` identity via deterministic keys (customer number, hashed email, phone number) and probabilistic fallbacks. Second: RFM calculation. On the fact table `fct_orders`, recency, frequency and monetary value per customer are calculated in dbt — deterministically, traceably, secured against definition drift by dbt tests. Third: probabilistic CLV prediction. Based on the RFM features, a model from the BG/NBD and Gamma-Gamma family estimates the expected number and expected value of future purchases per customer.

The result is not a dashboard, but a table: `marts.commerce.customer_clv` contains for each customer a recency value, an RFM segment, a historical value and a probabilistic prediction. Marketing uses it for segments and routing, service uses it for escalation routing, controlling uses it for cohort reporting. Same table, same definition — no more database tourism between teams. How the RFM-based CLV logic works in detail, and why it favours probabilistic models over pure ML black boxes, is covered in post 2 of this series.

A data architecture only becomes a foundation when three teams from three worlds pull the same number and no one has to ask whose definition applies. Snowflake and dbt are not magic — they force us to write definitions cleanly once and then make them productive. The effect shows up in every project first where CLV, margin and campaign ROI used to drift apart.

Dominik ThalmeierTeam Lead Data Science foobar Agency

Frequently asked questions

Not necessarily. If your world sits entirely in Google Cloud and workload separation is not a problem, BigQuery is a valid choice. Snowflake becomes relevant when workloads are distributed across AWS, Azure and GCP, when competing workloads slow each other down, or when you want to share data cross-account. The modelling logic in dbt works on both platforms — a migration between the warehouses is feasible, but not a trivial weekend project.

A productive minimum setup with three to four connected sources, raw/staging/intermediate/marts layers and a first domain-specific marts layer (for example commerce) typically takes us twelve to sixteen weeks. The precondition is that source systems are clearly defined and a data owner is named per source. Identity resolution and AI use cases such as CLV prediction follow afterwards in iterative steps — they are not a prerequisite for step one.

Snowflake is consumption-based: you pay compute credits per warehouse runtime used and storage per data volume. In a commerce context with a well-modelled dbt pipeline and cleanly configured virtual warehouses (auto-suspend, size tuning), running platform costs are usually well below the tool costs caused by redundant BI, CDP and analytics stacks. Reliable numbers only emerge after a workload assessment — flat monthly figures without a workload profile are rarely meaningful.

Data foundation workshop

In two hours at foobar Agency we clarify with you which sources your foundation has to carry, which metrics can be resolved first and whether Snowflake, BigQuery or Databricks fits your cloud landscape — hand on the architecture diagram, not on slides.

Dominik Thalmeier

Dominik Thalmeier

Data Scientist

Dominik ist Data Scientist und Softwareentwickler und setzt sich leidenschaftlich dafür ein, eine Brücke zwischen Forschung und Industrie zu schlagen. Er hat sich mit Algorithmen des verstärkenden Lernens und des maschinellen Lernens befasst, um Krankheiten wie Demenz und genetisch bedingten Hörverlust zu diagnostizieren. In der Industrie ist er als Berater für Data Science Architektur und Governance tätig und arbeitet als KI-Entwickler und Data Scientist.

All articles by Dominik Thalmeier

Get in touch

We look forward to your enquiry.

Please accept marketing cookies to load the registration form.