Migrate from DDL, DML, and stored procedures
Introduction
One of the more common situations that new dbt adopters encounter is a historical codebase of transformations written as a hodgepodge of DDL and DML statements, or stored procedures. Going from DML statements to dbt models is often a challenging hump for new users to get over, because the process involves a significant paradigm shift between a procedural flow of building a dataset (e.g. a series of DDL and DML statements) to a declarative approach to defining a dataset (e.g. how dbt uses SELECT statements to express data models). This guide aims to provide tips, tricks, and common patterns for converting DML statements to dbt models.
Preparing to migrate
Before getting into the meat of conversion, it’s worth noting that DML statements will not always illustrate a comprehensive set of columns and column types that an original table might contain. Without knowing the DDL to create the table, it’s impossible to know precisely if your conversion effort is apples-to-apples, but you can generally get close.
If your data warehouse supports SHOW CREATE TABLE
, that can be a quick way to get a comprehensive set of columns you’ll want to recreate. If you don’t have the DDL, but are working on a substantial stored procedure, one approach that can work is to pull column lists out of any DML statements that modify the table, and build up a full set of the columns that appear.
As for ensuring that you have the right column types, since models materialized by dbt generally use CREATE TABLE AS SELECT
or CREATE VIEW AS SELECT
as the driver for object creation, tables can end up with unintended column types if the queries aren’t explicit. For example, if you care about INT
versus DECIMAL
versus NUMERIC
, it’s generally going to be best to be explicit. The good news is that this is easy with dbt: you just cast the column to the type you intend.
We also generally recommend that column renaming and type casting happen as close to the source tables as possible, typically in a layer of staging transformations, which helps ensure that future dbt modelers will know where to look for those transformations! See How we structure our dbt projects for more guidance on overall project structure.
Operations we need to map
There are four primary DML statements that you are likely to have to convert to dbt operations while migrating a procedure:
INSERT
UPDATE
DELETE
MERGE
Each of these can be addressed using various techniques in dbt. Handling MERGE
s is a bit more involved than the rest, but can be handled effectively via dbt. The first three, however, are fairly simple to convert.
Map INSERTs
An INSERT
statement is functionally the same as using dbt to SELECT
from an existing source or other dbt model. If you are faced with an INSERT
-SELECT
statement, the easiest way to convert the statement is to just create a new dbt model, and pull the SELECT
portion of the INSERT
statement out of the procedure and into the model. That’s basically it!
To really break it down, let’s consider a simple example:
INSERT INTO returned_orders (order_id, order_date, total_return)
SELECT order_id, order_date, total FROM orders WHERE type = 'return'
Converting this with a first pass to a dbt model (in a file called returned_orders.sql) might look something like:
SELECT
order_id as order_id,
order_date as order_date,
total as total_return
FROM {{ ref('orders') }}
WHERE type = 'return'
Functionally, this would create a model (which could be materialized as a table or view depending on needs) called returned_orders
that contains three columns: order_id
, order_date
, total_return
) predicated on the type column. It achieves the same end as the INSERT
, just in a declarative fashion, using dbt.
A note on FROM
clauses
In dbt, using a hard-coded table or view name in a FROM
clause is one of the most serious mistakes new users make. dbt uses the ref and source macros to discover the ordering that transformations need to execute in, and if you don’t use them, you’ll be unable to benefit from dbt’s built-in lineage generation and pipeline execution. In the sample code throughout the remainder of this article, we’ll use ref statements in the dbt-converted versions of SQL statements, but it is an exercise for the reader to ensure that those models exist in their dbt projects.
Sequential INSERT
s to an existing table can be UNION ALL
’ed together
Since dbt models effectively perform a single CREATE TABLE AS SELECT
(or if you break it down into steps, CREATE
, then an INSERT
), you may run into complexities if there are multiple INSERT
statements in your transformation that all insert data into the same table. Fortunately, this is a simple thing to handle in dbt. Effectively, the logic is performing a UNION ALL
between the INSERT
queries. If I have a transformation flow that looks something like (ignore the contrived nature of the scenario):
CREATE TABLE all_customers
INSERT INTO all_customers SELECT * FROM us_customers
INSERT INTO all_customers SELECT * FROM eu_customers
The dbt-ified version of this would end up looking something like:
SELECT * FROM {{ ref('us_customers') }}
UNION ALL
SELECT * FROM {{ ref('eu_customers') }}
The logic is functionally equivalent. So if there’s another statement that INSERT
s into a model that I’ve already created, I can just add that logic into a second SELECT
statement that is just UNION ALL
'ed with the first. Easy!