Dataform template
Dataform is a service for data analysts to develop, test, control versions, and schedule complex workflows for data transformation in BigQuery. Dataform lets us manage data transformation in the Extraction, Loading, and Transformation (ELT) process for data integration. After raw data is extracted from source systems and loaded into BigQuery, Dataform helps us to transform it into a well-defined, tested, and documented suite of data tables.
Sources:
- Dataform template repository
- Create and execute a SQL workflow in Dataform
- Introduction to SQL workflows
- Configure Dataform workflow settings
- Recommended structure of the definitions directory
- SQLX file config block
- SQLX file body
- Connect to a third-party Git repository
The template repository structure
The following shows the repository structure:
.
├── README.md
├── definitions
│ ├── intermediate
│ ├── outputs
│ └── sources
└── workflow_settings.yaml
The workflow settings file
The workflow_settings.yaml
file, introduced in
Dataform core 3.0, stores Dataform workflow settings in the YAML
format. The following code shows a sample
workflow_settings.yaml
file:
defaultProject: <gcp-project-id>
defaultDataset: <dataset-name>
defaultLocation: US
defaultAssertionDataset: dataform_assertions
In the preceding code sample, the key-value pairs are described as follows:
defaultProject
: Your BigQuery Google Cloud project ID.defaultDataset
: The BigQuery dataset in which Dataform creates assets, called dataform by default.defaultLocation
: Your default BigQuery dataset region. In this location, Dataform processes your code and stores executed data. This processing region has to match the location of your BigQuery datasets, but it does not need to match the Dataform repository region.defaultAssertionDataset
: The BigQuery dataset in which Dataform creates views with assertion results, called dataform_assertions by default.
The definitions directory
The following structure of subdirectories in the
definitions
directory reflects the key stages of a SQL
workflow:
sources
: Data source declarations and basic transformation of source data, for example, filtering.intermediate
: Tables and actions that read from sources and transform data before you use the transformed data to define outputs tables. Tables typically not exposed to additional processes or tools, such as business intelligence (BI) tools, after Dataform executes them to BigQuery.outputs
: Definitions of tables consumed by processes or tools, such as BI, after Dataform executes them in BigQuery.extra
: Files outside of the main pipeline of your SQL workflow, for example, files that contain workflow data prepared for additional use, like machine learning. An optional and custom subdirectory.
The following example shows a subdirectory structure inside the
definitions
directory with filenames that conform to the
recommended naming strategy:
definitions/
sources/
google_analytics.sqlx
google_analytics_filtered.sqlx
intermediate/
stg_analytics_concept.sqlx
outputs/
customers.sqlx
sales/
sales.sqlx
sales_revenue.sqlx
ads/
campaigns.sqlx
ads_revenue.sqlx
SQLX files
To use Dataform core, we can write SQLX files. Each SQLX file contains a query that defines a database relation that Dataform creates and updates inside BigQuery.
SQLX file config block
A SQLX file consists of a config block and a body. All config properties, and the config block itself, are optional. Given this, any plain SQL file is a valid SQLX file that Dataform executes as-is. We can use the cofing block for:
Data source declarations, for example:
config {type: "declaration", database: "bigquery-public-data", schema: "samples", "shakespeare", name: }
where:
database
: the project ID of the project which contains the data source.schema
: the BigQuery dataset in which the data source exists.name
: the name of the table or view that you want to use as the data source. You can later use that name to reference the data source in Dataform.
Create tables, for example:
config {type: "table" }
where:
type
: specifies the table type, which can be one of the following:table
,view
, orincremental
.
SQLX file body
The SQLX body contains the SQL query used to create a table or
perform other SQL operations in BigQuery. In the body we can use the
ref
function that lets us reference tables defined in the
Dataform project instead of hard coding the schema and table names of
the data table. For example:
type: "table" }
config {
SELECT
AS date,
order_date AS order_id,
order_id AS order_status,
order_status SUM(item_count) AS item_count,
SUM(amount) AS revenue
FROM ${ref("store_clean")}
GROUP BY 1, 2, 3