Getting Started

This guide walks you through setting up EMPI Lite from a fresh clone to your first successful dbt run. Estimated time: 1-2 hours if your source tables are already available.

Prerequisites

Before you start, confirm you have:

dbt Core ≥ 1.7 installed (dbt --version)
The appropriate dbt adapter for your warehouse (dbt-snowflake, dbt-bigquery, dbt-redshift, dbt-databricks, or dbt-duckdb)
Warehouse credentials and a database/schema where EMPI Lite can create tables
Source data: an eligibility table, a patient table, or both (see Data Requirements)
Permission to create schemas and tables in your warehouse

Step 1 - Clone the repo

git clone <your-repo-url> empi_lite
cd empi_lite/empi_lite    # all dbt commands run from the inner directory

The double directory empi_lite/empi_lite/ is intentional. The inner directory is the dbt project root. Always run dbt commands from there.

Step 2 - Configure your dbt profile

Add a profile named empi_lite to ~/.dbt/profiles.yml. Profile templates for all supported warehouses are below.

Snowflake

empi_lite:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: <your_account>           # e.g., xy12345.us-east-1
      user: <your_user>
      password: <your_password>
      role: <your_role>
      warehouse: <your_warehouse>
      database: <your_database>
      schema: empi
      threads: 4

BigQuery

empi_lite:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: oauth                     # or service-account
      project: <your_gcp_project>
      dataset: empi
      threads: 4
      timeout_seconds: 300

Amazon Redshift

empi_lite:
  target: dev
  outputs:
    dev:
      type: redshift
      host: <your_cluster>.redshift.amazonaws.com
      user: <your_user>
      password: <your_password>
      port: 5439
      dbname: <your_database>
      schema: empi
      threads: 4

Microsoft Fabric / Synapse

empi_lite:
  target: dev
  outputs:
    dev:
      type: fabric
      driver: ODBC Driver 18 for SQL Server
      server: <your_server>.sql.azuresynapse.net
      port: 1433
      database: <your_database>
      schema: empi
      authentication: CLI               # or ActiveDirectoryPassword
      threads: 4

Databricks

empi_lite:
  target: dev
  outputs:
    dev:
      type: databricks
      host: <your_workspace>.azuredatabricks.net
      http_path: /sql/1.0/warehouses/<your_warehouse_id>
      token: <your_pat_token>
      catalog: <your_catalog>
      schema: empi
      threads: 4

DuckDB

empi_lite:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: /path/to/empi.duckdb       # use ':memory:' for in-memory
      schema: empi
      threads: 4

Step 3 - Create the manual review tables

EMPI Lite reads from two tables in the empi_manual_review schema. They must exist (even if empty) before the first dbt run. You can either run the DDL below or use the project macro: dbt run-operation create_manual_review_tables.

CREATE SCHEMA IF NOT EXISTS your_database.empi_manual_review;

CREATE TABLE IF NOT EXISTS your_database.empi_manual_review.match_review_decisions (
    data_source_a       VARCHAR     NOT NULL,
    source_person_id_a  VARCHAR     NOT NULL,
    data_source_b       VARCHAR     NOT NULL,
    source_person_id_b  VARCHAR     NOT NULL,
    is_match            BOOLEAN,              -- TRUE = match, FALSE = no match, NULL = pending
    reviewed            BOOLEAN,              -- FALSE = pending, TRUE = decision finalized
    reviewer_name       VARCHAR,
    review_date         DATE,
    notes               VARCHAR
);

CREATE TABLE IF NOT EXISTS your_database.empi_manual_review.split_review_decisions (
    data_source         VARCHAR     NOT NULL,
    source_person_id    VARCHAR     NOT NULL,
    is_split            BOOLEAN,               -- TRUE = split, FALSE = not split, NULL = pending
    reviewed            BOOLEAN,               -- FALSE = pending, TRUE = decision finalized
    reviewer_name       VARCHAR,
    review_date         DATE,
    notes               VARCHAR
);

Step 4 - Point EMPI Lite at your source tables

Open models/sources.yml and update the database/schema vars to match where your source tables live:

# models/sources.yml (excerpt)
sources:
  - name: empi_input
    database: "{{ var('empi_input_database') }}"
    schema:   "{{ var('empi_input_schema') }}"
    ...

Then set the corresponding vars in dbt_project.yml:

# dbt_project.yml
vars:
  empi_lite:
    empi_input_database: your_database
    empi_input_schema:   your_schema

    # Table identifiers (if your table names differ from the defaults)
    empi_eligibility_table: eligibility
    empi_patient_table:     patient

    # Manual review schema
    empi_manual_review_database: your_database
    empi_manual_review_schema:   empi_manual_review

If you only have one source type (e.g., eligibility only, no patient table), the staging model handles this automatically - the patient source is optional.

Step 5 - Install dbt packages

dbt deps

This installs dbt_utils (the only external dependency).

Step 6 - Load seeds

Load all reference data seeds:

# Load all seeds except the large static ones (ZIP proximity table, eval ground truth)
dbt seed --exclude tag:large_seed

# Load the ZIP code proximity table (only needed once; takes a few minutes on first load)
dbt seed --select gaz2024zcta5distance50miles --full-refresh

Tip: After the initial load, the ZIP proximity table never needs to be reloaded unless you explicitly want to refresh it. dbt seed --exclude tag:large_seed is the correct command for all routine runs.

Step 7 - Run the project

dbt run

This builds all staging, intermediate, and final models. On first run with a large dataset, expect this to take 15-45 minutes depending on patient volume and warehouse size.

You can monitor build progress with:

dbt run --select tag:final    # build only final output tables
dbt run --select empi_crosswalk empi_golden_record    # build specific tables

Step 8 - Run tests

dbt test

All tests should pass. Common failure causes on first run:

Source tables missing required columns → add the columns (nulls allowed)
Manual review tables don't exist → re-run the DDL from Step 3
Profile misconfiguration → check dbt debug output

Step 9 - Validate your results

Open the output tables and spot-check match quality:

-- How many records matched vs. singletons?
SELECT match_status, COUNT(*) as records
FROM empi.empi_crosswalk
GROUP BY match_status;

-- How large are the clusters?
SELECT cluster_size, COUNT(DISTINCT empi_id) as clusters
FROM empi.empi_crosswalk
GROUP BY cluster_size
ORDER BY cluster_size;

-- Review a sample of matches with their narratives
SELECT *
FROM empi.empi_patient_events
WHERE event_type = 'EMPI_MATCH'
LIMIT 25;

-- Check data quality anomalies
SELECT *
FROM empi.empi_demographic_anomalies
ORDER BY affected_records DESC;

If match quality looks off (too many or too few matches), see the Configuration Guide for threshold tuning.

Step 10 - Enable snapshots (optional but recommended)

Snapshots enable change-detection events in empi_patient_events - you'll see EMPI_ID_CHANGED, MATCH_STATUS_CHANGED, and SOURCE_DATA_UPDATED events across runs.

Run snapshots after the first dbt run:

dbt snapshot

Then enable snapshot events in dbt_project.yml:

vars:
  empi_lite:
    empi_snapshot_enabled: true

On all subsequent runs, execute in order:

dbt snapshot && dbt run

Installation pattern A - Local package import

If you want EMPI Lite to run as part of your existing dbt project rather than as a standalone project, use the local package import pattern.

In your existing dbt project's packages.yml:

packages:
  - local: ../empi_lite/empi_lite    # path relative to your project root

Then run dbt deps in your existing project. EMPI Lite models will build alongside your existing models in a single dbt run.

Your existing project's dbt_project.yml must include the EMPI Lite vars:

vars:
  empi_lite:
    empi_input_database: your_database
    empi_input_schema: your_schema
    match_threshold: 0.70
    review_threshold_low: 0.50
    review_threshold_high: 0.69
    empi_snapshot_enabled: false
    empi_custom_attributes_enabled: false

Routine operations

Once set up, the ongoing operational cadence is:

# On each source data refresh:
dbt snapshot && dbt run

# Process review queue (ad hoc or scheduled):
# 1. Query empi_review_queue_matches and empi_review_queue_splits
# 2. Insert decisions into empi_manual_review tables
# 3. Re-run: dbt run

Upgrading to a new version

When a new version of EMPI Lite is released:

Pull the latest code from the upstream repo
Do not overwrite:
- seeds/empi_blocking_rules.csv (if you've customized blocking groups)
- seeds/empi_attribute_scores.csv (if you've tuned weights)
- seeds/empi_invalid_values.csv (if you've added custom invalid values)
- dbt_project.yml (contains your vars)
- models/sources.yml (contains your source configuration)
Safe to overwrite: all models/ SQL files, macros/, snapshots/, tests/
Run dbt deps && dbt seed --exclude tag:large_seed && dbt run

Recommended: Keep your customizations on a dedicated git branch. Merge new versions into main and cherry-pick or rebase your branch.

Prerequisites​

Step 1 - Clone the repo​

Step 2 - Configure your dbt profile​

Snowflake​

BigQuery​

Amazon Redshift​

Microsoft Fabric / Synapse​

Databricks​

DuckDB​

Step 3 - Create the manual review tables​

Step 4 - Point EMPI Lite at your source tables​

Step 5 - Install dbt packages​

Step 6 - Load seeds​

Step 7 - Run the project​

Step 8 - Run tests​

Step 9 - Validate your results​

Step 10 - Enable snapshots (optional but recommended)​

Installation pattern A - Local package import​

Routine operations​

Upgrading to a new version​