Skip to main content

Configuration Guide

EMPI Lite is configured through three mechanisms: dbt vars (thresholds and feature flags), blocking rules (which attribute combinations create candidate pairs), and attribute scores (per-attribute weights and scoring behavior). All three are tunable without modifying any model code.

Matching thresholds

Set in dbt_project.yml under vars > empi_lite.

VariableDefaultDescription
match_threshold0.70Pairs scoring at or above this are automatically linked. Raise to be more conservative (fewer matches, fewer false positives). Lower to be more aggressive (more matches, fewer missed links).
review_threshold_low0.50Pairs scoring at or above this (and below match_threshold) are routed to the match review queue. Pairs below this are dismissed without review.
review_threshold_high0.69Controls the HIGH / MEDIUM / LOW priority label on review queue rows. Pairs between review_threshold_high and match_threshold receive HIGH priority. Does not affect which pairs enter the queue - that is controlled by review_threshold_low and match_threshold exclusively.
# dbt_project.yml
vars:
empi_lite:
match_threshold: 0.70
review_threshold_low: 0.50
review_threshold_high: 0.69

How the three thresholds interact

Scores are continuous. A pair with score 0.697 is below match_threshold (0.70) and above review_threshold_high (0.69), so it lands in the review queue with HIGH priority - it is not auto-matched. The boundaries work as follows:

score < 0.50              → dismissed, no action
0.50 ≤ score < 0.70 → routed to review queue
0.50 ≤ score ≤ 0.69 → MEDIUM or LOW priority
0.69 < score < 0.70 → HIGH priority
score ≥ 0.70 → auto-matched

review_threshold_high is purely a priority label - it does not create a gap in coverage. Every score between review_threshold_low and match_threshold lands in the queue.

Choosing your thresholds

Start with the defaults. Run the project, examine empi_patient_events match narratives for a sample of EMPI_MATCH events, and spot-check 20-30 matches manually.

Signs the match threshold is too low (too many false positives):

  • Matches where only one attribute agreed (e.g., same last name, nothing else)
  • Patients being merged who are clearly different people
  • Large clusters with 5+ records where some records don't belong

Signs the match threshold is too high (too many false negatives):

  • Obvious same-person records in different systems not being linked
  • High review queue volume with most pairs being obvious matches
  • Low EMPI_MATCHED percentage, high SINGLETON percentage

Blocking rules

Naively comparing every patient record against every other record is O(n²) - infeasible at any meaningful scale. Blocking rules solve this by first grouping records into buckets, so only records that land in the same bucket are compared.

Configured in: seeds/empi_blocking_rules.csv

After editing this file, re-run:

dbt seed --select empi_blocking_rules && dbt run

AND within a group, OR across groups

This is the most important concept to understand when configuring blocking.

Within a group - AND logic. All attributes in a group must match for two records to be placed in the same bucket. Group 1 requires FIRST_NAME and LAST_NAME and BIRTH_DATE to all agree (as a composite hash) before two records become candidates. A single attribute disagreement excludes the pair from that group's bucket.

Across groups - OR logic. A pair becomes a candidate if they share at least one group's bucket. Two records that don't share a Group 1 hash (name + DOB) but do share a Group 2 hash (SSN alone) will still be compared. A pair only has to satisfy one group to enter scoring.

Group 1: FIRST_NAME AND LAST_NAME AND BIRTH_DATE  ──┐
Group 2: SOCIAL_SECURITY_NUMBER ├── any one match → candidate pair → scored
Group 3: FIRST_NAME AND LAST_NAME │
Group 4: LAST_NAME AND BIRTH_DATE ──┘

This design lets you tune the blocking strategy precisely: tighter groups (more AND conditions) reduce the candidate pair count; additional groups (more OR options) improve recall at the cost of more comparisons.

Default blocking groups

group_idAttributes (all must match - AND)Purpose
1FIRST_NAME, LAST_NAME, BIRTH_DATECatches most same-person records with consistent demographics
2SOCIAL_SECURITY_NUMBERCatches records where names differ but SSN agrees
3FIRST_NAME, LAST_NAMECatches records where DOB is missing or entered differently
4LAST_NAME, BIRTH_DATECatches records where first name varies (nicknames, initials)

Seed structure

group_id,attribute,enabled
1,FIRST_NAME,true
1,LAST_NAME,true
1,BIRTH_DATE,true
2,SOCIAL_SECURITY_NUMBER,true
3,FIRST_NAME,true
3,LAST_NAME,true
4,LAST_NAME,true
4,BIRTH_DATE,true

Each row is one attribute within a blocking group. The composite hash for a group is built from all attributes in that group where enabled = true.

Adding or removing blocking groups

To add a new group (e.g., phone number as a standalone blocking key):

5,PHONE,true

To disable a group without deleting it, set enabled to false:

3,FIRST_NAME,false
3,LAST_NAME,false

Note: Blocking rules are snapshotted - changes are tracked in empi_blocking_rules_snapshot.

Tuning guidance

  • Too many missed matches (false negatives)? Add more blocking groups so pairs with fewer shared attributes can still become candidates. More groups = more candidate pairs = more compute.
  • Pipeline too slow? Remove groups or tighten existing ones (add more AND conditions). Fewer pairs to score = faster pipeline, but some matches may be missed.
  • Custom attributes can participate in any blocking group - see Custom Attributes below.

Attribute scores

Attribute scores control how each demographic attribute contributes to the similarity score.

Configured in: seeds/empi_attribute_scores.csv

After editing this file, re-run:

dbt seed --select empi_attribute_scores && dbt run

Attributes and their scoring behavior

The scoring engine ships with pre-tuned weights and scoring behavior for 14 demographic attributes. The full configuration lives in seeds/empi_attribute_scores.csv, which is included in your repo.

The attributes, in rough order of discriminating power, are:

AttributeMatching method
Social Security NumberExact match; strong mismatch penalty
Birth DateFuzzy (edit-distance)
Last NameFuzzy (Levenshtein similarity)
EmailExact match
Death DateFuzzy
PhoneExact (digits normalized)
AddressFuzzy
First NameFuzzy
ZIP CodeGeographic proximity (within 50 miles = partial credit)
StateExact
SexExact
CityFuzzy
CountyExact
RaceExact

Seed structure

attribute,weight,use_fuzzy_match,fuzzy_threshold,exact_match_score,mismatch_penalty
SOCIAL_SECURITY_NUMBER,...
BIRTH_DATE,...
LAST_NAME,...
...

Understanding each column

weight - The maximum positive contribution this attribute can make to the total score denominator. Higher weight = this attribute matters more.

use_fuzzy_match - If true, a fuzzy (edit-distance / Levenshtein) similarity score is computed. Pairs with similarity above fuzzy_threshold receive partial credit proportional to similarity. If false, only exact matches score.

fuzzy_threshold - Minimum similarity ratio (0-1) to receive any fuzzy match credit. Pairs below this threshold are treated as mismatches.

exact_match_score - Points awarded for an exact match. Typically equal to weight.

mismatch_penalty - Points subtracted when two non-missing, non-matching values exist. A penalty of 0 means a mismatch is simply neutral (no points gained, no points lost). A negative penalty means mismatches actively reduce the score.

ZIP code special behavior: ZIP codes are scored using geographic proximity rather than exact or fuzzy match. Two ZIP codes within 50 miles receive partial credit; beyond 50 miles they are treated as a mismatch. The mismatch_penalty column still applies for ZIPs that are far apart.

How normalized scoring works

The final similarity score is normalized to account for which attributes were present on each record:

score = sum(attribute_scores) / sum(weights_for_present_attributes)

A pair where both records are missing 8 of 14 attributes is not penalized against a pair where all 14 are present. This is critical for real-world data where fill rates vary significantly across source systems.

Tuning guidance

To increase the importance of an attribute: Raise its weight and exact_match_score.

To make mismatches hurt more: Lower (more negative) the mismatch_penalty.

To allow fuzzier name matches: Lower the fuzzy_threshold for FIRST_NAME or LAST_NAME (e.g., 0.60 instead of 0.70).

To make an attribute mandatory for matching: Set a very high mismatch penalty (e.g., -100) on an attribute like SSN. A SSN mismatch will then drive the score below any reasonable threshold, preventing the match.

Attribute scores are snaphotted - changes are tracked in empi_attribute_scores_snapshot.

Feature flags

VariableDefaultDescription
empi_snapshot_enabledfalseSet to true after the first dbt snapshot run. Enables change-detection events in empi_patient_events.
empi_custom_attributes_enabledfalseSet to true if you have a custom_attributes source table.

Data quality / anomaly detection tuning

VariableDefaultDescription
anomaly_z_threshold3.0Log z-score threshold for frequency outlier detection. Higher = fewer, more extreme flags. Recommended range: 2.5-4.0.
anomaly_min_count10Minimum number of patients sharing a value before it can be flagged as an outlier. Prevents small-N noise.
anomaly_min_pct_name0.02Minimum population share (first/last name) to flag - 2%.
anomaly_min_pct_dob0.01Minimum population share (birth date) to flag - 1%.
anomaly_min_pct_contact0.005Minimum population share (phone, address) to flag - 0.5%.
anomaly_min_pct_zip0.10Minimum population share (ZIP, city, payer) to flag - 10%. Higher because these are legitimately concentrated.

Custom attributes

To add organization-specific identifiers to the matching engine:

1. Enable the feature:

vars:
empi_lite:
empi_custom_attributes_enabled: true

2. Populate the custom_attributes source table with your data (see Data Requirements).

3. Register the attribute in empi_attribute_scores.csv:

Add a row for each custom attribute, specifying a weight appropriate to its discriminating power, whether it should use fuzzy matching, and the desired mismatch penalty. Refer to the existing rows in the seed for examples.

attribute,weight,use_fuzzy_match,fuzzy_threshold,exact_match_score,mismatch_penalty
NPI,...
EMPLOYEE_ID,...

4. Optionally add the attribute to a blocking group in empi_blocking_rules.csv:

5,EMAIL,true
6,NPI,true

5. Re-run:

dbt seed && dbt run

Custom attributes automatically participate in blocking, scoring, the review queues, and the audit trail. No model code changes required.

Source configuration reference

All source table pointers live in models/sources.yml and are controlled by these vars:

VariableDescription
empi_input_databaseDatabase containing your source tables
empi_input_schemaSchema containing your source tables
empi_eligibility_tableTable name for the eligibility source (default: eligibility)
empi_patient_tableTable name for the patient source (default: patient)
empi_custom_attributes_tableTable name for custom attributes (default: empi_custom_attributes)
empi_manual_review_databaseDatabase for manual review tables
empi_manual_review_schemaSchema for manual review tables (default: empi_manual_review)