Configuration Guide
EMPI Lite is configured through three mechanisms: dbt vars (thresholds and feature flags), blocking rules (which attribute combinations create candidate pairs), and attribute scores (per-attribute weights and scoring behavior). All three are tunable without modifying any model code.
Matching thresholds
Set in dbt_project.yml under vars > empi_lite.
| Variable | Default | Description |
|---|---|---|
match_threshold | 0.70 | Pairs scoring at or above this are automatically linked. Raise to be more conservative (fewer matches, fewer false positives). Lower to be more aggressive (more matches, fewer missed links). |
review_threshold_low | 0.50 | Pairs scoring at or above this (and below match_threshold) are routed to the match review queue. Pairs below this are dismissed without review. |
review_threshold_high | 0.69 | Controls the HIGH / MEDIUM / LOW priority label on review queue rows. Pairs between review_threshold_high and match_threshold receive HIGH priority. Does not affect which pairs enter the queue - that is controlled by review_threshold_low and match_threshold exclusively. |
# dbt_project.yml
vars:
empi_lite:
match_threshold: 0.70
review_threshold_low: 0.50
review_threshold_high: 0.69
How the three thresholds interact
Scores are continuous. A pair with score 0.697 is below match_threshold (0.70) and above review_threshold_high (0.69), so it lands in the review queue with HIGH priority - it is not auto-matched. The boundaries work as follows:
score < 0.50 → dismissed, no action
0.50 ≤ score < 0.70 → routed to review queue
0.50 ≤ score ≤ 0.69 → MEDIUM or LOW priority
0.69 < score < 0.70 → HIGH priority
score ≥ 0.70 → auto-matched
review_threshold_high is purely a priority label - it does not create a gap in coverage. Every score between review_threshold_low and match_threshold lands in the queue.
Choosing your thresholds
Start with the defaults. Run the project, examine empi_patient_events match narratives for a sample of EMPI_MATCH events, and spot-check 20-30 matches manually.
Signs the match threshold is too low (too many false positives):
- Matches where only one attribute agreed (e.g., same last name, nothing else)
- Patients being merged who are clearly different people
- Large clusters with 5+ records where some records don't belong
Signs the match threshold is too high (too many false negatives):
- Obvious same-person records in different systems not being linked
- High review queue volume with most pairs being obvious matches
- Low
EMPI_MATCHEDpercentage, highSINGLETONpercentage
Blocking rules
Naively comparing every patient record against every other record is O(n²) - infeasible at any meaningful scale. Blocking rules solve this by first grouping records into buckets, so only records that land in the same bucket are compared.
Configured in: seeds/empi_blocking_rules.csv
After editing this file, re-run:
dbt seed --select empi_blocking_rules && dbt run
AND within a group, OR across groups
This is the most important concept to understand when configuring blocking.
Within a group - AND logic. All attributes in a group must match for two records to be placed in the same bucket. Group 1 requires FIRST_NAME and LAST_NAME and BIRTH_DATE to all agree (as a composite hash) before two records become candidates. A single attribute disagreement excludes the pair from that group's bucket.
Across groups - OR logic. A pair becomes a candidate if they share at least one group's bucket. Two records that don't share a Group 1 hash (name + DOB) but do share a Group 2 hash (SSN alone) will still be compared. A pair only has to satisfy one group to enter scoring.
Group 1: FIRST_NAME AND LAST_NAME AND BIRTH_DATE ──┐
Group 2: SOCIAL_SECURITY_NUMBER ├── any one match → candidate pair → scored
Group 3: FIRST_NAME AND LAST_NAME │
Group 4: LAST_NAME AND BIRTH_DATE ──┘
This design lets you tune the blocking strategy precisely: tighter groups (more AND conditions) reduce the candidate pair count; additional groups (more OR options) improve recall at the cost of more comparisons.
Default blocking groups
| group_id | Attributes (all must match - AND) | Purpose |
|---|---|---|
| 1 | FIRST_NAME, LAST_NAME, BIRTH_DATE | Catches most same-person records with consistent demographics |
| 2 | SOCIAL_SECURITY_NUMBER | Catches records where names differ but SSN agrees |
| 3 | FIRST_NAME, LAST_NAME | Catches records where DOB is missing or entered differently |
| 4 | LAST_NAME, BIRTH_DATE | Catches records where first name varies (nicknames, initials) |
Seed structure
group_id,attribute,enabled
1,FIRST_NAME,true
1,LAST_NAME,true
1,BIRTH_DATE,true
2,SOCIAL_SECURITY_NUMBER,true
3,FIRST_NAME,true
3,LAST_NAME,true
4,LAST_NAME,true
4,BIRTH_DATE,true
Each row is one attribute within a blocking group. The composite hash for a group is built from all attributes in that group where enabled = true.
Adding or removing blocking groups
To add a new group (e.g., phone number as a standalone blocking key):
5,PHONE,true
To disable a group without deleting it, set enabled to false:
3,FIRST_NAME,false
3,LAST_NAME,false
Note: Blocking rules are snapshotted - changes are tracked in empi_blocking_rules_snapshot.
Tuning guidance
- Too many missed matches (false negatives)? Add more blocking groups so pairs with fewer shared attributes can still become candidates. More groups = more candidate pairs = more compute.
- Pipeline too slow? Remove groups or tighten existing ones (add more AND conditions). Fewer pairs to score = faster pipeline, but some matches may be missed.
- Custom attributes can participate in any blocking group - see Custom Attributes below.
Attribute scores
Attribute scores control how each demographic attribute contributes to the similarity score.
Configured in: seeds/empi_attribute_scores.csv
After editing this file, re-run:
dbt seed --select empi_attribute_scores && dbt run
Attributes and their scoring behavior
The scoring engine ships with pre-tuned weights and scoring behavior for 14 demographic attributes. The full configuration lives in seeds/empi_attribute_scores.csv, which is included in your repo.
The attributes, in rough order of discriminating power, are:
| Attribute | Matching method |
|---|---|
| Social Security Number | Exact match; strong mismatch penalty |
| Birth Date | Fuzzy (edit-distance) |
| Last Name | Fuzzy (Levenshtein similarity) |
| Exact match | |
| Death Date | Fuzzy |
| Phone | Exact (digits normalized) |
| Address | Fuzzy |
| First Name | Fuzzy |
| ZIP Code | Geographic proximity (within 50 miles = partial credit) |
| State | Exact |
| Sex | Exact |
| City | Fuzzy |
| County | Exact |
| Race | Exact |
Seed structure
attribute,weight,use_fuzzy_match,fuzzy_threshold,exact_match_score,mismatch_penalty
SOCIAL_SECURITY_NUMBER,...
BIRTH_DATE,...
LAST_NAME,...
...
Understanding each column
weight - The maximum positive contribution this attribute can make to the total score denominator. Higher weight = this attribute matters more.
use_fuzzy_match - If true, a fuzzy (edit-distance / Levenshtein) similarity score is computed. Pairs with similarity above fuzzy_threshold receive partial credit proportional to similarity. If false, only exact matches score.
fuzzy_threshold - Minimum similarity ratio (0-1) to receive any fuzzy match credit. Pairs below this threshold are treated as mismatches.
exact_match_score - Points awarded for an exact match. Typically equal to weight.
mismatch_penalty - Points subtracted when two non-missing, non-matching values exist. A penalty of 0 means a mismatch is simply neutral (no points gained, no points lost). A negative penalty means mismatches actively reduce the score.
ZIP code special behavior: ZIP codes are scored using geographic proximity rather than exact or fuzzy match. Two ZIP codes within 50 miles receive partial credit; beyond 50 miles they are treated as a mismatch. The mismatch_penalty column still applies for ZIPs that are far apart.