Skip to main content

EMPI Lite - Product Overview

Enterprise Master Patient Index - delivered as a dbt package

The Problem

Patient data arrives from multiple systems - an EHR, a claims payer, a lab, a state registry - each with its own patient identifier and its own version of a person's name, address, and date of birth. There is no single authoritative answer to "who is this patient, and where do they appear across our data?"

The result is overcounting, fragmented care histories, duplicate outreach, inaccurate attribution, and analytics that cannot be trusted at the patient level.

EMPI Lite solves this. It connects patient records across source systems, assigns a single persistent identity to each real person, and delivers that identity back to every downstream model and report.

What You Get

EMPI Lite is a complete, production-ready master patient index built entirely in dbt SQL. It ships as full source code that runs in your warehouse - no black-box API, no vendor dependency, no data leaving your environment.

1. Probabilistic Identity Matching

EMPI Lite compares patient records across every source system and computes a weighted similarity score for each candidate pair using up to 16 demographic attributes.

How it works:

  • Blocking reduces the comparison space from O(n²) to manageable scale. Records are grouped by composite attribute hashes (e.g., last name + birth date, SSN) - only records sharing at least one blocking group become candidate pairs.
  • Attribute scoring applies per-attribute weights, exact match scores, fuzzy match (edit-distance similarity), and mismatch penalties. Every attribute is independently configurable.
  • Normalized scoring produces a 0-1 similarity score that accounts for which attributes were present on each record - a pair with only two attributes populated is not penalized against a pair with ten.
  • ZIP code proximity - ZIP codes within 50 miles receive partial match credit, not a binary match/miss.

Default matching attributes and their weights:

AttributeWeightNotesSource
Social Security Number20Exact match; strong mismatch penaltyeligibility / patient
Birth Date10Fuzzy (edit-distance)eligibility / patient
Last Name7Fuzzy (Levenshtein)eligibility / patient
Email6Exact matchcustom attributes table
Death Date5Fuzzyeligibility / patient
Phone4Exact (digits normalized)eligibility / patient
Address4Fuzzyeligibility / patient
First Name3Fuzzyeligibility / patient
ZIP Code3Proximity-scored via Census distance tableeligibility / patient
State2Exacteligibility / patient
Sex2Exacteligibility / patient
City2Fuzzyeligibility / patient
County1.5Exactpatient only
Race1Exacteligibility / patient

All weights, scores, penalties, and fuzzy thresholds are controlled by a seed CSV - no code changes required to re-tune. Attributes marked "custom attributes table" require enabling the custom attributes feature and supplying values in a separate key-value source table.

2. Persistent EMPI ID Assignment

Every source patient record receives an empi_id - a stable identifier that represents the real person behind the record, regardless of which system it came from.

Matched records are grouped into clusters using connected-components analysis (iterative min-label propagation). If patient A matches patient B, and patient B matches patient C, all three receive the same empi_id - even if A and C were never directly compared.

empi_crosswalk - the primary output table - maps every (data_source, source_person_id) pair to its empi_id, along with:

  • match_status - SINGLETON, EMPI_MATCHED, or MANUAL_MATCHED
  • Cluster size and data source membership
  • Enrollment date spans per source and across the full cluster

3. Golden Record

empi_golden_record resolves one authoritative demographic row per EMPI ID from all the records in each cluster.

Resolution logic:

  • Clinical patient records are preferred over eligibility/enrollment records
  • The most recent effective date wins within each record type
  • Critical identifiers (SSN, phone, death date, address) are backfilled from any record in the cluster when the primary record is missing them
  • Field coverage counts (ssn_source_count, birth_date_source_count, etc.) provide downstream confidence signals
  • Built-in data quality flags surface potential problems: has_conflicting_birth_dates, has_conflicting_last_names, is_matched_record

4. Full Audit Trail

empi_patient_events is a complete chronological record of everything that has ever happened to a patient identity. Every row is a discrete, human-readable event.

Event TypeWhen it fires
DEMOGRAPHIC_LOADEDEach demographic record ingested from a source system
DEMOGRAPHIC_CHANGEA field (name, address, phone, sex) changed between consecutive records for the same person
EMPI_MATCHTwo source IDs were linked - algorithmically or by a reviewer
RECORD_SPLITA record was flagged as containing data for more than one patient
ENROLLMENT_STARTA new enrollment span began
ENROLLMENT_ENDAn enrollment span ended
EMPI_ID_CHANGED(when snapshots enabled) A person's cluster membership changed between runs
MATCH_STATUS_CHANGED(when snapshots enabled) A record moved from SINGLETON to matched
SOURCE_DATA_UPDATED(when snapshots enabled) An upstream source record was corrected

Every EMPI_MATCH event includes a plain-English narrative explaining exactly why the match was made:

"Automatically linked: [EHR_A:PAT-001] and [CLAIMS_B:MBR-9921]. Last name matched exactly. Birth date matched exactly. First name was a close fuzzy match (Jon vs. John, 75% similarity). Address matched exactly. Social security number matched exactly. Similarity score: 91.4% - above the 70.0% match threshold."

5. Manual Review Workflows

Not every match decision can or should be made by an algorithm. EMPI Lite ships two ready-to-use worklists that feed human reviewers and automatically propagate their decisions into the matching engine.

Match Review Queue

empi_review_queue_matches surfaces candidate pairs whose score falls in the configurable review zone - not high enough to auto-match, not low enough to dismiss. Each row includes:

  • Side-by-side attribute comparison with exact, close-match, and no-match labels
  • Full person profiles for both records with all available attributes
  • Lists of matching and mismatching attributes
  • review_priority (HIGH / MEDIUM / LOW) based on how close the score is to the match threshold
  • SSN masked for safe display in BI tools

Once a reviewer inserts a decision into the match_review_decisions worklist table and re-runs dbt, that decision is immediately reflected in the EMPI crosswalk, golden record, and event trail.

Split Review Queue

empi_review_queue_splits identifies source person records whose own demographic history contains conflicting values - a signal that the source record may contain data for more than one patient (a common EHR data entry error).

Priority is driven by severity of the conflict:

PriorityTriggered by
CRITICALConflicting birth date or SSN
HIGHConflicting last name
MEDIUMConflicting first name, sex, or email
LOWConflicting address, phone, or other fields

A split record, once confirmed, is excluded from automatic matching entirely and can only be linked to other records through explicit manual decisions.

6. Data Quality Monitoring

empi_demographic_anomalies runs statistical surveillance over the entire patient population to detect two classes of data quality problem:

Frequency outliers - demographic values that appear across a disproportionate share of records relative to the distribution for that attribute. Common examples: test birth dates (1900-01-01), placeholder names (UNKNOWN, TEST), recycled phone numbers. Detected using a log z-score model so the heavily right-skewed frequency distribution is normalized before testing. Fully tunable thresholds.

Unique field violations - identifier fields (SSN, email) that are shared across two or more distinct EMPI IDs. These are the most clinically dangerous data quality issues - a shared SSN means two different patients may receive each other's care history, billing, or clinical decisions.

In addition, EMPI Lite ships an invalid values seed containing ~23,000 known-bad demographic values (test SSNs, placeholder names and addresses, fake birth dates) that are suppressed before matching begins.

7. Snapshot-Based Change Detection

EMPI Lite snapshots capture the state of the crosswalk and source tables at each run, enabling questions that a stateless pipeline cannot answer:

  • When was this person first assigned to their current EMPI ID?
  • Did this cluster reorganize after last week's data refresh?
  • Which source records were corrected upstream between runs?
  • How many records transitioned from SINGLETON to matched this month?

Five snapshot targets are included: crosswalk history, source eligibility, source patient, blocking rules, and attribute score configuration.

8. Precision / Recall Evaluation Framework

For organizations with ground-truth data, EMPI Lite includes a built-in evaluation framework that measures how well the matching engine is performing.

empi_threshold_analysis computes precision, recall, and F1 score at every 0.01 threshold increment from 0.50 to 0.95, enabling data-driven threshold selection rather than guesswork.

empi_threshold_examples surfaces specific false positives and false negatives at the current threshold - the fastest way to understand why the model is making the decisions it is.

9. Custom Attribute Matching

The matching engine is fully extensible. Any organization-specific identifier - NPI, employee ID, health plan subscriber ID, custom biometric - can be registered as a custom attribute and will automatically participate in blocking, scoring, the review queues, and the audit trail.

No core model changes are required - custom attributes are configured entirely through a source table and seed CSV rows.

10. Tuva Integration

EMPI Lite writes Tuva-compatible patient and eligibility tables where person_id = empi_id. Every downstream Tuva mart - claims, clinical, quality measures - automatically inherits the resolved patient identity with no additional transformation required.

Outputs at a Glance

TableWhat it answers
empi_crosswalkWhat is the EMPI ID for this source person?
empi_golden_recordWhat are the best demographics for this patient?
empi_patient_eventsWhat has happened to this patient's identity over time?
empi_demographics_tallWhat is every known demographic value for this patient across all sources?
empi_review_queue_matchesWhich pairs need a human match/no-match decision?
empi_review_queue_splitsWhich records may contain data for more than one patient?
empi_demographic_anomaliesWhere is demographic data statistically suspicious or dangerously shared?
empi_threshold_analysisWhat is the precision/recall of the matching engine at each threshold?

What Is and Is Not Included

Included

  • Full dbt source code for all models, macros, seeds, and snapshots
  • 4 default blocking groups (name+DOB, SSN, name only, last+DOB) - extensible with any custom attribute
  • 16 pre-configured matching attributes with tuned weights
  • ~23,000 known-bad invalid value suppressions
  • US Census ZIP code proximity table (365,000+ ZIP pair distances)
  • US state code normalization seed
  • Manual review worklist schema and workflow
  • Plain-English match narrative generation
  • Unit tests for staging, intermediate, and final models
  • Full documentation

Not Included

  • A hosted UI or application layer (the review queues are tables - you bring your own BI tool or app)
  • A REST API or real-time matching endpoint (EMPI Lite is a batch dbt pipeline)
  • A pre-built patient portal or clinician-facing workflow application

Delivery Model

EMPI Lite is delivered as a clone-and-customize dbt project. You receive the complete source code and own it outright. There is no runtime dependency on a vendor system, no data sent to a third party, and no license keys.

Two integration patterns are supported:

  • Import into an existing dbt project via a local package reference (local: ../empi_lite/empi_lite) - EMPI Lite runs alongside your existing models in a single dbt run
  • Run as a standalone dbt project with its own dedicated profile and run cadence

Both patterns support all six major data warehouses: Snowflake, BigQuery, Redshift, Microsoft Fabric, DuckDB, and Databricks.

Typical Implementation Timeline

PhaseActivitiesTypical Duration
SetupClone repo, configure profile, point at source tables, load seeds, first dbt run1-2 days
ValidationReview match quality, spot-check golden records, tune thresholds3-5 days
Review workflowCreate worklist tables, establish review cadence, process initial queue1-2 weeks
Steady stateScheduled dbt run + dbt snapshot on each source data refreshOngoing

Requirements

  • dbt Core ≥ 1.7
  • One of: Snowflake, BigQuery, Redshift, Microsoft Fabric, DuckDB, or Databricks
  • Source tables: an eligibility table and/or a patient table with standard demographic columns
  • Two empty worklist tables in a empi_manual_review schema (DDL provided in README.md)