Skip to main content

Output Table Reference

EMPI Lite produces eight final output tables, all written to the empi schema in your warehouse. This document describes each table, its columns, and example rows.

empi_crosswalk

The primary identity mapping table. One row per (data_source, source_person_id) pair. Every source record in every source system maps to exactly one empi_id.

How to use it: Join this table to any other table using data_source and source_person_id to add empi_id. Then join on empi_id to any other EMPI output.

Key columns

ColumnTypeDescription
empi_idVARCHARGlobally unique stable identifier for the real person (deterministic hash, format EMP- + hex or numeric). Shared by all source records that have been linked to the same individual.
data_sourceVARCHARSource system name (e.g., EHR_SYSTEM, CLAIMS_PAYER).
source_person_idVARCHARPatient identifier as it appears in the source system.
match_statusVARCHARSINGLETON - not linked to any other record. EMPI_MATCHED - linked algorithmically. MANUAL_MATCHED - linked by a human reviewer.
cluster_sizeINTEGERHow many source records share this empi_id.
cluster_data_sourcesVARCHARPipe-delimited list of all data sources represented in the cluster.
cluster_enrollment_startDATEEarliest enrollment start date across all records in the cluster.
cluster_enrollment_endDATELatest enrollment end date across all records in the cluster.
source_enrollment_startDATEEnrollment start for this specific source record.
source_enrollment_endDATEEnrollment end for this specific source record.

Example rows

empi_id      data_source    source_person_id   match_status    cluster_size   cluster_data_sources
----------- ------------- ----------------- -------------- ------------- ------------------------------------
EMP-0000042 EHR_SYSTEM PAT-10042 EMPI_MATCHED 2 EHR_SYSTEM|CLAIMS_PAYER
EMP-0000042 CLAIMS_PAYER MBR-88199A EMPI_MATCHED 2 EHR_SYSTEM|CLAIMS_PAYER
EMP-0000043 EHR_SYSTEM PAT-10043 SINGLETON 1 EHR_SYSTEM
EMP-0000044 EHR_SYSTEM PAT-10044 EMPI_MATCHED 3 EHR_SYSTEM|CLAIMS_PAYER|LAB_SYSTEM
EMP-0000044 CLAIMS_PAYER MBR-10031 EMPI_MATCHED 3 EHR_SYSTEM|CLAIMS_PAYER|LAB_SYSTEM
EMP-0000044 LAB_SYSTEM LAB-000291 EMPI_MATCHED 3 EHR_SYSTEM|CLAIMS_PAYER|LAB_SYSTEM

empi_golden_record

One authoritative demographic row per EMPI ID. Resolves the best available values from all source records in each cluster.

Resolution logic:

  • Patient records are preferred over eligibility/enrollment records
  • The most recent effective date wins within each record type
  • Critical identifiers (SSN, phone, death date, address) are backfilled from any record in the cluster when the primary record is missing them

How to use it: Join on empi_id to get resolved demographics for any patient. Use the has_conflicting_* flags to qualify downstream analysis.

Key columns

ColumnTypeDescription
empi_idVARCHARPatient identifier.
first_name, last_nameVARCHARResolved name.
birth_dateDATEResolved date of birth.
death_dateDATEResolved date of death (backfilled from any cluster record).
social_security_numberVARCHARResolved SSN.
sexVARCHARResolved sex/gender.
address, city, state, zip_codeVARCHARResolved address.
phone, emailVARCHARResolved contact info.
ssn_source_countINTEGERNumber of distinct SSN values seen across the cluster.
birth_date_source_countINTEGERNumber of distinct birth dates seen across the cluster.
is_matched_recordBOOLEANTrue if this patient has been linked to at least one other source record.
has_conflicting_birth_datesBOOLEANTrue if two or more distinct birth dates exist across the cluster.
has_conflicting_last_namesBOOLEANTrue if two or more distinct last names exist across the cluster.
primary_record_typeVARCHARThe source record type used as the primary source (PATIENT or ELIGIBILITY).

Example rows

empi_id      first_name  last_name   birth_date   sex     state   is_matched_record   has_conflicting_birth_dates   has_conflicting_last_names
----------- ---------- ---------- ----------- ------ ------ ------------------- --------------------------- --------------------------
EMP-0000042 Jonathan Smith 1978-03-14 male OR true false false
EMP-0000043 Maria Gonzalez 1991-07-02 female CA false false false
EMP-0000044 Robert Johnson 1965-11-28 male CO true false false
EMP-0000091 Jane Doe 1983-05-19 female TX true true false

EMP-0000091 has conflicting birth dates - the same person appears with different DOBs in different source systems. A downstream analyst should investigate before relying on the birth date.

empi_patient_events

A complete chronological audit trail. One row per event per patient. Every match, every demographic change, every enrollment span, and every snapshot-detected change is recorded here.

How to use it: Filter on empi_id to reconstruct the complete history for any patient. Filter on event_type to find all matches, all demographic changes, etc.

Event types

Event typeDescription
DEMOGRAPHIC_LOADEDA demographic record was ingested from a source system.
DEMOGRAPHIC_CHANGEA demographic field changed between consecutive records for the same person in the same source.
EMPI_MATCHTwo source IDs were linked - algorithmically or by a reviewer. Includes a plain-English narrative.
RECORD_SPLITA source record was flagged as containing data for more than one real patient.
ENROLLMENT_STARTA new enrollment span began in a source system.
ENROLLMENT_ENDAn enrollment span ended.
EMPI_ID_CHANGED(snapshots enabled) A record's cluster membership changed between pipeline runs.
MATCH_STATUS_CHANGED(snapshots enabled) A record transitioned from SINGLETON to EMPI_MATCHED.
SOURCE_DATA_UPDATED(snapshots enabled) An upstream source record was corrected between runs.

Key columns

ColumnTypeDescription
empi_idVARCHARPatient identifier.
event_dateDATEWhen the event occurred (or was detected).
event_typeVARCHARSee table above.
event_descriptionVARCHARHuman-readable description of the event.
data_sourceVARCHARSource system involved.
source_person_idVARCHARSource record involved.

Example rows

empi_id      event_date   event_type            event_description
----------- ----------- -------------------- -----------------------------------------------------------------------
EMP-0000042 2019-01-14 DEMOGRAPHIC_LOADED Record loaded from EHR_SYSTEM (PAT-10042): Jonathan Smith, DOB 1978-03-14
EMP-0000042 2019-01-14 ENROLLMENT_START Enrollment began at CLAIMS_PAYER (MBR-88199A) on 2018-11-01
EMP-0000042 2019-01-15 EMPI_MATCH Automatically linked: [EHR_SYSTEM:PAT-10042] and
[CLAIMS_PAYER:MBR-88199A]. Last name matched exactly. Birth date
matched exactly. First name was a close fuzzy match (Jon vs. Jonathan,
78% similarity). Social security number matched exactly.
Similarity score: 93.2% - above the 70.0% match threshold.
EMP-0000042 2023-06-01 DEMOGRAPHIC_CHANGE Address changed on EHR_SYSTEM (PAT-10042):
123 Oak St → 847 Pine Ave

empi_demographics_tall

Every known demographic value for every patient, across all sources, in a long/tall format. Useful for understanding what values exist across a cluster and for building source-level comparison views.

Key columns

ColumnTypeDescription
empi_idVARCHARPatient identifier.
data_sourceVARCHARWhich source system provided this value.
source_person_idVARCHARSource record that provided this value.
attributeVARCHARDemographic attribute name (e.g., first_name, birth_date, ssn).
valueVARCHARThe attribute value from this source.
attribute_tierVARCHARSTANDARD (eligibility/patient columns) or CUSTOM (from custom attributes table).

empi_review_queue_matches

Candidate pairs in the review zone, ready for human adjudication. Score is above the dismissal threshold but below the auto-match threshold. A reviewer sets is_match = TRUE (same person) or is_match = FALSE (not a match), then sets reviewed = TRUE to finalize the decision.

See Manual Review Workflow for the full decision workflow.

Key columns

ColumnTypeDescription
data_source_a / data_source_bVARCHARData source for each record in the pair.
source_person_id_a / source_person_id_bVARCHARSource person ID for each record.
person_profile_a / person_profile_bVARCHARDemographic summary for each record (SSN masked).
side_by_side_comparisonVARCHARAttribute-by-attribute comparison with match indicators.
normalized_scoreFLOATNormalized score between 0 and 1.
review_priorityVARCHARHIGH (close to match threshold), MEDIUM, or LOW.
matching_attributesVARCHARComma-separated list of attributes that agreed.
mismatching_attributesVARCHARComma-separated list of attributes that conflicted.

Example rows

data_source_a   source_person_id_a   data_source_b   source_person_id_b   normalized_score   review_priority
-------------- ------------------- -------------- ------------------- ----------------- ---------------
EHR_SYSTEM PAT-10091 CLAIMS_PAYER MBR-20047 0.61 HIGH
EHR_SYSTEM PAT-10145 LAB_SYSTEM LAB-000512 0.57 MEDIUM
EHR_SYSTEM PAT-10201 CLAIMS_PAYER MBR-30110 0.52 LOW

empi_review_queue_splits

Source records whose own demographic history contains conflicting values. A signal that a source record may contain data for more than one real patient (a common EHR data entry error).

See Manual Review Workflow for the split decision workflow.

Key columns

ColumnTypeDescription
data_sourceVARCHARSource system containing the suspicious record.
source_person_idVARCHARThe record with conflicting demographics.
record_countINTEGERNumber of distinct demographic records found for this source person.
review_priorityVARCHARCRITICAL (conflicting DOB or SSN), HIGH (conflicting last name), MEDIUM, or LOW.
inconsistent_attribute_countINTEGERNumber of attributes with conflicting values.
inconsistent_attributesVARCHARComma-separated list of attributes that conflict.
inconsistency_summaryVARCHARNewline-delimited summary of each conflicting attribute and its observed values, ordered by severity.

Example rows

data_source   source_person_id   record_count   review_priority   inconsistent_attribute_count   inconsistent_attributes   inconsistency_summary
------------ ----------------- ------------- --------------- ---------------------------- ----------------------- -------------------------------------------
EHR_SYSTEM PAT-10188 2 CRITICAL 1 birth_date birth_date: 1972-04-01, 1989-11-15
EHR_SYSTEM PAT-10244 2 HIGH 1 last_name last_name: Martinez, Hoffman

empi_demographic_anomalies

Statistical surveillance over the patient population. Surfaces values that appear across a disproportionate share of records and identifier fields shared across multiple distinct patients.

Anomaly types

Anomaly typeDescription
FREQUENCY_OUTLIERA demographic value appears across an unusually high share of records relative to the distribution for that attribute - likely a test value, placeholder, or recycled value.
UNIQUE_FIELD_VIOLATIONAn identifier field (SSN, email) is shared across two or more distinct EMPI IDs. This is a high-risk data quality issue.

Example rows

anomaly_type             attribute   value          affected_records   pct_of_population   description
----------------------- ---------- ------------- ----------------- ------------------ ------------------------------------------
FREQUENCY_OUTLIER birth_date 1900-01-01 847 2.3% Placeholder/test birth date shared by 847 records
FREQUENCY_OUTLIER first_name TEST 312 0.8% Likely test/placeholder first name
FREQUENCY_OUTLIER phone 555-000-0000 214 0.6% Placeholder phone number
UNIQUE_FIELD_VIOLATION ssn 123-45-6789 14 - SSN shared across 14 distinct EMPI IDs
UNIQUE_FIELD_VIOLATION email test@test.com 28 - Email shared across 28 distinct EMPI IDs

empi_score_distribution

Score distribution across all candidate pairs. Shows how many pairs fall in each 0.01 score bucket. Used for threshold selection and understanding the shape of your data.

empi_attribute_coverage

Attribute fill rates across the patient population. Shows what percentage of records have each demographic attribute populated. Low fill rates identify attributes that will contribute little to matching quality for your specific data.

Evaluation models (optional)

If you supply a true_patient_mappings seed (ground-truth known-correct links), two additional evaluation models are available:

TableDescription
empi_threshold_analysisPrecision, recall, and F1 score at every 0.01 threshold increment from 0.50 to 0.95.
empi_threshold_examplesSpecific false positives and false negatives at the current threshold - for understanding why the model makes specific decisions.

Tuva integration tables

EMPI Lite also writes two tables to the public schema for downstream Tuva compatibility:

Table (alias)Description
tuva_patient (alias: patient)Tuva-compatible patient table where person_id = empi_id.
tuva_eligibility (alias: eligibility)Tuva-compatible eligibility table where person_id = empi_id.

Every downstream Tuva mart automatically inherits the resolved patient identity with no additional transformation.