Output Table Reference

EMPI Lite produces eight final output tables, all written to the empi schema in your warehouse. This document describes each table, its columns, and example rows.

empi_crosswalk

The primary identity mapping table. One row per (data_source, source_person_id) pair. Every source record in every source system maps to exactly one empi_id.

How to use it: Join this table to any other table using data_source and source_person_id to add empi_id. Then join on empi_id to any other EMPI output.

Key columns

Column	Type	Description
`empi_id`	VARCHAR	Globally unique stable identifier for the real person (deterministic hash, format `EMP-` + hex or numeric). Shared by all source records that have been linked to the same individual.
`data_source`	VARCHAR	Source system name (e.g., `EHR_SYSTEM`, `CLAIMS_PAYER`).
`source_person_id`	VARCHAR	Patient identifier as it appears in the source system.
`match_status`	VARCHAR	`SINGLETON` - not linked to any other record. `EMPI_MATCHED` - linked algorithmically. `MANUAL_MATCHED` - linked by a human reviewer.
`cluster_size`	INTEGER	How many source records share this `empi_id`.
`cluster_data_sources`	VARCHAR	Pipe-delimited list of all data sources represented in the cluster.
`cluster_enrollment_start`	DATE	Earliest enrollment start date across all records in the cluster.
`cluster_enrollment_end`	DATE	Latest enrollment end date across all records in the cluster.
`source_enrollment_start`	DATE	Enrollment start for this specific source record.
`source_enrollment_end`	DATE	Enrollment end for this specific source record.

Example rows

empi_id      data_source    source_person_id   match_status    cluster_size   cluster_data_sources
-----------  -------------  -----------------  --------------  -------------  ------------------------------------
EMP-0000042  EHR_SYSTEM     PAT-10042          EMPI_MATCHED    2              EHR_SYSTEM|CLAIMS_PAYER
EMP-0000042  CLAIMS_PAYER   MBR-88199A         EMPI_MATCHED    2              EHR_SYSTEM|CLAIMS_PAYER
EMP-0000043  EHR_SYSTEM     PAT-10043          SINGLETON       1              EHR_SYSTEM
EMP-0000044  EHR_SYSTEM     PAT-10044          EMPI_MATCHED    3              EHR_SYSTEM|CLAIMS_PAYER|LAB_SYSTEM
EMP-0000044  CLAIMS_PAYER   MBR-10031          EMPI_MATCHED    3              EHR_SYSTEM|CLAIMS_PAYER|LAB_SYSTEM
EMP-0000044  LAB_SYSTEM     LAB-000291         EMPI_MATCHED    3              EHR_SYSTEM|CLAIMS_PAYER|LAB_SYSTEM

empi_golden_record

One authoritative demographic row per EMPI ID. Resolves the best available values from all source records in each cluster.

Resolution logic:

Patient records are preferred over eligibility/enrollment records
The most recent effective date wins within each record type
Critical identifiers (SSN, phone, death date, address) are backfilled from any record in the cluster when the primary record is missing them

How to use it: Join on empi_id to get resolved demographics for any patient. Use the has_conflicting_* flags to qualify downstream analysis.

Key columns

Column	Type	Description
`empi_id`	VARCHAR	Patient identifier.
`first_name`, `last_name`	VARCHAR	Resolved name.
`birth_date`	DATE	Resolved date of birth.
`death_date`	DATE	Resolved date of death (backfilled from any cluster record).
`social_security_number`	VARCHAR	Resolved SSN.
`sex`	VARCHAR	Resolved sex/gender.
`address`, `city`, `state`, `zip_code`	VARCHAR	Resolved address.
`phone`, `email`	VARCHAR	Resolved contact info.
`ssn_source_count`	INTEGER	Number of distinct SSN values seen across the cluster.
`birth_date_source_count`	INTEGER	Number of distinct birth dates seen across the cluster.
`is_matched_record`	BOOLEAN	True if this patient has been linked to at least one other source record.
`has_conflicting_birth_dates`	BOOLEAN	True if two or more distinct birth dates exist across the cluster.
`has_conflicting_last_names`	BOOLEAN	True if two or more distinct last names exist across the cluster.
`primary_record_type`	VARCHAR	The source record type used as the primary source (`PATIENT` or `ELIGIBILITY`).

Example rows

empi_id      first_name  last_name   birth_date   sex     state   is_matched_record   has_conflicting_birth_dates   has_conflicting_last_names
-----------  ----------  ----------  -----------  ------  ------  -------------------  ---------------------------  --------------------------
EMP-0000042  Jonathan    Smith       1978-03-14   male    OR      true                 false                         false
EMP-0000043  Maria       Gonzalez    1991-07-02   female  CA      false                false                         false
EMP-0000044  Robert      Johnson     1965-11-28   male    CO      true                 false                         false
EMP-0000091  Jane        Doe         1983-05-19   female  TX      true                 true                          false

EMP-0000091 has conflicting birth dates - the same person appears with different DOBs in different source systems. A downstream analyst should investigate before relying on the birth date.

empi_patient_events

A complete chronological audit trail. One row per event per patient. Every match, every demographic change, every enrollment span, and every snapshot-detected change is recorded here.

How to use it: Filter on empi_id to reconstruct the complete history for any patient. Filter on event_type to find all matches, all demographic changes, etc.

Event types

Event type	Description
`DEMOGRAPHIC_LOADED`	A demographic record was ingested from a source system.
`DEMOGRAPHIC_CHANGE`	A demographic field changed between consecutive records for the same person in the same source.
`EMPI_MATCH`	Two source IDs were linked - algorithmically or by a reviewer. Includes a plain-English narrative.
`RECORD_SPLIT`	A source record was flagged as containing data for more than one real patient.
`ENROLLMENT_START`	A new enrollment span began in a source system.
`ENROLLMENT_END`	An enrollment span ended.
`EMPI_ID_CHANGED`	(snapshots enabled) A record's cluster membership changed between pipeline runs.
`MATCH_STATUS_CHANGED`	(snapshots enabled) A record transitioned from `SINGLETON` to `EMPI_MATCHED`.
`SOURCE_DATA_UPDATED`	(snapshots enabled) An upstream source record was corrected between runs.

Key columns

Column	Type	Description
`empi_id`	VARCHAR	Patient identifier.
`event_date`	DATE	When the event occurred (or was detected).
`event_type`	VARCHAR	See table above.
`event_description`	VARCHAR	Human-readable description of the event.
`data_source`	VARCHAR	Source system involved.
`source_person_id`	VARCHAR	Source record involved.

Example rows

empi_id      event_date   event_type            event_description
-----------  -----------  --------------------  -----------------------------------------------------------------------
EMP-0000042  2019-01-14   DEMOGRAPHIC_LOADED    Record loaded from EHR_SYSTEM (PAT-10042): Jonathan Smith, DOB 1978-03-14
EMP-0000042  2019-01-14   ENROLLMENT_START      Enrollment began at CLAIMS_PAYER (MBR-88199A) on 2018-11-01
EMP-0000042  2019-01-15   EMPI_MATCH            Automatically linked: [EHR_SYSTEM:PAT-10042] and
                                                [CLAIMS_PAYER:MBR-88199A]. Last name matched exactly. Birth date
                                                matched exactly. First name was a close fuzzy match (Jon vs. Jonathan,
                                                78% similarity). Social security number matched exactly.
                                                Similarity score: 93.2% - above the 70.0% match threshold.
EMP-0000042  2023-06-01   DEMOGRAPHIC_CHANGE    Address changed on EHR_SYSTEM (PAT-10042):
                                                123 Oak St → 847 Pine Ave

empi_demographics_tall

Every known demographic value for every patient, across all sources, in a long/tall format. Useful for understanding what values exist across a cluster and for building source-level comparison views.

Key columns

Column	Type	Description
`empi_id`	VARCHAR	Patient identifier.
`data_source`	VARCHAR	Which source system provided this value.
`source_person_id`	VARCHAR	Source record that provided this value.
`attribute`	VARCHAR	Demographic attribute name (e.g., `first_name`, `birth_date`, `ssn`).
`value`	VARCHAR	The attribute value from this source.
`attribute_tier`	VARCHAR	`STANDARD` (eligibility/patient columns) or `CUSTOM` (from custom attributes table).

empi_review_queue_matches

Candidate pairs in the review zone, ready for human adjudication. Score is above the dismissal threshold but below the auto-match threshold. A reviewer sets is_match = TRUE (same person) or is_match = FALSE (not a match), then sets reviewed = TRUE to finalize the decision.

See Manual Review Workflow for the full decision workflow.

Key columns

Column	Type	Description
`data_source_a` / `data_source_b`	VARCHAR	Data source for each record in the pair.
`source_person_id_a` / `source_person_id_b`	VARCHAR	Source person ID for each record.
`person_profile_a` / `person_profile_b`	VARCHAR	Demographic summary for each record (SSN masked).
`side_by_side_comparison`	VARCHAR	Attribute-by-attribute comparison with match indicators.
`normalized_score`	FLOAT	Normalized score between 0 and 1.
`review_priority`	VARCHAR	`HIGH` (close to match threshold), `MEDIUM`, or `LOW`.
`matching_attributes`	VARCHAR	Comma-separated list of attributes that agreed.
`mismatching_attributes`	VARCHAR	Comma-separated list of attributes that conflicted.

Example rows

data_source_a   source_person_id_a   data_source_b   source_person_id_b   normalized_score   review_priority
--------------  -------------------  --------------  -------------------  -----------------  ---------------
EHR_SYSTEM      PAT-10091            CLAIMS_PAYER    MBR-20047             0.61               HIGH
EHR_SYSTEM      PAT-10145            LAB_SYSTEM     LAB-000512           0.57               MEDIUM
EHR_SYSTEM      PAT-10201            CLAIMS_PAYER   MBR-30110             0.52               LOW

empi_review_queue_splits

Source records whose own demographic history contains conflicting values. A signal that a source record may contain data for more than one real patient (a common EHR data entry error).

See Manual Review Workflow for the split decision workflow.

Key columns

Column	Type	Description
`data_source`	VARCHAR	Source system containing the suspicious record.
`source_person_id`	VARCHAR	The record with conflicting demographics.
`record_count`	INTEGER	Number of distinct demographic records found for this source person.
`review_priority`	VARCHAR	`CRITICAL` (conflicting DOB or SSN), `HIGH` (conflicting last name), `MEDIUM`, or `LOW`.
`inconsistent_attribute_count`	INTEGER	Number of attributes with conflicting values.
`inconsistent_attributes`	VARCHAR	Comma-separated list of attributes that conflict.
`inconsistency_summary`	VARCHAR	Newline-delimited summary of each conflicting attribute and its observed values, ordered by severity.

Example rows

data_source   source_person_id   record_count   review_priority   inconsistent_attribute_count   inconsistent_attributes   inconsistency_summary
------------  -----------------  -------------  ---------------   ----------------------------   -----------------------  -------------------------------------------
EHR_SYSTEM    PAT-10188          2              CRITICAL          1                              birth_date                birth_date: 1972-04-01, 1989-11-15
EHR_SYSTEM    PAT-10244          2              HIGH              1                              last_name                 last_name: Martinez, Hoffman

empi_demographic_anomalies

Statistical surveillance over the patient population. Surfaces values that appear across a disproportionate share of records and identifier fields shared across multiple distinct patients.

Anomaly types

Anomaly type	Description
`FREQUENCY_OUTLIER`	A demographic value appears across an unusually high share of records relative to the distribution for that attribute - likely a test value, placeholder, or recycled value.
`UNIQUE_FIELD_VIOLATION`	An identifier field (SSN, email) is shared across two or more distinct EMPI IDs. This is a high-risk data quality issue.

Example rows

anomaly_type             attribute   value          affected_records   pct_of_population   description
-----------------------  ----------  -------------  -----------------  ------------------  ------------------------------------------
FREQUENCY_OUTLIER        birth_date  1900-01-01     847                2.3%                Placeholder/test birth date shared by 847 records
FREQUENCY_OUTLIER        first_name  TEST           312                0.8%                Likely test/placeholder first name
FREQUENCY_OUTLIER        phone       555-000-0000   214                0.6%                Placeholder phone number
UNIQUE_FIELD_VIOLATION   ssn         123-45-6789    14                 -                   SSN shared across 14 distinct EMPI IDs
UNIQUE_FIELD_VIOLATION   email       test@test.com  28                 -                   Email shared across 28 distinct EMPI IDs

empi_score_distribution

Score distribution across all candidate pairs. Shows how many pairs fall in each 0.01 score bucket. Used for threshold selection and understanding the shape of your data.

empi_attribute_coverage

Attribute fill rates across the patient population. Shows what percentage of records have each demographic attribute populated. Low fill rates identify attributes that will contribute little to matching quality for your specific data.

Evaluation models (optional)

If you supply a true_patient_mappings seed (ground-truth known-correct links), two additional evaluation models are available:

Table	Description
`empi_threshold_analysis`	Precision, recall, and F1 score at every 0.01 threshold increment from 0.50 to 0.95.
`empi_threshold_examples`	Specific false positives and false negatives at the current threshold - for understanding why the model makes specific decisions.

Tuva integration tables

EMPI Lite also writes two tables to the public schema for downstream Tuva compatibility:

Table (alias)	Description
`tuva_patient` (alias: `patient`)	Tuva-compatible patient table where `person_id = empi_id`.
`tuva_eligibility` (alias: `eligibility`)	Tuva-compatible eligibility table where `person_id = empi_id`.

Every downstream Tuva mart automatically inherits the resolved patient identity with no additional transformation.

empi_crosswalk​

Key columns​

Example rows​

empi_golden_record​

Key columns​

Example rows​

empi_patient_events​

Event types​

Key columns​

Example rows​

empi_demographics_tall​

Key columns​

empi_review_queue_matches​

Key columns​

Example rows​

empi_review_queue_splits​

Key columns​

Example rows​

empi_demographic_anomalies​

Anomaly types​

Example rows​

empi_score_distribution​

empi_attribute_coverage​

Evaluation models (optional)​

Tuva integration tables​

empi_crosswalk

Key columns

Example rows

empi_golden_record

Key columns

Example rows

empi_patient_events

Event types

Key columns

Example rows

empi_demographics_tall

Key columns

empi_review_queue_matches

Key columns

Example rows

empi_review_queue_splits

Key columns

Example rows

empi_demographic_anomalies

Anomaly types

Example rows

empi_score_distribution

empi_attribute_coverage

Evaluation models (optional)

Tuva integration tables