Skip to main content

4. Data Quality

Last updated: 06-21-2025

Data Quality is a critical tool in the Tuva Project. It's used to check whether you've mapped source data to Tuva correctly, identify data quality problems in source data, and show how source data quality problems impact Data Marts and as a result your ability to do analytics with the data.

Data Quality in Tuva includes 3 main components:

  1. Data Quality Tests: 600+ tests built into the Tuva data pipelines that identify atomic-level data quality problems
  2. Data Quality Tables: Data tables where the data quality test results are aggregated which can be used for downstream analysis
  3. Data Quality Dashboard: A visualization for easily understanding the data quality problems in your source data

Repository for Dashboard: https://github.com/tuva-health/tuva_dqi

Below we explain how to run Data Quality.

What Does It Do?

The Tuva Data Quality Dashboard takes the output from the data quality tests run by the Tuva dbt package and presents it in an easy-to-understand web application. Its main goals are to help you:

  1. Monitor Overall Health: Get a quick sense of your data's quality with an A-F grading system.
  2. Assess Usability: Understand if specific data marts (like Service Categories or CMS HCCs) are reliable enough to use based on test results.
  3. Investigate Issues: Drill down into specific failing tests to understand what went wrong and where.

How It Works: The Big Picture

It's important to understand that the dashboard itself doesn't run the data quality tests. Instead, it visualizes the results generated by the Tuva dbt package.

Here's the typical workflow:

  1. Run Tests: You use the Tuva Health dbt package (version 0.14.3 or later) within your dbt project to execute data quality tests against your data warehouse.
  2. Generate Outputs: The dbt package creates specific tables containing the test results (data_quality__testing_summary) and data for visualizations (data_quality__exploratory_charts).
  3. Export Data: You export these two tables from your data warehouse into CSV files (with headers).
  4. Import into Dashboard: You upload these CSV files into the running Data Quality Dashboard application.
  5. Visualize & Analyze: The dashboard processes the CSVs and displays the interactive charts, grades, and test details.

Key Features at a Glance

  • Data Quality Grade: An overall A-F grade summarizing data health.
  • Data Mart Status: Clear indicators of usability for each data mart.
  • Detailed Test Results: See which tests passed or failed, including severity.
  • Quality Dimension Analysis: Break down results by completeness, validity, etc.
  • Interactive Visualizations: Charts exploring data patterns and quality metrics over time or by category.
  • Report Generation: Create shareable PDF report cards.
  • Severity Levels: Tests are categorized by severity to help prioritize fixes:
    • Level 1: Critical (prevents dbt build)
    • Level 2: Major (affects data reliability)
    • Level 3: Moderate (use data with caution)
    • Level 4: Minor (limited impact)
    • Level 5: Informational (dbt informational tests)

Getting Started: Running the Dashboard

Here’s how you can get the dashboard running locally. For detailed instructions, always refer to the project's README.

Prerequisites:

  • Python (version 3.11 or higher recommended)
  • Git

Steps:

  1. Clone the Repository:
    git clone https://github.com/tuva-health/tuva_dqi.git
    cd tuva_dqi
  2. Set up a Virtual Environment: (Recommended to avoid dependency conflicts)
    # Navigate into the cloned directory (if you aren't already)
    cd tuva_dqi
    # Create the environment
    python -m venv .venv # Use python3 if 'python' doesn't work
    # Activate the environment
    # On macOS/Linux:
    source .venv/bin/activate
    # On Windows:
    # .venv\Scripts\activate
  3. Install Dependencies:
    python -m pip install -r requirements.txt
  4. Run the Application:
    python app.py
  5. Access: Open your web browser and navigate to http://localhost:8080.

(Note: Docker instructions are also available in the repository's README if you prefer containerization.)

Generating the Input Data (Crucial Step!)

The dashboard needs specific data generated by the Tuva dbt package.

Prerequisites:

  • A working dbt project.
  • Your healthcare data loaded into your data warehouse.

Steps:

  1. Add the Tuva Package: Ensure your project's packages.yml includes the Tuva package, version 0.14.3 or later:

    packages:
    - package: tuva-health/the_tuva_project
    version: [">=0.14.3"] # Or specify a higher version
  2. Install/Update the Package:

    dbt clean
    dbt deps
  3. Run Tuva Models and Tests: Execute the necessary dbt commands to build Tuva models and run the data quality tests. A typical sequence is:

    dbt seed --full-refresh # Load necessary seed data
    dbt run # Build Tuva models
    dbt test # Run data quality tests
  4. Export the Results: After the dbt test command finishes successfully, two key tables will exist in your data warehouse (likely in your dbt data quality schema with Tuva prefix if specified):

    • data_quality__testing_summary
    • data_quality__exploratory_charts

    You need to export the data from these two tables into CSV files, including headers. The method for exporting depends on your specific data warehouse (e.g., using SQL queries, warehouse UI export features, or other tools).

  5. Upload to Dashboard: Once you have the testing_summary.csv and exploratory_charts.csv files, use the "Import Test Results" feature within the running dashboard application to upload them.

Running tests just within dbt

DQI has introduced variables into the Tuva dbt project. You can enable or target specific dqi tests while staying within dbt. For example

# Run all the tests associated with the HCC mart. Warnings change into errors
dbt test -s tag:dqi_cms_hccs --warn-error

# Test sev 1 and sev 2 issues only
dbt test -s tag:tuva_dqi_sev_1 -s tag:tuva_dqi_sev_2

# Run all tests that are specific to tuva dqi
dbt test -s tag:dqi

List of mart DQI test tags

  • dqi_service_categories
  • dqi_ccsr
  • dqi_cms_chronic_conditions
  • dqi_tuva_chronic_conditions
  • dqi_cms_hccs
  • dqi_ed_classification
  • dqi_financial_pmpm
  • dqi_quality_measures
  • dqi_readmission

How to disable DQI

Set the variable in your CLI or dbt_project.yml file enable_input_layer_testing to false and this will disable the functionality.

Development Status

Please note that the Tuva Data Quality Dashboard is currently in alpha/early development. It's designed to work with outputs from the Tuva dbt package version 0.14.3 or later.

Legacy Data Quality

Atomic Data Quality

Prior to developing the data quality framework outlined above we used a different approach. This approach (documented below) is not inherently built into the data pipelines, making it difficult to catch issues at run-time. It also takes a very long time to run and there is no corresponding dashboard to visualize results. We will be phasing out this legacy approach in coming releases but will keep it documented here for transparency.

Atomic-level data quality focuses on identifying problems inherent in your raw data, irrespective of specific analytics use cases. For example, invalid ICD-10-CM diagnosis codes are an atomic-level data quality problem. Obviously invalid codes like this will also impact analytics use cases, but at the atomic-level we focus on identifying and summarizing these problems across the dataset without linking them to specific analytics use cases.

There are two types of atomic-level data quality problems:

  1. Mapping Problems
  2. Inherent Problems

Mapping problems are data quality problems that are inadvertently introduced when data is transformed from the source to the Tuva Data Model. These problems can be corrected by fixing the mapping.

Inherent problems are data quality problems that are artifacts of the raw data that were not introduced during mapping and cannot be corrected by mapping. They can only be corrected by acquiring better data from the source.

For each claims data table in the Tuva Input Layer, we analyze several domains of data quality problems. We have organized these domains in order of importance below. To view a union of all the individual checks (ignoring date trend and other checks that are more complex) query the table below:

select *
from data_quality.data_quality_test_summary
order by result_count desc

Medical Claim

Primary Key

The most basic check we perform on each data table is a primary key check. This ensures the data in the table is at the proper grain. Modern data warehouses (e.g. Snowflake) don't always enforce primary keys. If the primary key isn't set properly this could cause downstream queries to unintentionally explode in size and number of records.

The primary key on the medical_claim table is comprised of 3 fields: claim_id, claim_line_number, and data_source. The primary_key_check table verifies whether there are any duplicates across the primary key fields for each data table. If any result in the duplicate_records column is greater than zero, then that table's primary key is not set properly and this needs to be corrected in the mapping.

select *
from data_quality.primary_key_check

Primary Keys

In the example above the pharmacy_claim table has 100 distinct claims that have multiple records with the same values for the primary key fields. If any result in this table is non-zero you need to correct the mapping to fix it.

Person ID

The patient is at the center of the vast majority of the analyses we're interested in. Therefore it's important that we check a few things related to the person_id field on every medical claim, specifically:

  1. Does every line on each claim have a value for person_id populated?
  2. Is there more than one value for person_id within a single claim (there shouldn't be)?
  3. Does the person_id value indicated on the claim have corresponding valid eligibility during the time period when claim was rendered?

The medical_claim_person_id table verifies whether any of these data quality problems occur in the medical_claim table. You can query it as follows:

select *
from data_quality.medical_claim_person_id

This query returns the number of unique claim IDs that have each of these data quality problems.

Medical Claim Patient ID

In the example table above we observe the following:

  1. person_id is populated for every single record in the medical_claim table
  2. 50 claim IDs have more than 1 person_id. This can occur when two or more distinct lines on the claim have different values for person_id.
  3. 1,000 claim IDs are considered "orphaned claims". This means that the claim_start_date or claim_end_date occur during a month when the patient does not have insurance eligibility.

If any of these problems occur in your data you should attempt to correct them in mapping. However the specific techniques to do this will vary by dataset and it may not be possible to correct the problems. If the problems can't be correct we still include these records in the dataset, but there will be limitations in terms of how useful they are for analytics.

Date Fields

The majority of analyses are longitudinal in nature, meaning they look at trends or changes over time. To make these sorts of analyses possible we need reliable date fields. In medical claims the following fields are the important date fields:

  • claim_start_date
  • claim_end_date
  • claim_line_start_date
  • claim_line_end_date
  • admission_date
  • discharge_date
  • paid_date

We need to check these fields for the following problems:

  1. Every claim line should have a value populated for the following fields:
  • claim_start_date
  • claim_end_date
  • claim_line_start_date
  • claim_line_end_date
  • paid_date
  1. Every institutional claim should have an admission_date and discharge_date field populated.

  2. The following fields should have 1 and only 1 value on the claim when they are populated.

  • claim_start_date
  • claim_end_date
  • admission_date
  • discharge_date

It's possible and expected for claim line dates to be different for different records on a single claim.

The following table returns the count of distinct claims that violate the rules outlined above.

select *
from data_quality.medical_claim_date_checks

Medical Claim Basic Dates

In the example table above we can see that 50 medical claims are missing a claim_end_date and 1,000 claims are missing a claim_line_start_date.

Beyond these basic date checks, we need to check whether the values within our date fields are reasonable over time. For example, if we have a dataset that covers 3 calendar years, it's possible that claim_start_date is perfectly normal for the first two years, but then is completely missing thereafter. The following query analyzes trends in each of the date fields by looking at the count of distinct claims over time for each date field:

select *
from data_quality.claim_date_trends
order by 1

Claim Date Trends

In the example table above we see that the first 3 months of 2021 have claim start and end dates populated, but no admission, discharge, or paid dates. This indicates we are missing date information on these fields during these months.

Diagnosis and Procedure Fields

Most analyses we perform incorporate information about the patient's health conditions and / or procedures performed on the patient. Therefore it's critical that we have solid information on the ICD-10-CM and ICD-10-PCS codes which encode this information.

Specifically we check to ensure the following:

  1. Every claim line should have a value populated for diagnosis_code_1
  2. Every value in the diagnosis_code_1 field should be a valid ICD-10-CM code
  3. Every claim should have 1 and only 1 value for diagnosis_code_1
  4. For any secondary diagnosis codes that are populated (e.g. diagnosis_code_2, diagnosis_code_3, etc.) these values should be valid ICD-10-CM codes
  5. We also want to know the percent of claims that have 1 or more secondary diagnosis codes
  6. For any procedure codes (e.g. procedure_code_1, procedure_code_2, etc.) that are populated these values should be valid ICD-10-PCS codes

The query below will display the results from each of these checks:

select *
from data_quality.dx_and_px

Dx and Px

In the example table above, we see that 50 claims have 1 or more invalid primary diagnosis codes and that 1,000 claims have more than 1 primary diagnosis code, making it impossible to use the primary diagnosis code from these claims in analysis.

Institutional Header Fields

There are several key fields on institutional claims that we use to identify the type of service being delivered and to group claims into encounters. These fields include the following:

  • bill_type_code
  • discharge_disposition_code
  • ms_drg_code
  • apr_drg_code

For each of these fields we need to check institutional claims to ensure the following is true:

  1. Every claim line has a value populated
  2. The value populated is valid
  3. There is 1 and only 1 value for each claim across all claim lines

The following code returns the results of these checks:

select *
from data_quality.medical_claim_inst_header_fields
order by result_count desc

Inst Header Fields

In the example output above we see that 50 claims have an invalid bill_type_code, meaning it does not match a code in the terminology table, and that 1,000 claims have multiple values for bill_type_code.

Claim Line Fields

There are a few key fields across both professional and institutional claims that we use to assign service categories and to group claims into encounters. These including the following fields:

  • revenue_center_code
  • place_of_service_code
  • hcpcs_code

We need to check these fields to see if the values of these fields are missing or invalid. The following table returns the results of this.

select *
from data_quality.medical_claim_claim_line_fields
order by result_count desc

Medical Claim Line Checks

In the example table above we see that 1,000 claims are missing revenue_center_code.

Provider NPI Fields

Often we wish to do analysis about the specific providers or facilities delivering care. NPI is critical here because it's how we identify providers, facilities, and health systemcs. Therefore we perform the following checks on NPI:

  • Every claim should have 1 and only 1 billing_npi and it should be valid
  • Every claim should have 1 or more rendering_npi values and they should be valid
  • Every institutional claim should have 1 and only 1 facility_npi and it should be valid
select *
from data_quality.medical_claim_provider_npi
order by result_count desc

Provider NPI

Trended Claim Volume and Dollars

Often we see anomalies where claim volume or dollars can change unexpectedly over time, possibly indicating a data quality problem. Therefore we need to check whether this is the case.

select *
from data_quality.medical_claim_volume_and_dollars
order by year_month

Medical Claim Volume and Dollars

In the example above we clearly have medical claims in the month of March 2021 but we have no paid amounts.

Data Loss

Often when we map data we perform complex filtering and other types of transformation operations. In these scenarios it can be easy for unintended data loss to occur. Therefore we need to confirm some basic statistics between the raw source data and the Tuva Input Layer to ensure unintended data loss hasn't occurred.

The table identified in the query below will only populate once you've created the data loss table in the Input Layer. To create this table, you manually calculate the exact same metrics as the source data and map these results to the data_loss table in the claims Input Layer.

select *
from data_quality.data_loss

Data Loss

In the example above we can compare several basic "table stakes" statistics for medical and pharmacy claims, as well as eligibility. The highlighted rows all show data loss as we move from raw to input layer to core. Some of this data loss might be expected and explainable. You should work to understand any and all data loss that is not currently explainable before moving on.

Pharmacy Claim

Primary Key

The most basic check we perform on any table is a primary key check. This ensures the data in this table is at the proper grain. Modern data warehouses (e.g. Snowflake) don't always enforce primary keys. If the primary key isn't set properly this could cause downstream queries to explode in size and number of records.

The primary key on the pharmacy_claim table is comprised of 3 fields: claim_id, claim_line_number, and data_source. Query the following table to check whether your data has any duplicate records across these fields. The duplicate_records column in this table should equal 0 for every table. If it's not you have a primary key problem to fix.

select *
from data_quality.primary_keys
Person ID

Similar to the medical_claim table, the patient is at the center of the vast majority of the analyses we're interested in. Therefore it's important that we check a few things related to the person_id field on every pharmacy claim, specifically:

  1. Does every line on each claim have a value for person_id populated?
  2. Is there more than one value for person_id within a single claim?
  3. Does the person_id value indicated on the claim have corresponding valid eligibility during the time period when claim was rendered?

The pharmacy_claim_person_id table verifies whether any of these data quality problems occur in the pharmacy_claim table. You can query it as follows:

select *
from data_quality.pharmacy_claim_person_id

This query returns a table with one row per check and the count of unique claim IDs that have that particular data quality issue

Medical Claim Patient ID

In the example table above we observe the following:

  1. person_id is populated for every single record in the pharmacy_claim table
  2. 50 claim IDs have more than 1 person_id. This can occur when two or more distinct lines on the claim have different values for person_id.
  3. 1,000 claim IDs are considered "orphaned claims". This means that the paid_date or the dispensing_date occurs during a month when the patient does not have insurance eligibility.

If any of these problems occur in your data you should attempt to correct them in mapping. However the specific techniques to do this will vary by dataset and it may not be possible to correct the problems. If the problems can't be correct we still include these records in the dataset, but there will be limitations in terms of how useful they are for analytics.

Date Fields

The majority of analyses are longitudinal in nature, meaning they look at trends or changes over time. To make these sorts of analyses possible we need reliable date fields. In pharmacy claims the following fields are the important date fields:

  • paid_date
  • dispensing_date

We need to check these fields to ensure they are populated for every claim and claim line:

  • paid_date
  • dispensing_date

The following table returns the count of distinct claims that violate the rules outlined above.

select *
from data_quality.pharmacy_claim_date_checks

Pharmacy Claim Basic Dates

In the example table above we can see that 1000 pharmacy claims are missing a dispensing_date while paid_date is always populated.

Beyond these basic date checks, we need to check whether the values within our date fields are reasonable over time. For example, if we have a dataset that covers 3 calendar years, it's possible that paid_date is perfectly normal for the first two years, but then is completely missing thereafter. The following query analyzes trends in each of the date fields by looking at the count of distinct claims over time for each date field:

select *
from data_quality.claim_date_trends
order by 1

Pharmacy Claim Date Trends

In the example table above we see that the first 3 months of 2021 have some small amount of claims with paid_dates. It is worth further investigation to see if we are missing dates for those claims, or if that is just ramp up in the pharmacy claims data we have received.

NDC

Every pharmacy claim should have an NDC (National Drug Code) on it. We check to see whether every claim and claim line has an NDC field that is populated, and whether that NDC is a valid NDC (meaning it is part of the published NDC list by the FDA). These checks are found here:

select *
from data_quality.pharmacy_claim_ndc
order by 1

Pharmacy Claim NDC

Some invalid NDCs are expected, as there are often NDCs in pharmacy data for items like 'glucose test strips' that are not found on the FDA list.

Prescription Details

A pharmacy claim will have prescription details in these important fields:

  • quantity
  • days supply
  • refills

We check each of these fields to ensure they are are populated.

select *
from data_quality.pharmacy_claim_prescription_details
order by 1

Pharmacy Prescription Details

Quantity is the most important of the fields, as it is used in the pharmacy mart to calculate the generic available opportunity.

NPI Fields

Pharmacy claims contain two important NPI fields:

  • prescribing_npi
  • dispensing_npi

Of the two fields, prescribing npi is more important as it enables analysis around who is prescribing certain drugs and what type of practioners they are.

We check these fields to ensure they are populated and valid NPIs. We also check to ensure the prescribing_npi is the expected entity type (an individual instead of an organization).

select *
from data_quality.pharmacy_claim_npi

Pharmacy NPI Details

Trended Claim Volume and Dollars

Often we see anomalies where claim volume or dollars can change unexpectedly over time, possibly indicating a data quality problem. Therefore we need to check whether this is the case.

select *
from data_quality.pharmacy_claim_volume_and_dollars

Pharmacy Claim Volume and Dollars

In the example above we clearly have pharmacy claims in the month of March 2021 but we have no paid amounts.

Data Loss

Often when we map data we perform complex filtering and other types of transformation operations. In these scenarios it can be easy for unintended data loss to occur. Therefore we need to confirm some basic statistics between the raw source data and the Tuva Input Layer to ensure unintended data loss hasn't occurred.

The table identified in the query below will only populate once you've created the data loss table in the Input Layer. To create this table, you manually calculate the exact same metrics as the source data and map these results to the data_loss table in the claims Input Layer.

select *
from data_quality.data_loss

Data Loss

In the example above we can compare several basic "table stakes" statistics for medical and pharmacy claims, as well as eligibility. The highlighted rows all show data loss as we move from raw to input layer to core. Some of this data loss might be expected and explainable. You should work to understand any and all data loss that is not currently explainable before moving on.

Eligibility

Primary Key

The most basic check we perform on any table is a primary key check. This ensures the data in this table is at the proper grain.
Modern data warehouses (e.g. Snowflake) don't always enforce primary keys. If the primary key isn't set properly this could cause downstream queries to explode in size and number of records.

The primary key on the eligibility table is comprised of 4 fields: memner_id, payer, plan, enrollment_start_date, and enrollment_end_date.
Query the following table to check whether your data has any duplicate records across these fields. The duplicate_records column in this table should equal 0 for every table. If it's not you have a primary key problem to fix.

select *
from data_quality.primary_key_check

Primary Keys

In the example above the pharmacy_claim table has 100 distinct claims that have multiple records with the same values for the primary key fields. If any result in this table is non-zero you need to correct the mapping to fix it.

Person ID

Does every row of eligibility have a person_id populated?

The eligibility_missing_person_id table verifies whether any rows do not have a person_id. You can query it as follows:

select *
from data_quality.eligibility_missing_person_id

This query returns the number of rows in the eligibility table that do not have a person_id. If this number is greater than 0, you need to correct the mapping to fix it.

Eligibility Patient ID

In the example table above we observe that all rows in the source data have a person_id.

Date Fields

In eligibility, the following fields are the important date fields:

  • enrollment_start_date
  • enrollment_end_date

We need to check these fields for the following problems:

  1. Every row in the eligibility input layer should have valid dates in these fields.
  2. enrollment_start_date should be less than or equal to enrollment_end_date.
  3. The latest enrollment_end_date value should be the last day of the current month when the tuva project was last run.
  4. Dates should be relatively recent e.g., no dates before 1900.

The following table returns the count of eligibility rows that violate the rules outlined above.

select *
from data_quality.eligibility_date_checks

Eligibility Date Checks

Patient Demographics

Birth date and gender information about a member is crucial for some downstream analytics. The eligibility_demographics table provides a count of eligibility spans that are missing this data.

select *
from data_quality.eligibility_demographics

Eligibility Demographics

Payer Info

The eligibility_missing_payer table provides a count of eligibility spans that are missing the following:

  • payer type (e.g. medicare, medicaid, commercial)
  • payer name (e.g. CMS)
  • invalid payer
select *
from data_quality.eligibility_missing_payer

Eligibility Missing Payer

Trended Enrollment Volume

The eligibility_trend table provides a count of eligibility spans that contain enrollment in the corresponding year and month. Trending this count can surface abnormal drops and/or spikes in enrollment.

select *
from data_quality.eligibility_trend
order by year_month

Eligibility Trend

Data Loss

Often when we map data we perform complex filtering and other types of transformation operations. In these scenarios it can be easy for unintended data loss to occur. Therefore we need to confirm some basic statistics between the raw source data and the Tuva Input Layer to ensure unintended data loss hasn't occurred.

The table identified in the query below will only populate once you've created the data loss table in the Input Layer.
To create this table, you manually calculate the exact same metrics as the source data and map these results to the data_loss table in the claims Input Layer.

select *
from data_quality.data_loss

Data Loss

In the example above we can compare several basic "table stakes" statistics for medical and pharmacy claims, as well as eligibility. The highlighted rows all show data loss as we move from raw to input layer to core. Some of this data loss might be expected and explainable. You should work to understand any and all data loss that is not currently explainable before moving on.

Analytics Data Quality

Analytics-level data quality focuses on understanding the ability of your data to support specific analyses and whether the results generated from those analyses are reasonable. This level of data quality assessment goes beyond the basic integrity checks performed at the atomic level and examines the data from an analytical perspective.

There are two main aspects of analytics-level data quality:

  1. Assessing which analyses your data can support
  2. Determining if the results your data generates for those analyses are reasonable

The analytics-level checks described below are intended to be used as a checklist. Each check exists as a data table in the Tuva Data Model, and the documentation below describes why the check is important and how to use these tables to perform the check.

Since many of the analytics-level checks involve assessing reasonableness, it's important to have something to compare to. For many of the checks we include reference data points based on Medicare FFS claims data. While not perfect for comparing to commercial and Medicaid populations, it still adds directional value (e.g. is a reasonable PMPM closer to $400 or $4,000).

We have organized most of the analytics data quality checks into two tables for ease of use.

The analytics reference summary table unions together the various analytics measures and their associated values in the Medicare FFS claims data set.

select *
from data_quality.analytics_reference_summary
order by analytics_concept
,analytics_measure

The analytics checks summary table unions together the different data quality checks and the count of results for each of those checks. Some of the checks are seen as "never events," meaning any result greater than zero is a problem worth looking into. Others are checks where results aren't a problem. The two different types of checks are separated with the "normally zero" in the data.

select *
from data_quality.analytics_checks_summary
where normally_zero = 1
order by result_count

Core Populated

The most basic check we perform at the analytics level is to confirm whether or not the Core Data Model is populated. Sometimes The Tuva Project builds all data tables in the data model, but because of data quality problems, a set of tables will end up with zero records (i.e., is not populated). This check gives us a quick glance at whether or not this is happening in the Core Data Model.

select *
from data_quality.core_populated

This query returns information about which tables in the Core Data Model are populated and which are not. If any tables are not populated, it may indicate upstream data quality issues that need to be addressed.

Core Populated

In the above image, core.pharmacy_claim has returned 0 rows, indicating that the table is not populated.

Analytics Populated

On top of the Core Data Model are the Data Marts. We use these Data Marts to perform analytics. However, sometimes the tables are not populated because of data quality issues. This table lets us see if this is happening at a glance.

select *
from data_quality.analytics_populated

This query provides information about which Data Mart tables are populated. If any tables are not populated, it may indicate issues with the data transformation process or upstream data quality problems.

Analytics Populated

In the above image, ed_visits has returned 0 rows, indicating there is an issue with one of the fields used to create ED encounters or service categories.

Chronic Conditions

Chronic diseases follow a very predictable distribution in patient populations. Here we check the prevalence of the top 10 chronic conditions in Medicare FFS data and compare it to the prevalence in your data.

select *
from data_quality.chronic_conditions_prevalence

This query returns the prevalence rates of common chronic conditions in your population, along with benchmark comparisons. Significant deviations from expected prevalence rates may indicate issues with diagnosis coding or population differences that need to be understood.

select *
from data_quality.chronic_conditions_none

The above query returns the percent of patients who have no chronic conditions, along with benchmark comparisons from the Medicare data. A significantly higher or lower number of members without chronic conditions may indicate issues with diagnosis coding or population differences that need to be understood. For example, commercial populations may have a higher percentage of members without chronic conditions due a generally younger and healthier group of enrollees.

Encounter Types and Service Categories

Encounter types and service categories provide insight into the distribution of healthcare services across different types of care. This section examines the utilization rates, costs, and trends associated with various encounter types and service categories. Understanding these patterns helps identify potential areas of concern or opportunity in healthcare delivery.

select *
from data_quality.encounters_missing_groups

This query returns information about encounter groups that are missing in the dataset. This is a high level check to ensure that all encounter groups are present in the data. If any encounter groups are missing, it may indicate issues with data mapping that needs to be addressed.

select *
from data_quality.encounters_cost_and_utilization_trend
order by year_month

The above query provides trend data for encounter costs per unit and utilization (as PKPYs) over time. This information is valuable for identifying patterns in healthcare service delivery and costs, which can inform resource allocation and policy decisions. Additionally, we can ensure consistent trends across time periods and quickly identify potential data quality issues if data is suddenly missing. For example, if we ran this query:

select *
from data_quality.encounters_cost_and_utilization_trend
where encounter_type = 'acute inpatient'
order by year_month

With these results:

Acute Inpatient Encounter Trend

We can see that we have paid amounts for acute inpatient encounters in January and February (shown as very high paid per values for those months) but low volume (PKPY). This is a clear indication of a data quality issue as the values are significantly different for the rest of the year.

select *
from data_quality.encounters_cost_and_utilization

This query returns paid per encounter and PKPY data for different encounter types. Analyzing this information helps in understanding the distribution of healthcare resources across various service categories and can highlight areas of high cost or utilization that may require further investigation.

If we ran the following query to specifically examine acute inpatient PKPY and cost pers relative to the Medicare data:

select *
from data_quality.encounters_cost_and_utilization
where analytics_measure = 'acute inpatient'

And got the following results:

Acute Inpatient Encounter Cost and Utilization

It would raise some questions about the underlying dataset and potentially identify a data quality issue. For example, in the image above we see that the PKPY values are very similar (263.32 vs 258.68). However, the paid per encounter is less than half in our dataset ($7,998) vs ($16,134).

In order to better determine if this is true or the result of a data quality issue, it is important to take into account the other information we know about the dataset. Is it Medicare data? Commercial data? What is the age distribution of the population? What are their risk scores? What are the types of hospital procedures that are being performed? The answers to these questions can help us better understand the context of the data and potentially identify a data quality issue.

select *
from data_quality.service_category_pmpm_trend

The query above shows the trend of per member per month (PMPM) costs for different service categories over time. This data is crucial for tracking changes in healthcare spending patterns and identifying potential cost drivers or areas for cost containment. Comparing this to industry benchmarks or historical data can give insights into overall cost trends and the efficiency of healthcare delivery in your population.

Financial PMPM

Per Member Per Month (PMPM) costs are a key metric in healthcare analytics. As the previous sections examined PMPM at the service category and encounter levels, this section evaluates overall PMPM costs. This can be useful in identifying any high level issues with claim type mapping, or missing claims in general.

select *
from data_quality.data_quality_financial_pmpm
order by year_nbr

A typical PMPM (Per Member Per Month) for various populations are outlined in the table below:

Payer TypePMPM Range
Commercial$300-$500
Medicare/Medicare Advantage$800-$1200
Medicaid$200-$400

There are always exceptions to these rules, specifically if a population is higher risk than average for the payer type.

A typical split based on claim type is listed below:

Claim TypePercent of Total Spend
Institutional50% - 80%
Professional20% - 40%
Pharmacy (if applicable)10% - 30%

The claims data split by claim type can be viewed at the table here:

select *
from data_quality.data_quality_claim_percent
order by percent_of_total_paid

If spend falls outside of these ranges, it could indicate a mapping or data quality issue.

Atomic Checks The following atomic checks would be a good place to check for data quality issues if the analytics check surface any potential issues

  • place of service code
  • bill_type_code
  • revenue center code
  • ms_drg_code

ED Visits

ED visit analysis is one of the most common types of analysis. The ED visit PKPY rate is covered under the encounter and service category sections. Here we specifically look at whether the distribution of ED Visits by preventable classification is reasonable and if each of the important categories are populated

In this table we check to make sure each of the avoidable or non avoidable categories of emergency department use are populated. If one or more of the sections is missing in the data, a 1 would appear in the result_count column. This would likely indicate an issue with diagnosis code mapping as the classification algorithm is diagnosis code based.

select *
from data_quality.data_quality_ed_classification

In the table below, we verify that each category of avoidable or non-avoidable emergency department use is populated. If one or more sections are missing from the data, a 1 will appear in the result_count column. This likely indicates an issue with diagnosis code mapping, as the classification algorithm is based on diagnosis codes.

select *
from data_quality.data_quality_ed_classification_reference

Atomic Checks The following atomic checks would be a good place to check for data quality issues if the analytics check surface any potential issues

  • icd_diagnosis_code
  • bill_type_code

Acute Inpatient Visits

Acute Inpatient Visits represent a significant portion of healthcare costs and serve as an important indicator of population health. This section analyzes various aspects of inpatient care, including visit rates, length of stay (ALOS), costs, mortality rates, and case mix complexity. These metrics provide insights into the intensity and quality of inpatient care delivery.

  • ALOS (Average Length of Stay)
  • Average Mortality Rate
  • Top DRGs (Diagnosis-Related Groups)

The following table checks whether any acute inpatient encounters lack associated professional claims. Professional claims are matched with institutional claims as part of the encounter-building process. If no professional claims can be matched (based on claim dates and patient ID) to institutional claims, these cases will be flagged here.

select *
from data_quality.data_quality_acute_inpatient

This table measures the average length of stay and mortality rate of inpatient encounters, comparing them to the Medicare FFS (Fee-for-Service) claims data. If there is a variance of ±2 days in the length of stay, it warrants further investigation to understand the cause. The mortality rate may vary depending on the population type.

select *
from data_quality.data_quality_acute_inpatient_reference

This table measures the prevalence of the top 10 DRGs in the data and compares their prevalence to the top 10 DRGs in Medicare data. If the most common DRGs from Medicare are not found in the data source, it could indicate a data issue. Alternatively, it could reflect differences in the population compared to the Medicare population.

select *
from data_quality.data_quality_acute_inpatient_prevalence

Atomic Checks The following atomic checks would be a good place to check for data quality issues if the analytics check surface any potential issues

  • icd_diagnosis_code
  • ms_drg_code
  • bill_type_code
  • claim_line_start_date
  • claim_line_end_date
  • claim_start_date
  • claim_end_date
  • admit_date
  • discharge_date

Readmissions

Readmission rates are a key quality indicator in healthcare. This section focuses on 30-day readmission rates, both overall and for specific diagnoses. Understanding readmission patterns can help identify opportunities for improving care transitions and reducing unnecessary hospitalizations.

  • 30-day readmission rate

The following table checks for various reasons why inpatient encounters were not eligible to be considered as index admissions (and were therefore excluded from the readmissions calculation). Some degree of encounters with data quality issues is unavoidable in claims data, but if the number of excluded encounters exceeds a few percentage points of the total, it is worth investigating further.

select *
from data_quality.data_quality_readmissions

This table compares the readmission rate in the data source to the readmission rate from the Medicare FFS (Fee-for-Service) claims data. Normal variation in readmission rates is between 5% and 20%. Any rates outside of this range are unusually high or low and should be researched further.

select *
from data_quality.data_quality_readmissions_reference

Atomic Checks The following atomic checks would be a good place to check for data quality issues if the analytics check surface any potential issues

  • icd_diagnosis_code
  • ms_drg_code
  • bill_type_code
  • admit_date
  • discharge_date

CMS-HCCs

The CMS Hierarchical Condition Category (HCC) risk adjustment model is widely used to predict healthcare costs. This section analyzes risk scores, comorbidity rates, and trends in risk scores over time. These analyses provide insight into the health status of the population and its predicted resource utilization.

This table checks whether both demographic factors and disease factors are present in the CMS-HCC mart. Issues with enrollment or diagnosis codes may cause one or both factors to be missing. A result count of 1 indicates that the factor was missing from the data.

select *
from data_quality.data_quality_cms_hcc

This table compares the average CMS HCC score in the data source to the average score from Medicare FFS (Fee-for-Service) claims data. The expected score for Medicare populations is typically around 1. If there is significant deviation from 1 (< 0.7 or > 1.5), it could indicate a problem with the data. However, if the population is known to be particularly healthy or very sick, this variance might be expected.

For commercial and Medicaid populations, the risk score cannot be directly compared, as the CMS-HCC risk adjustment algorithm was specifically trained on Medicare populations and their unique conditions.

select *
from data_quality.data_quality_cms_hcc_reference

Atomic Checks The following atomic checks would be a good place to check for data quality issues if the analytics check surface any potential issues:

  • icd_diagnosis_code
  • enrollment_start_date
  • enrollment_end_Date
  • sex
  • birth_date

Quality Measures

Quality measures are essential for assessing the effectiveness and efficiency of healthcare delivery. This section examines various quality measure rates, their trends over time, and gaps in care. These analyses provide insights into the quality of care being delivered and highlight areas for potential improvement.

select *
from data_quality.data_quality_quality_measures

In this query, we check whether the numerator or denominator for any quality measure is 0. While it is possible for a population to have a 0 in either category, it is unlikely. Therefore, we flag the number of quality measures where this occurs and recommend investigating the possible reasons to determine if such results are plausible.

select *
from data_quality.data_quality_quality_measures_reference

In this query, we compare the percentage of each measure (numerator/denominator) against the Medicare FFS (Fee-for-Service) claims data set. This comparison helps assess the reasonableness of the values for each measure. Some measures may naturally have very low percentages, while others may have very high percentages. If your data shows results that are drastically different, it could indicate a potential data quality issue.

Atomic Checks The following atomic checks would be a good place to check for data quality issues if the analytics check surface any potential issues:

  • icd_diagnosis_code
  • enrollment_start_date
  • enrollment_end_Date
  • sex
  • birth_date
  • hcpcs_code
  • icd_procedure_code