4. Data Quality
Data Quality is a critical tool in the Tuva Project. It's used to check whether you've mapped source data to Tuva correctly, identify data quality problems in source data, and show how source data quality problems impact Data Marts and as a result your ability to do analytics with the data.
Data Quality in Tuva includes 3 main components:
- Data Quality Tests: 600+ tests built into the Tuva data pipelines that identify atomic-level data quality problems
- Data Quality Tables: Data tables where the data quality test results are aggregated which can be used for downstream analysis
- Data Quality Dashboard: A visualization for easily understanding the data quality problems in your source data
Repository for Dashboard: https://github.com/tuva-health/tuva_dqi
Below we explain how to run Data Quality.
What Does It Do?
The Tuva Data Quality Dashboard takes the output from the data quality tests run by the Tuva dbt package and presents it in an easy-to-understand web application. Its main goals are to help you:
- Monitor Overall Health: Get a quick sense of your data's quality with an A-F grading system.
- Assess Usability: Understand if specific data marts (like
Service Categories
orCMS HCCs
) are reliable enough to use based on test results. - Investigate Issues: Drill down into specific failing tests to understand what went wrong and where.
How It Works: The Big Picture
It's important to understand that the dashboard itself doesn't run the data quality tests. Instead, it visualizes the results generated by the Tuva dbt package.
Here's the typical workflow:
- Run Tests: You use the Tuva Health dbt package (version 0.14.3 or later) within your dbt project to execute data quality tests against your data warehouse.
- Generate Outputs: The dbt package creates specific tables containing the test results (
data_quality__testing_summary
) and data for visualizations (data_quality__exploratory_charts
). - Export Data: You export these two tables from your data warehouse into CSV files (with headers).
- Import into Dashboard: You upload these CSV files into the running Data Quality Dashboard application.
- Visualize & Analyze: The dashboard processes the CSVs and displays the interactive charts, grades, and test details.
Key Features at a Glance
- Data Quality Grade: An overall A-F grade summarizing data health.
- Data Mart Status: Clear indicators of usability for each data mart.
- Detailed Test Results: See which tests passed or failed, including severity.
- Quality Dimension Analysis: Break down results by completeness, validity, etc.
- Interactive Visualizations: Charts exploring data patterns and quality metrics over time or by category.
- Report Generation: Create shareable PDF report cards.
- Severity Levels: Tests are categorized by severity to help prioritize fixes:
- Level 1: Critical (prevents dbt build)
- Level 2: Major (affects data reliability)
- Level 3: Moderate (use data with caution)
- Level 4: Minor (limited impact)
- Level 5: Informational (dbt informational tests)
Getting Started: Running the Dashboard
Here’s how you can get the dashboard running locally. For detailed instructions, always refer to the project's README.
Prerequisites:
- Python (version 3.11 or higher recommended)
- Git
Steps:
- Clone the Repository:
git clone https://github.com/tuva-health/tuva_dqi.git
cd tuva_dqi - Set up a Virtual Environment: (Recommended to avoid dependency conflicts)
# Navigate into the cloned directory (if you aren't already)
cd tuva_dqi
# Create the environment
python -m venv .venv # Use python3 if 'python' doesn't work
# Activate the environment
# On macOS/Linux:
source .venv/bin/activate
# On Windows:
# .venv\Scripts\activate - Install Dependencies:
python -m pip install -r requirements.txt
- Run the Application:
python app.py
- Access: Open your web browser and navigate to
http://localhost:8080
.
(Note: Docker instructions are also available in the repository's README if you prefer containerization.)
Generating the Input Data (Crucial Step!)
The dashboard needs specific data generated by the Tuva dbt package.
Prerequisites:
- A working dbt project.
- Your healthcare data loaded into your data warehouse.
Steps:
-
Add the Tuva Package: Ensure your project's
packages.yml
includes the Tuva package, version 0.14.3 or later:packages:
- package: tuva-health/the_tuva_project
version: [">=0.14.3"] # Or specify a higher version -
Install/Update the Package:
dbt clean
dbt deps -
Run Tuva Models and Tests: Execute the necessary dbt commands to build Tuva models and run the data quality tests. A typical sequence is:
dbt seed --full-refresh # Load necessary seed data
dbt run # Build Tuva models
dbt test # Run data quality tests -
Export the Results: After the
dbt test
command finishes successfully, two key tables will exist in your data warehouse (likely in your dbt data quality schema with Tuva prefix if specified):data_quality__testing_summary
data_quality__exploratory_charts
You need to export the data from these two tables into CSV files, including headers. The method for exporting depends on your specific data warehouse (e.g., using SQL queries, warehouse UI export features, or other tools).
-
Upload to Dashboard: Once you have the
testing_summary.csv
andexploratory_charts.csv
files, use the "Import Test Results" feature within the running dashboard application to upload them.
Running tests just within dbt
DQI has introduced variables into the Tuva dbt project. You can enable or target specific dqi tests while staying within dbt. For example
# Run all the tests associated with the HCC mart. Warnings change into errors
dbt test -s tag:dqi_cms_hccs --warn-error
# Test sev 1 and sev 2 issues only
dbt test -s tag:tuva_dqi_sev_1 -s tag:tuva_dqi_sev_2
# Run all tests that are specific to tuva dqi
dbt test -s tag:dqi
List of mart DQI test tags
dqi_service_categories
dqi_ccsr
dqi_cms_chronic_conditions
dqi_tuva_chronic_conditions
dqi_cms_hccs
dqi_ed_classification
dqi_financial_pmpm
dqi_quality_measures
dqi_readmission
How to disable DQI
Set the variable in your CLI or dbt_project.yml
file enable_input_layer_testing
to false
and this will disable the
functionality.
Development Status
Please note that the Tuva Data Quality Dashboard is currently in alpha/early development. It's designed to work with outputs from the Tuva dbt package version 0.14.3 or later.
Legacy Data Quality
Atomic Data Quality
Prior to developing the data quality framework outlined above we used a different approach. This approach (documented below) is not inherently built into the data pipelines, making it difficult to catch issues at run-time. It also takes a very long time to run and there is no corresponding dashboard to visualize results. We will be phasing out this legacy approach in coming releases but will keep it documented here for transparency.
Atomic-level data quality focuses on identifying problems inherent in your raw data, irrespective of specific analytics use cases. For example, invalid ICD-10-CM diagnosis codes are an atomic-level data quality problem. Obviously invalid codes like this will also impact analytics use cases, but at the atomic-level we focus on identifying and summarizing these problems across the dataset without linking them to specific analytics use cases.
There are two types of atomic-level data quality problems:
- Mapping Problems
- Inherent Problems
Mapping problems are data quality problems that are inadvertently introduced when data is transformed from the source to the Tuva Data Model. These problems can be corrected by fixing the mapping.
Inherent problems are data quality problems that are artifacts of the raw data that were not introduced during mapping and cannot be corrected by mapping. They can only be corrected by acquiring better data from the source.
For each claims data table in the Tuva Input Layer, we analyze several domains of data quality problems. We have organized these domains in order of importance below. To view a union of all the individual checks (ignoring date trend and other checks that are more complex) query the table below:
select *
from data_quality.data_quality_test_summary
order by result_count desc
Medical Claim
Primary Key
The most basic check we perform on each data table is a primary key check. This ensures the data in the table is at the proper grain. Modern data warehouses (e.g. Snowflake) don't always enforce primary keys. If the primary key isn't set properly this could cause downstream queries to unintentionally explode in size and number of records.
The primary key on the medical_claim
table is comprised of 3 fields: claim_id
, claim_line_number
, and data_source
. The primary_key_check
table verifies whether there are any duplicates across the primary key fields for each data table. If any result in the duplicate_records column is greater than zero, then that table's primary key is not set properly and this needs to be corrected in the mapping.
select *
from data_quality.primary_key_check
In the example above the pharmacy_claim table has 100 distinct claims that have multiple records with the same values for the primary key fields. If any result in this table is non-zero you need to correct the mapping to fix it.
Person ID
The patient is at the center of the vast majority of the analyses we're interested in. Therefore it's important that we check a few things related to the person_id
field on every medical claim, specifically:
- Does every line on each claim have a value for
person_id
populated? - Is there more than one value for
person_id
within a single claim (there shouldn't be)? - Does the
person_id
value indicated on the claim have corresponding valid eligibility during the time period when claim was rendered?
The medical_claim_person_id
table verifies whether any of these data quality problems occur in the medical_claim
table. You can query it as follows:
select *
from data_quality.medical_claim_person_id
This query returns the number of unique claim IDs that have each of these data quality problems.
In the example table above we observe the following:
- person_id is populated for every single record in the medical_claim table
- 50 claim IDs have more than 1 person_id. This can occur when two or more distinct lines on the claim have different values for person_id.
- 1,000 claim IDs are considered "orphaned claims". This means that the claim_start_date or claim_end_date occur during a month when the patient does not have insurance eligibility.
If any of these problems occur in your data you should attempt to correct them in mapping. However the specific techniques to do this will vary by dataset and it may not be possible to correct the problems. If the problems can't be correct we still include these records in the dataset, but there will be limitations in terms of how useful they are for analytics.
Date Fields
The majority of analyses are longitudinal in nature, meaning they look at trends or changes over time. To make these sorts of analyses possible we need reliable date fields. In medical claims the following fields are the important date fields:
claim_start_date
claim_end_date
claim_line_start_date
claim_line_end_date
admission_date
discharge_date
paid_date
We need to check these fields for the following problems:
- Every claim line should have a value populated for the following fields:
claim_start_date
claim_end_date
claim_line_start_date
claim_line_end_date
paid_date
-
Every institutional claim should have an
admission_date
anddischarge_date
field populated. -
The following fields should have 1 and only 1 value on the claim when they are populated.
claim_start_date
claim_end_date
admission_date
discharge_date
It's possible and expected for claim line dates to be different for different records on a single claim.
The following table returns the count of distinct claims that violate the rules outlined above.
select *
from data_quality.medical_claim_date_checks
In the example table above we can see that 50 medical claims are missing a claim_end_date and 1,000 claims are missing a claim_line_start_date.
Beyond these basic date checks, we need to check whether the values within our date fields are reasonable over time. For example, if we have a dataset that covers 3 calendar years, it's possible that claim_start_date is perfectly normal for the first two years, but then is completely missing thereafter. The following query analyzes trends in each of the date fields by looking at the count of distinct claims over time for each date field:
select *
from data_quality.claim_date_trends
order by 1
In the example table above we see that the first 3 months of 2021 have claim start and end dates populated, but no admission, discharge, or paid dates. This indicates we are missing date information on these fields during these months.
Diagnosis and Procedure Fields
Most analyses we perform incorporate information about the patient's health conditions and / or procedures performed on the patient. Therefore it's critical that we have solid information on the ICD-10-CM and ICD-10-PCS codes which encode this information.
Specifically we check to ensure the following:
- Every claim line should have a value populated for diagnosis_code_1
- Every value in the diagnosis_code_1 field should be a valid ICD-10-CM code
- Every claim should have 1 and only 1 value for diagnosis_code_1
- For any secondary diagnosis codes that are populated (e.g. diagnosis_code_2, diagnosis_code_3, etc.) these values should be valid ICD-10-CM codes
- We also want to know the percent of claims that have 1 or more secondary diagnosis codes
- For any procedure codes (e.g. procedure_code_1, procedure_code_2, etc.) that are populated these values should be valid ICD-10-PCS codes
The query below will display the results from each of these checks:
select *
from data_quality.dx_and_px
In the example table above, we see that 50 claims have 1 or more invalid primary diagnosis codes and that 1,000 claims have more than 1 primary diagnosis code, making it impossible to use the primary diagnosis code from these claims in analysis.
Institutional Header Fields
There are several key fields on institutional claims that we use to identify the type of service being delivered and to group claims into encounters. These fields include the following:
- bill_type_code
- discharge_disposition_code
- ms_drg_code
- apr_drg_code
For each of these fields we need to check institutional claims to ensure the following is true:
- Every claim line has a value populated
- The value populated is valid
- There is 1 and only 1 value for each claim across all claim lines
The following code returns the results of these checks:
select *
from data_quality.medical_claim_inst_header_fields
order by result_count desc
In the example output above we see that 50 claims have an invalid bill_type_code, meaning it does not match a code in the terminology table, and that 1,000 claims have multiple values for bill_type_code.
Claim Line Fields
There are a few key fields across both professional and institutional claims that we use to assign service categories and to group claims into encounters. These including the following fields:
- revenue_center_code
- place_of_service_code
- hcpcs_code
We need to check these fields to see if the values of these fields are missing or invalid. The following table returns the results of this.
select *
from data_quality.medical_claim_claim_line_fields
order by result_count desc
In the example table above we see that 1,000 claims are missing revenue_center_code.
Provider NPI Fields
Often we wish to do analysis about the specific providers or facilities delivering care. NPI is critical here because it's how we identify providers, facilities, and health systemcs. Therefore we perform the following checks on NPI:
- Every claim should have 1 and only 1
billing_npi
and it should be valid - Every claim should have 1 or more
rendering_npi
values and they should be valid - Every institutional claim should have 1 and only 1
facility_npi
and it should be valid
select *
from data_quality.medical_claim_provider_npi
order by result_count desc
Trended Claim Volume and Dollars
Often we see anomalies where claim volume or dollars can change unexpectedly over time, possibly indicating a data quality problem. Therefore we need to check whether this is the case.
select *
from data_quality.medical_claim_volume_and_dollars
order by year_month
In the example above we clearly have medical claims in the month of March 2021 but we have no paid amounts.
Data Loss
Often when we map data we perform complex filtering and other types of transformation operations. In these scenarios it can be easy for unintended data loss to occur. Therefore we need to confirm some basic statistics between the raw source data and the Tuva Input Layer to ensure unintended data loss hasn't occurred.
The table identified in the query below will only populate once you've created the data loss table in the Input Layer. To create this table, you manually calculate the exact same metrics as the source data and map these results to the data_loss table in the claims Input Layer.
select *
from data_quality.data_loss
In the example above we can compare several basic "table stakes" statistics for medical and pharmacy claims, as well as eligibility. The highlighted rows all show data loss as we move from raw to input layer to core. Some of this data loss might be expected and explainable. You should work to understand any and all data loss that is not currently explainable before moving on.
Pharmacy Claim
Primary Key
The most basic check we perform on any table is a primary key check. This ensures the data in this table is at the proper grain. Modern data warehouses (e.g. Snowflake) don't always enforce primary keys. If the primary key isn't set properly this could cause downstream queries to explode in size and number of records.
The primary key on the pharmacy_claim
table is comprised of 3 fields: claim_id
, claim_line_number
, and data_source
. Query the following table to check whether your data has any duplicate records across these fields. The duplicate_records
column in this table should equal 0 for every table. If it's not you have a primary key problem to fix.
select *
from data_quality.primary_keys
Person ID
Similar to the medical_claim table, the patient is at the center of the vast majority of the analyses we're interested in. Therefore it's important that we check a few things related to the person_id
field on every pharmacy claim, specifically:
- Does every line on each claim have a value for
person_id
populated? - Is there more than one value for
person_id
within a single claim? - Does the
person_id
value indicated on the claim have corresponding valid eligibility during the time period when claim was rendered?
The pharmacy_claim_person_id
table verifies whether any of these data quality problems occur in the pharmacy_claim
table. You can query it as follows:
select *
from data_quality.pharmacy_claim_person_id
This query returns a table with one row per check and the count of unique claim IDs that have that particular data quality issue
In the example table above we observe the following:
- person_id is populated for every single record in the pharmacy_claim table
- 50 claim IDs have more than 1 person_id. This can occur when two or more distinct lines on the claim have different values for person_id.
- 1,000 claim IDs are considered "orphaned claims". This means that the paid_date or the dispensing_date occurs during a month when the patient does not have insurance eligibility.
If any of these problems occur in your data you should attempt to correct them in mapping. However the specific techniques to do this will vary by dataset and it may not be possible to correct the problems. If the problems can't be correct we still include these records in the dataset, but there will be limitations in terms of how useful they are for analytics.
Date Fields
The majority of analyses are longitudinal in nature, meaning they look at trends or changes over time. To make these sorts of analyses possible we need reliable date fields. In pharmacy claims the following fields are the important date fields:
paid_date
dispensing_date
We need to check these fields to ensure they are populated for every claim and claim line:
paid_date
dispensing_date
The following table returns the count of distinct claims that violate the rules outlined above.
select *
from data_quality.pharmacy_claim_date_checks
In the example table above we can see that 1000 pharmacy claims are missing a dispensing_date while paid_date is always populated.
Beyond these basic date checks, we need to check whether the values within our date fields are reasonable over time. For example, if we have a dataset that covers 3 calendar years, it's possible that paid_date is perfectly normal for the first two years, but then is completely missing thereafter. The following query analyzes trends in each of the date fields by looking at the count of distinct claims over time for each date field:
select *
from data_quality.claim_date_trends
order by 1
In the example table above we see that the first 3 months of 2021 have some small amount of claims with paid_dates. It is worth further investigation to see if we are missing dates for those claims, or if that is just ramp up in the pharmacy claims data we have received.
NDC
Every pharmacy claim should have an NDC (National Drug Code) on it. We check to see whether every claim and claim line has an NDC field that is populated, and whether that NDC is a valid NDC (meaning it is part of the published NDC list by the FDA). These checks are found here:
select *
from data_quality.pharmacy_claim_ndc
order by 1
Some invalid NDCs are expected, as there are often NDCs in pharmacy data for items like 'glucose test strips' that are not found on the FDA list.
Prescription Details
A pharmacy claim will have prescription details in these important fields:
quantity
days supply
refills
We check each of these fields to ensure they are are populated.
select *
from data_quality.pharmacy_claim_prescription_details
order by 1
Quantity is the most important of the fields, as it is used in the pharmacy mart to calculate the generic available opportunity.
NPI Fields
Pharmacy claims contain two important NPI fields:
prescribing_npi
dispensing_npi
Of the two fields, prescribing npi is more important as it enables analysis around who is prescribing certain drugs and what type of practioners they are.
We check these fields to ensure they are populated and valid NPIs. We also check to ensure the prescribing_npi is the expected entity type (an individual instead of an organization).
select *
from data_quality.pharmacy_claim_npi
Trended Claim Volume and Dollars
Often we see anomalies where claim volume or dollars can change unexpectedly over time, possibly indicating a data quality problem. Therefore we need to check whether this is the case.
select *
from data_quality.pharmacy_claim_volume_and_dollars
In the example above we clearly have pharmacy claims in the month of March 2021 but we have no paid amounts.
Data Loss
Often when we map data we perform complex filtering and other types of transformation operations. In these scenarios it can be easy for unintended data loss to occur. Therefore we need to confirm some basic statistics between the raw source data and the Tuva Input Layer to ensure unintended data loss hasn't occurred.
The table identified in the query below will only populate once you've created the data loss table in the Input Layer. To create this table, you manually calculate the exact same metrics as the source data and map these results to the data_loss table in the claims Input Layer.
select *
from data_quality.data_loss
In the example above we can compare several basic "table stakes" statistics for medical and pharmacy claims, as well as eligibility. The highlighted rows all show data loss as we move from raw to input layer to core. Some of this data loss might be expected and explainable. You should work to understand any and all data loss that is not currently explainable before moving on.
Eligibility
Primary Key
The most basic check we perform on any table is a primary key check. This ensures the data in this table is at the proper grain.
Modern data warehouses (e.g. Snowflake) don't always enforce primary keys. If the primary key isn't set properly this could cause
downstream queries to explode in size and number of records.
The primary key on the eligibility
table is comprised of 4 fields: memner_id
, payer
, plan
, enrollment_start_date
, and enrollment_end_date
.
Query the following table to check whether your data has any duplicate records across these fields. The duplicate_records
column in this table should equal 0 for every table. If it's not you have a primary key problem to fix.
select *
from data_quality.primary_key_check
In the example above the pharmacy_claim table has 100 distinct claims that have multiple records with the same values for the primary key fields. If any result in this table is non-zero you need to correct the mapping to fix it.
Person ID
Does every row of eligibility have a person_id
populated?
The eligibility_missing_person_id
table verifies whether any rows do not have a person_id
. You can query it as follows:
select *
from data_quality.eligibility_missing_person_id
This query returns the number of rows in the eligibility table that do not have a person_id
. If this number is greater than 0,
you need to correct the mapping to fix it.
In the example table above we observe that all rows in the source data have a person_id
.
Date Fields
In eligibility, the following fields are the important date fields:
enrollment_start_date
enrollment_end_date
We need to check these fields for the following problems:
- Every row in the eligibility input layer should have valid dates in these fields.
enrollment_start_date
should be less than or equal toenrollment_end_date
.- The latest
enrollment_end_date
value should be the last day of the current month when the tuva project was last run. - Dates should be relatively recent e.g., no dates before 1900.
The following table returns the count of eligibility rows that violate the rules outlined above.
select *
from data_quality.eligibility_date_checks
Patient Demographics
Birth date and gender information about a member is crucial for some downstream analytics. The eligibility_demographics
table provides a count of eligibility spans that are missing this data.
select *
from data_quality.eligibility_demographics
Payer Info
The eligibility_missing_payer
table provides a count of eligibility spans that are missing the following:
- payer type (e.g. medicare, medicaid, commercial)
- payer name (e.g. CMS)
- invalid payer
select *
from data_quality.eligibility_missing_payer
Trended Enrollment Volume
The eligibility_trend
table provides a count of eligibility spans that contain enrollment in the corresponding year and month.
Trending this count can surface abnormal drops and/or spikes in enrollment.
select *
from data_quality.eligibility_trend
order by year_month
Data Loss
Often when we map data we perform complex filtering and other types of transformation operations. In these scenarios it can be easy for unintended data loss to occur. Therefore we need to confirm some basic statistics between the raw source data and the Tuva Input Layer to ensure unintended data loss hasn't occurred.
The table identified in the query below will only populate once you've created the data loss table in the Input Layer.
To create this table, you manually calculate the exact same metrics as the source data and map these results to the
data_loss table in the claims Input Layer.
select *
from data_quality.data_loss
In the example above we can compare several basic "table stakes" statistics for medical and pharmacy claims, as well as eligibility. The highlighted rows all show data loss as we move from raw to input layer to core. Some of this data loss might be expected and explainable. You should work to understand any and all data loss that is not currently explainable before moving on.
Analytics Data Quality
Analytics-level data quality focuses on understanding the ability of your data to support specific analyses and whether the results generated from those analyses are reasonable. This level of data quality assessment goes beyond the basic integrity checks performed at the atomic level and examines the data from an analytical perspective.
There are two main aspects of analytics-level data quality:
- Assessing which analyses your data can support
- Determining if the results your data generates for those analyses are reasonable
The analytics-level checks described below are intended to be used as a checklist. Each check exists as a data table in the Tuva Data Model, and the documentation below describes why the check is important and how to use these tables to perform the check.
Since many of the analytics-level checks involve assessing reasonableness, it's important to have something to compare to. For many of the checks we include reference data points based on Medicare FFS claims data. While not perfect for comparing to commercial and Medicaid populations, it still adds directional value (e.g. is a reasonable PMPM closer to $400 or $4,000).
We have organized most of the analytics data quality checks into two tables for ease of use.
The analytics reference summary table unions together the various analytics measures and their associated values in the Medicare FFS claims data set.
select *
from data_quality.analytics_reference_summary
order by analytics_concept
,analytics_measure
The analytics checks summary table unions together the different data quality checks and the count of results for each of those checks. Some of the checks are seen as "never events," meaning any result greater than zero is a problem worth looking into. Others are checks where results aren't a problem. The two different types of checks are separated with the "normally zero" in the data.
select *
from data_quality.analytics_checks_summary
where normally_zero = 1
order by result_count
Core Populated
The most basic check we perform at the analytics level is to confirm whether or not the Core Data Model is populated. Sometimes The Tuva Project builds all data tables in the data model, but because of data quality problems, a set of tables will end up with zero records (i.e., is not populated). This check gives us a quick glance at whether or not this is happening in the Core Data Model.
select *
from data_quality.core_populated
This query returns information about which tables in the Core Data Model are populated and which are not. If any tables are not populated, it may indicate upstream data quality issues that need to be addressed.
In the above image, core.pharmacy_claim
has returned 0 rows, indicating that the table is not populated.
Analytics Populated
On top of the Core Data Model are the Data Marts. We use these Data Marts to perform analytics. However, sometimes the tables are not populated because of data quality issues. This table lets us see if this is happening at a glance.
select *
from data_quality.analytics_populated
This query provides information about which Data Mart tables are populated. If any tables are not populated, it may indicate issues with the data transformation process or upstream data quality problems.
In the above image, ed_visits
has returned 0 rows, indicating there is an issue with one of the fields used to create ED encounters or service categories.
Chronic Conditions
Chronic diseases follow a very predictable distribution in patient populations. Here we check the prevalence of the top 10 chronic conditions in Medicare FFS data and compare it to the prevalence in your data.
select *
from data_quality.chronic_conditions_prevalence
This query returns the prevalence rates of common chronic conditions in your population, along with benchmark comparisons. Significant deviations from expected prevalence rates may indicate issues with diagnosis coding or population differences that need to be understood.
select *
from data_quality.chronic_conditions_none
The above query returns the percent of patients who have no chronic conditions, along with benchmark comparisons from the Medicare data. A significantly higher or lower number of members without chronic conditions may indicate issues with diagnosis coding or population differences that need to be understood. For example, commercial populations may have a higher percentage of members without chronic conditions due a generally younger and healthier group of enrollees.
Encounter Types and Service Categories
Encounter types and service categories provide insight into the distribution of healthcare services across different types of care. This section examines the utilization rates, costs, and trends associated with various encounter types and service categories. Understanding these patterns helps identify potential areas of concern or opportunity in healthcare delivery.
select *
from data_quality.encounters_missing_groups
This query returns information about encounter groups that are missing in the dataset. This is a high level check to ensure that all encounter groups are present in the data. If any encounter groups are missing, it may indicate issues with data mapping that needs to be addressed.
select *
from data_quality.encounters_cost_and_utilization_trend
order by year_month
The above query provides trend data for encounter costs per unit and utilization (as PKPYs) over time. This information is valuable for identifying patterns in healthcare service delivery and costs, which can inform resource allocation and policy decisions. Additionally, we can ensure consistent trends across time periods and quickly identify potential data quality issues if data is suddenly missing. For example, if we ran this query:
select *
from data_quality.encounters_cost_and_utilization_trend
where encounter_type = 'acute inpatient'
order by year_month
With these results:
We can see that we have paid amounts for acute inpatient encounters in January and February (shown as very high paid per values for those months) but low volume (PKPY). This is a clear indication of a data quality issue as the values are significantly different for the rest of the year.
select *
from data_quality.encounters_cost_and_utilization
This query returns paid per encounter and PKPY data for different encounter types. Analyzing this information helps in understanding the distribution of healthcare resources across various service categories and can highlight areas of high cost or utilization that may require further investigation.
If we ran the following query to specifically examine acute inpatient PKPY and cost pers relative to the Medicare data:
select *
from data_quality.encounters_cost_and_utilization
where analytics_measure = 'acute inpatient'
And got the following results:
It would raise some questions about the underlying dataset and potentially identify a data quality issue. For example, in the image above we see that the PKPY values are very similar (263.32 vs 258.68). However, the paid per encounter is less than half in our dataset ($7,998) vs ($16,134).
In order to better determine if this is true or the result of a data quality issue, it is important to take into account the other information we know about the dataset. Is it Medicare data? Commercial data? What is the age distribution of the population? What are their risk scores? What are the types of hospital procedures that are being performed? The answers to these questions can help us better understand the context of the data and potentially identify a data quality issue.
select *
from data_quality.service_category_pmpm_trend
The query above shows the trend of per member per month (PMPM) costs for different service categories over time. This data is crucial for tracking changes in healthcare spending patterns and identifying potential cost drivers or areas for cost containment. Comparing this to industry benchmarks or historical data can give insights into overall cost trends and the efficiency of healthcare delivery in your population.
Financial PMPM
Per Member Per Month (PMPM) costs are a key metric in healthcare analytics. As the previous sections examined PMPM at the service category and encounter levels, this section evaluates overall PMPM costs. This can be useful in identifying any high level issues with claim type mapping, or missing claims in general.
select *
from data_quality.data_quality_financial_pmpm
order by year_nbr
A typical PMPM (Per Member Per Month) for various populations are outlined in the table below:
Payer Type | PMPM Range |
---|---|
Commercial | $300-$500 |
Medicare/Medicare Advantage | $800-$1200 |
Medicaid | $200-$400 |
There are always exceptions to these rules, specifically if a population is higher risk than average for the payer type.
A typical split based on claim type is listed below:
Claim Type | Percent of Total Spend |
---|---|
Institutional | 50% - 80% |
Professional | 20% - 40% |
Pharmacy (if applicable) | 10% - 30% |
The claims data split by claim type can be viewed at the table here:
select *
from data_quality.data_quality_claim_percent
order by percent_of_total_paid
If spend falls outside of these ranges, it could indicate a mapping or data quality issue.
Atomic Checks The following atomic checks would be a good place to check for data quality issues if the analytics check surface any potential issues
- place of service code
- bill_type_code
- revenue center code
- ms_drg_code
ED Visits
ED visit analysis is one of the most common types of analysis. The ED visit PKPY rate is covered under the encounter and service category sections. Here we specifically look at whether the distribution of ED Visits by preventable classification is reasonable and if each of the important categories are populated
In this table we check to make sure each of the avoidable or non avoidable categories of emergency department use are populated. If one or more of the sections is missing in the data, a 1 would appear in the result_count column. This would likely indicate an issue with diagnosis code mapping as the classification algorithm is diagnosis code based.
select *
from data_quality.data_quality_ed_classification
In the table below, we verify that each category of avoidable or non-avoidable emergency department use is populated. If one or more sections are missing from the data, a 1 will appear in the result_count column. This likely indicates an issue with diagnosis code mapping, as the classification algorithm is based on diagnosis codes.
select *
from data_quality.data_quality_ed_classification_reference
Atomic Checks The following atomic checks would be a good place to check for data quality issues if the analytics check surface any potential issues
- icd_diagnosis_code
- bill_type_code
Acute Inpatient Visits
Acute Inpatient Visits represent a significant portion of healthcare costs and serve as an important indicator of population health. This section analyzes various aspects of inpatient care, including visit rates, length of stay (ALOS), costs, mortality rates, and case mix complexity. These metrics provide insights into the intensity and quality of inpatient care delivery.
- ALOS (Average Length of Stay)
- Average Mortality Rate
- Top DRGs (Diagnosis-Related Groups)
The following table checks whether any acute inpatient encounters lack associated professional claims. Professional claims are matched with institutional claims as part of the encounter-building process. If no professional claims can be matched (based on claim dates and patient ID) to institutional claims, these cases will be flagged here.
select *
from data_quality.data_quality_acute_inpatient
This table measures the average length of stay and mortality rate of inpatient encounters, comparing them to the Medicare FFS (Fee-for-Service) claims data. If there is a variance of ±2 days in the length of stay, it warrants further investigation to understand the cause. The mortality rate may vary depending on the population type.
select *
from data_quality.data_quality_acute_inpatient_reference
This table measures the prevalence of the top 10 DRGs in the data and compares their prevalence to the top 10 DRGs in Medicare data. If the most common DRGs from Medicare are not found in the data source, it could indicate a data issue. Alternatively, it could reflect differences in the population compared to the Medicare population.
select *
from data_quality.data_quality_acute_inpatient_prevalence
Atomic Checks The following atomic checks would be a good place to check for data quality issues if the analytics check surface any potential issues
- icd_diagnosis_code
- ms_drg_code
- bill_type_code
- claim_line_start_date
- claim_line_end_date
- claim_start_date
- claim_end_date
- admit_date
- discharge_date
Readmissions
Readmission rates are a key quality indicator in healthcare. This section focuses on 30-day readmission rates, both overall and for specific diagnoses. Understanding readmission patterns can help identify opportunities for improving care transitions and reducing unnecessary hospitalizations.
- 30-day readmission rate
The following table checks for various reasons why inpatient encounters were not eligible to be considered as index admissions (and were therefore excluded from the readmissions calculation). Some degree of encounters with data quality issues is unavoidable in claims data, but if the number of excluded encounters exceeds a few percentage points of the total, it is worth investigating further.
select *
from data_quality.data_quality_readmissions
This table compares the readmission rate in the data source to the readmission rate from the Medicare FFS (Fee-for-Service) claims data. Normal variation in readmission rates is between 5% and 20%. Any rates outside of this range are unusually high or low and should be researched further.
select *
from data_quality.data_quality_readmissions_reference
Atomic Checks The following atomic checks would be a good place to check for data quality issues if the analytics check surface any potential issues
- icd_diagnosis_code
- ms_drg_code
- bill_type_code
- admit_date
- discharge_date
CMS-HCCs
The CMS Hierarchical Condition Category (HCC) risk adjustment model is widely used to predict healthcare costs. This section analyzes risk scores, comorbidity rates, and trends in risk scores over time. These analyses provide insight into the health status of the population and its predicted resource utilization.
This table checks whether both demographic factors and disease factors are present in the CMS-HCC mart. Issues with enrollment or diagnosis codes may cause one or both factors to be missing. A result count of 1 indicates that the factor was missing from the data.
select *
from data_quality.data_quality_cms_hcc
This table compares the average CMS HCC score in the data source to the average score from Medicare FFS (Fee-for-Service) claims data. The expected score for Medicare populations is typically around 1. If there is significant deviation from 1 (< 0.7 or > 1.5)
, it could indicate a problem with the data. However, if the population is known to be particularly healthy or very sick, this variance might be expected.
For commercial and Medicaid populations, the risk score cannot be directly compared, as the CMS-HCC risk adjustment algorithm was specifically trained on Medicare populations and their unique conditions.
select *
from data_quality.data_quality_cms_hcc_reference
Atomic Checks The following atomic checks would be a good place to check for data quality issues if the analytics check surface any potential issues:
- icd_diagnosis_code
- enrollment_start_date
- enrollment_end_Date
- sex
- birth_date
Quality Measures
Quality measures are essential for assessing the effectiveness and efficiency of healthcare delivery. This section examines various quality measure rates, their trends over time, and gaps in care. These analyses provide insights into the quality of care being delivered and highlight areas for potential improvement.
select *
from data_quality.data_quality_quality_measures
In this query, we check whether the numerator or denominator for any quality measure is 0. While it is possible for a population to have a 0 in either category, it is unlikely. Therefore, we flag the number of quality measures where this occurs and recommend investigating the possible reasons to determine if such results are plausible.
select *
from data_quality.data_quality_quality_measures_reference
In this query, we compare the percentage of each measure (numerator/denominator) against the Medicare FFS (Fee-for-Service) claims data set. This comparison helps assess the reasonableness of the values for each measure. Some measures may naturally have very low percentages, while others may have very high percentages. If your data shows results that are drastically different, it could indicate a potential data quality issue.
Atomic Checks The following atomic checks would be a good place to check for data quality issues if the analytics check surface any potential issues:
- icd_diagnosis_code
- enrollment_start_date
- enrollment_end_Date
- sex
- birth_date
- hcpcs_code
- icd_procedure_code