Skip to main content

CMS BCDA

Check out the code on GitHub

Overview

CMS provides access to claims data via a FHIR endpoint called the CMS Beneficiary Claims Data API (BCDA). You can read about it here.

This process allows you to transforms data from CMS BCDA into the Tuva Data Model. There are 3 steps:

  1. Flatten FHIR to CSVs
  2. Load CSVs into Date Warehouse
  3. Import and Run dbt Packages

We also created a demo project that leverages the sample CMS BCDA data BCDA demo project.

Step 1: Flatten FHIR to CSVs

Create configs for BCDA JSON files

This step is only required if your BCDA files differ in format from the sample BCDA data. Unfortunately there isn't a good way to tell if your data is structured differently until you attempt to run the connector.

  1. Clone the FHIR_inferno repository
  2. Delete the config files in fhir_inferno\configurations\configuration_bcda\config
  3. Copy your JSON into the fhir_inferno\configurations\configuration_bcda\input_fhir folder
  4. Open the script analyzeJson_new.py in fhir_inferno\helper_scripts
  5. In analyzeJson_new.py, update the configurations:
    1. folder_path - path to your BCDA files (e.g...\configurations\configuration_bcda\input_fhir)
    2. keyword - the filename or JSON resource type (e.g. coverage)
    3. anchor_path - root path of the JSON (leave blank when executing the script for the first time)
    4. ignore_list - paths in JSON to ignore (leave blank when executing the script for the first time)
    5. inputFormat - default to ndJSON; no need to update
  6. Execute the script
py analyzeJson_new.py

In the terminal, analyzeJson_new.py will output the different keys in the JSON file. The number following the key is how many arrays are in each key. Since we are flattening the FHIR object, think of the keys as tables and the arrays as columns. Decide which keys should be broken out into its own table to cut down on the number of columns. The screenshot below depicts the output of the coverage resource. The key extension will be split into its own table with 154 columns. analyzeJson_output

  1. Once you’ve decided which keys you want to split out, update the configuration in analyzeJson_new.py and rerun the script for each key.

    1. Update ignore_path with all keys you want as a separate file cov_config_example
    2. Rerun anaylzeJson_new.py. A file called config_coverage.ini will be generated in the same directory as anaylzeJson_new.py. This config file contains the information to parse the coverage resource without the extension key. config_ext_anchor
    3. Move extension from ignore_path to anchor_path. Rerun anaylzeJson_new. Another file called config_coverage_extension.ini will be generated in the same directory as anaylzeJson_new. This config file contains the information to parse the coverage resource without the extension key. cov_config_ext_example
  2. Repeat steps 5-7 for each resource in the BCDA data (i.e. patient and explanationofbenefit)

  3. Update each config file with a primary key to facilitate JOINs between tables

    1. In the BCDA data, id is the unique identifier for each line. In the main resource, e.g. coverage, move id to the top of the list of arrays under [Struct]. config_primary_key

    2. In the sub resource, e.g. coverage_extension, add coverage_id = id to [Struct] config_move_pk

    3. Repeat step 9 for each resource

  4. Copy all config files to fhir_inferno\configurations\configuration_bcda\config

Parse JSON files to create CSVs

Now that you've generated the configs, which are the instructions for how to parse the JSON, it's time to actually run the parser.

  1. Move parseFHIR.py from fhir_inferno to fhir_inferno\configurations\configuration_bcda
  2. Open BCDAConnector.py in fhir_inferno\configurations\configuration_bcda
  3. If following these instructions, the configs in BCDAConnector.py should not have to be updated but can be if using a custom folder.
    1. config_dir - the location of the configration .ini files
    2. input_dir - the location of the JSON files
    3. outpur_dir - the location the CSVs will be output to
    4. configs - the name of each resource
      1. the name must match the resource in the .ini files
  4. Execute the script
py BCDAConnector.py
  1. Confirm CSVs have been created in the output_dir (e.g. output_csv)

Step 2: Load CSVs into Data Warehouse

Now you need to load the CSVs into your data warehouse of choice (we'll trust you can take care of that).

Step 3: Import dbt Package

Now you're going to import the bcda_connector dbt package into your dbt project (create a new dbt project if you don't already have one). This package will convert the CSVs into the Tuva Input Layer.

  1. Add the bcda_connector to your packages.yml

    packages:
    - git: "https://github.com/tuva-health/bcda_connector.git"
    revision: main
  2. Add the following vars to your dbt_project.yml with the applicable values

    1. bcda_coverage_file_prefix - a string of text to parse enrollment date
      1. When CMS sends enrollment data, it does not contain an explicit field with the year and month. It is implied that members present in the monthly coverage file are enrolled. So the date that the file is sent needs to be stored in the database as a discreet field to be used for analytics. This is done by parsing the filename which contains the date of the file. FHIR inferno contains a filename field in every table and below is an example from coverage. The string fhir_input\\coverage_ needs to be removed from to parse out the date. In my dbt_project.yml, my var will be bcda_coverage_file_prefix: fhir_input\\\\coverage_ (there are 4 backslashes to escape the) filename dbtprojectyml_example

From here you're ready to import the Tuva package and run dbt build to populate the data model and load all seed files.