Skip to main content

7. Data Assets

Tuva stores its shared data assets in public DoltHub repositories under the tuva-health organization. These repositories are the source of truth for the data assets that ship with the project and are loaded into supported workflows as needed.

DoltHub is the source of truth, but these assets are distributed and loaded when you run Tuva through the Tuva dbt package by pulling them from AWS S3, Google Cloud Storage, or Azure Blob Storage depending on the data warehouse you are using.

Data Asset Types

  • Terminology: Healthcare code systems and their associated descriptions, including assets such as ICD-10, SNOMED, HCPCS, NDC, and related dictionaries used throughout the Tuva project.

  • Reference Data: Shared supporting reference datasets used across Tuva, including reusable calendar tables, geographic crosswalks, code type references, and similar lookup assets.

  • Concept Library: Tuva-defined healthcare concepts and the related value sets that back them, including concepts for conditions, diagnostics, and other internally maintained clinical definitions.

  • Value Sets: Third-party-defined code groupings and value sets used by Tuva data marts such as CMS HCCs, CCSR, readmissions, quality measures, and other external methodologies.

  • Provider Data: Provider-focused datasets derived from sources such as NPPES and NUCC taxonomy files, cleaned and reshaped into reusable provider assets for analytics and attribution workflows.

  • Synthetic Data: Public synthetic datasets maintained by the Tuva project for testing, demos, and integration workflows. Tables with the _small suffix are reduced copies used for CI and faster validation workflows.

These six repositories separate Tuva-defined concepts from third-party definitions while keeping each asset class versioned, queryable, and easy to maintain in one place.