Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions .zenodo.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
{
"title": "HeartBioPortal DataHub: HBP 3.0 NAR release",
"creators": [
{
"name": "HeartBioPortal contributors"
},
{
"name": "TBD: confirm author list before release"
}
],
"description": "Data integration, provenance, artifact publishing, and serving-datamart tooling for the HeartBioPortal 3.0 NAR Database Issue manuscript release.",
"access_right": "open",
"upload_type": "software",
"license": "cc-by-4.0",
"keywords": [
"HeartBioPortal",
"cardiovascular genomics",
"data integration",
"provenance",
"NAR Database Issue"
],
"related_identifiers": [
{
"identifier": "https://github.com/HeartBioPortal/DataHub",
"relation": "isSupplementTo",
"scheme": "url"
},
{
"identifier": "https://heartbioportal.org/",
"relation": "isSupplementTo",
"scheme": "url"
},
{
"identifier": "https://github.com/HeartBioPortal",
"relation": "isPartOf",
"scheme": "url"
}
]
}
11 changes: 11 additions & 0 deletions ARTIFACT_MANIFEST.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
artifact_name hbp_layer artifact_path_or_pattern artifact_type key_fields record_count gene_count variant_count source_count build_version schema_version compressed redistributable license_notes description
population_frequency_datamart Population-frequency context datamart/population_frequency*; datamart/association_serving_slim.duckdb tables/duckdb rsid; allele; population_label; population_group; sample_size TBD; verify from production QA TBD TBD; expected about 18.1M rsIDs if production QA confirms 37 if production QA confirms v3.0.0-nar TBD mixed requires review requires source-license review Approximate 594.3M source-specific observations was not locally verified; confirm before release.
association_artifacts Association and phenotype evidence analyzed_data_unified/association/final/**; association/final/** json/json.gz/duckdb gene; variant_id; phenotype_path; p_value; ancestry; consequence; clinical_significance TBD; verify from production QA TBD TBD TBD v3.0.0-nar TBD mixed requires review requires source-license review Gene-level association artifacts, variant-index payloads, phenotype rollups, and serving summaries.
structural_variant_gene_payloads Structural-variant evidence analyzed_data/dbvar/dbvar_structural_variants_nstd229.json.zip json zip gene; sv_id; sv_type; coordinates; clinical_significance 3072942 75192 3040582 1 v3.0.0-nar structural_variant_legacy yes requires review requires TOPMed/dbVar source-term review Local dbVar nstd229 report verifies records, variants, and genes.
exon_enriched_sv_artifacts Structural-variant evidence analyzed_data/dbvar/dbvar_structural_variants_nstd229.exons.json.zip json zip gene; sv_id; transcript_overlap; exon_overlap TBD; verify from production QA TBD TBD 1 v3.0.0-nar TBD yes requires review requires TOPMed/dbVar source-term review Exon-enriched structural-variant payloads derived from dbVar nstd229 and annotation overlap.
protein_context_artifacts Protein context secondary_analyses/final/protein_context/*.json.gz json.gz gene; transcript_id; translation_id; protein_accession; feature_id TBD; verify from production QA 5304 local untracked payload files observed; production expected value requires QA TBD Ensembl/UniProt/EBI/InterPro sources v3.0.0-nar TBD yes requires review requires source attribution and terms review Production target mentions about 66.9k isoforms and more than 3.2M feature annotations; not locally verified.
gene_profile_artifacts Gene profiles secondary_analyses/final/gene_profile/*.json.gz; generated gene-profile paths json/json.gz gene; source_id; xref TBD; verify from production QA TBD TBD HGNC/NCBI/UniProt/GOA/Reactome/etc. v3.0.0-nar TBD mixed requires review requires source-license review Gene summary, nomenclature, ontology, pathway, and cross-reference payloads.
expression_payloads Expression layer secondary_analyses/final/expression/*.json.gz; configured expression outputs json/json.gz gene; tissue; cell_type; expression_value TBD; verify from production QA TBD TBD TBD v3.0.0-nar TBD mixed requires review requires source-license review Expression payloads imported or transformed for HBP gene dossiers.
shared_architecture_summaries Shared-architecture layer secondary_analyses/final/sga/*.json.gz; configured SGA outputs json/json.gz gene; cvd_phenotype; trait_phenotype; shared_variant_count TBD; verify from production QA TBD TBD inherits association source count v3.0.0-nar TBD mixed requires review inherits association source restrictions Derived overlap summaries between CVD and trait association variants.
drug_discovery_payloads Drug-discovery / Drugs & Compounds layer drug-discovery output paths; per-gene drug payloads json/json.gz gene; molecule_id; molecule_name; target_id; source TBD; reported 17128 records requires production QA TBD; reported 1839 gene files requires production QA TBD Open Targets; DrugBank v3.0.0-nar TBD mixed requires review DrugBank raw data license-restricted; Open Targets terms require confirmation Reported 17,128 records, 1,839 gene files, and 1,454 unique molecules were not locally verified.
guideline_derived_payloads Clinical guidelines / guideline graph links HCG/HCG-KG derived import paths; hcgkg_llm artifacts json/graph/vector gene; guideline_id; recommendation_id; evidence_class; evidence_level; snippet_id TBD; verify from HCG/HCG-KG release QA TBD TBD AHA/ACC guideline sources; future ESC v3.0.0-nar TBD mixed requires review guideline source licenses require review Gene-first guideline context consumed by HBP search_summary and guideline_detail when present.
31 changes: 31 additions & 0 deletions BUILD_METADATA.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"project": "HeartBioPortal",
"component": "DataHub",
"release": "3.0.0-nar",
"created_at": "2026-05-18T00:00:00-04:00",
"git_commit": "784decace6458cd0791e8149e09eb76552a31bee",
"datahub_schema_version": "TBD; confirm before release",
"genome_assembly": "GRCh38 if applicable; confirm per artifact",
"major_layers": [
"association and phenotype evidence",
"population-frequency context",
"variant annotation",
"structural-variant evidence",
"protein context",
"gene profiles",
"clinical guideline graph links",
"drug-discovery / drugs and compounds",
"expression",
"shared genetic architecture"
],
"source_manifest": "DATA_SOURCES.tsv",
"artifact_manifest": "ARTIFACT_MANIFEST.tsv",
"restricted_data_policy": "Do not redistribute controlled individual-level human data, credentials, restricted raw source data, raw DrugBank full database files, or license-uncertain third-party source files.",
"controlled_individual_level_data": "not redistributed",
"notes": [
"Update git_commit after final release commit is selected.",
"Counts marked TBD in ARTIFACT_MANIFEST.tsv require production QA confirmation.",
"Local dbVar nstd229 report verifies 3072942 records, 3040582 variants, and 75192 gene-level payloads.",
"DrugBank v5.1.12 raw data are license-restricted and should not be included unless redistribution permission is confirmed."
]
}
18 changes: 18 additions & 0 deletions CHECKSUMS.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
ecff620da4e234325c5616ee5e8efd732e074748b23705c99a5cf6c06ac24451 .zenodo.json
f2df420efc5b1962e21fc9f431714605873150300bdbf03b5f59ab300cff57b7 ARTIFACT_MANIFEST.tsv
76544cb7d297551bfadcb8816d433bb8b8855ac917a2f144bd1e5a3c480a9da7 BUILD_METADATA.json
a4435542f0fb7ca1b8dbd1c14861a2c4628b637483a31c72bb52de6bd9647bc8 CITATION.cff
7509ecb44ad4137b71ec083a454f50cb4987bed4f04f3d20c3af2abb8139f83b DATA_SOURCES.md
04982f407a637edfc0736a1b74fc59bf991430872f858d9a6ef9e3ad62c76a9d DATA_SOURCES.tsv
fbf48fe05e2e55283171b4c7a90b6388cb3e82d44eb7eb5c70c75e0a2973714a LICENSE
4f23d36d7930cf3c6702de7135ecd7ef781235809b50c608f56c654c0ec8aba5 LICENSES.md
8f2e1fa3f4d582335b1486d465da986836f26d45ea334333692e11bf1f8279ea MANIFEST.md
5b14a73d64211e75c7ea7cf988741a96727babc44e094041d261d91893dcfcd5 PROVENANCE_SCHEMA.md
1dcdf166eb05e0a3c1214fd06904628ea4cb9b673304b929db1bf6103d5518c0 README.md
ba24cd85ffe6811b29928c46268ee28106b340c828e1453530b87255224c8e86 RELEASE_NOTES.md
96962af6c01ef76ffcc75697b328d0513e1ef75cc97a97b0ab5c4585e72248e6 docs/schemas/drug_discovery.md
2984ea49a550a3d3b91e51a0346a5316ba1e3520eb884a55ed877925a09317c1 docs/schemas/gene_profile.md
96b9ce0a18ab071c26c1270727b13a4cf45d8d4833842ea322c60e40cbf0b694 docs/schemas/guideline_signal.md
1e90d2145fbc4049d2f8354f9c30eae04ccd2e41ab923348541ec1506aa032cf docs/schemas/population_frequency.md
febfe8f9b31139c063b8ac4a5953a1e646da363359697e707fc3c7122d9feba7 docs/schemas/protein_context.md
7fb87293e8b1446941a9f5dcf3bc9d42c7d6fed242751fd7a86dd771efea7c73 docs/schemas/structural_variant.md
18 changes: 18 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
cff-version: 1.2.0
message: "If you use this software or data-processing workflow, please cite the HeartBioPortal 3.0 manuscript and this archived release."
title: "HeartBioPortal DataHub"
version: "3.0.0-nar"
date-released: "2026-05-18"
license: "CC-BY-4.0"
repository-code: "https://github.com/HeartBioPortal/DataHub"
url: "https://heartbioportal.org/"
authors:
- name: "HeartBioPortal contributors"
- name: "TBD: confirm author list before release"
abstract: "Data integration, provenance, artifact publishing, and serving-datamart tooling for the HeartBioPortal 3.0 NAR Database Issue manuscript release."
keywords:
- HeartBioPortal
- cardiovascular genomics
- data integration
- provenance
- NAR Database Issue
39 changes: 39 additions & 0 deletions DATA_SOURCES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Data Sources Summary

`DATA_SOURCES.tsv` is the machine-readable source inventory for the HBP 3.0 NAR release. This companion file summarizes the source families by HBP layer.

## Association and phenotype evidence

DataHub normalizes association rows from HBP legacy CVD/trait layers, Million Veteran Program summary-statistics inputs when available, GWAS Catalog when included, and other association profiles into canonical gene, variant, phenotype, p-value, ancestry, consequence, clinical-significance, and provenance fields. The final artifacts are association JSON/JSON.GZ payloads, variant-index payloads, phenotype rollups, and serving DuckDB tables. Source provenance should preserve input file, source dataset, source version, phenotype path, rsID/variant ID, p-value, genome build, and transformation notes. Controlled or non-public inputs must not be redistributed.

## Population-frequency context

Population-frequency context is expected to combine source-specific allele-frequency observations from resources such as ALFA, gnomAD v4, 1000 Genomes, 1000 Genomes 30X, TOPMed-derived public frequency resources, PAGE, HGDP-CEPH, HapMap, ExAC, SGDP, and 38KJPN where present in the production build. DataHub harmonizes rsID, allele, population label, population group, sample size, study/resource, genome build, and source provenance into population-frequency datamarts. Production totals must be verified from QA before release; the approximate 594.3 million frequency observations across 18.1 million rsIDs and 37 resources are not hard-coded in `ARTIFACT_MANIFEST.tsv` because they were not locally verifiable.

## Variant annotation

Variant annotation uses dbSNP, ClinVar, Ensembl Variation, ClinGen where included, and source-specific legacy fields. DataHub preserves rsID/variant IDs as variant-level keys for chart aggregation and records clinical significance, variation type, most-severe-consequence-like fields, source record IDs, source dataset, genome build, and transformation steps. Source licenses follow the original providers.

## Structural-variant evidence

Structural-variant evidence is currently represented by dbVar nstd102/ClinVar structural-variant seed payloads and dbVar nstd229/TOPMed structural-variant call-set artifacts. DataHub normalizes source DB, study/submission, SV ID, SV type, coordinates, event length, clinical significance when present, gene overlap, transcript overlap, and exon overlap. The local nstd229 report verifies 3,072,942 records, 3,040,582 variants, and 75,192 gene-level payloads. TOPMed-related licensing and redistribution constraints require final review before public archival of source or derived bulk artifacts.

## Protein context

Protein context connects variant associations to protein architecture through Ensembl gene/transcript/translation IDs, canonical and protein-coding isoforms, exon-to-protein coordinate mapping, RefSeq/UniProt cross-references, Ensembl features, EBI Proteins features, and InterPro domains/families/motifs/regions. DataHub harmonizes source-specific API records into gene-level protein-context payloads. Production totals around 66.9k isoforms and more than 3.2 million protein feature annotations require production QA confirmation before release.

## Gene profiles

Gene profiles integrate nomenclature, gene summaries, protein cross-references, ontology/pathway membership, and curated source metadata from sources such as HGNC, NCBI Gene, UniProtKB, GOA/Gene Ontology, Reactome, Human Protein Atlas, and ClinGen where included. DataHub should preserve source IDs, source versions, access dates, cross-reference IDs, and source-specific licensing notes.

## Clinical guidelines / guideline graph links

Clinical guideline artifacts are generated primarily by HCG and HCG-KG. DataHub consumes release JSON, graph exports, or vector/serving artifacts when present and links genes to guideline snippets, recommendations, evidence classes, evidence levels, conditions, biomarkers, drugs/interventions, and source documents. Guideline snippets are context for interpretation and are not automated medical advice.

## Drug-discovery / Drugs & Compounds layer

The drugs and compounds layer uses Open Targets Platform GraphQL API v4 and licensed DrugBank v5.1.12 inputs where available. DataHub should preserve the GraphQL query, variables, access date, source field names, molecule source, molecule ID, target ID, source action type, indication, trial phase/status, source version, and source license. The raw DrugBank full database is license-restricted and must not be committed or archived unless redistribution permission is confirmed. The reported drug-layer total of 17,128 gene-drug records across 1,839 gene files and 1,454 unique molecule names requires production QA confirmation.

## Expression and shared-architecture layers

Expression payloads are imported from source-specific expression resources and existing HBP payloads where present. Shared genetic architecture is derived from association artifacts by comparing gene-level CVD and trait variant overlap. Both layers inherit redistribution constraints from their source data and must preserve source dataset, source version, input file, transformation, and HBP build version.
Loading
Loading