Skip to content

Cross-Domain-Interoperability-Framework/validation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

198 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Files for validation of CDIF metadata

This repository contains JSON schema, JSON-LD frames, contexts, and SHACL rule sets for validating CDIF metadata documents.

Table of Contents

Files

Current (2026 Schema with DDI-CDI/CSVW)

File Description
CDIFDiscoverySchema.json JSON Schema for framed (tree) CDIF discovery profile metadata, generated by generate_validation_schema.py from CDIFDiscoveryProfile resolvedSchema
CDIFCompleteSchema.json JSON Schema for framed (tree) CDIF complete profile metadata (discovery + data description + archive + provenance), generated by generate_validation_schema.py from CDIFcompleteProfile resolvedSchema
CDIFDataDescriptionSchema.json JSON Schema for framed (tree) CDIF data description profile metadata (discovery + data description), generated by generate_validation_schema.py from CDIFDataDescriptionProfile resolvedSchema
generate_validation_schema.py Generates framed-tree validation schemas from building block profile resolved schemas
CDIF-graph-schema-2026.json JSON Schema for flattened JSON-LD graphs (@graph arrays), generated by generate_graph_schema.py
generate_graph_schema.py Generates the graph schema from building block source schemas
ShaclValidation/generate_shacl_shapes.py Generates composite SHACL shapes from building block rules.shacl files
ShaclValidation/generate_shacl_report.py Generates markdown SHACL validation reports with severity grouping
ShaclValidation/CDIF-Discovery-Shapes.ttl Composite SHACL shapes for CDIFDiscovery profile (generated by ShaclValidation/generate_shacl_shapes.py)
ShaclValidation/CDIF-Complete-Shapes.ttl Composite SHACL shapes for CDIFcomplete profile (generated by generate_shacl_shapes.py --profile complete)
CDIF-frame-2026.jsonld JSON-LD frame for 2026 schema
CDIF-context-2026.jsonld JSON-LD context for authoring without namespace prefixes
FrameAndValidate.py Python script for framing and validation
croissant/ConvertToCroissant.py Converts CDIF JSON-LD to Croissant (mlcommons.org/croissant/1.0) format
validate_building_blocks.py Validates building block schemas, SHACL shapes, and examples across the BB source tree
validate-cdif.bat Windows batch script for oXygen XML Editor integration
batch_validate.py Batch validation of CDIF metadata files across multiple file groups (JSON Schema + SHACL)
validate_conformance.py Validates JSON-LD instances against the CDIF profiles they claim conformance to via schema:subjectOf/dcterms:conformsTo. Maps conformsTo URIs to profile/building-block schemas and reports per-file, per-profile results
geocodes_harvester.py Harvests dataset metadata from the EarthCube GeoCodes SPARQL endpoint, extracts original JSON-LD from landing pages, and optionally converts to CDIF core or discovery profile format
DCAT/dcat_to_cdif.py Converts DCAT JSON-LD catalogs to CDIF schema.org format. Maps DCAT/Dublin Core properties to schema.org equivalents per the CDIF DCAT implementation guide. See DCAT/README.md

DDI-CDI Resolved Schema

File Description
ddi-cdi/ddi-cdi.schema_normative.json Full DDI-CDI normative JSON Schema (395 definitions)
ddi-cdi/cls-InstanceVariable-resolved.json Self-contained resolved schema for DDI-CDI InstanceVariable class
ddi-cdi/cls-InstanceVariable-resolved-README.md Documentation for the resolved schema generation process

Legacy (Pre-2026, in archive/)

File Description
CDIFDiscoverySchema.json Hand-maintained discovery schema (superseded by generated version)
CDIFCompleteSchema.json Hand-maintained complete schema (superseded by generated version)
CDIF-JSONLD-schema-2026.json Original all-in-one framed tree schema (superseded by CDIFDiscoverySchema + CDIFCompleteSchema)
CDIF-JSONLD-schema-schemaprefix.json JSON Schema for CDIF Discovery profile metadata with schema: prefixes
CDIF-frame.jsonld JSON-LD frame for legacy schema
CDIF-context.jsonld Legacy JSON-LD context

Quick Start

Prerequisites

pip install PyLD jsonschema

Validate a Document

# Using Python script (default: 2026 schema)
python FrameAndValidate.py my-metadata.jsonld -v

# Using Windows batch script
validate-cdif.bat my-metadata.jsonld

Save Framed Output for Debugging

python FrameAndValidate.py my-metadata.jsonld -o framed.json -v

Batch Validate Multiple Files

batch_validate.py runs both JSON Schema and SHACL validation across multiple file groups:

python batch_validate.py

File groups validated:

  • testJSONMetadata -- 77 ADA metadata test files
  • cdifbook -- 10 cdifbook example documents
  • cdifProfiles -- 5 CDIF profile examples from building blocks
  • adaProfiles -- 36 ADA profile examples from building blocks

Output shows per-file results for each validation type with severity-aware reporting:

  • JSON Schema: PASS or FAIL
  • SHACL: PASS (clean), PASS (N warnings, M info), FAIL (N violations, M warnings), or SKIP (for generated output files like -croissant.json, -rocrate.json)

Group summaries and an overall summary list all violations and schema failures.

Current Validation Status

As of April 2026, validation across testJSONMetadata (77 files) and all 5 CDIF profile examples shows:

  • JSON Schema: 77/77 testJSONMetadata pass against all three schemas (Discovery, DataDescription, Complete)
  • Profile examples: 5/5 pass (Discovery, DiscoveryMinimal, DiscoveryComplete, DataDescription, Complete)
  • SHACL Violations: 0 across all files
  • SHACL Warnings/Info: All files pass with warnings/info only — these reflect optional-but-recommended properties (missing activity descriptions, contact points, physical data types, etc.)

SHACL severity levels are aligned with JSON Schema: properties that are optional in the JSON Schema are sh:Warning (not sh:Violation) in SHACL.

Validation Workflow

CDIF metadata is expressed as JSON-LD. To validate JSON-LD documents against the JSON Schema, you need to first frame the document to ensure it has the correct structure. The framing process:

  1. Reshapes the JSON-LD graph into a tree structure
  2. Ensures properties use the expected prefixes (e.g., schema:name)
  3. Embeds referenced nodes inline
  4. Normalizes arrays and single values

Step 1: Frame the JSON-LD Document

Use a JSON-LD processor to apply CDIF-frame-2026.jsonld to your metadata document.

Step 2: Validate Against Schema

Validate the framed output against the appropriate schema:

  • CDIFDiscoverySchema.json -- discovery profile only
  • CDIFDataDescriptionSchema.json -- discovery + data description
  • CDIFCompleteSchema.json -- discovery + data description + archive + provenance (default)

RO-Crate Conversion and Validation

RO-Crate conversion and validation tools (ConvertToROCrate.py, ValidateROCrate.py) have been moved to the CDIF packaging repository. These tools convert nested/compacted CDIF JSON-LD into RO-Crate 1.1 form via JSON-LD expand + flatten.

See the packaging repository documentation for conversion details, validation checks, and usage.

Croissant Conversion

croissant/ConvertToCroissant.py converts CDIF JSON-LD metadata to Croissant (mlcommons.org/croissant/1.0) JSON-LD, an ML-oriented dataset metadata format developed by MLCommons. Both formats build on schema.org and JSON-LD, so discovery-level metadata maps directly.

# Convert a CDIF document to Croissant
python croissant/ConvertToCroissant.py input.jsonld -o output-croissant.json

# Validate the output (requires: pip install mlcroissant)
mlcroissant validate --jsonld output-croissant.json

See croissant/README.md for detailed documentation on the conversion process, property mappings, example output files, and usage options. The full property-by-property mapping is in croissant/CDIFtoCroissant.md.

Usage Examples

Command Line (Recommended)

The FrameAndValidate.py script handles the complete workflow:

# Validate with 2026 schema (default)
python FrameAndValidate.py my-metadata.jsonld -v

# Save framed output
python FrameAndValidate.py my-metadata.jsonld -o framed.json -v

# Use legacy schema
python FrameAndValidate.py my-metadata.jsonld --frame archive/CDIF-frame.jsonld --schema archive/CDIF-JSONLD-schema-schemaprefix.json -v

Options:

  • -v, --validate - Validate against JSON Schema
  • -o, --output FILE - Save framed output to file
  • --schema FILE - Path to JSON Schema (default: CDIFCompleteSchema.json)
  • --frame FILE - Path to JSON-LD frame (default: CDIF-frame-2026.jsonld)

oXygen XML Editor

The validate-cdif.bat script enables validation from within oXygen XML Editor.

Setup

  1. Go to Tools → External Tools → Configure...
  2. Click New and configure:
Field Value
Name CDIF Validate
Command Path to validate-cdif.bat
Arguments "${cf}"
Working directory (leave empty)

Usage

  1. Open a JSON-LD file in oXygen
  2. Go to Tools → External Tools → CDIF Validate
  3. Results appear in the oXygen console

Batch Script Options

validate-cdif.bat file.jsonld           # Validate with 2026 schema
validate-cdif.bat file.jsonld --framed  # Validate + save framed output
validate-cdif.bat file.jsonld --legacy  # Use pre-2026 schema
validate-cdif.bat --help                # Show help

Python

import json
from pyld import jsonld
import jsonschema

# Load the frame
with open('CDIF-frame-2026.jsonld') as f:
    frame = json.load(f)

# Load your JSON-LD metadata document
with open('my-metadata.jsonld') as f:
    doc = json.load(f)

# Load the schema
with open('CDIFCompleteSchema.json') as f:
    schema = json.load(f)

# Step 1: Frame the document
framed = jsonld.frame(doc, frame)

# Step 2: Validate against schema
try:
    jsonschema.validate(instance=framed, schema=schema)
    print("Validation successful!")
except jsonschema.ValidationError as e:
    print(f"Validation failed: {e.message}")

Required packages:

pip install PyLD jsonschema

JavaScript/Node.js

const jsonld = require('jsonld');
const Ajv = require('ajv');
const addFormats = require('ajv-formats');
const fs = require('fs');

async function validateCDIF(metadataPath) {
    // Load files
    const frame = JSON.parse(fs.readFileSync('CDIF-frame-2026.jsonld', 'utf8'));
    const doc = JSON.parse(fs.readFileSync(metadataPath, 'utf8'));
    const schema = JSON.parse(fs.readFileSync('CDIFCompleteSchema.json', 'utf8'));

    // Step 1: Frame the document
    const framed = await jsonld.frame(doc, frame);

    // Step 2: Validate against schema
    const ajv = new Ajv({ allErrors: true });
    addFormats(ajv);
    const validate = ajv.compile(schema);

    if (validate(framed)) {
        console.log('Validation successful!');
        return true;
    } else {
        console.log('Validation failed:', validate.errors);
        return false;
    }
}

validateCDIF('my-metadata.jsonld');

Required packages:

npm install jsonld ajv ajv-formats

Context Requirements

Your JSON-LD metadata documents must include a @context with namespace prefixes. Only schema and dcterms are required at the discovery level; additional prefixes are needed depending on which optional properties are used.

2026 Schema Requirements

Required (discovery level):

{
    "@context": {
        "schema": "http://schema.org/",
        "dcterms": "http://purl.org/dc/terms/"
    }
}

Optional prefixes (add as needed for the properties you use):

Prefix IRI When needed
spdx http://spdx.org/rdf/terms# Checksum properties on distributions
dcat http://www.w3.org/ns/dcat# dcat:CatalogRecord on subjectOf
geosparql http://www.opengis.net/ont/geosparql# Spatial coverage geometry
prov http://www.w3.org/ns/prov# Provenance (wasGeneratedBy)
dqv http://www.w3.org/ns/dqv# Data quality measurements
cdi http://ddialliance.org/Specification/DDI-CDI/1.0/RDF/ DDI-CDI variable/data structure properties
csvw http://www.w3.org/ns/csvw# CSVW tabular data properties (data description level)

Domain-specific metadata may also use extension namespace prefixes. For example, the XAS (X-ray absorption spectroscopy) test example uses:

| Prefix | IRI | Purpose |
|--------|-----|---------|
| `xas` | `http://cdi4exas.org/` | XAS-specific types and properties (beamline, detector, edge energy, etc.) |
| `cdifq` | `http://crossdomaininteroperability.org/cdifq/` | Placeholder namespace for data structure properties (`nColumns`, `nRows`) not yet assigned to a formal vocabulary |

The `cdifq` namespace is a temporary placeholder. Properties using it (such as row/column counts on data structures) may migrate to DDI-CDI, CSVW, or another standard vocabulary in the future. `croissant/ConvertToCroissant.py` includes `cdifq` in its output context so that these terms resolve correctly during JSON-LD processing.

### Legacy Schema Requirements

```json
{
    "@context": {
        "schema": "http://schema.org/",
        "dcterms": "http://purl.org/dc/terms/",
        "prov": "http://www.w3.org/ns/prov#",
        "dqv": "http://www.w3.org/ns/dqv#",
        "geosparql": "http://www.opengis.net/ont/geosparql#",
        "spdx": "http://spdx.org/rdf/terms#",
        "time": "http://www.w3.org/2006/time#"
    }
}

Authoring Instances Without Prefixes

If you prefer to author metadata without namespace prefixes (e.g., name instead of schema:name), you can use the CDIF-context-2026.jsonld context file. This context maps unprefixed property names to their full IRIs.

Example Instance Without Prefixes

{
    "@context": "https://your-server.org/CDIF-context-2026.jsonld",
    "@type": "Dataset",
    "@id": "https://example.org/dataset/123",
    "name": "My Dataset",
    "description": "A sample dataset description",
    "identifier": "dataset-123",
    "dateModified": "2024-01-15",
    "url": "https://example.org/data/123",
    "license": "https://creativecommons.org/licenses/by/4.0/",
    "subjectOf": {
        "@type": ["Dataset"],
        "additionalType": ["dcat:CatalogRecord"],
        "sdDatePublished": "2024-01-15"
    }
}

How It Works

The validation workflow handles both prefixed and unprefixed instances:

  1. Unprefixed instance references CDIF-context-2026.jsonld
  2. Framing with CDIF-frame-2026.jsonld transforms the instance
  3. The frame's context uses prefixed names, so the output has prefixed keys
  4. Validate against CDIFCompleteSchema.json

This means you only need one schema. The framing step normalizes all instances to the prefixed format regardless of how they were authored.

Deploying the Context

For production use, host CDIF-context-2026.jsonld at a stable URL and reference it in your instances:

{
    "@context": "https://your-server.org/CDIF-context-2026.jsonld",
    ...
}

Or embed the context directly in your instance by copying the contents of CDIF-context-2026.jsonld.

Schema Structure

The schema validates CDIF Discovery profile metadata with the following required fields:

  • @id - Resource identifier
  • @type - Must include schema:Dataset
  • @context - JSON-LD context with required prefixes
  • schema:name - Resource name
  • schema:identifier - Primary identifier
  • schema:dateModified - Last modification date
  • schema:subjectOf - Metadata about the metadata record (requires @type containing schema:Dataset and schema:additionalType containing dcat:CatalogRecord)
  • Either schema:url or schema:distribution - Access information
  • Either schema:license or schema:conditionsOfAccess - Usage terms

2026 Schema Additions

The 2026 schema adds support for:

Variables (schema:variableMeasured):

  • Items are anyOf PropertyValue-based (cdifVariableMeasured) or schema:StatisticalVariable
  • PropertyValue variables: typed as schema:PropertyValue with DDI-CDI extensions (cdi:intendedDataType, cdi:simpleUnitOfMeasure, cdi:describedUnitOfMeasure, cdi:uses, cdi:role)
  • cdi:role -- enum: MeasureComponent, AttributeComponent, DimensionComponent, DescriptorComponent, ReferenceValueComponent
  • StatisticalVariable: typed as schema:StatisticalVariable with schema:statType, schema:measuredProperty (required)
  • cdi:physicalDataType is required at the data description level (CDIFDataDescription/CDIFcomplete profiles), not at discovery level

Distributions:

  • cdi:StructuredDataSet - For structured formats (JSON, XML, HDF5, NetCDF)
  • cdi:TabularTextDataSet - For tabular text (wide format) with CSVW properties:
    • csvw:delimiter, csvw:header, csvw:headerRowCount
    • cdi:isDelimited OR cdi:isFixedWidth
    • cdi:hasPhysicalMapping - Links variables to physical representation
  • cdi:LongStructureDataSet - For long/narrow data format where each row is a single observation:
    • A descriptor column identifies which variable each row measures (cdi:role: DescriptorComponent)
    • A reference column holds the actual value (cdi:role: ReferenceValueComponent)
    • Optional CSVW properties (delimiter, header, etc.) and DDI-CDI physical properties
    • cdi:hasPhysicalMapping - Links variables to physical representation
    • SHACL rules enforce exactly one DescriptorComponent and at least one ReferenceValueComponent

Flattened Graph Schema

CDIF-graph-schema-2026.json is the graph-based counterpart to the framed tree schema. It validates flattened JSON-LD documents that use @graph arrays directly, without requiring framing first. This is useful for validating JSON-LD as it naturally comes out of RDF stores or JSON-LD flatten operations.

The schema is generated by generate_graph_schema.py from the CDIF building block source schemas.

Building Block Sources

The generator reads building block schemas from the metadataBuildingBlocks/_sources/ directory (the BuildingBlockSubmodule). The location is auto-detected or can be overridden:

# Auto-detect (looks for BuildingBlockSubmodule/_sources/ relative to script)
python generate_graph_schema.py

# Explicit path
python generate_graph_schema.py --bb-dir /path/to/_sources

# Environment variable
export CDIF_BB_DIR=/path/to/_sources
python generate_graph_schema.py

# Custom output path
python generate_graph_schema.py --output my-graph-schema.json

Graph Schema Usage

# Validate a flattened JSON-LD document directly
python -c "
import json, jsonschema
with open('CDIF-graph-schema-2026.json') as f: schema = json.load(f)
with open('my-flattened.jsonld') as f: doc = json.load(f)
jsonschema.validate(doc, schema)
print('Valid')
"

The graph schema accepts three input forms:

  • A {"@context": {...}, "@graph": [...]} document (the primary use case)
  • A bare array of typed objects
  • A single typed object

Schema Structure (Graph)

The generated schema has this high-level structure:

  • root-graph: validates @context prefix declarations + @graph array of nodes
  • root-object: a nested if/then/else chain dispatching objects by @type to the correct type definition
  • id-reference: shared {"@id": "string"} definition for cross-node references
  • 24 type definitions: type-Dataset, type-Person, type-Organization, type-PropertyValue, type-DefinedTerm, type-CreativeWork, type-DataDownload, type-MediaObject, type-WebAPI, type-Action, type-HowTo, type-Place, type-ProperInterval, type-MonetaryGrant, type-Role, type-Activity, type-QualityMeasurement, type-Claim, type-CatalogRecord, type-Identifier, type-InstanceVariable, type-StructuredDataSet, type-TabularTextDataSet, type-LongStructureDataSet

Type dispatch is ordered most-specific-first (e.g., cdi:StructuredDataSet before schema:Dataset) so that subtypes are matched before their parent types.

Key Transformations

The generator applies these transformations when reading building block source schemas:

  1. External $ref resolution -- Cross-building-block $refs (e.g., ../person/schema.yaml) are resolved to internal #/$defs/type-X references
  2. anyOf alternatives -- Properties that reference other building block types get anyOf [type-ref, id-reference] so they accept either inline objects or @id cross-references
  3. @type disambiguation -- Composite types get additional type markers for dispatch (e.g., cdifCatalogRecord becomes dcat:CatalogRecord, identifier adds cdi:Identifier)
  4. @context stripping -- Context declarations are removed from non-root types (the @context goes on the root-graph wrapper only)
  5. Composite type assembly -- Complex types like type-Dataset merge mandatory + optional building blocks; type-StructuredDataSet/type-TabularTextDataSet/type-LongStructureDataSet compose dataDownload + CDI extensions
  6. Extended provenance -- type-Activity built from cdifProv building block, requiring multi-typed @type: ["schema:Action", "prov:Activity"], merging base generatedBy properties (prov:used) with schema.org Action properties (schema:agent, schema:actionProcess, etc.). Instruments are nested within prov:used items via schema:instrument sub-key (instruments are prov:Entity subclasses). type-HowTo and type-Claim added as new dispatch types for methodology and assertion objects

Troubleshooting

Common Validation Errors

  1. Missing required property

    • Ensure all required fields are present
    • Check that schema:subjectOf contains required nested fields
  2. Type mismatch

    • Properties like schema:spatialCoverage and schema:temporalCoverage expect arrays
    • Check that @type values use the schema: prefix
  3. Invalid @type

    • Root @type must include schema:Dataset
    • For 2026 schema, variables must include both schema:PropertyValue and cdi:InstanceVariable
  4. Framing issues

    • Ensure your document has proper @id values for node references
    • Check that the @context is compatible with the frame
  5. dcterms:conformsTo syntax

    • Must use object syntax: [{"@id": "..."}] not ["..."]

Debugging

To see the framed output before validation:

python FrameAndValidate.py my-metadata.jsonld -o framed.json

Or in Python:

framed = jsonld.frame(doc, frame)
print(json.dumps(framed, indent=2))

SHACL Validation

In addition to JSON Schema validation, CDIF metadata can be validated using SHACL (Shapes Constraint Language) rules. SHACL validation operates on the RDF graph and can express constraints that JSON Schema cannot -- SPARQL-based targeting, cross-node relationships, and semantic inference.

The composite SHACL shapes are compiled from modular rules.shacl files in individual building blocks in the metadataBuildingBlocks repository. Two profiles are available:

  • discoveryShaclValidation/CDIF-Discovery-Shapes.ttl (64 shapes)
  • completeShaclValidation/CDIF-Complete-Shapes.ttl (76 shapes, adds provenance + data description)

Quick start:

# Validate against discovery shapes
python ShaclValidation/ShaclJSONLDContext.py my-metadata.jsonld ShaclValidation/CDIF-Discovery-Shapes.ttl

# Generate a markdown validation report
python ShaclValidation/generate_shacl_report.py my-metadata.jsonld ShaclValidation/CDIF-Complete-Shapes.ttl -o report.md

# Regenerate shapes after building block changes
python ShaclValidation/generate_shacl_shapes.py --profile discovery
python ShaclValidation/generate_shacl_shapes.py --profile complete

See ShaclValidation/README.md for detailed documentation on the SHACL tools, shapes architecture, report format, and how to add new building block shapes.

Recommendation: Use both JSON Schema and SHACL validation for comprehensive coverage. batch_validate.py runs both automatically across multiple file groups.

Conformance Validation

validate_conformance.py inspects JSON-LD files for schema:subjectOf/dcterms:conformsTo claims and validates each file against the profile schemas it claims to conform to. Supports cdifCore, CDIFDiscovery, CDIFDataDescription, CDIFcomplete, and building block schemas (provenance, manifest/archive distribution).

# Validate a directory of JSON-LD files against their claimed profiles
python validate_conformance.py testJSONMetadata/

# Summary only
python validate_conformance.py testJSONMetadata/ --summary

# Verbose per-file error details
python validate_conformance.py testJSONMetadata/ --verbose

Conformance URIs with ada: prefix are ignored. URIs are normalized (trailing slashes stripped, dataDescription mapped to data_description).

GeoCodes Harvester

geocodes_harvester.py harvests dataset metadata from the EarthCube GeoCodes catalog (~170K indexed datasets). It queries the Blazegraph SPARQL endpoint, fetches original JSON-LD from source landing pages when available, and optionally converts records to CDIF profile format.

# List publishers and dataset counts
python geocodes_harvester.py --list-publishers

# Harvest 5 records from diverse publishers, convert to CDIF Discovery
python geocodes_harvester.py --count 5 --output ./examples --cdif discovery

# Harvest from a specific publisher
python geocodes_harvester.py --publisher "PANGAEA" --count 3 --output ./examples

# Harvest without CDIF conversion (raw schema.org JSON-LD)
python geocodes_harvester.py --count 5 --output ./raw-examples

The CDIF conversion handles: property prefixing (schema:), @context/@type normalization, @list wrapping for creators, distribution fixes, subjectOf with conformsTo, type mappings (FundingAgency to Organization, Grant to MonetaryGrant, Croissant sc:Dataset to Dataset), Person name synthesis, and sameAs array normalization. All conversions are documented in each record's subjectOf description. Extra properties from the source are preserved (open-world assumption).

DCAT Conversion

DCAT/dcat_to_cdif.py converts DCAT JSON-LD catalogs or individual dataset records to CDIF-conformant schema.org JSON-LD. Maps DCAT/Dublin Core properties to schema.org equivalents per the CDIF DCAT implementation guide.

# List datasets in a DCAT catalog
python DCAT/dcat_to_cdif.py catalog.jsonld --list

# Convert selected records, validate output
python DCAT/dcat_to_cdif.py catalog.jsonld --output ./examples \
  --select 0,3,5 --catalog-name "My Catalog" --catalog-url "https://example.org/" \
  --validate

Key mappings: dcterms:titleschema:name, dcterms:descriptionschema:description, dcterms:modifiedschema:dateModified, dcterms:license �� schema:license, dcterms:accessRightsschema:conditionsOfAccess, dcat:keywordschema:keywords, dcat:Distributionschema:DataDownload, dcterms:spatialschema:spatialCoverage, dcterms:temporalschema:temporalCoverage. Unmapped properties preserved (open world). Auto-detects Discovery vs Core profile based on spatial/temporal content.

See DCAT/README.md for the full property mapping table, PSDI catalog example, and known limitations.

MetadataExamples

The MetadataExamples/ directory contains sample CDIF JSON-LD documents for testing:

File Technique Description
tof-htk9-f770.json ToF-SIMS Time-of-flight mass spectrometry particle analysis
xrd-2j0t-gq80.json XRD X-ray diffraction
xanes-2arx-b516.json XANES X-ray absorption near-edge structure
yv1f-jb20.json -- General dataset
test_se_na2so4-testschemaorg-cdiv3.json XAS X-ray absorption spectroscopy with DDI-CDI data structure (WideDataStructure, InstanceVariable, ValueMapping). Uses xas: and cdifq: extension namespaces
nwis-water-quality-longdata.json Water Quality NWIS groundwater nutrient analysis (464 rows, 20 columns) in cdi:LongStructureDataSet long (narrow) format with DescriptorComponent/ReferenceValueComponent roles, cdi:hasPhysicalMapping, and 5 MeasureComponent domain variables. Validates against graph schema (CDIF-graph-schema-2026.json)
prov-ocean-temp-example.json Ocean Temperature Extended provenance example demonstrating cdifProv building block: action chaining (schema:object/schema:result), multi-typed ["schema:Action", "prov:Activity"] activities, agents with Role wrappers, inline schema:HowTo methodology via schema:actionProcess with 3 steps, diverse instruments, facility location, and backward-compatible prov:used. Validates against graph schema

Corresponding Croissant output files are in the croissant/ directory.

DDI-CDI Resolved Schema

The ddi-cdi/cls-InstanceVariable-resolved.json file is a standalone JSON Schema (Draft 2020-12) for the DDI-CDI InstanceVariable class, derived from ddi-cdi/ddi-cdi.schema_normative.json. It resolves all $ref references into a self-contained schema suitable for use in editors like oXygen without needing the full 395-definition DDI-CDI schema.

The resolved schema applies several transformations to make the schema practical:

  • Reverse properties removed - 767 _OF_ reverse relationship properties stripped (use JSON-LD @reverse instead)
  • catalogDetails removed - Catalog-level metadata omitted from all classes
  • Redundant classes omitted - cls-DataPoint, cls-Datum, cls-RepresentedVariable simplified to IRI-only references
  • XSD types inlined - Primitive types (xsd:string, xsd:integer, etc.) replaced with inline definitions
  • Patterns normalized - if/then/else array patterns converted to consistent anyOf
  • Frequency-based $ref resolution - Common definitions (>3 uses) in $defs; rare definitions inlined

See ddi-cdi/cls-InstanceVariable-resolved-README.md for full details on the generation process, circular reference analysis, and transformation rationale.

Notes

  • The framed tree schemas (CDIFCompleteSchema.json, CDIFDiscoverySchema.json) are generated from building block profile resolved schemas using generate_validation_schema.py. The hand-maintained originals and the all-in-one CDIF-JSONLD-schema-2026.json are in archive/.
  • Legacy schema (CDIF-JSONLD-schema-schemaprefix.json) is still available for older documents.
  • All schema.org elements require the schema: prefix for SHACL validation compatibility.
  • The frame ensures that after framing, the output structure matches what the JSON schema expects.
  • For SHACL validation, use the corresponding .shacl or .ttl files in this repository.
  • @type flexibility: All @type definitions in the framed schemas use anyOf to accept either a string ("schema:Dataset") or an array (["schema:Dataset"]). JSON-LD framing may compact single-element arrays to strings; FrameAndValidate.py recursively normalizes all @type values back to arrays.
  • spdx:Checksum typing: All spdx:checksum objects must include "@type": "spdx:Checksum". This is required by both the JSON Schema (required: ["@type"]) and SHACL shapes (sh:class spdx:Checksum).

About

resources for validation of CDIF metadata

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors