Files for validation of CDIF metadata

This repository contains JSON schema, JSON-LD frames, contexts, and SHACL rule sets for validating CDIF metadata documents.

Files
Quick Start
Validation Workflow
- Step 1: Frame the JSON-LD Document
- Step 2: Validate Against Schema
RO-Crate Conversion and Validation
Croissant Conversion
- How the Croissant Conversion Works
- Croissant Usage
Usage Examples
Context Requirements
Authoring Instances Without Prefixes
Schema Structure
Flattened Graph Schema
Troubleshooting
- Common Validation Errors
- Debugging
Composite SHACL Shapes
SHACL Validation
DDI-CDI Resolved Schema
Notes

Files

Current (2026 Schema with DDI-CDI/CSVW)

File	Description
`CDIFDiscoverySchema.json`	JSON Schema for framed (tree) CDIF discovery profile metadata, generated by `generate_validation_schema.py` from CDIFDiscoveryProfile resolvedSchema
`CDIFCompleteSchema.json`	JSON Schema for framed (tree) CDIF complete profile metadata (discovery + data description + archive + provenance), generated by `generate_validation_schema.py` from CDIFcompleteProfile resolvedSchema
`CDIFDataDescriptionSchema.json`	JSON Schema for framed (tree) CDIF data description profile metadata (discovery + data description), generated by `generate_validation_schema.py` from CDIFDataDescriptionProfile resolvedSchema
`generate_validation_schema.py`	Generates framed-tree validation schemas from building block profile resolved schemas
`CDIF-graph-schema-2026.json`	JSON Schema for flattened JSON-LD graphs (`@graph` arrays), generated by `generate_graph_schema.py`
`generate_graph_schema.py`	Generates the graph schema from building block source schemas
`ShaclValidation/generate_shacl_shapes.py`	Generates composite SHACL shapes from building block rules.shacl files
`ShaclValidation/generate_shacl_report.py`	Generates markdown SHACL validation reports with severity grouping
`ShaclValidation/CDIF-Discovery-Shapes.ttl`	Composite SHACL shapes for CDIFDiscovery profile (generated by `ShaclValidation/generate_shacl_shapes.py`)
`ShaclValidation/CDIF-Complete-Shapes.ttl`	Composite SHACL shapes for CDIFcomplete profile (generated by `generate_shacl_shapes.py --profile complete`)
`CDIF-frame-2026.jsonld`	JSON-LD frame for 2026 schema
`CDIF-context-2026.jsonld`	JSON-LD context for authoring without namespace prefixes
`FrameAndValidate.py`	Python script for framing and validation
`croissant/ConvertToCroissant.py`	Converts CDIF JSON-LD to Croissant (mlcommons.org/croissant/1.0) format
`validate_building_blocks.py`	Validates building block schemas, SHACL shapes, and examples across the BB source tree
`validate-cdif.bat`	Windows batch script for oXygen XML Editor integration
`batch_validate.py`	Batch validation of CDIF metadata files across multiple file groups (JSON Schema + SHACL)
`validate_conformance.py`	Validates JSON-LD instances against the CDIF profiles they claim conformance to via `schema:subjectOf/dcterms:conformsTo`. Maps conformsTo URIs to profile/building-block schemas and reports per-file, per-profile results
`geocodes_harvester.py`	Harvests dataset metadata from the EarthCube GeoCodes SPARQL endpoint, extracts original JSON-LD from landing pages, and optionally converts to CDIF core or discovery profile format
`DCAT/dcat_to_cdif.py`	Converts DCAT JSON-LD catalogs to CDIF schema.org format. Maps DCAT/Dublin Core properties to schema.org equivalents per the CDIF DCAT implementation guide. See DCAT/README.md

DDI-CDI Resolved Schema

File	Description
`ddi-cdi/ddi-cdi.schema_normative.json`	Full DDI-CDI normative JSON Schema (395 definitions)
`ddi-cdi/cls-InstanceVariable-resolved.json`	Self-contained resolved schema for DDI-CDI InstanceVariable class
`ddi-cdi/cls-InstanceVariable-resolved-README.md`	Documentation for the resolved schema generation process

Legacy (Pre-2026, in `archive/`)

File	Description
`CDIFDiscoverySchema.json`	Hand-maintained discovery schema (superseded by generated version)
`CDIFCompleteSchema.json`	Hand-maintained complete schema (superseded by generated version)
`CDIF-JSONLD-schema-2026.json`	Original all-in-one framed tree schema (superseded by CDIFDiscoverySchema + CDIFCompleteSchema)
`CDIF-JSONLD-schema-schemaprefix.json`	JSON Schema for CDIF Discovery profile metadata with `schema:` prefixes
`CDIF-frame.jsonld`	JSON-LD frame for legacy schema
`CDIF-context.jsonld`	Legacy JSON-LD context

Quick Start

Prerequisites

pip install PyLD jsonschema

Validate a Document

# Using Python script (default: 2026 schema)
python FrameAndValidate.py my-metadata.jsonld -v

# Using Windows batch script
validate-cdif.bat my-metadata.jsonld

Save Framed Output for Debugging

python FrameAndValidate.py my-metadata.jsonld -o framed.json -v

Batch Validate Multiple Files

batch_validate.py runs both JSON Schema and SHACL validation across multiple file groups:

python batch_validate.py

File groups validated:

testJSONMetadata -- 77 ADA metadata test files
cdifbook -- 10 cdifbook example documents
cdifProfiles -- 5 CDIF profile examples from building blocks
adaProfiles -- 36 ADA profile examples from building blocks

Output shows per-file results for each validation type with severity-aware reporting:

JSON Schema: PASS or FAIL
SHACL: PASS (clean), PASS (N warnings, M info), FAIL (N violations, M warnings), or SKIP (for generated output files like -croissant.json, -rocrate.json)

Group summaries and an overall summary list all violations and schema failures.

Current Validation Status

As of April 2026, validation across testJSONMetadata (77 files) and all 5 CDIF profile examples shows:

JSON Schema: 77/77 testJSONMetadata pass against all three schemas (Discovery, DataDescription, Complete)
Profile examples: 5/5 pass (Discovery, DiscoveryMinimal, DiscoveryComplete, DataDescription, Complete)
SHACL Violations: 0 across all files
SHACL Warnings/Info: All files pass with warnings/info only — these reflect optional-but-recommended properties (missing activity descriptions, contact points, physical data types, etc.)

SHACL severity levels are aligned with JSON Schema: properties that are optional in the JSON Schema are sh:Warning (not sh:Violation) in SHACL.

Validation Workflow

CDIF metadata is expressed as JSON-LD. To validate JSON-LD documents against the JSON Schema, you need to first frame the document to ensure it has the correct structure. The framing process:

Reshapes the JSON-LD graph into a tree structure
Ensures properties use the expected prefixes (e.g., schema:name)
Embeds referenced nodes inline
Normalizes arrays and single values

Step 1: Frame the JSON-LD Document

Use a JSON-LD processor to apply CDIF-frame-2026.jsonld to your metadata document.

Step 2: Validate Against Schema

Validate the framed output against the appropriate schema:

CDIFDiscoverySchema.json -- discovery profile only
CDIFDataDescriptionSchema.json -- discovery + data description
CDIFCompleteSchema.json -- discovery + data description + archive + provenance (default)

RO-Crate Conversion and Validation

RO-Crate conversion and validation tools (ConvertToROCrate.py, ValidateROCrate.py) have been moved to the CDIF packaging repository. These tools convert nested/compacted CDIF JSON-LD into RO-Crate 1.1 form via JSON-LD expand + flatten.

See the packaging repository documentation for conversion details, validation checks, and usage.

Croissant Conversion

croissant/ConvertToCroissant.py converts CDIF JSON-LD metadata to Croissant (mlcommons.org/croissant/1.0) JSON-LD, an ML-oriented dataset metadata format developed by MLCommons. Both formats build on schema.org and JSON-LD, so discovery-level metadata maps directly.

# Convert a CDIF document to Croissant
python croissant/ConvertToCroissant.py input.jsonld -o output-croissant.json

# Validate the output (requires: pip install mlcroissant)
mlcroissant validate --jsonld output-croissant.json

See croissant/README.md for detailed documentation on the conversion process, property mappings, example output files, and usage options. The full property-by-property mapping is in croissant/CDIFtoCroissant.md.

Usage Examples

Command Line (Recommended)

The FrameAndValidate.py script handles the complete workflow:

# Validate with 2026 schema (default)
python FrameAndValidate.py my-metadata.jsonld -v

# Save framed output
python FrameAndValidate.py my-metadata.jsonld -o framed.json -v

# Use legacy schema
python FrameAndValidate.py my-metadata.jsonld --frame archive/CDIF-frame.jsonld --schema archive/CDIF-JSONLD-schema-schemaprefix.json -v

Options:

-v, --validate - Validate against JSON Schema
-o, --output FILE - Save framed output to file
--schema FILE - Path to JSON Schema (default: CDIFCompleteSchema.json)
--frame FILE - Path to JSON-LD frame (default: CDIF-frame-2026.jsonld)

oXygen XML Editor

The validate-cdif.bat script enables validation from within oXygen XML Editor.

Setup

Go to Tools → External Tools → Configure...
Click New and configure:

Field	Value
Name	`CDIF Validate`
Command	Path to `validate-cdif.bat`
Arguments	`"${cf}"`
Working directory	(leave empty)

Usage

Open a JSON-LD file in oXygen
Go to Tools → External Tools → CDIF Validate
Results appear in the oXygen console

Batch Script Options

validate-cdif.bat file.jsonld           # Validate with 2026 schema
validate-cdif.bat file.jsonld --framed  # Validate + save framed output
validate-cdif.bat file.jsonld --legacy  # Use pre-2026 schema
validate-cdif.bat --help                # Show help

Python

import json
from pyld import jsonld
import jsonschema

# Load the frame
with open('CDIF-frame-2026.jsonld') as f:
    frame = json.load(f)

# Load your JSON-LD metadata document
with open('my-metadata.jsonld') as f:
    doc = json.load(f)

# Load the schema
with open('CDIFCompleteSchema.json') as f:
    schema = json.load(f)

# Step 1: Frame the document
framed = jsonld.frame(doc, frame)

# Step 2: Validate against schema
try:
    jsonschema.validate(instance=framed, schema=schema)
    print("Validation successful!")
except jsonschema.ValidationError as e:
    print(f"Validation failed: {e.message}")

Required packages:

pip install PyLD jsonschema

JavaScript/Node.js

const jsonld = require('jsonld');
const Ajv = require('ajv');
const addFormats = require('ajv-formats');
const fs = require('fs');

async function validateCDIF(metadataPath) {
    // Load files
    const frame = JSON.parse(fs.readFileSync('CDIF-frame-2026.jsonld', 'utf8'));
    const doc = JSON.parse(fs.readFileSync(metadataPath, 'utf8'));
    const schema = JSON.parse(fs.readFileSync('CDIFCompleteSchema.json', 'utf8'));

    // Step 1: Frame the document
    const framed = await jsonld.frame(doc, frame);

    // Step 2: Validate against schema
    const ajv = new Ajv({ allErrors: true });
    addFormats(ajv);
    const validate = ajv.compile(schema);

    if (validate(framed)) {
        console.log('Validation successful!');
        return true;
    } else {
        console.log('Validation failed:', validate.errors);
        return false;
    }
}

validateCDIF('my-metadata.jsonld');

Required packages:

npm install jsonld ajv ajv-formats

Context Requirements

Your JSON-LD metadata documents must include a @context with namespace prefixes. Only schema and dcterms are required at the discovery level; additional prefixes are needed depending on which optional properties are used.

2026 Schema Requirements

Required (discovery level):

{
    "@context": {
        "schema": "http://schema.org/",
        "dcterms": "http://purl.org/dc/terms/"
    }
}

Optional prefixes (add as needed for the properties you use):

Prefix	IRI	When needed
`spdx`	`http://spdx.org/rdf/terms#`	Checksum properties on distributions
`dcat`	`http://www.w3.org/ns/dcat#`	`dcat:CatalogRecord` on subjectOf
`geosparql`	`http://www.opengis.net/ont/geosparql#`	Spatial coverage geometry
`prov`	`http://www.w3.org/ns/prov#`	Provenance (wasGeneratedBy)
`dqv`	`http://www.w3.org/ns/dqv#`	Data quality measurements
`cdi`	`http://ddialliance.org/Specification/DDI-CDI/1.0/RDF/`	DDI-CDI variable/data structure properties
`csvw`	`http://www.w3.org/ns/csvw#`	CSVW tabular data properties (data description level)


Domain-specific metadata may also use extension namespace prefixes. For example, the XAS (X-ray absorption spectroscopy) test example uses:

| Prefix | IRI | Purpose |
|--------|-----|---------|
| `xas` | `http://cdi4exas.org/` | XAS-specific types and properties (beamline, detector, edge energy, etc.) |
| `cdifq` | `http://crossdomaininteroperability.org/cdifq/` | Placeholder namespace for data structure properties (`nColumns`, `nRows`) not yet assigned to a formal vocabulary |

The `cdifq` namespace is a temporary placeholder. Properties using it (such as row/column counts on data structures) may migrate to DDI-CDI, CSVW, or another standard vocabulary in the future. `croissant/ConvertToCroissant.py` includes `cdifq` in its output context so that these terms resolve correctly during JSON-LD processing.

### Legacy Schema Requirements

```json
{
    "@context": {
        "schema": "http://schema.org/",
        "dcterms": "http://purl.org/dc/terms/",
        "prov": "http://www.w3.org/ns/prov#",
        "dqv": "http://www.w3.org/ns/dqv#",
        "geosparql": "http://www.opengis.net/ont/geosparql#",
        "spdx": "http://spdx.org/rdf/terms#",
        "time": "http://www.w3.org/2006/time#"
    }
}

Authoring Instances Without Prefixes

If you prefer to author metadata without namespace prefixes (e.g., name instead of schema:name), you can use the CDIF-context-2026.jsonld context file. This context maps unprefixed property names to their full IRIs.

Example Instance Without Prefixes

{
    "@context": "https://your-server.org/CDIF-context-2026.jsonld",
    "@type": "Dataset",
    "@id": "https://example.org/dataset/123",
    "name": "My Dataset",
    "description": "A sample dataset description",
    "identifier": "dataset-123",
    "dateModified": "2024-01-15",
    "url": "https://example.org/data/123",
    "license": "https://creativecommons.org/licenses/by/4.0/",
    "subjectOf": {
        "@type": ["Dataset"],
        "additionalType": ["dcat:CatalogRecord"],
        "sdDatePublished": "2024-01-15"
    }
}

How It Works

The validation workflow handles both prefixed and unprefixed instances:

Unprefixed instance references CDIF-context-2026.jsonld
Framing with CDIF-frame-2026.jsonld transforms the instance
The frame's context uses prefixed names, so the output has prefixed keys
Validate against CDIFCompleteSchema.json

This means you only need one schema. The framing step normalizes all instances to the prefixed format regardless of how they were authored.

Deploying the Context

For production use, host CDIF-context-2026.jsonld at a stable URL and reference it in your instances:

{
    "@context": "https://your-server.org/CDIF-context-2026.jsonld",
    ...
}

Or embed the context directly in your instance by copying the contents of CDIF-context-2026.jsonld.

Schema Structure

The schema validates CDIF Discovery profile metadata with the following required fields:

@id - Resource identifier
@type - Must include schema:Dataset
@context - JSON-LD context with required prefixes
schema:name - Resource name
schema:identifier - Primary identifier
schema:dateModified - Last modification date
schema:subjectOf - Metadata about the metadata record (requires @type containing schema:Dataset and schema:additionalType containing dcat:CatalogRecord)
Either schema:url or schema:distribution - Access information
Either schema:license or schema:conditionsOfAccess - Usage terms

2026 Schema Additions

The 2026 schema adds support for:

Variables (schema:variableMeasured):

Items are anyOf PropertyValue-based (cdifVariableMeasured) or schema:StatisticalVariable
PropertyValue variables: typed as schema:PropertyValue with DDI-CDI extensions (cdi:intendedDataType, cdi:simpleUnitOfMeasure, cdi:describedUnitOfMeasure, cdi:uses, cdi:role)
cdi:role -- enum: MeasureComponent, AttributeComponent, DimensionComponent, DescriptorComponent, ReferenceValueComponent
StatisticalVariable: typed as schema:StatisticalVariable with schema:statType, schema:measuredProperty (required)
cdi:physicalDataType is required at the data description level (CDIFDataDescription/CDIFcomplete profiles), not at discovery level

Distributions:

cdi:StructuredDataSet - For structured formats (JSON, XML, HDF5, NetCDF)
cdi:TabularTextDataSet - For tabular text (wide format) with CSVW properties:
- csvw:delimiter, csvw:header, csvw:headerRowCount
- cdi:isDelimited OR cdi:isFixedWidth
- cdi:hasPhysicalMapping - Links variables to physical representation
cdi:LongStructureDataSet - For long/narrow data format where each row is a single observation:
- A descriptor column identifies which variable each row measures (cdi:role: DescriptorComponent)
- A reference column holds the actual value (cdi:role: ReferenceValueComponent)
- Optional CSVW properties (delimiter, header, etc.) and DDI-CDI physical properties
- cdi:hasPhysicalMapping - Links variables to physical representation
- SHACL rules enforce exactly one DescriptorComponent and at least one ReferenceValueComponent

Flattened Graph Schema

CDIF-graph-schema-2026.json is the graph-based counterpart to the framed tree schema. It validates flattened JSON-LD documents that use @graph arrays directly, without requiring framing first. This is useful for validating JSON-LD as it naturally comes out of RDF stores or JSON-LD flatten operations.

The schema is generated by generate_graph_schema.py from the CDIF building block source schemas.

Building Block Sources

The generator reads building block schemas from the metadataBuildingBlocks/_sources/ directory (the BuildingBlockSubmodule). The location is auto-detected or can be overridden:

# Auto-detect (looks for BuildingBlockSubmodule/_sources/ relative to script)
python generate_graph_schema.py

# Explicit path
python generate_graph_schema.py --bb-dir /path/to/_sources

# Environment variable
export CDIF_BB_DIR=/path/to/_sources
python generate_graph_schema.py

# Custom output path
python generate_graph_schema.py --output my-graph-schema.json

Graph Schema Usage

# Validate a flattened JSON-LD document directly
python -c "
import json, jsonschema
with open('CDIF-graph-schema-2026.json') as f: schema = json.load(f)
with open('my-flattened.jsonld') as f: doc = json.load(f)
jsonschema.validate(doc, schema)
print('Valid')
"

The graph schema accepts three input forms:

A {"@context": {...}, "@graph": [...]} document (the primary use case)
A bare array of typed objects
A single typed object

Schema Structure (Graph)

The generated schema has this high-level structure:

root-graph: validates @context prefix declarations + @graph array of nodes
root-object: a nested if/then/else chain dispatching objects by @type to the correct type definition
id-reference: shared {"@id": "string"} definition for cross-node references
24 type definitions: type-Dataset, type-Person, type-Organization, type-PropertyValue, type-DefinedTerm, type-CreativeWork, type-DataDownload, type-MediaObject, type-WebAPI, type-Action, type-HowTo, type-Place, type-ProperInterval, type-MonetaryGrant, type-Role, type-Activity, type-QualityMeasurement, type-Claim, type-CatalogRecord, type-Identifier, type-InstanceVariable, type-StructuredDataSet, type-TabularTextDataSet, type-LongStructureDataSet

Type dispatch is ordered most-specific-first (e.g., cdi:StructuredDataSet before schema:Dataset) so that subtypes are matched before their parent types.

Key Transformations

The generator applies these transformations when reading building block source schemas:

External $ref resolution -- Cross-building-block $refs (e.g., ../person/schema.yaml) are resolved to internal #/$defs/type-X references
anyOf alternatives -- Properties that reference other building block types get anyOf [type-ref, id-reference] so they accept either inline objects or @id cross-references
@type disambiguation -- Composite types get additional type markers for dispatch (e.g., cdifCatalogRecord becomes dcat:CatalogRecord, identifier adds cdi:Identifier)
@context stripping -- Context declarations are removed from non-root types (the @context goes on the root-graph wrapper only)
Composite type assembly -- Complex types like type-Dataset merge mandatory + optional building blocks; type-StructuredDataSet/type-TabularTextDataSet/type-LongStructureDataSet compose dataDownload + CDI extensions
Extended provenance -- type-Activity built from cdifProv building block, requiring multi-typed @type: ["schema:Action", "prov:Activity"], merging base generatedBy properties (prov:used) with schema.org Action properties (schema:agent, schema:actionProcess, etc.). Instruments are nested within prov:used items via schema:instrument sub-key (instruments are prov:Entity subclasses). type-HowTo and type-Claim added as new dispatch types for methodology and assertion objects

Troubleshooting

Common Validation Errors

Missing required property
- Ensure all required fields are present
- Check that schema:subjectOf contains required nested fields
Type mismatch
- Properties like schema:spatialCoverage and schema:temporalCoverage expect arrays
- Check that @type values use the schema: prefix
Invalid @type
- Root @type must include schema:Dataset
- For 2026 schema, variables must include both schema:PropertyValue and cdi:InstanceVariable
Framing issues
- Ensure your document has proper @id values for node references
- Check that the @context is compatible with the frame
dcterms:conformsTo syntax
- Must use object syntax: [{"@id": "..."}] not ["..."]

Debugging

To see the framed output before validation:

python FrameAndValidate.py my-metadata.jsonld -o framed.json

Or in Python:

framed = jsonld.frame(doc, frame)
print(json.dumps(framed, indent=2))

SHACL Validation

In addition to JSON Schema validation, CDIF metadata can be validated using SHACL (Shapes Constraint Language) rules. SHACL validation operates on the RDF graph and can express constraints that JSON Schema cannot -- SPARQL-based targeting, cross-node relationships, and semantic inference.

The composite SHACL shapes are compiled from modular rules.shacl files in individual building blocks in the metadataBuildingBlocks repository. Two profiles are available:

discovery — ShaclValidation/CDIF-Discovery-Shapes.ttl (64 shapes)
complete — ShaclValidation/CDIF-Complete-Shapes.ttl (76 shapes, adds provenance + data description)

Quick start:

# Validate against discovery shapes
python ShaclValidation/ShaclJSONLDContext.py my-metadata.jsonld ShaclValidation/CDIF-Discovery-Shapes.ttl

# Generate a markdown validation report
python ShaclValidation/generate_shacl_report.py my-metadata.jsonld ShaclValidation/CDIF-Complete-Shapes.ttl -o report.md

# Regenerate shapes after building block changes
python ShaclValidation/generate_shacl_shapes.py --profile discovery
python ShaclValidation/generate_shacl_shapes.py --profile complete

See ShaclValidation/README.md for detailed documentation on the SHACL tools, shapes architecture, report format, and how to add new building block shapes.

Recommendation: Use both JSON Schema and SHACL validation for comprehensive coverage. batch_validate.py runs both automatically across multiple file groups.

Conformance Validation

validate_conformance.py inspects JSON-LD files for schema:subjectOf/dcterms:conformsTo claims and validates each file against the profile schemas it claims to conform to. Supports cdifCore, CDIFDiscovery, CDIFDataDescription, CDIFcomplete, and building block schemas (provenance, manifest/archive distribution).

# Validate a directory of JSON-LD files against their claimed profiles
python validate_conformance.py testJSONMetadata/

# Summary only
python validate_conformance.py testJSONMetadata/ --summary

# Verbose per-file error details
python validate_conformance.py testJSONMetadata/ --verbose

Conformance URIs with ada: prefix are ignored. URIs are normalized (trailing slashes stripped, dataDescription mapped to data_description).

GeoCodes Harvester

geocodes_harvester.py harvests dataset metadata from the EarthCube GeoCodes catalog (~170K indexed datasets). It queries the Blazegraph SPARQL endpoint, fetches original JSON-LD from source landing pages when available, and optionally converts records to CDIF profile format.

# List publishers and dataset counts
python geocodes_harvester.py --list-publishers

# Harvest 5 records from diverse publishers, convert to CDIF Discovery
python geocodes_harvester.py --count 5 --output ./examples --cdif discovery

# Harvest from a specific publisher
python geocodes_harvester.py --publisher "PANGAEA" --count 3 --output ./examples

# Harvest without CDIF conversion (raw schema.org JSON-LD)
python geocodes_harvester.py --count 5 --output ./raw-examples

The CDIF conversion handles: property prefixing (schema:), @context/@type normalization, @list wrapping for creators, distribution fixes, subjectOf with conformsTo, type mappings (FundingAgency to Organization, Grant to MonetaryGrant, Croissant sc:Dataset to Dataset), Person name synthesis, and sameAs array normalization. All conversions are documented in each record's subjectOf description. Extra properties from the source are preserved (open-world assumption).

DCAT Conversion

DCAT/dcat_to_cdif.py converts DCAT JSON-LD catalogs or individual dataset records to CDIF-conformant schema.org JSON-LD. Maps DCAT/Dublin Core properties to schema.org equivalents per the CDIF DCAT implementation guide.

# List datasets in a DCAT catalog
python DCAT/dcat_to_cdif.py catalog.jsonld --list

# Convert selected records, validate output
python DCAT/dcat_to_cdif.py catalog.jsonld --output ./examples \
  --select 0,3,5 --catalog-name "My Catalog" --catalog-url "https://example.org/" \
  --validate

Key mappings: dcterms:title → schema:name, dcterms:description → schema:description, dcterms:modified → schema:dateModified, dcterms:license �� schema:license, dcterms:accessRights → schema:conditionsOfAccess, dcat:keyword → schema:keywords, dcat:Distribution → schema:DataDownload, dcterms:spatial → schema:spatialCoverage, dcterms:temporal → schema:temporalCoverage. Unmapped properties preserved (open world). Auto-detects Discovery vs Core profile based on spatial/temporal content.

See DCAT/README.md for the full property mapping table, PSDI catalog example, and known limitations.

MetadataExamples

The MetadataExamples/ directory contains sample CDIF JSON-LD documents for testing:

File	Technique	Description
`tof-htk9-f770.json`	ToF-SIMS	Time-of-flight mass spectrometry particle analysis
`xrd-2j0t-gq80.json`	XRD	X-ray diffraction
`xanes-2arx-b516.json`	XANES	X-ray absorption near-edge structure
`yv1f-jb20.json`	--	General dataset
`test_se_na2so4-testschemaorg-cdiv3.json`	XAS	X-ray absorption spectroscopy with DDI-CDI data structure (WideDataStructure, InstanceVariable, ValueMapping). Uses `xas:` and `cdifq:` extension namespaces
`nwis-water-quality-longdata.json`	Water Quality	NWIS groundwater nutrient analysis (464 rows, 20 columns) in `cdi:LongStructureDataSet` long (narrow) format with `DescriptorComponent`/`ReferenceValueComponent` roles, `cdi:hasPhysicalMapping`, and 5 MeasureComponent domain variables. Validates against graph schema (`CDIF-graph-schema-2026.json`)
`prov-ocean-temp-example.json`	Ocean Temperature	Extended provenance example demonstrating `cdifProv` building block: action chaining (`schema:object`/`schema:result`), multi-typed `["schema:Action", "prov:Activity"]` activities, agents with Role wrappers, inline `schema:HowTo` methodology via `schema:actionProcess` with 3 steps, diverse instruments, facility location, and backward-compatible `prov:used`. Validates against graph schema

Corresponding Croissant output files are in the croissant/ directory.

DDI-CDI Resolved Schema

The ddi-cdi/cls-InstanceVariable-resolved.json file is a standalone JSON Schema (Draft 2020-12) for the DDI-CDI InstanceVariable class, derived from ddi-cdi/ddi-cdi.schema_normative.json. It resolves all $ref references into a self-contained schema suitable for use in editors like oXygen without needing the full 395-definition DDI-CDI schema.

The resolved schema applies several transformations to make the schema practical:

Reverse properties removed - 767 _OF_ reverse relationship properties stripped (use JSON-LD @reverse instead)
catalogDetails removed - Catalog-level metadata omitted from all classes
Redundant classes omitted - cls-DataPoint, cls-Datum, cls-RepresentedVariable simplified to IRI-only references
XSD types inlined - Primitive types (xsd:string, xsd:integer, etc.) replaced with inline definitions
Patterns normalized - if/then/else array patterns converted to consistent anyOf
Frequency-based $ref resolution - Common definitions (>3 uses) in $defs; rare definitions inlined

See ddi-cdi/cls-InstanceVariable-resolved-README.md for full details on the generation process, circular reference analysis, and transformation rationale.

Notes

The framed tree schemas (CDIFCompleteSchema.json, CDIFDiscoverySchema.json) are generated from building block profile resolved schemas using generate_validation_schema.py. The hand-maintained originals and the all-in-one CDIF-JSONLD-schema-2026.json are in archive/.
Legacy schema (CDIF-JSONLD-schema-schemaprefix.json) is still available for older documents.
All schema.org elements require the schema: prefix for SHACL validation compatibility.
The frame ensures that after framing, the output structure matches what the JSON schema expects.
For SHACL validation, use the corresponding .shacl or .ttl files in this repository.
@type flexibility: All @type definitions in the framed schemas use anyOf to accept either a string ("schema:Dataset") or an array (["schema:Dataset"]). JSON-LD framing may compact single-element arrays to strings; FrameAndValidate.py recursively normalizes all @type values back to arrays.
spdx:Checksum typing: All spdx:checksum objects must include "@type": "spdx:Checksum". This is required by both the JSON Schema (required: ["@type"]) and SHACL shapes (sh:class spdx:Checksum).

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
.idea		.idea
DCAT		DCAT
DDI		DDI
MetadataExamples		MetadataExamples
ShaclValidation		ShaclValidation
archive		archive
croissant		croissant
ddi-cdi		ddi-cdi
docs		docs
testJSONMetadata		testJSONMetadata
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CDIF-context-2026.jsonld		CDIF-context-2026.jsonld
CDIF-frame-2026.jsonld		CDIF-frame-2026.jsonld
CDIF-graph-schema-2026.json		CDIF-graph-schema-2026.json
CDIFCompleteSchema.json		CDIFCompleteSchema.json
CDIFDataDescriptionSchema.json		CDIFDataDescriptionSchema.json
CDIFDiscoverySchema.json		CDIFDiscoverySchema.json
CLAUDE.md		CLAUDE.md
FrameAndValidate.py		FrameAndValidate.py
LICENSE		LICENSE
README.md		README.md
batch_validate.py		batch_validate.py
generate_graph_schema.py		generate_graph_schema.py
generate_validation_schema.py		generate_validation_schema.py
geocodes_harvester.py		geocodes_harvester.py
validate-cdif.bat		validate-cdif.bat
validate-cdif.js		validate-cdif.js
validate_building_blocks.py		validate_building_blocks.py
validate_conformance.py		validate_conformance.py

Folders and files

Latest commit

History

Repository files navigation

Files for validation of CDIF metadata

Table of Contents

Files

Current (2026 Schema with DDI-CDI/CSVW)

DDI-CDI Resolved Schema

Legacy (Pre-2026, in archive/)

Quick Start

Prerequisites

Validate a Document

Save Framed Output for Debugging

Batch Validate Multiple Files

Current Validation Status

Validation Workflow

Step 1: Frame the JSON-LD Document

Step 2: Validate Against Schema

RO-Crate Conversion and Validation

Croissant Conversion

Usage Examples

Command Line (Recommended)

oXygen XML Editor

Setup

Usage

Batch Script Options

Python

JavaScript/Node.js

Context Requirements

2026 Schema Requirements

Authoring Instances Without Prefixes

Example Instance Without Prefixes

How It Works

Deploying the Context

Schema Structure

2026 Schema Additions

Flattened Graph Schema

Building Block Sources

Graph Schema Usage

Schema Structure (Graph)

Key Transformations

Troubleshooting

Common Validation Errors

Debugging

SHACL Validation

Conformance Validation

GeoCodes Harvester

DCAT Conversion

MetadataExamples

DDI-CDI Resolved Schema

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Legacy (Pre-2026, in `archive/`)

Packages