Redesign of repo tooling for mixed URI prefixes in 7.1#729
Redesign of repo tooling for mixed URI prefixes in 7.1#729tychonievich merged 13 commits intov7.1from
Conversation
Closes #727 Unblocks #722 and #723 There's a lot going on here, and I spent several weeks trying to minimize this PR, split it into pieces, etc, but couldn't find a way that worked. So here's an outline of what's changed. 1. In the spec, many tables (notably enumsets) now have an explicit URI entry. I made the URIs their own column, not the two-line cells used for URIs of events and attributes earlier, mostly because I thought we have space. We can refactor that into using <br> if we prefer. 2. I re-wrote `build/uri-def.py` from the ground up as `build/extract-yaml-tsv.py`. `uri-def.py` was a 50-line script I wrote that I slowly changed into a 500-line monstristy that I find almost impossible to read or maintain. And it had many assumptions about the spec's format not only in its code (that's inevitable when parsing human text into machine-readable files) ubt also in its design. Trying to add the more versatile URIs was such a headache I decided to use it as an excuse for a long-verdue re-write. `extract-yaml-tsv.py` uses classes to have the code mirror the files it is designed to produce. It uses type annotations (which pass `mypy`) to avoid various typing bugs. It tries to do one thing in one place instead of doing all things possible everywhere. It's 100 lines longer than `uri-def.py` but faster to run and easier to maintain. I spent some effort to match ideosynracies of `uri-def.py`, like when to use `strings`, `'strings'`, or `"strings"` in the emitted YAML; but sometimes dong that was not possible without replicating `uri-def.py`'s worse parts, so there are some trivial changes (notably in the order of lists) in the files it creates. 3. I (accidentally) fixed various bugs of omission. `uri-def.py` was sloppy in how it handled the textual explantions in sections that had either gedstruct rules with alternatives or tables. Fixing that with a more principled approach means adding extra `specification:` entries for around half of all YAML files. 4. I tried but failed to match the sorting of the various .tsv files from `uri-def.py`, which was based on some kind of sorting of structures before they were used to generate multiple lines each. I was tempted to scrap that and simply sort the final lines, which is easy and well-defined, but this commit does not do that, instead doing a half-hearted effort at mimickiing the old sorting. I did all this on 7.1 because that was the original motivation, but if we like it I can port it back to the 7.0 branch (either by adding explicit URIs to the 7.0 spec or by adding special-case handling of tag-to-URI) with relative ease.
There was a problem hiding this comment.
If the payload URI (line 23) is a v7.1 URI, shouldn't the structure URI also be v7.1?
There was a problem hiding this comment.
Discussion in GEDCOM Steering Committee meeting:
- The script should detect this and report it as an error
- Luther will update this PR to do so
There was a problem hiding this comment.
Pull request overview
This PR redesigns the repository tooling to support mixed URI prefixes (g7: and g7.1:) in version 7.1 of the GEDCOM specification. The changes address issue #727 by making URI handling more explicit and flexible.
Changes:
- Adds explicit URI columns to enumeration and month tables in specification markdown files
- Completely rewrites
uri-def.pyasextract-yaml-tsv.pywith better structure using classes and type annotations - Fixes bugs where textual explanations were omitted from YAML files
Reviewed changes
Copilot reviewed 140 out of 140 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| specification/gedcom-6-appendix-calendars.md | Added URI column to month tables for all calendars |
| specification/gedcom-3-structures-4-enumerations.md | Added URI column to all enumeration value tables and removed concatenation text |
| specification/gedcom-3-structures-3-meaning.md | Added deprecation notice for HEAD.SOUR.DATA |
| build/extract-yaml-tsv.py | New Python script with class-based design to extract YAML/TSV files from spec |
| build/uri-def.py | Deleted (replaced by extract-yaml-tsv.py) |
| build/Makefile | Updated to use new extract-yaml-tsv.py script |
| extracted-files/* | Updated YAML and TSV files with new URIs, labels, and specification text |
Co-authored-by: Dave Thaler <dthaler1968@gmail.com>
Co-authored-by: Dave Thaler <dthaler1968@gmail.com>
Co-authored-by: Dave Thaler <dthaler1968@gmail.com>
Co-authored-by: Dave Thaler <dthaler1968@gmail.com>
Co-authored-by: Dave Thaler <dthaler1968@gmail.com>
Co-authored-by: Dave Thaler <dthaler1968@gmail.com>
Per committee discussion summarized at FamilySearch/GEDCOM#729 (comment) Signed-off-by: Dave Thaler <dthaler@armidalesoftware.com>
* substructures.tsv can have empty string in first column Per committee discussion summarized at FamilySearch/GEDCOM#729 (comment) Signed-off-by: Dave Thaler <dthaler@armidalesoftware.com> * Apply suggestion from @dthaler --------- Signed-off-by: Dave Thaler <dthaler@armidalesoftware.com> Co-authored-by: Dave Thaler <dthaler@armidalesoftware.com>
|
Based on FamilySearch/GEDCOM.io#187, this should be refactored to split out the YAML and TSV processing, extracting the YAML first and then creating the TSV from the YAML, using the same YAML-to-TSV script for this repo and the registries repo. |
|
I think I resolved the previous issues: YAML is extracted by a separate file from TSV; TSV could be extracted from registries too; and URI version checks consider substructures, payloads, enumeration sets, and enumeration values. The checks meant I found and fixed many transitive URI version changes:
Since this already touches so many files, I also standardized to simple sorting for TSV files and substructure/superstructure lists to make the scripts simpler and more future-proof. |
…efitions to reference correct URIs
| type: structure | ||
|
|
||
| uri: https://gedcom.io/terms/v7/ANCI | ||
| uri: https://gedcom.io/terms/v7.1/ANCI |
There was a problem hiding this comment.
Question: to confirm, the payload is a pointer to an updated 7.1 SUBM record, so just want to confirm that the rule is the URI does need to change?
There was a problem hiding this comment.
That is how I've been applying the rule. Pointing to a new type means having a new URI.
There was a problem hiding this comment.
Discussed in steering committee: we can see the reason in both ways, but decided that updating to 7.1 is appropriate
| type: enumeration | ||
|
|
||
| uri: https://gedcom.io/terms/v7/enum-EVEN | ||
| uri: https://gedcom.io/terms/v7.1/enum-EVEN |
There was a problem hiding this comment.
I don't think this URI should need to change, correct? Same question for other enum-* files.
There was a problem hiding this comment.
The definition of g7:enum-EVEN was "either g7:INDI-EVEN or g7:FAM-EVEN." Since both of alternatives have new URIs, I think g7.1:enum-EVEN should too. Admittedly the updated URI is not in machine-readable part of the spec of these, but it is present in them.
There was a problem hiding this comment.
Discussed in steering committee:
- The text might in enumset-ENUMATTR be better seen as non-normative because there are ways that a source could reference a CENS without being about either INDI-CENS or FAM-CENS, such as a source that lists when each census was performed by some nation or agency. Hence, maybe don't change these URIs because they are non-normative.
- That might be more complicated for enumset-EVEN, which is used in NO and is thus referencing a specific structure type.
- Having two different CENS URIs (v7 for ENUMATTR, v7.1 for ENUM) would be confusing. As the only cost to changing the URIs is extra files in the registry, we might update these URIs for consistency and simplicity.
|
Discussed in steering committee
|
Closes #727
Unblocks #722 and #723
There's a lot going on here, and I spent several weeks trying to minimize this PR, split it into pieces, etc, but couldn't find a way that worked. So here's an outline of what's changed.
In the spec, many tables (notably enumsets) now have an explicit URI entry.
I made the URIs their own column, not the two-line cells used for URIs of events and attributes earlier, mostly because I thought we have space. We can refactor that into using
<br>if we prefer.I re-wrote
build/uri-def.pyfrom the ground up asbuild/extract-yaml-tsv.py.uri-def.pywas a 50-line script I wrote that I slowly changed into a 500-line monstristy that I find almost impossible to read or maintain. And it had many assumptions about the spec's format not only in its code (that's inevitable when parsing human text into machine-readable files) but also in its design. Trying to add the more versatile URIs was such a headache I decided to use it as an excuse for a long-overdue re-write.extract-yaml-tsv.pyuses classes to have the code mirror the files it is designed to produce. It uses type annotations (which passmypy) to avoid various typing bugs. It tries to do one thing in one place instead of doing all things possible everywhere. It's 100 lines longer thanuri-def.pybut faster to run and easier to maintain.I spent some effort to match idiosyncrasies of
uri-def.py, like when to usestrings,'strings', or"strings"in the emitted YAML; but sometimes dong that was not possible without replicatinguri-def.py's worse parts, so there are some trivial changes (notably in the order of lists) in the files it creates.I (accidentally) fixed various bugs of omission.
uri-def.pywas sloppy in how it handled the textual explanations in sections that had either gedstruct rules with alternatives or tables. Fixing that with a more principled approach means adding extraspecification:entries for around half of all YAML files.I tried but failed to match the sorting of the various .tsv files from
uri-def.py, which was based on some kind of sorting of structures before they were used to generate multiple lines each. I was tempted to scrap that and simply sort the final lines, which is easy and well-defined, but this commit does not do that, instead doing a half-hearted effort at mimicking the old sorting.I did all this on 7.1 because that was the original motivation, but if we like it I can port it back to the 7.0 branch (either by adding explicit URIs to the 7.0 spec or by adding special-case handling of tag-to-URI) with relative ease.