-
Notifications
You must be signed in to change notification settings - Fork 92
Rename AttributeType/adapter terminology to Codec #1300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dimitri-yatsenko
wants to merge
32
commits into
pre/v2.0
Choose a base branch
from
claude/clarify-column-type-names-2dpns
base: pre/v2.0
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Rename AttributeType/adapter terminology to Codec #1300
dimitri-yatsenko
wants to merge
32
commits into
pre/v2.0
from
claude/clarify-column-type-names-2dpns
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add fixed-point decimal as a core DataJoint type, allowing it to be recorded in field comments using :type: syntax for reconstruction. This provides scientists with a standardized type for exact numeric precision use cases (financial data, coordinates, etc.). Co-authored-by: dimitri-yatsenko <[email protected]>
Change the core binary type from 'blob' to 'bytes' to: - Enable cross-database portability (LONGBLOB in MySQL, BYTEA in PostgreSQL) - Free up native blob types (tinyblob, blob, mediumblob, longblob) - Use Pythonic naming that matches the stored/returned type Update all documentation to include PostgreSQL type mappings alongside MySQL mappings, making the cross-database support explicit. Co-authored-by: dimitri-yatsenko <[email protected]>
Correct the dtype documentation to clarify: - longblob is a native MySQL type for raw binary data (not serialized) - <djblob> should be used as dtype for serialized Python objects Co-authored-by: dimitri-yatsenko <[email protected]>
PostgreSQL supports native ENUM via CREATE TYPE ... AS ENUM, which provides similar semantics to MySQL ENUM (efficient storage, value enforcement, definition-order ordering). DataJoint will handle the separate type creation automatically. Co-authored-by: dimitri-yatsenko <[email protected]>
- Rewrite attributes.md to prioritize core types over native types - Add timezone policy: all datetime values stored as UTC - Timezone conversion is a presentation concern, not database concern - Update storage-types-spec.md with UTC policy and CURRENT_TIMESTAMP example Co-authored-by: dimitri-yatsenko <[email protected]>
Core types: - Add `text` as a core type for unlimited-length text (TEXT in both MySQL and PostgreSQL) Type modifiers policy: - Document that SQL modifiers (NOT NULL, DEFAULT, PRIMARY KEY, UNIQUE, COMMENT) are not allowed - DataJoint has its own syntax - Document that AUTO_INCREMENT is discouraged but allowed with native types - UNSIGNED is allowed as part of type semantics Co-authored-by: dimitri-yatsenko <[email protected]>
- UTF-8 required: utf8mb4 (MySQL) / UTF8 (PostgreSQL) - Case-sensitive by default: utf8mb4_bin / C collation - Database-level configuration via dj.config, not per-column - CHARACTER SET and COLLATE modifiers not allowed in type definitions - Like timezone, encoding is infrastructure configuration Co-authored-by: dimitri-yatsenko <[email protected]>
- Reorganize "Special DataJoint-only datatypes" as "AttributeTypes" - Add naming convention explanation (dj prefix, x prefix, @store suffix) - List all built-in AttributeTypes with categories: - Serialization types: <djblob>, <xblob> - File storage types: <object>, <content> - File attachment types: <attach>, <xattach> - File reference types: <filepath> - Fix inconsistent angle bracket notation throughout docs - Update example to use int32 core type and include <djblob> - Expand naming conventions in Key Design Decisions section Co-authored-by: dimitri-yatsenko <[email protected]>
The @ character now indicates external storage (object store vs database): - No @ = internal (database): <blob>, <attach> - @ present = external (object store): <blob@>, <attach@store> - @ alone = default store: <blob@> - @name = named store: <blob@cold> Key changes: - Rename <djblob> to <blob> (internal) and <xblob> to <blob@> (external) - Rename <xattach> to <attach@> (external variant of <attach>) - Mark <object@>, <content@>, <filepath@> as external-only types - Replace dtype property with get_dtype(is_external) method - Use core type 'bytes' instead of 'longblob' for portability - Add type resolution and chaining documentation - Update Storage Comparison and Built-in AttributeType Comparison tables - Simplify from 7 built-in types to 5: blob, attach, object, content, filepath Type chaining at declaration time: <blob> → get_dtype(False) → "bytes" → LONGBLOB/BYTEA <blob@> → get_dtype(True) → "<content>" → json → JSON/JSONB <object@> → get_dtype(True) → "json" → JSON/JSONB Co-authored-by: dimitri-yatsenko <[email protected]>
Co-authored-by: dimitri-yatsenko <[email protected]>
Rename <content@> to <hash@> throughout documentation: - More descriptive: indicates hash-based addressing mechanism - Familiar concept: works like a hash data structure - Storage folder: _content/ → _hash/ - Registry: ContentRegistry → HashRegistry The <hash@> type provides: - SHA256 hash-based addressing - Automatic deduplication - External-only storage (requires @) - Used as dtype by <blob@> and <attach@> Co-authored-by: dimitri-yatsenko <[email protected]>
- Use '= CURRENT_TIMESTAMP : datetime' syntax (not SQL DEFAULT) - Use uint64 core type instead of 'bigint unsigned' native type Co-authored-by: dimitri-yatsenko <[email protected]>
DataJoint handles nullability through the default value syntax: - Attribute is nullable iff default is NULL - No separate NOT NULL / NULL modifier needed - Examples: required, nullable, and default value cases Co-authored-by: dimitri-yatsenko <[email protected]>
Hash metadata (hash, store, size) is stored directly in each table's JSON column - no separate registry table is needed. Garbage collection now scans all tables to find referenced hashes in JSON fields directly. Co-authored-by: dimitri-yatsenko <[email protected]>
MD5 (128-bit, 32-char hex) is sufficient for content-addressed deduplication: - Birthday bound ~2^64 provides adequate collision resistance for scientific data - 32-char vs 64-char hashes reduces storage overhead in JSON metadata - MD5 is ~2-3x faster than SHA256 for large files - Consistent with existing dj.hash module (key_hash, uuid_from_buffer) - Simplifies migration since only storage format changes, not the algorithm Added Hash Algorithm Choice section documenting the rationale. Co-authored-by: dimitri-yatsenko <[email protected]>
- uuid_from_file was never called anywhere in the codebase - uuid_from_stream only existed to support uuid_from_file - Inlined the logic directly into uuid_from_buffer - Removed unused io and pathlib imports Co-authored-by: dimitri-yatsenko <[email protected]>
The implementation plan was heavily outdated with: - Old type names (<content>, <xblob>, <xattach> vs <hash@>, <blob@>, <attach@>) - Wrong hash algorithm (SHA256 vs MD5) - Wrong paths (_content/ vs _hash/) - References to removed HashRegistry table All relevant design information is now in storage-types-spec.md. Implementation details (ObjectRef API, staged_insert) will be documented in user-facing API docs when implemented. Co-authored-by: dimitri-yatsenko <[email protected]>
- Rename DECIMAL to NUMERIC in native types (decimal is in core types) - Rename TEXT to NATIVE_TEXT (text is in core types) - Change BLOB references to BYTES in heading.py (bytes is the core type name) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Terminology changes in spec and user docs: - "AttributeTypes" → "Codec Types" (category name) - "AttributeType" → "Codec" (base class) - "@register_type" → "@dj.codec" (decorator) - "type_name" → "name" (class attribute) The term "Codec" better conveys the encode/decode semantics of these types, drawing on the familiar audio/video codec analogy. Code changes (class renaming, backward-compat aliases) to follow. Co-authored-by: dimitri-yatsenko <[email protected]>
Design improvements for Python 3.10+:
- Codecs auto-register when subclassed via __init_subclass__
- No decorator needed - just inherit from dj.Codec and set name
- Use register=False for abstract base classes
- Removed @dj.codec decorator from all examples
New API:
class GraphCodec(dj.Codec):
name = "graph"
def encode(...): ...
def decode(...): ...
Abstract bases:
class ExternalOnlyCodec(dj.Codec, register=False):
...
Co-authored-by: dimitri-yatsenko <[email protected]>
- Codec.get_dtype(is_external) now determines storage type based on whether @ modifier is present in the declaration - BlobCodec returns "bytes" for internal, "<hash>" for external - AttachCodec returns "bytes" for internal, "<hash>" for external - HashCodec, ObjectCodec, FilepathCodec enforce external-only usage - Consolidates <blob>/<xblob> and <attach>/<xattach> into unified codecs - Adds backward compatibility aliases for old type names - Updates __init__.py with new codec exports (Codec, list_codecs, get_codec)
- Remove legacy codecs (djblob, xblob, xattach, content) - Use unified codecs: <blob>, <attach>, <hash>, <object>, <filepath> - All codecs support both internal and external modes via @store modifier - Fix dtype chain resolution to propagate store to inner codecs - Fix fetch.py to resolve correct chain for external storage - Update tests to use new codec API (name, get_dtype method) - Fix imports: use content_registry for get_store_backend - Add 'local' store to mock_object_storage fixture All 471 tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Rename attribute_type.py → codecs.py - Rename builtin_types.py → builtin_codecs.py - Rename test_attribute_type.py → test_codecs.py - Rename get_adapter() → lookup_codec() - Rename attr.adapter → attr.codec in Attribute namedtuple - Update all imports and references throughout codebase - Update comments and docstrings to use codec terminology All 471 tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove AttributeType alias (use Codec directly) - Remove register_type function (codecs auto-register) - Remove deprecated type_name property (use name) - Remove list_types, get_type, is_type_registered, unregister_type aliases - Update all internal usages from type_name to name - Update tests to use new API The previous implementation was experimental; no backward compatibility is needed for the v2.0 release. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add codec-spec.md: detailed API specification for creating codecs - Add codecs.md: user guide with examples (replaces customtype.md) - Remove customtype.md (replaced by codecs.md) Documentation covers: - Codec base class and required methods - Auto-registration via __init_subclass__ - Codec composition/chaining - Plugin system via entry points - Built-in codecs (blob, hash, object, attach, filepath) - Complete examples for neuroscience workflows 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The detailed implementation specification has served its purpose. User documentation is now in object.md, codec API in codec-spec.md, and type architecture in storage-types-spec.md. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Code cleanup: - Remove backward compatibility aliases (ObjectType, AttachType, etc.) - Remove misleading comments about non-existent DJBlobType/ContentType - Remove unused build_foreign_key_parser_old function - Remove unused feature switches (ADAPTED_TYPE_SWITCH, FILEPATH_FEATURE_SWITCH) - Remove unused os import from errors.py - Rename ADAPTED type category to CODEC Documentation fixes: - Update mkdocs.yaml nav: customtype.md → codecs.md - Fix dead links in attributes.md pointing to customtype.md Terminology updates: - Replace "AttributeType" with "Codec" in all comments - Replace "Adapter" with "Codec" in docstrings - Fix SHA256 → MD5 in content_registry.py docstring Version bump to 2.0.0a6 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Filepath feature is now always enabled; no feature flag needed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
File renames: - schema_adapted.py → schema_codecs.py - test_adapted_attributes.py → test_codecs.py - test_type_composition.py → test_codec_chaining.py Content updates: - LOCALS_ADAPTED → LOCALS_CODECS - GraphType → GraphCodec, LayoutToFilepathType → LayoutCodec - Test class names: TestTypeChain* → TestCodecChain* - Test function names: test_adapted_* → test_codec_* - Updated docstrings and comments 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Tests now automatically start MySQL and MinIO containers via testcontainers. No manual `docker-compose up` required - just run `pytest tests/`. Changes: - conftest.py: Add mysql_container and minio_container fixtures that auto-start containers when tests run and stop them afterward - pyproject.toml: Add testcontainers[mysql,minio] dependency, update pixi tasks, remove pytest-env (no longer needed) - docker-compose.yaml: Update docs to clarify it's optional for tests - README.md: Comprehensive developer guide with clear instructions for running tests, pre-commit hooks, and PR submission checklist Usage: - Default: `pytest tests/` - testcontainers manages containers - External: `DJ_USE_EXTERNAL_CONTAINERS=1 pytest` - use docker-compose Benefits: - Zero setup for developers - just `pip install -e ".[test]" && pytest` - Dynamic ports (no conflicts with other services) - Automatic cleanup after tests - Simpler CI configuration Version bump to 2.0.0a7 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Update settings tests to accept dynamic ports (testcontainers uses random ports instead of default 3306) - Fix test_top_restriction_with_keywords to use set comparison since dj.Top only guarantees which elements are selected, not their order - Bump version to 2.0.0a8 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
1ae815b to
15412ea
Compare
- Register requires_mysql and requires_minio marks in pyproject.toml - Add pytest_collection_modifyitems hook to auto-mark tests based on fixture usage - Remove autouse=True from configure_datajoint fixture so containers only start when needed - Fix test_drop_unauthorized to use connection_test fixture Tests can now run without Docker: pytest -m "not requires_mysql" # Run 192 unit tests Full test suite still works: DJ_USE_EXTERNAL_CONTAINERS=1 pytest tests/ # 471 tests Bump version to 2.0.0a9 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
15412ea to
fa47f47
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements a comprehensive redesign of the custom type system, renaming "AttributeType/adapter" terminology to "Codec" and providing a cleaner, more intuitive API. It also simplifies the testing infrastructure with testcontainers.
Codec API Redesign
Key Changes
AttributeTypebase class toCodec- The new name better reflects the purpose: encoding Python objects for database storage and decoding them on retrieval__init_subclass__- Codecs automatically register when their class is defined; no decorator neededget_dtype(is_external)method - Codecs dynamically return their underlying storage type based on whether external storage is used<name>or<name@store>syntax consistentlyADAPTEDtoCODECin internal codeBuilt-in Codecs
<blob>bytes<hash@><hash@>json<object@>json<attach>bytes<hash@><filepath@store>jsonNew Codec API
Testing Infrastructure: Testcontainers
Tests now use testcontainers to automatically manage MySQL and MinIO containers. No manual
docker-compose uprequired.New Developer Workflow
Benefits
pip installandpytestFallback: External Containers
For development/debugging with persistent containers:
Dead Code & Terminology Cleanup
Removed
AttributeType,register_type,list_types,get_type, etc.)ObjectType,AttachType,XAttachType,FilepathType)build_foreign_key_parser_old()functionADAPTED_TYPE_SWITCH,FILEPATH_FEATURE_SWITCH)enable_filepath_featuretest fixtureDJBlobType/ContentTypeobject-type-spec.md(implementation complete, info now inobject.md)pytest-envdependency (testcontainers handles configuration)Renamed
ADAPTED→CODECin declare.py and heading.pyschema_adapted.py→schema_codecs.pytest_adapted_attributes.py→test_codecs.pytest_type_composition.py→test_codec_chaining.pyCodecterminologyUpdated Terminology
content_registry.pydocstring: SHA256 → MD5Documentation Updates
New Documentation
codec-spec.md- Detailed API specification for creating custom codecscodecs.md- User guide with examples (replacescustomtype.md)README.md- Comprehensive developer guide with test/pre-commit instructionsUpdated Documentation
mkdocs.yaml- Navigation updated:customtype.md→codecs.mdattributes.md- Fixed dead links, updated terminologydocker-compose.yaml- Clarified it's optional for testsRemoved Documentation
object-type-spec.md(redundant withobject.md)customtype.md(replaced bycodecs.md)Other Changes in This Branch
Settings System
save_*methods andset_passwordfunctionType System
int8,int16,int32,int64,uint8, etc.)decimal(n,f)to core typestexttype and documented type modifier policyExternal Storage
<object@>type for managed file/folder storageInfrastructure
unit/andintegration/directories2.0.0a7Test Plan
<blob@>→<hash@>→ storage) worksDJ_USE_EXTERNAL_CONTAINERS=1)🤖 Generated with Claude Code