Skip to content

Improve HED search - Phase II object-based search #1293

@VisLab

Description

@VisLab

Plan 2: Improve current object-based search

The HED basic search capabilities are integral to analysis such as epoching. We have made some progress in addressing some of the implementation issues with the search efficiency. This plan addresses additional potential improvements.

Fix remaining performance and correctness issues in the QueryHandler system from issue #1268. PRs 1281-1282 already fixed 5 of 10 items; this plan covers the remaining 5 (items #3, #6, #7, #8, #10).

Phase A: Make SearchResult hashable (fixes #7, #8)

The AND merge in ExpressionAnd.merge_and_groups() has O(n³) complexity (self-documented as "trash and slow") and the OR dedup in ExpressionOr.handle_expr() is O(n²). Both stem from linear scans to detect duplicate SearchResult objects.

  1. Add __hash__ and __eq__ to SearchResult in hed/models/query_util.py based on (id(group), frozenset(id(c) for c in children)) — this preserves the existing identity-based semantics (the current has_same_children method uses is checks)
  2. Replace the O(n³) dedup loop in ExpressionAnd.merge_and_groups() (hed/models/query_expressions.py lines 121-142) with set-based lookup
  3. Replace the O(n²) dedup loop in ExpressionOr.handle_expr() (hed/models/query_expressions.py lines 228-238) with set-based lookup

Can start immediately — no dependencies.

Phase B: Resolve tag form inconsistencies (#10)

Parallel with Phase A.

Each find_* method in hed/models/hed_group.py uses a different tag property for comparison:

Method Tag property used What it matches
find_tags() (L435) short_base_tag.casefold() Base tag without extension/value
find_wildcard_tags() (L464) short_tag.casefold() prefix Short form including value, prefix match
find_exact_tags() (L501) HedTag.__eq__ short_tag then org_tag.casefold() fallback
find_tags_with_term() (L566) tag_terms tuple All ancestor terms from schema
  1. Audit whether each method's choice of tag form is correct for its intended semantics. In particular, find_exact_tags() is called from query_expressions.py with plain strings (e.g., "def/mydef"), and HedTag.__eq__(str) compares via self.casefold() == other.casefold() which returns str(self).casefold() = short_tag.casefold(). If a tag was written in long form in the source HED string, str(tag) still returns short_tag (when a schema entry exists), so this should work — but needs verification with tests.
  2. Add targeted tests for cross-form matching scenarios:
    • Long-form written tags matched by short-form queries
    • Tags with extensions/values
    • Tags without a recognized schema entry (where str(tag) falls back to org_tag)
  3. Document the rationale for each find_* method's choice of tag form

Phase C: Code quality (#6)

Parallel with Phase A.

  1. Replace the string-based wildcard check in the negation restriction (hed/models/query_handler.py line 141):
    if "?" in str(interior):
        raise HedQueryError("Cannot negate wildcards...")
    With an expression-tree has_wildcard property on Expression subclasses. Currently, ~(Event && Action) is fine, but ~([A, ?]) is prohibited even though the wildcard is nested. The check converts the entire subtree to string and looks for ?, which is overly broad.
  2. Remove stale TODO comments for already-fixed issues

Phase D: Documentation alignment

  1. The HED search guide references QueryParser and TagExpressionParser — update to QueryHandler
  2. Document the @ (not-in-line) operator in user-facing documentation (currently only documented in code comments and tests)
  3. Add examples showing exact-match group with optional portion {A: B} and {A, B:} syntax

Files to modify

File What to change
hed/models/query_util.py Add __hash__/__eq__ to SearchResult
hed/models/query_expressions.py Replace dedup loops in merge_and_groups() and ExpressionOr.handle_expr()
hed/models/hed_group.py Audit/document find_* methods (lines 435-600)
hed/models/hed_tag.py Reference: __eq__ (L647), tag_terms (L55/342), short_tag (L87)
hed/models/query_handler.py Refactor negation wildcard check (L141)
tests/models/test_query_handler.py Add cross-form and edge-case tests
tests/models/test_query_util.py Add SearchResult hash/equality tests

Verification

  1. Run full test suite: python -m unittest discover tests -v
  2. Run spec tests: python -m unittest discover spec_tests -v
  3. Benchmark AND/OR dedup on HED strings with 50+ tags to verify performance improvement
  4. Test cross-form matching (long-form source strings with short-form queries and vice versa)
  5. Lint: ruff check hed/ tests/ && ruff format --check hed/ tests/ && typos

Decisions

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions