Skip to content

#828 Add support for structs and arrays in EBCDIC writer.#829

Merged
yruslan merged 3 commits intomasterfrom
feature/828-writer-struct-occurs
Mar 6, 2026
Merged

#828 Add support for structs and arrays in EBCDIC writer.#829
yruslan merged 3 commits intomasterfrom
feature/828-writer-struct-occurs

Conversation

@yruslan
Copy link
Collaborator

@yruslan yruslan commented Mar 4, 2026

Closes #828

Summary by CodeRabbit

  • New Features

    • Added a record-to-human-readable debugging utility for inspecting nested rows.
    • Introduced a new nested COBOL writer with RDW header support (big- and little-endian), nested/group/array handling, and RDW length validation.
  • Refactor

    • Replaced the previous basic writer implementation with the new nested writer approach.
  • Tests

    • Added tests for the debugging utility and nested COBOL writing with RDW handling.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 75f17458-8f8d-424e-8f4a-bbb32854226c

📥 Commits

Reviewing files that changed from the base of the PR and between 6d830be and f0762e5.

📒 Files selected for processing (1)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala
🚧 Files skipped from review as they are similar to previous changes (1)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala

Walkthrough

Replaces the basic record combiner with a new NestedRecordCombiner and WriterAst for writing COBOL records (including nested groups and OCCURS), adds RDW header handling (big/little-endian), exposes a public UDF printRowUdf, removes BasicRecordCombiner, and adds unit/integration tests for the writer and UDF.

Changes

Cohort / File(s) Summary
Writer AST
spark-cobol/src/main/scala/.../writer/WriterAst.scala
New sealed trait WriterAst with node case classes (Filler, PrimitiveField, GroupField, PrimitiveArray, GroupArray) and getter type aliases to model writer-side extraction and structure.
Nested Record Combiner
spark-cobol/src/main/scala/.../writer/NestedRecordCombiner.scala
New NestedRecordCombiner implementing RecordCombiner: builds a writer AST from copybook + Spark schema, validates RDW parameters, supports big/little-endian RDW headers, and serializes Rows into fixed-size byte arrays handling groups, OCCURS, fillers and encodings.
Selector change
spark-cobol/src/main/scala/.../writer/RecordCombinerSelector.scala
selectCombiner now returns NestedRecordCombiner (replaces previous BasicRecordCombiner choice).
Removed implementation
spark-cobol/src/main/scala/.../writer/BasicRecordCombiner.scala
Deleted BasicRecordCombiner and its companion object — previous writer implementation removed in favor of NestedRecordCombiner.
Debugging UDF
spark-cobol/src/main/scala/.../utils/SparkUtils.scala
Added public val printRowUdf: UserDefinedFunction that formats a Spark Row (including nested Rows and sequences) into a readable string; imports updated to include Row and UDF types.
Tests — UDF and writer
spark-cobol/src/test/scala/.../utils/SparkUtilsSuite.scala, spark-cobol/src/test/scala/.../writer/NestedWriterSuite.scala
Added unit test for printRowUdf and an integration-style NestedWriterSuite that writes a DataFrame to COBOL bytes (EBCDIC) and asserts the produced bytes match expected output.

Sequence Diagram(s)

sequenceDiagram
    participant Selector as RecordCombinerSelector
    participant Combiner as NestedRecordCombiner
    participant AST as WriterAst Builder
    participant RDD as Partition/Rows
    participant Buffer as ByteBuffer

    Selector->>Combiner: selectCombiner(cobolSchema, readerParameters)
    activate Combiner
    Combiner->>Combiner: validate RDW params (hasRdw, endianness, lengths)
    Combiner->>RDD: processRDD(rdd, copybook, schema, recordSize, recordLengthHeader, startOffset, hasRdw, isRdwBigEndian)
    activate RDD
    RDD->>AST: constructWriterAst(copybook, schema)
    activate AST
    AST-->>RDD: WriterAst (GroupField root)
    deactivate AST

    loop per Row
        RDD->>Buffer: allocate byte[arraySize]
        RDD->>Buffer: write RDW header (if hasRdw)
        RDD->>Buffer: writeToBytes(WriterAst, row, buffer, offset)
        activate Buffer
        Buffer->>Buffer: traverse WriterAst (groups, primitives, arrays, fillers)
        Buffer-->>RDD: populated byte[]
        deactivate Buffer
    end

    RDD-->>Combiner: RDD[Array[Byte]]
    deactivate RDD
    Combiner-->>Selector: RDD[Array[Byte]]
    deactivate Combiner
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I hopped through Rows and nested trees,
I packed OCCURS and groups with nimble ease.
RDW headers gleam in endian light,
WriterAst guided bytes just right.
Hooray — carrots and records tonight! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding support for structs and arrays in the EBCDIC writer.
Linked Issues check ✅ Passed All code changes directly address the linked issue #828 requirements: WriterAst and NestedRecordCombiner implement struct and OCCURS array support; SparkUtils.printRowUdf adds debugging capability; tests validate the new functionality.
Out of Scope Changes check ✅ Passed All changes are in scope: new AST structures for structs/arrays, NestedRecordCombiner implementation, integration via RecordCombinerSelector, supporting utilities, and comprehensive tests.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/828-writer-struct-occurs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

JaCoCo code coverage report - 'cobol-parser'

Overall Project 91.4% 🍏

There is no coverage information present for the Files changed

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

JaCoCo code coverage report - 'spark-cobol'

Overall Project 83.16% -8.26% 🍏
Files changed 47.4%

File Coverage
RecordCombinerSelector.scala 100% 🍏
SparkUtils.scala 92.15% 🍏
WriterAst.scala 92.06%
NestedRecordCombiner.scala 79.93% -20.78%

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala (1)

72-82: Extract duplicated getFieldDefinition into a shared writer utility.

This helper is duplicated with BasicRecordCombiner, which increases drift risk in error messaging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`
around lines 72 - 82, Extract the duplicated getFieldDefinition logic from
NestedRecordCombiner into a single shared utility (e.g., a new object/class like
WriterUtils or FieldDefinitionUtil) and have both NestedRecordCombiner and
BasicRecordCombiner call that shared method; move the current implementation
(which computes pic and usage by pattern-matching on field.dataType for Integral
and Decimal) into the new shared function signature getFieldDefinition(field:
Primitive): String, keep the exact behavior and return formatting, replace the
local implementations in NestedRecordCombiner and BasicRecordCombiner with calls
to WriterUtils.getFieldDefinition, and add any necessary imports so compilation
remains unchanged.
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala (1)

46-48: Remove commented debug scaffolding before merge.

These commented blocks make the test harder to scan and maintain.

Also applies to: 59-69, 84-90, 126-127

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala`
around lines 46 - 48, Remove the leftover commented debug scaffolding in
NestedWriterSuite.scala: delete the commented lines that reference
parsedCopybook and generateRecordLayoutPositions (e.g., "//val parsedCopybook =
CopybookParser.parse(copybook)" and
"//println(parsedCopybook.generateRecordLayoutPositions())"), as well as the
other commented println/debug blocks noted at the ranges (59-69, 84-90,
126-127); ensure no commented debug prints remain so the test file contains only
production test code and assertions.
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala (1)

23-35: Update selector Scaladoc to match current behavior.

Lines 27–28 still describe a basic-only combiner path, but Line 35 now always returns NestedRecordCombiner.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala`
around lines 23 - 35, The Scaladoc for selectCombiner is outdated: it claims
only a basic fixed record length combiner exists while the method
unconditionally returns NestedRecordCombiner; update the comment for def
selectCombiner(cobolSchema: CobolSchema, readerParameters: ReaderParameters):
RecordCombiner to accurately describe current behavior (it returns a
NestedRecordCombiner), remove or rephrase the sentence about "only basic fixed
record length combiner" and optionally note that future selection logic may be
added to return different RecordCombiner implementations.
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtilsSuite.scala (1)

871-886: Add nested/null coverage for printRowUdf behavior.

This test is useful, but adding one case with nested struct/array and one with a null struct input would better protect the recursive path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtilsSuite.scala`
around lines 871 - 886, Add two additional test cases in SparkUtilsSuite to
exercise printRowUdf's recursive handling: one case where the input row contains
a nested struct and array (e.g., a column with a struct containing an array and
another nested struct) and assert the pretty JSON matches expected output; and
one case where the struct passed into SparkUtils.printRowUdf is null (create a
DataFrame with a nullable struct column set to null) and assert the function
gracefully returns a JSON representation for a null record. Use the existing
pattern (create DataFrame, call
SparkUtils.printRowUdf(struct(...)).as("record"), convert with
SparkUtils.convertDataFrameToPrettyJSON and compare with assertResults) so the
tests target SparkUtils.printRowUdf and convertDataFrameToPrettyJSON paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtils.scala`:
- Around line 563-589: The UDF printRowUdf can throw NPE when invoked with a
null top-level Row; update the udf body to guard the input by checking if row ==
null (or row.isInstanceOf[Row] check) and return a safe string like "null"
before calling rowToString; keep the existing rowToString implementation for
non-null Rows (it already handles nested nulls), so modify only the outer lambda
in printRowUdf to return "null" when row is null and otherwise call
rowToString(row).

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 156-170: The current buildPrimitiveNode returns None when a schema
field is missing which causes subsequent writes to shift and corrupt offsets;
instead, change buildPrimitiveNode (and the analogous group-node builder) to
always return a WriterAst: when fieldIndexOpt is present return the existing
PrimitiveField or PrimitiveArray (using PrimitiveField(p, row => row.get(idx))
and PrimitiveArray(p, row => row.getAs[mutable.WrappedArray[AnyRef]](idx))), and
when absent return an offset-preserving fallback WriterAst that emits the
correct number of blank/default bytes for the primitive (respecting p.occurs for
arrays and the primitive size/metadata for single fields) so children are still
written at their original offsets. Ensure the fallback implements the same
WriterAst interface used by the sequential writer so accumulated sizes remain
unchanged.
- Around line 35-39: The combine method in NestedRecordCombiner currently treats
RDW only for RecordFormat.VariableLength but lacks an explicit guard for
RecordFormat.VariableBlock; update the method to fail fast when
readerParameters.recordFormat == RecordFormat.VariableBlock by throwing a clear
exception (e.g., IllegalArgumentException) explaining VB is unsupported, and
keep RDW-related logic (isRdwBigEndian, isRdwPartRecLength, rdwAdjustment) only
for RecordFormat.VariableLength so behavior is unambiguous; locate this in the
combine(...) implementation and add the guard before any RDW/adjustment logic.

---

Nitpick comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 72-82: Extract the duplicated getFieldDefinition logic from
NestedRecordCombiner into a single shared utility (e.g., a new object/class like
WriterUtils or FieldDefinitionUtil) and have both NestedRecordCombiner and
BasicRecordCombiner call that shared method; move the current implementation
(which computes pic and usage by pattern-matching on field.dataType for Integral
and Decimal) into the new shared function signature getFieldDefinition(field:
Primitive): String, keep the exact behavior and return formatting, replace the
local implementations in NestedRecordCombiner and BasicRecordCombiner with calls
to WriterUtils.getFieldDefinition, and add any necessary imports so compilation
remains unchanged.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala`:
- Around line 23-35: The Scaladoc for selectCombiner is outdated: it claims only
a basic fixed record length combiner exists while the method unconditionally
returns NestedRecordCombiner; update the comment for def
selectCombiner(cobolSchema: CobolSchema, readerParameters: ReaderParameters):
RecordCombiner to accurately describe current behavior (it returns a
NestedRecordCombiner), remove or rephrase the sentence about "only basic fixed
record length combiner" and optionally note that future selection logic may be
added to return different RecordCombiner implementations.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtilsSuite.scala`:
- Around line 871-886: Add two additional test cases in SparkUtilsSuite to
exercise printRowUdf's recursive handling: one case where the input row contains
a nested struct and array (e.g., a column with a struct containing an array and
another nested struct) and assert the pretty JSON matches expected output; and
one case where the struct passed into SparkUtils.printRowUdf is null (create a
DataFrame with a nullable struct column set to null) and assert the function
gracefully returns a JSON representation for a null record. Use the existing
pattern (create DataFrame, call
SparkUtils.printRowUdf(struct(...)).as("record"), convert with
SparkUtils.convertDataFrameToPrettyJSON and compare with assertResults) so the
tests target SparkUtils.printRowUdf and convertDataFrameToPrettyJSON paths.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala`:
- Around line 46-48: Remove the leftover commented debug scaffolding in
NestedWriterSuite.scala: delete the commented lines that reference
parsedCopybook and generateRecordLayoutPositions (e.g., "//val parsedCopybook =
CopybookParser.parse(copybook)" and
"//println(parsedCopybook.generateRecordLayoutPositions())"), as well as the
other commented println/debug blocks noted at the ranges (59-69, 84-90,
126-127); ensure no commented debug prints remain so the test file contains only
production test code and assertions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e277d245-6ffb-40a9-913a-656313526fb5

📥 Commits

Reviewing files that changed from the base of the PR and between 5e31b62 and 1e3736c.

📒 Files selected for processing (6)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtils.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/WriterAst.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtilsSuite.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala (1)

37-42: ⚠️ Potential issue | 🟠 Major

Fail fast for RecordFormat.VariableBlock instead of writing through an unsupported path.

VariableBlock currently falls through without an explicit guard, which makes behavior ambiguous for an unsupported format.

Proposed guard
 override def combine(df: DataFrame, cobolSchema: CobolSchema, readerParameters: ReaderParameters): RDD[Array[Byte]] = {
+    if (readerParameters.recordFormat == RecordFormat.VariableBlock) {
+      throw new NotImplementedError(
+        "Writing VariableBlock (VB) format is not supported. Use VariableLength or fixed-length formats."
+      )
+    }
     val hasRdw = readerParameters.recordFormat == RecordFormat.VariableLength
Based on learnings: Guideline: In the COBOL writer (BasicRecordCombiner and related writer classes), do not implement VariableBlock (VB) format writing; add an explicit guard and keep RDW behavior only for VariableLength.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`
around lines 37 - 42, The combine method in NestedRecordCombiner.scala must
explicitly reject RecordFormat.VariableBlock instead of falling through; add a
guard at the start of combine that checks readerParameters.recordFormat ==
RecordFormat.VariableBlock and throws an IllegalArgumentException (or similar)
indicating VariableBlock is not supported, while preserving the existing
RDW-only logic that applies when readerParameters.recordFormat ==
RecordFormat.VariableLength and using the existing readerParameters fields
(isRdwPartRecLength, isRdwBigEndian, rdwAdjustment) for RDW handling.
🧹 Nitpick comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala (1)

159-160: Update docs/log text to match the current missing-field behavior.

The code now returns Filler(...) for absent fields, but comments still say “Returns None” and logs say “Will be skipped.”

Suggested wording update
- * Returns `None` when the field is absent from the schema (e.g. filtered out during reading).
+ * When the field is absent from the schema, returns a [[Filler]] node preserving byte length.
@@
-      log.error(s"Field '$path${p.name}' is not found in Spark schema. Will be skipped.")
+      log.error(s"Field '$path${p.name}' is not found in Spark schema. Will be replaced by filler.")
@@
- * Returns `None` when the field is absent from the schema.
+ * When the field is absent from the schema, returns a [[Filler]] node preserving byte length.
@@
-      log.error(s"Field '$path${g.name}' is not found in Spark schema. Will be skipped.")
+      log.error(s"Field '$path${g.name}' is not found in Spark schema. Will be replaced by filler.")

Also applies to: 180-181, 190-191, 222-223

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`
around lines 159 - 160, Update the documentation comments and log messages in
NestedRecordCombiner to reflect the current behavior: absent schema fields are
returned/written as Filler(...) rather than None. Locate the comment blocks and
log calls inside the NestedRecordCombiner class (the doc comments around the
methods handling missing fields and the log statements that currently say "Will
be skipped") and change the phrasing to state that the method returns a Filler
record for missing fields and that such fields "will be written as Filler" (or
equivalent), ensuring all occurrences mentioned (the comment/log blocks around
the earlier ranges) are updated for consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 37-42: The combine method in NestedRecordCombiner.scala must
explicitly reject RecordFormat.VariableBlock instead of falling through; add a
guard at the start of combine that checks readerParameters.recordFormat ==
RecordFormat.VariableBlock and throws an IllegalArgumentException (or similar)
indicating VariableBlock is not supported, while preserving the existing
RDW-only logic that applies when readerParameters.recordFormat ==
RecordFormat.VariableLength and using the existing readerParameters fields
(isRdwPartRecLength, isRdwBigEndian, rdwAdjustment) for RDW handling.

---

Nitpick comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 159-160: Update the documentation comments and log messages in
NestedRecordCombiner to reflect the current behavior: absent schema fields are
returned/written as Filler(...) rather than None. Locate the comment blocks and
log calls inside the NestedRecordCombiner class (the doc comments around the
methods handling missing fields and the log statements that currently say "Will
be skipped") and change the phrasing to state that the method returns a Filler
record for missing fields and that such fields "will be written as Filler" (or
equivalent), ensuring all occurrences mentioned (the comment/log blocks around
the earlier ranges) are updated for consistency.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 418136d7-6c62-43af-9e65-4d9bf05d391d

📥 Commits

Reviewing files that changed from the base of the PR and between 1e3736c and 6d830be.

📒 Files selected for processing (3)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtils.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala
💤 Files with no reviewable changes (1)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala

@yruslan yruslan merged commit a76495d into master Mar 6, 2026
7 checks passed
@yruslan yruslan deleted the feature/828-writer-struct-occurs branch March 6, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support of writing EBCDIC files with nested structs and OCCURS

1 participant