#828 Add support for structs and arrays in EBCDIC writer.#829
#828 Add support for structs and arrays in EBCDIC writer.#829
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughReplaces the basic record combiner with a new NestedRecordCombiner and WriterAst for writing COBOL records (including nested groups and OCCURS), adds RDW header handling (big/little-endian), exposes a public UDF Changes
Sequence Diagram(s)sequenceDiagram
participant Selector as RecordCombinerSelector
participant Combiner as NestedRecordCombiner
participant AST as WriterAst Builder
participant RDD as Partition/Rows
participant Buffer as ByteBuffer
Selector->>Combiner: selectCombiner(cobolSchema, readerParameters)
activate Combiner
Combiner->>Combiner: validate RDW params (hasRdw, endianness, lengths)
Combiner->>RDD: processRDD(rdd, copybook, schema, recordSize, recordLengthHeader, startOffset, hasRdw, isRdwBigEndian)
activate RDD
RDD->>AST: constructWriterAst(copybook, schema)
activate AST
AST-->>RDD: WriterAst (GroupField root)
deactivate AST
loop per Row
RDD->>Buffer: allocate byte[arraySize]
RDD->>Buffer: write RDW header (if hasRdw)
RDD->>Buffer: writeToBytes(WriterAst, row, buffer, offset)
activate Buffer
Buffer->>Buffer: traverse WriterAst (groups, primitives, arrays, fillers)
Buffer-->>RDD: populated byte[]
deactivate Buffer
end
RDD-->>Combiner: RDD[Array[Byte]]
deactivate RDD
Combiner-->>Selector: RDD[Array[Byte]]
deactivate Combiner
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
JaCoCo code coverage report - 'cobol-parser'
|
JaCoCo code coverage report - 'spark-cobol'
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala (1)
72-82: Extract duplicatedgetFieldDefinitioninto a shared writer utility.This helper is duplicated with
BasicRecordCombiner, which increases drift risk in error messaging.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala` around lines 72 - 82, Extract the duplicated getFieldDefinition logic from NestedRecordCombiner into a single shared utility (e.g., a new object/class like WriterUtils or FieldDefinitionUtil) and have both NestedRecordCombiner and BasicRecordCombiner call that shared method; move the current implementation (which computes pic and usage by pattern-matching on field.dataType for Integral and Decimal) into the new shared function signature getFieldDefinition(field: Primitive): String, keep the exact behavior and return formatting, replace the local implementations in NestedRecordCombiner and BasicRecordCombiner with calls to WriterUtils.getFieldDefinition, and add any necessary imports so compilation remains unchanged.spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala (1)
46-48: Remove commented debug scaffolding before merge.These commented blocks make the test harder to scan and maintain.
Also applies to: 59-69, 84-90, 126-127
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala` around lines 46 - 48, Remove the leftover commented debug scaffolding in NestedWriterSuite.scala: delete the commented lines that reference parsedCopybook and generateRecordLayoutPositions (e.g., "//val parsedCopybook = CopybookParser.parse(copybook)" and "//println(parsedCopybook.generateRecordLayoutPositions())"), as well as the other commented println/debug blocks noted at the ranges (59-69, 84-90, 126-127); ensure no commented debug prints remain so the test file contains only production test code and assertions.spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala (1)
23-35: Update selector Scaladoc to match current behavior.Lines 27–28 still describe a basic-only combiner path, but Line 35 now always returns
NestedRecordCombiner.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala` around lines 23 - 35, The Scaladoc for selectCombiner is outdated: it claims only a basic fixed record length combiner exists while the method unconditionally returns NestedRecordCombiner; update the comment for def selectCombiner(cobolSchema: CobolSchema, readerParameters: ReaderParameters): RecordCombiner to accurately describe current behavior (it returns a NestedRecordCombiner), remove or rephrase the sentence about "only basic fixed record length combiner" and optionally note that future selection logic may be added to return different RecordCombiner implementations.spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtilsSuite.scala (1)
871-886: Add nested/null coverage forprintRowUdfbehavior.This test is useful, but adding one case with nested struct/array and one with a null struct input would better protect the recursive path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtilsSuite.scala` around lines 871 - 886, Add two additional test cases in SparkUtilsSuite to exercise printRowUdf's recursive handling: one case where the input row contains a nested struct and array (e.g., a column with a struct containing an array and another nested struct) and assert the pretty JSON matches expected output; and one case where the struct passed into SparkUtils.printRowUdf is null (create a DataFrame with a nullable struct column set to null) and assert the function gracefully returns a JSON representation for a null record. Use the existing pattern (create DataFrame, call SparkUtils.printRowUdf(struct(...)).as("record"), convert with SparkUtils.convertDataFrameToPrettyJSON and compare with assertResults) so the tests target SparkUtils.printRowUdf and convertDataFrameToPrettyJSON paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtils.scala`:
- Around line 563-589: The UDF printRowUdf can throw NPE when invoked with a
null top-level Row; update the udf body to guard the input by checking if row ==
null (or row.isInstanceOf[Row] check) and return a safe string like "null"
before calling rowToString; keep the existing rowToString implementation for
non-null Rows (it already handles nested nulls), so modify only the outer lambda
in printRowUdf to return "null" when row is null and otherwise call
rowToString(row).
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 156-170: The current buildPrimitiveNode returns None when a schema
field is missing which causes subsequent writes to shift and corrupt offsets;
instead, change buildPrimitiveNode (and the analogous group-node builder) to
always return a WriterAst: when fieldIndexOpt is present return the existing
PrimitiveField or PrimitiveArray (using PrimitiveField(p, row => row.get(idx))
and PrimitiveArray(p, row => row.getAs[mutable.WrappedArray[AnyRef]](idx))), and
when absent return an offset-preserving fallback WriterAst that emits the
correct number of blank/default bytes for the primitive (respecting p.occurs for
arrays and the primitive size/metadata for single fields) so children are still
written at their original offsets. Ensure the fallback implements the same
WriterAst interface used by the sequential writer so accumulated sizes remain
unchanged.
- Around line 35-39: The combine method in NestedRecordCombiner currently treats
RDW only for RecordFormat.VariableLength but lacks an explicit guard for
RecordFormat.VariableBlock; update the method to fail fast when
readerParameters.recordFormat == RecordFormat.VariableBlock by throwing a clear
exception (e.g., IllegalArgumentException) explaining VB is unsupported, and
keep RDW-related logic (isRdwBigEndian, isRdwPartRecLength, rdwAdjustment) only
for RecordFormat.VariableLength so behavior is unambiguous; locate this in the
combine(...) implementation and add the guard before any RDW/adjustment logic.
---
Nitpick comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 72-82: Extract the duplicated getFieldDefinition logic from
NestedRecordCombiner into a single shared utility (e.g., a new object/class like
WriterUtils or FieldDefinitionUtil) and have both NestedRecordCombiner and
BasicRecordCombiner call that shared method; move the current implementation
(which computes pic and usage by pattern-matching on field.dataType for Integral
and Decimal) into the new shared function signature getFieldDefinition(field:
Primitive): String, keep the exact behavior and return formatting, replace the
local implementations in NestedRecordCombiner and BasicRecordCombiner with calls
to WriterUtils.getFieldDefinition, and add any necessary imports so compilation
remains unchanged.
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scala`:
- Around line 23-35: The Scaladoc for selectCombiner is outdated: it claims only
a basic fixed record length combiner exists while the method unconditionally
returns NestedRecordCombiner; update the comment for def
selectCombiner(cobolSchema: CobolSchema, readerParameters: ReaderParameters):
RecordCombiner to accurately describe current behavior (it returns a
NestedRecordCombiner), remove or rephrase the sentence about "only basic fixed
record length combiner" and optionally note that future selection logic may be
added to return different RecordCombiner implementations.
In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtilsSuite.scala`:
- Around line 871-886: Add two additional test cases in SparkUtilsSuite to
exercise printRowUdf's recursive handling: one case where the input row contains
a nested struct and array (e.g., a column with a struct containing an array and
another nested struct) and assert the pretty JSON matches expected output; and
one case where the struct passed into SparkUtils.printRowUdf is null (create a
DataFrame with a nullable struct column set to null) and assert the function
gracefully returns a JSON representation for a null record. Use the existing
pattern (create DataFrame, call
SparkUtils.printRowUdf(struct(...)).as("record"), convert with
SparkUtils.convertDataFrameToPrettyJSON and compare with assertResults) so the
tests target SparkUtils.printRowUdf and convertDataFrameToPrettyJSON paths.
In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala`:
- Around line 46-48: Remove the leftover commented debug scaffolding in
NestedWriterSuite.scala: delete the commented lines that reference
parsedCopybook and generateRecordLayoutPositions (e.g., "//val parsedCopybook =
CopybookParser.parse(copybook)" and
"//println(parsedCopybook.generateRecordLayoutPositions())"), as well as the
other commented println/debug blocks noted at the ranges (59-69, 84-90,
126-127); ensure no commented debug prints remain so the test file contains only
production test code and assertions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e277d245-6ffb-40a9-913a-656313526fb5
📒 Files selected for processing (6)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtils.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RecordCombinerSelector.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/WriterAst.scalaspark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtilsSuite.scalaspark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtils.scala
Outdated
Show resolved
Hide resolved
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala
Show resolved
Hide resolved
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala
Outdated
Show resolved
Hide resolved
…k schema it is replaced by a Filler.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala (1)
37-42:⚠️ Potential issue | 🟠 MajorFail fast for
RecordFormat.VariableBlockinstead of writing through an unsupported path.
VariableBlockcurrently falls through without an explicit guard, which makes behavior ambiguous for an unsupported format.Based on learnings: Guideline: In the COBOL writer (BasicRecordCombiner and related writer classes), do not implement VariableBlock (VB) format writing; add an explicit guard and keep RDW behavior only for VariableLength.Proposed guard
override def combine(df: DataFrame, cobolSchema: CobolSchema, readerParameters: ReaderParameters): RDD[Array[Byte]] = { + if (readerParameters.recordFormat == RecordFormat.VariableBlock) { + throw new NotImplementedError( + "Writing VariableBlock (VB) format is not supported. Use VariableLength or fixed-length formats." + ) + } val hasRdw = readerParameters.recordFormat == RecordFormat.VariableLength🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala` around lines 37 - 42, The combine method in NestedRecordCombiner.scala must explicitly reject RecordFormat.VariableBlock instead of falling through; add a guard at the start of combine that checks readerParameters.recordFormat == RecordFormat.VariableBlock and throws an IllegalArgumentException (or similar) indicating VariableBlock is not supported, while preserving the existing RDW-only logic that applies when readerParameters.recordFormat == RecordFormat.VariableLength and using the existing readerParameters fields (isRdwPartRecLength, isRdwBigEndian, rdwAdjustment) for RDW handling.
🧹 Nitpick comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala (1)
159-160: Update docs/log text to match the current missing-field behavior.The code now returns
Filler(...)for absent fields, but comments still say “ReturnsNone” and logs say “Will be skipped.”Suggested wording update
- * Returns `None` when the field is absent from the schema (e.g. filtered out during reading). + * When the field is absent from the schema, returns a [[Filler]] node preserving byte length. @@ - log.error(s"Field '$path${p.name}' is not found in Spark schema. Will be skipped.") + log.error(s"Field '$path${p.name}' is not found in Spark schema. Will be replaced by filler.") @@ - * Returns `None` when the field is absent from the schema. + * When the field is absent from the schema, returns a [[Filler]] node preserving byte length. @@ - log.error(s"Field '$path${g.name}' is not found in Spark schema. Will be skipped.") + log.error(s"Field '$path${g.name}' is not found in Spark schema. Will be replaced by filler.")Also applies to: 180-181, 190-191, 222-223
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala` around lines 159 - 160, Update the documentation comments and log messages in NestedRecordCombiner to reflect the current behavior: absent schema fields are returned/written as Filler(...) rather than None. Locate the comment blocks and log calls inside the NestedRecordCombiner class (the doc comments around the methods handling missing fields and the log statements that currently say "Will be skipped") and change the phrasing to state that the method returns a Filler record for missing fields and that such fields "will be written as Filler" (or equivalent), ensuring all occurrences mentioned (the comment/log blocks around the earlier ranges) are updated for consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 37-42: The combine method in NestedRecordCombiner.scala must
explicitly reject RecordFormat.VariableBlock instead of falling through; add a
guard at the start of combine that checks readerParameters.recordFormat ==
RecordFormat.VariableBlock and throws an IllegalArgumentException (or similar)
indicating VariableBlock is not supported, while preserving the existing
RDW-only logic that applies when readerParameters.recordFormat ==
RecordFormat.VariableLength and using the existing readerParameters fields
(isRdwPartRecLength, isRdwBigEndian, rdwAdjustment) for RDW handling.
---
Nitpick comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 159-160: Update the documentation comments and log messages in
NestedRecordCombiner to reflect the current behavior: absent schema fields are
returned/written as Filler(...) rather than None. Locate the comment blocks and
log calls inside the NestedRecordCombiner class (the doc comments around the
methods handling missing fields and the log statements that currently say "Will
be skipped") and change the phrasing to state that the method returns a Filler
record for missing fields and that such fields "will be written as Filler" (or
equivalent), ensuring all occurrences mentioned (the comment/log blocks around
the earlier ranges) are updated for consistency.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 418136d7-6c62-43af-9e65-4d9bf05d391d
📒 Files selected for processing (3)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtils.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala
💤 Files with no reviewable changes (1)
- spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/BasicRecordCombiner.scala
Closes #828
Summary by CodeRabbit
New Features
Refactor
Tests