Skip to content

[VL] Enable enhanced tests for spark 4.0 & fix failures#11868

Open
infvg wants to merge 1 commit intoapache:mainfrom
infvg:icebergspark40fix
Open

[VL] Enable enhanced tests for spark 4.0 & fix failures#11868
infvg wants to merge 1 commit intoapache:mainfrom
infvg:icebergspark40fix

Conversation

@infvg
Copy link
Copy Markdown
Contributor

@infvg infvg commented Apr 2, 2026

This PR enables enhanced test for spark 4.0 & resolves sql queries in iceberg due to the new metadata columns.

@infvg infvg force-pushed the icebergspark40fix branch 2 times, most recently from 2887032 to ef0d7ac Compare April 2, 2026 08:54
@zhouyuan zhouyuan changed the title Enable enhanced tests for spark 4.0 & fix failures [VL] Enable enhanced tests for spark 4.0 & fix failures Apr 2, 2026
auto inputRowVector = batch.getRowVector();
auto inputRowType = asRowType(inputRowVector->type());

// Filter columns to match the expected schema (rowType_)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata columns should append in the end or beginning of the schema, and the number of the columns should be fixed value, so could we simplify the logic?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the metadata column name is the specific name, we only need to match the pattern to decide it is metadata column, could you show the example schema to let us understand this issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify it by using fieldIds. Field ID > Integer.MAX - 200 are reserved for metadata columns:
https://iceberg.apache.org/spec/#reserved-field-ids

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just splice the columns and remove any additional columns that appear at the end so we don't have to add any loops

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I want

@infvg infvg force-pushed the icebergspark40fix branch 5 times, most recently from e6dc9d7 to c34f5b7 Compare April 8, 2026 18:47
// Filter out metadata columns from the Spark output schema and reorder to match Iceberg schema
// Spark 4.0 may include metadata columns in the output schema during UPDATE operations,
// but these should not be written to the Iceberg table
val schemaFieldMap = schema.fields.map(f => f.name -> f).toMap
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use Intellij to debug here, see the writeSchema and schema: StructType difference, also use slice to take some of the columns

@infvg infvg force-pushed the icebergspark40fix branch 2 times, most recently from a12d8da to 73c9e38 Compare April 8, 2026 20:31
@zhouyuan
Copy link
Copy Markdown
Member

zhouyuan commented Apr 8, 2026

@infvg Thanks for the fix. Please fix the CI.

dataSink_->appendData(batch.getRowVector());
auto inputRowVector = batch.getRowVector();

auto outputRowVector = std::make_shared<RowVector>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you need it? Just set the rowType_?

Co-authored-by: Yuan <yuanzhou@apache.org>
@infvg infvg force-pushed the icebergspark40fix branch from 73c9e38 to 7704838 Compare April 10, 2026 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants