Skip to content

Conversation

@jkhaliqi
Copy link
Contributor

@jkhaliqi jkhaliqi commented Dec 3, 2025

Description

These changes require this velox PR to land first : facebookincubator/velox#12846
Prestissimo decimal type is incorrect when inserting values in Parquet. Decimal precision, scale are written incorrectly to Parquet. Need to change presto velox parquet-tools inspect output on parquet tools created by with decimal to be similar to presto java.

To Reproduce

create table decimaltest(col0 DECIMAL(10, 2));
CREATE TABLE
insert into decimaltest values(1200.00);
INSERT: 1 row

parquet meta file.parquet

Before the PR the output for prestissimo would be showing the col0 type as DECIMAL(6,2) instead of DECIMAL(10,2).

Created by: parquet-mr version 2.6.0 (build parquet-cpp-velox)
Properties: (none)
Schema:
message schema {
  optional int32 col0 (DECIMAL(6,2));
}

Expected behavior
Java writer does not have this issue. It uses the right precision, scale. AND with this PR we are getting the same output for the DECIMAL type and scale.

Created by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba)
Properties: (none)
Schema:
message hive_schema {
  optional fixed_len_byte_array(5) col0 (DECIMAL(10,2));
}

Motivation and Context

#26113

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== NO RELEASE NOTE ==

@jkhaliqi jkhaliqi requested review from a team as code owners December 3, 2025 19:26
@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Dec 3, 2025
@prestodb-ci prestodb-ci requested review from a team, auden-woolfson and pratyakshsharma and removed request for a team December 3, 2025 19:26
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Dec 3, 2025

Reviewer's Guide

Adjusts how TableWrite nodes build their RowType so that DECIMAL precision and scale from the connector insert handle are preserved when writing Parquet, aligning Prestissimo behavior with the Java writer.

Sequence diagram for building TableWriteNode with correct DECIMAL precision/scale

sequenceDiagram
  actor User
  participant PrestoPlanner
  participant VeloxQueryPlanConverterBase
  participant InsertTableHandle
  participant ConnectorInsertTableHandle
  participant ParquetWriter

  User->>PrestoPlanner: submit INSERT with DECIMAL(10,2)
  PrestoPlanner->>VeloxQueryPlanConverterBase: toVeloxQueryPlan(tableWriteNode)

  VeloxQueryPlanConverterBase->>InsertTableHandle: connectorInsertTableHandle()
  InsertTableHandle-->>VeloxQueryPlanConverterBase: ConnectorInsertTableHandle

  VeloxQueryPlanConverterBase->>ConnectorInsertTableHandle: getColumnHandleDataTypes()
  ConnectorInsertTableHandle-->>VeloxQueryPlanConverterBase: tableTypes (DECIMAL(10,2))

  VeloxQueryPlanConverterBase->>VeloxQueryPlanConverterBase: toRowType(columns, tableTypes, typeParser)
  VeloxQueryPlanConverterBase-->>PrestoPlanner: TableWriteNode with RowType(DECIMAL(10,2))

  PrestoPlanner->>ParquetWriter: execute TableWriteNode
  ParquetWriter-->>User: Parquet file with DECIMAL(10,2) schema
Loading

Updated class diagram for TableWriteNode creation and RowType construction

classDiagram
  class VeloxQueryPlanConverterBase {
    +toVeloxQueryPlan(node, sourceVeloxPlan, tableWriteInfo, taskId) TableWriteNode
  }

  class TableWriteNode {
    +id
    +outputType RowType
    +columnNames
    +insertTableHandle InsertTableHandle
  }

  class InsertTableHandle {
    +connectorInsertTableHandle() ConnectorInsertTableHandle
  }

  class ConnectorInsertTableHandle {
    +getColumnHandleDataTypes() vector~TypePtr~
  }

  class RowType {
    +names vector~string~
    +types vector~TypePtr~
  }

  class VariableReferenceExpression {
    +name string
    +typeSignature
  }

  class TypeParser {
    +parse(typeSignature) TypePtr
  }

  class toRowType_overload_1 {
    +toRowType(variables vector~VariableReferenceExpression~, typeParser TypeParser) RowTypePtr
  }

  class toRowType_overload_2 {
    +toRowType(variables vector~VariableReferenceExpression~, types vector~TypePtr~, typeParser TypeParser, excludeNames unordered_set~string~) RowTypePtr
  }

  VeloxQueryPlanConverterBase --> TableWriteNode : creates
  TableWriteNode *-- RowType : outputType
  TableWriteNode *-- InsertTableHandle
  InsertTableHandle *-- ConnectorInsertTableHandle
  ConnectorInsertTableHandle ..> RowType : provides types

  VeloxQueryPlanConverterBase ..> toRowType_overload_1 : uses (unchanged)
  VeloxQueryPlanConverterBase ..> toRowType_overload_2 : uses for TableWriteNode

  toRowType_overload_1 ..> VariableReferenceExpression : reads types via TypeParser
  toRowType_overload_1 ..> TypeParser

  toRowType_overload_2 ..> VariableReferenceExpression : reads names
  toRowType_overload_2 ..> RowType : builds from names and connector types
  toRowType_overload_2 ..> TypeParser : available but not used for decimal precision
Loading

File-Level Changes

Change Details Files
Use connector-provided column data types when constructing the RowType for TableWrite nodes to preserve accurate DECIMAL metadata.
  • Add an overload of toRowType that takes VariableReferenceExpression list plus an explicit list of TypePtr and builds a RowType using only the variable names, optionally excluding some by name.
  • Retrieve column data types from the insert table handle via getColumnHandleDataTypes() in VeloxQueryPlanConverterBase::toVeloxQueryPlan.
  • Change the TableWriteNode construction to use the new toRowType overload with the connector-provided column types instead of parsing types from the query plan, ensuring correct DECIMAL precision and scale are used in Parquet writes.
presto-native-execution/presto_cpp/main/types/PrestoToVeloxQueryPlan.cpp

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and found some issues that need to be addressed.

  • In the new toRowType overload, filtering variables via excludeNames without filtering types will misalign names and types; either filter types in sync or derive them inside the function instead of passing the original vector unchanged.
  • The types parameter in the new toRowType overload is passed by value and never modified; consider taking it as a const std::vector<TypePtr>& to avoid an unnecessary copy.
  • The typeParser parameter in the new toRowType overload is currently unused; if it is not needed, it should be removed to avoid confusion, or used if there was an intended purpose.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the new `toRowType` overload, filtering `variables` via `excludeNames` without filtering `types` will misalign names and types; either filter `types` in sync or derive them inside the function instead of passing the original vector unchanged.
- The `types` parameter in the new `toRowType` overload is passed by value and never modified; consider taking it as a `const std::vector<TypePtr>&` to avoid an unnecessary copy.
- The `typeParser` parameter in the new `toRowType` overload is currently unused; if it is not needed, it should be removed to avoid confusion, or used if there was an intended purpose.

## Individual Comments

### Comment 1
<location> `presto-native-execution/presto_cpp/main/types/PrestoToVeloxQueryPlan.cpp:78-82` </location>
<code_context>
   return ROW(std::move(names), std::move(types));
 }

+RowTypePtr toRowType(
+    const std::vector<protocol::VariableReferenceExpression>& variables,
+    const std::vector<TypePtr> types,
+    const TypeParser& typeParser,
+    const std::unordered_set<std::string>& excludeNames = {}) {
+  std::vector<std::string> names;
+  names.reserve(variables.size());
</code_context>

<issue_to_address>
**issue (bug_risk):** Filtering names via `excludeNames` without filtering `types` risks a name/type size mismatch in the resulting `RowType`.

This overload drops variables whose names are in `excludeNames`, but still passes the original `types` to `ROW(...)`. If any names are excluded, `names.size()` will be less than `types.size()`, which can cause undefined behavior or misaligned columns in the resulting `RowType`. Please either filter `types` in parallel, disallow non-empty `excludeNames` here, or add a runtime assertion that `names.size() == types.size()` before constructing the `RowType`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@jkhaliqi jkhaliqi force-pushed the parquet_tools_decimal branch 2 times, most recently from 1d026c6 to a3d2fee Compare December 3, 2025 20:30
@jkhaliqi jkhaliqi changed the title fix(parquet): Fix inserting decimal type values precision and scale fix(native): Fix inserting decimal type values precision and scale Dec 3, 2025
@jkhaliqi jkhaliqi force-pushed the parquet_tools_decimal branch from a3d2fee to 0f7f023 Compare December 4, 2025 19:45
@jkhaliqi jkhaliqi force-pushed the parquet_tools_decimal branch from 0f7f023 to 566ff2f Compare December 5, 2025 22:06
@jkhaliqi jkhaliqi force-pushed the parquet_tools_decimal branch from 566ff2f to 061a7bb Compare December 5, 2025 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants