Skip to content

feat: support read to arrow#116

Open
luoyuxia wants to merge 2 commits intoapache:mainfrom
luoyuxia:support-read-to-arrow-final
Open

feat: support read to arrow#116
luoyuxia wants to merge 2 commits intoapache:mainfrom
luoyuxia:support-read-to-arrow-final

Conversation

@luoyuxia
Copy link
Contributor

Purpose

Linked issue: close #115

Brief change log

Tests

API and Format

Documentation

@luoyuxia luoyuxia force-pushed the support-read-to-arrow-final branch from 54e9acf to ecee807 Compare March 10, 2026 11:47
@luoyuxia luoyuxia force-pushed the support-read-to-arrow-final branch 3 times, most recently from b969a6c to 16f7ab1 Compare March 10, 2026 13:17
@luoyuxia luoyuxia force-pushed the support-read-to-arrow-final branch from 16f7ab1 to 42b2732 Compare March 10, 2026 13:30
.iter()
.flat_map(|ds| ds.data_file_entries().map(|(p, _)| p))
.map(|p| {
if !p.to_ascii_lowercase().ends_with(".parquet") {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only parquet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's only support parquet now. We can support more in next following pr.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

.iter()
.flat_map(|ds| ds.data_file_entries().map(|(p, _)| p))
.map(|p| {
if !p.to_ascii_lowercase().ends_with(".parquet") {
Copy link

@XiaoHongbo-Hope XiaoHongbo-Hope Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get your point, only parquet is supported. But I am afraid that this check is not enough. See DataFilePathFactory.formatIdentifier. Or can we have a better way to get the format.

Copy link
Contributor Author

@luoyuxia luoyuxia Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concern. However, for parquet files we normally do not expect extra compression suffixes like .gz, since compression is handled within the parquet format itself.

So for the current parquet-only read path, I think checking for .parquet should be sufficient. If we later need to support other file naming conventions, we can revisit this and align with DataFilePathFactory.formatIdentifier.

Btw, From the Python side, it seems we only inspect the last extension via os.path.splitext(). That means a path like *.json.gz would be identified as gz, not json.

So the current behavior there does not appear to handle compressed suffixes in a generalized way either.

For parquet, this is less of a concern since we normally do not expect an extra compression suffix such as .parquet.gz.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your very kind explanation. And thanks again for pointing out the issue in python side, I wiil check it.

@XiaoHongbo-Hope
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support read parquet to arrow

2 participants