Conversation
54e9acf to
ecee807
Compare
b969a6c to
16f7ab1
Compare
16f7ab1 to
42b2732
Compare
| .iter() | ||
| .flat_map(|ds| ds.data_file_entries().map(|(p, _)| p)) | ||
| .map(|p| { | ||
| if !p.to_ascii_lowercase().ends_with(".parquet") { |
There was a problem hiding this comment.
Yes, let's only support parquet now. We can support more in next following pr.
| .iter() | ||
| .flat_map(|ds| ds.data_file_entries().map(|(p, _)| p)) | ||
| .map(|p| { | ||
| if !p.to_ascii_lowercase().ends_with(".parquet") { |
There was a problem hiding this comment.
I can get your point, only parquet is supported. But I am afraid that this check is not enough. See DataFilePathFactory.formatIdentifier. Or can we have a better way to get the format.
There was a problem hiding this comment.
I understand the concern. However, for parquet files we normally do not expect extra compression suffixes like .gz, since compression is handled within the parquet format itself.
So for the current parquet-only read path, I think checking for .parquet should be sufficient. If we later need to support other file naming conventions, we can revisit this and align with DataFilePathFactory.formatIdentifier.
Btw, From the Python side, it seems we only inspect the last extension via os.path.splitext(). That means a path like *.json.gz would be identified as gz, not json.
So the current behavior there does not appear to handle compressed suffixes in a generalized way either.
For parquet, this is less of a concern since we normally do not expect an extra compression suffix such as .parquet.gz.
There was a problem hiding this comment.
Thanks for your very kind explanation. And thanks again for pointing out the issue in python side, I wiil check it.
|
+1 |
Purpose
Linked issue: close #115
Brief change log
Tests
API and Format
Documentation