[Json] Replace ArrayData with typed Array construction in json-reader#9497
[Json] Replace ArrayData with typed Array construction in json-reader#9497alamb merged 7 commits intoapache:mainfrom
ArrayData with typed Array construction in json-reader#9497Conversation
|
Any particular benchmark you want me to run? |
alamb
left a comment
There was a problem hiding this comment.
Thanks @liamzwbao -- this looks good to me -- I have just one question
This one should be good: https://github.com/apache/arrow-rs/blob/main/arrow-json/benches/json-reader.rs |
|
Hi @alamb, just noticed that we don't have benchmark for List yet. I will create one and get back to you later |
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Relates to #9497. # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Add benchmark for `ListArray` in `json_reader` to support the performance evaluation of #9497 # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - Benchmarks for decoding and serialize json list to `ListArray`. - Benchmarks for `ListArray` and `FixedSizeListArray` json writer # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Benchmarks only # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> No
6155c94 to
7daefe6
Compare
|
Hi @alamb, this PR is ready for the benchmark, could you please run |
|
run benchmark json_reader |
This comment was marked as outdated.
This comment was marked as outdated.
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark json_reader |
|
🤖 |
|
🤖: Benchmark completed Details
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ArrayDataBuilder with GenericListArray in ListArrayDecoderArrayDataBuilder with GenericListArray in ListArrayDecoder in json-reader
alamb
left a comment
There was a problem hiding this comment.
Thank you for this PR @liamzwbao
the benchmarks seem to show that this PR is slower for some reason. Could you look into that and figure out why?
This comment has been minimized.
This comment has been minimized.
alamb
left a comment
There was a problem hiding this comment.
Sorry I forgot to commit my comment. Seems like you already found it though
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ebb46dd to
ab89d76
Compare
|
Hi @alamb, this is ready for the |
alamb
left a comment
There was a problem hiding this comment.
Thank you @liamzwbao -- this is really close -- I think there are a few small items for ListViewArray and the bugin the REE encoder.
|
|
||
| pub struct ListLikeArrayDecoder<O, const IS_VIEW: bool> { | ||
| data_type: DataType, | ||
| field: FieldRef, |
There was a problem hiding this comment.
this is a nice change (as this now clones an Arc rather than a DataType) 👍
arrow-json/src/reader/list_array.rs
Outdated
| data = data | ||
| .add_buffer(Buffer::from_vec(offsets)) | ||
| .add_buffer(Buffer::from_vec(sizes)); | ||
| let array = GenericListViewArray::<O>::try_new( |
There was a problem hiding this comment.
This now does full validation for LIstViewArray I think -- which is more expensive than the previous creation via data.build_unchecked()
I can see two alternatives:
- Add a
unsafe GenericListViewArray::new_uncheckedmethods - Leave this code as is and create an ArrayRef directly from the ArrayData
I would selfishly suggest 2 (and then a follow on PR to add unsafe GenericListViewArray::new_unchecked / rework this code) to make reviewing this one eaiser on my
There was a problem hiding this comment.
Revert to use ArrayDataBuilder for ListView, will create an issue to support GenericListViewArray::new_unchecked and improve later. Created an issue #9646
| } | ||
|
|
||
| let flat_data = self.decoder.decode(tape, pos)?; | ||
| let flat_array = self.decoder.decode(tape, pos)?; |
There was a problem hiding this comment.
as pointed out, there is something wrong with this code -- I recommend reverting the extra optimization from this PR and adding it as a follow on PR
There was a problem hiding this comment.
Revert the REE optimization, will fix the partition API and reimplement the optimization later
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
🐱 🚀 |
|
Hi @alamb, I have revert the refactor for ListView and the REE optimization as discussed. Also filed several followup issues. PTAL, thanks! |
ArrayData with typed Array construction in json-readerArrayData with typed Array construction in json-reader
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
alamb
left a comment
There was a problem hiding this comment.
Thank you @liamzwbao -- this looks great to me. I kicked off some benchmarks just to double check but i think this is a really nice improvement overall
I also ran the test from #9634 on this PR
And it passes:
running 1 test
test reader::tests::test_read_nested_run_end_encoded ... ok
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmarks json_reader |
|
Benchmarks look as good or better to me |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing issue-9298-replace-array-data (5469fac) to 61b5763 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🚀 |
|
Thanks for sticking with this one @liamzwbao |
Which issue does this PR close?
ArrayDatawith direct Array construction, when possible #9298.Rationale for this change
While implementing
ListViewArrayDecoderin arrow-json, I noticed we could potentially retireArrayDataBuilderinsideListArrayDecoder. Therefore, I'd like to use a small PR here to make sure there's no regressionWhat changes are included in this PR?
Replace
ArrayDataBuilderwithGenericListArrayinListArrayDecoderAre these changes tested?
Covered by existing tests
Are there any user-facing changes?
No