Skip to content

feat: async stream read#9632

Open
ariel-miculas wants to merge 3 commits intoapache:mainfrom
ariel-miculas:async-stream-reader
Open

feat: async stream read#9632
ariel-miculas wants to merge 3 commits intoapache:mainfrom
ariel-miculas:async-stream-reader

Conversation

@ariel-miculas
Copy link
Copy Markdown

Which issue does this PR close?

Rationale for this change

Implement an async streamed reader for avro, this is similar to how datafusion handles json and csv scanning.
There are two main advantages:

  • we don't have to read the entire data upfront, which could require a lot of memory
  • data fetching and decoding can happen in parallel, leading to performance improvements

What changes are included in this PR?

Change the avro reader implementation to use async streams.

Are these changes tested?

Yes, added tests.

Are there any user-facing changes?

Users can now implement get_stream as part of the AsyncFileReader trait

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Mar 31, 2026
Copy link
Copy Markdown
Contributor

@AndreaBozzo AndreaBozzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ariel-miculas

while i'm very in favor of the idea behind this, there a few concerns about the implementation

mut state: AvroReaderState<R>,
) -> impl Stream<Item = Result<RecordBatch, ArrowError>> + Send + 'static {
async_stream::try_stream! {
if state.meta.range.start >= state.meta.range.end {
Copy link
Copy Markdown
Contributor

@AndreaBozzo AndreaBozzo Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_stream! is sequential? There's no actual prefetching or concurrent work happening, did i miss something?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is entirely pull-based. The reader API exposes the stream, which can be driven by a dedicated async task (with the granularity of a record batch at a time) if beneficial to the application.

There is a comment about not being able to use the inner spawn helper because of other constraints on the trait design.

(Responding because I had an input on this change.)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my end-to-end benchmark, prefetching is done by the object-store implementation, specifically in the call to get_opts

tokio = { version = "1.0", optional = true, default-features = false, features = ["macros", "rt", "io-util"] }
tokio-util = { version = "0.7.18", default-features = false, features = ["io"] }
async-stream = "0.3.6"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Async stuff should be feature gated i think

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean feature gating AsyncFileReader entirely (which wasn't feature gated before) or having two implementations for it (which I disklike)?

Copy link
Copy Markdown
Contributor

@mzabaluev mzabaluev Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These dependencies should be optional and enabled by the "async" feature, like tokio above.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AsyncFileReader was indeed feature-gated before, so I'll make these two dependencies optional and add them as part of async feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants