download speed in bigrquery on macos

*Edited* to reflect that the slowdown has nothing to do with the `bigrquerystorage`-vs-"duckdb extension" showdown, it's a difference between this package on macos and this package on linux.

---

TBH, I'm not certain if this is the best place for this question, but since the speed comparison is in `bigrquery`+`bigrquerystorage` (and `bigrquery` alone is too slow anyway), I thought it might apply.

I have a table in BQ that is 5M rows per query, but querying from within R is prohibitive. When pulling the 5M rows, anecdotally it hung for over 4h, locking up the process (could not break out, needed to `kill` R to try again); this hang is reproducible for me. When reducing the query to `n`-rows only, I'm able to get the data I need, but the times are an order of magnitude higher than what I can get in native python. Since the python code uses the same streaming interface (theoretically?), the difference in speed must be due to something else, likely within the R packages/functions. I don't expect R to be perfectly as fast as python when downloading, but an OOM seems like something else.

I'm not using `gcloud-cli`'s `bq` utility since that uses the REST api. Instead, Gemini crafted a simple python script (included below) that takes advantage of the streaming api; I offer it for demonstration.

The query used for the benchmarks (both in R and using the python script) is identical except for the `limit n`. The query is performant in google console, and does not exceed "partitioned table" boundaries. I believe that the query is not the problem. I can confirm that `bq_dataset_query(api="json")` is significantly slower than `api="arrow"`, so `bigrquerystorage` is making significant improvements on the data pull.

The native R `bench::mark()` on 100, 1000, etc rows of the table.

```r
# A tibble: 4 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                     time            gc               
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                     <list>          <list>           
1 100        797.77ms 868.74ms    1.08      1.38MB    0.108    10     1      9.26s <NULL> <Rprofmem [1,962 × 3]>     <bench_tm [10]> <tibble [10 × 3]>
2 1000          1.08s    1.15s    0.873    10.02MB    0.262    10     3     11.45s <NULL> <Rprofmem [15,398 × 3]>    <bench_tm [10]> <tibble [10 × 3]>
3 10000         4.27s    4.57s    0.217    97.44MB    0.717    10    33     46.04s <NULL> <Rprofmem [150,400 × 3]>   <bench_tm [10]> <tibble [10 × 3]>
4 100000       27.75s    29.4s    0.0343  822.98MB    1.06     10   308      4.86m <NULL> <Rprofmem [1,621,608 × 3]> <bench_tm [10]> <tibble [10 × 3]>
```

When I run the identical queries using python, for even the largest of these queries, I see times 1/10th from within R. The below times are summaries of 10 runs each size (similar to `bench::mark()` above), units are seconds. (I used `time bq.py ...` and captured the output, then summarized on the `real` time returned.)

```none
       n time_min time_median
1    100    1.821       1.898
2   1000    1.689       1.766
3  10000    1.876       1.984
4 100000    2.522       2.581
```

Do you have ideas on why this is?

<details><summary> bq.py without REST </summary>

```python
import argparse
import os
import time
from datetime import datetime
from google.cloud import bigquery
from google.oauth2 import service_account

def get_client(key_path=None):
    """Handles the fallback logic for authentication."""
    if key_path:
        print(f"[*] Using service account key: {key_path}")
        credentials = service_account.Credentials.from_service_account_file(key_path)
        return bigquery.Client(credentials=credentials, project=credentials.project_id)

    # If no key_path, bigquery.Client() automatically checks GOOGLE_APPLICATION_CREDENTIALS
    # or the gcloud ADC login you performed earlier.
    print("[*] Using default environment authentication.")
    return bigquery.Client()

def run_downloader():
    parser = argparse.ArgumentParser(description="High-speed BigQuery data downloader.")

    # Authentication & Format Options
    parser.add_argument("-k", "--key", help="Path to service account JSON key file.")
    parser.add_argument("-f", "--format", choices=["json", "parquet"], default="parquet",
                        help="Output format: 'json' (NDJSON) or 'parquet' (Default: parquet)")
    parser.add_argument("-o", "--output", help="Output filename. If omitted, a timestamped name is generated.")

    # The final argument: the query string
    parser.add_argument("query", help="The SQL query to execute.")

    args = parser.parse_args()

    # 1. Initialize Client
    client = get_client(args.key)

    # 2. Run Query
    print(f"[{datetime.now().strftime('%H:%M:%S')}] Submitting query...")
    query_job = client.query(args.query)

    # 3. Download via Storage Read API (High-Speed)
    print(f"[{datetime.now().strftime('%H:%M:%S')}] Downloading data via Storage API...")
    # This remains the fastest path for 5M+ rows
    df = query_job.to_dataframe(create_bqstorage_client=True)

    # 4. Handle Filename
    timestamp = datetime.now().strftime("%Y%m%dT%H%M%S")
    extension = "jsonl" if args.format == "json" else "parquet"
    output_file = args.output or f"query_results_{timestamp}.{extension}"

    # 5. Save Data
    print(f"[{datetime.now().strftime('%H:%M:%S')}] Saving {len(df)} rows to {output_file}...")

    if args.format == "json":
        # Saves as Newline Delimited JSON (NDJSON) to preserve types correctly
        df.to_json(output_file, orient="records", lines=True)
    else:
        # High-speed binary format
        df.to_parquet(output_file, engine="pyarrow")

    print(f"[{datetime.now().strftime('%H:%M:%S')}] Success!")

if __name__ == "__main__":
    run_downloader()
```

</details>

Anecdotally, when I go much above the 100,000 rows, in R the query will just hang with no return (after 8+ hours). I haven't found the right number that triggers the fail, since once it's hung I cannot interrupt R. Regardless of the un-breakable hang here, I think even the benchmarking of 100,000 rows is sufficient to warrant fixing it.

In the interim, I will likely use just the python script via `processx::process$new(...)` so that I can get reasonable turnaround on my queries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

download speed in bigrquery on macos #85

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

download speed in bigrquery on macos #85

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions