Skip to content

download speed in bigrquery on macos #85

@r2evans

Description

@r2evans

Edited to reflect that the slowdown has nothing to do with the bigrquerystorage-vs-"duckdb extension" showdown, it's a difference between this package on macos and this package on linux.


TBH, I'm not certain if this is the best place for this question, but since the speed comparison is in bigrquery+bigrquerystorage (and bigrquery alone is too slow anyway), I thought it might apply.

I have a table in BQ that is 5M rows per query, but querying from within R is prohibitive. When pulling the 5M rows, anecdotally it hung for over 4h, locking up the process (could not break out, needed to kill R to try again); this hang is reproducible for me. When reducing the query to n-rows only, I'm able to get the data I need, but the times are an order of magnitude higher than what I can get in native python. Since the python code uses the same streaming interface (theoretically?), the difference in speed must be due to something else, likely within the R packages/functions. I don't expect R to be perfectly as fast as python when downloading, but an OOM seems like something else.

I'm not using gcloud-cli's bq utility since that uses the REST api. Instead, Gemini crafted a simple python script (included below) that takes advantage of the streaming api; I offer it for demonstration.

The query used for the benchmarks (both in R and using the python script) is identical except for the limit n. The query is performant in google console, and does not exceed "partitioned table" boundaries. I believe that the query is not the problem. I can confirm that bq_dataset_query(api="json") is significantly slower than api="arrow", so bigrquerystorage is making significant improvements on the data pull.

The native R bench::mark() on 100, 1000, etc rows of the table.

# A tibble: 4 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                     time            gc               
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                     <list>          <list>           
1 100        797.77ms 868.74ms    1.08      1.38MB    0.108    10     1      9.26s <NULL> <Rprofmem [1,962 × 3]>     <bench_tm [10]> <tibble [10 × 3]>
2 1000          1.08s    1.15s    0.873    10.02MB    0.262    10     3     11.45s <NULL> <Rprofmem [15,398 × 3]>    <bench_tm [10]> <tibble [10 × 3]>
3 10000         4.27s    4.57s    0.217    97.44MB    0.717    10    33     46.04s <NULL> <Rprofmem [150,400 × 3]>   <bench_tm [10]> <tibble [10 × 3]>
4 100000       27.75s    29.4s    0.0343  822.98MB    1.06     10   308      4.86m <NULL> <Rprofmem [1,621,608 × 3]> <bench_tm [10]> <tibble [10 × 3]>

When I run the identical queries using python, for even the largest of these queries, I see times 1/10th from within R. The below times are summaries of 10 runs each size (similar to bench::mark() above), units are seconds. (I used time bq.py ... and captured the output, then summarized on the real time returned.)

       n time_min time_median
1    100    1.821       1.898
2   1000    1.689       1.766
3  10000    1.876       1.984
4 100000    2.522       2.581

Do you have ideas on why this is?

bq.py without REST
import argparse
import os
import time
from datetime import datetime
from google.cloud import bigquery
from google.oauth2 import service_account

def get_client(key_path=None):
    """Handles the fallback logic for authentication."""
    if key_path:
        print(f"[*] Using service account key: {key_path}")
        credentials = service_account.Credentials.from_service_account_file(key_path)
        return bigquery.Client(credentials=credentials, project=credentials.project_id)

    # If no key_path, bigquery.Client() automatically checks GOOGLE_APPLICATION_CREDENTIALS
    # or the gcloud ADC login you performed earlier.
    print("[*] Using default environment authentication.")
    return bigquery.Client()

def run_downloader():
    parser = argparse.ArgumentParser(description="High-speed BigQuery data downloader.")

    # Authentication & Format Options
    parser.add_argument("-k", "--key", help="Path to service account JSON key file.")
    parser.add_argument("-f", "--format", choices=["json", "parquet"], default="parquet",
                        help="Output format: 'json' (NDJSON) or 'parquet' (Default: parquet)")
    parser.add_argument("-o", "--output", help="Output filename. If omitted, a timestamped name is generated.")

    # The final argument: the query string
    parser.add_argument("query", help="The SQL query to execute.")

    args = parser.parse_args()

    # 1. Initialize Client
    client = get_client(args.key)

    # 2. Run Query
    print(f"[{datetime.now().strftime('%H:%M:%S')}] Submitting query...")
    query_job = client.query(args.query)

    # 3. Download via Storage Read API (High-Speed)
    print(f"[{datetime.now().strftime('%H:%M:%S')}] Downloading data via Storage API...")
    # This remains the fastest path for 5M+ rows
    df = query_job.to_dataframe(create_bqstorage_client=True)

    # 4. Handle Filename
    timestamp = datetime.now().strftime("%Y%m%dT%H%M%S")
    extension = "jsonl" if args.format == "json" else "parquet"
    output_file = args.output or f"query_results_{timestamp}.{extension}"

    # 5. Save Data
    print(f"[{datetime.now().strftime('%H:%M:%S')}] Saving {len(df)} rows to {output_file}...")

    if args.format == "json":
        # Saves as Newline Delimited JSON (NDJSON) to preserve types correctly
        df.to_json(output_file, orient="records", lines=True)
    else:
        # High-speed binary format
        df.to_parquet(output_file, engine="pyarrow")

    print(f"[{datetime.now().strftime('%H:%M:%S')}] Success!")

if __name__ == "__main__":
    run_downloader()

Anecdotally, when I go much above the 100,000 rows, in R the query will just hang with no return (after 8+ hours). I haven't found the right number that triggers the fail, since once it's hung I cannot interrupt R. Regardless of the un-breakable hang here, I think even the benchmarking of 100,000 rows is sufficient to warrant fixing it.

In the interim, I will likely use just the python script via processx::process$new(...) so that I can get reasonable turnaround on my queries.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions