Edited to reflect that the slowdown has nothing to do with the bigrquerystorage-vs-"duckdb extension" showdown, it's a difference between this package on macos and this package on linux.
TBH, I'm not certain if this is the best place for this question, but since the speed comparison is in bigrquery+bigrquerystorage (and bigrquery alone is too slow anyway), I thought it might apply.
I have a table in BQ that is 5M rows per query, but querying from within R is prohibitive. When pulling the 5M rows, anecdotally it hung for over 4h, locking up the process (could not break out, needed to kill R to try again); this hang is reproducible for me. When reducing the query to n-rows only, I'm able to get the data I need, but the times are an order of magnitude higher than what I can get in native python. Since the python code uses the same streaming interface (theoretically?), the difference in speed must be due to something else, likely within the R packages/functions. I don't expect R to be perfectly as fast as python when downloading, but an OOM seems like something else.
I'm not using gcloud-cli's bq utility since that uses the REST api. Instead, Gemini crafted a simple python script (included below) that takes advantage of the streaming api; I offer it for demonstration.
The query used for the benchmarks (both in R and using the python script) is identical except for the limit n. The query is performant in google console, and does not exceed "partitioned table" boundaries. I believe that the query is not the problem. I can confirm that bq_dataset_query(api="json") is significantly slower than api="arrow", so bigrquerystorage is making significant improvements on the data pull.
The native R bench::mark() on 100, 1000, etc rows of the table.
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 100 797.77ms 868.74ms 1.08 1.38MB 0.108 10 1 9.26s <NULL> <Rprofmem [1,962 × 3]> <bench_tm [10]> <tibble [10 × 3]>
2 1000 1.08s 1.15s 0.873 10.02MB 0.262 10 3 11.45s <NULL> <Rprofmem [15,398 × 3]> <bench_tm [10]> <tibble [10 × 3]>
3 10000 4.27s 4.57s 0.217 97.44MB 0.717 10 33 46.04s <NULL> <Rprofmem [150,400 × 3]> <bench_tm [10]> <tibble [10 × 3]>
4 100000 27.75s 29.4s 0.0343 822.98MB 1.06 10 308 4.86m <NULL> <Rprofmem [1,621,608 × 3]> <bench_tm [10]> <tibble [10 × 3]>
When I run the identical queries using python, for even the largest of these queries, I see times 1/10th from within R. The below times are summaries of 10 runs each size (similar to bench::mark() above), units are seconds. (I used time bq.py ... and captured the output, then summarized on the real time returned.)
n time_min time_median
1 100 1.821 1.898
2 1000 1.689 1.766
3 10000 1.876 1.984
4 100000 2.522 2.581
Do you have ideas on why this is?
bq.py without REST
import argparse
import os
import time
from datetime import datetime
from google.cloud import bigquery
from google.oauth2 import service_account
def get_client(key_path=None):
"""Handles the fallback logic for authentication."""
if key_path:
print(f"[*] Using service account key: {key_path}")
credentials = service_account.Credentials.from_service_account_file(key_path)
return bigquery.Client(credentials=credentials, project=credentials.project_id)
# If no key_path, bigquery.Client() automatically checks GOOGLE_APPLICATION_CREDENTIALS
# or the gcloud ADC login you performed earlier.
print("[*] Using default environment authentication.")
return bigquery.Client()
def run_downloader():
parser = argparse.ArgumentParser(description="High-speed BigQuery data downloader.")
# Authentication & Format Options
parser.add_argument("-k", "--key", help="Path to service account JSON key file.")
parser.add_argument("-f", "--format", choices=["json", "parquet"], default="parquet",
help="Output format: 'json' (NDJSON) or 'parquet' (Default: parquet)")
parser.add_argument("-o", "--output", help="Output filename. If omitted, a timestamped name is generated.")
# The final argument: the query string
parser.add_argument("query", help="The SQL query to execute.")
args = parser.parse_args()
# 1. Initialize Client
client = get_client(args.key)
# 2. Run Query
print(f"[{datetime.now().strftime('%H:%M:%S')}] Submitting query...")
query_job = client.query(args.query)
# 3. Download via Storage Read API (High-Speed)
print(f"[{datetime.now().strftime('%H:%M:%S')}] Downloading data via Storage API...")
# This remains the fastest path for 5M+ rows
df = query_job.to_dataframe(create_bqstorage_client=True)
# 4. Handle Filename
timestamp = datetime.now().strftime("%Y%m%dT%H%M%S")
extension = "jsonl" if args.format == "json" else "parquet"
output_file = args.output or f"query_results_{timestamp}.{extension}"
# 5. Save Data
print(f"[{datetime.now().strftime('%H:%M:%S')}] Saving {len(df)} rows to {output_file}...")
if args.format == "json":
# Saves as Newline Delimited JSON (NDJSON) to preserve types correctly
df.to_json(output_file, orient="records", lines=True)
else:
# High-speed binary format
df.to_parquet(output_file, engine="pyarrow")
print(f"[{datetime.now().strftime('%H:%M:%S')}] Success!")
if __name__ == "__main__":
run_downloader()
Anecdotally, when I go much above the 100,000 rows, in R the query will just hang with no return (after 8+ hours). I haven't found the right number that triggers the fail, since once it's hung I cannot interrupt R. Regardless of the un-breakable hang here, I think even the benchmarking of 100,000 rows is sufficient to warrant fixing it.
In the interim, I will likely use just the python script via processx::process$new(...) so that I can get reasonable turnaround on my queries.
Edited to reflect that the slowdown has nothing to do with the
bigrquerystorage-vs-"duckdb extension" showdown, it's a difference between this package on macos and this package on linux.TBH, I'm not certain if this is the best place for this question, but since the speed comparison is in
bigrquery+bigrquerystorage(andbigrqueryalone is too slow anyway), I thought it might apply.I have a table in BQ that is 5M rows per query, but querying from within R is prohibitive. When pulling the 5M rows, anecdotally it hung for over 4h, locking up the process (could not break out, needed to
killR to try again); this hang is reproducible for me. When reducing the query ton-rows only, I'm able to get the data I need, but the times are an order of magnitude higher than what I can get in native python. Since the python code uses the same streaming interface (theoretically?), the difference in speed must be due to something else, likely within the R packages/functions. I don't expect R to be perfectly as fast as python when downloading, but an OOM seems like something else.I'm not using
gcloud-cli'sbqutility since that uses the REST api. Instead, Gemini crafted a simple python script (included below) that takes advantage of the streaming api; I offer it for demonstration.The query used for the benchmarks (both in R and using the python script) is identical except for the
limit n. The query is performant in google console, and does not exceed "partitioned table" boundaries. I believe that the query is not the problem. I can confirm thatbq_dataset_query(api="json")is significantly slower thanapi="arrow", sobigrquerystorageis making significant improvements on the data pull.The native R
bench::mark()on 100, 1000, etc rows of the table.When I run the identical queries using python, for even the largest of these queries, I see times 1/10th from within R. The below times are summaries of 10 runs each size (similar to
bench::mark()above), units are seconds. (I usedtime bq.py ...and captured the output, then summarized on therealtime returned.)Do you have ideas on why this is?
bq.py without REST
Anecdotally, when I go much above the 100,000 rows, in R the query will just hang with no return (after 8+ hours). I haven't found the right number that triggers the fail, since once it's hung I cannot interrupt R. Regardless of the un-breakable hang here, I think even the benchmarking of 100,000 rows is sufficient to warrant fixing it.
In the interim, I will likely use just the python script via
processx::process$new(...)so that I can get reasonable turnaround on my queries.