Split cleanData and add Parquet exports by AbhirupaGhosh · Pull Request #11 · JRaviLab/amRdata

AbhirupaGhosh · 2026-02-19T15:55:20Z

Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.

Description

What kind of change(s) are included?

Feature (adds or updates new capabilities)
Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

I have read and followed the CONTRIBUTING.md guidelines.
I have searched for existing content to ensure this is not a duplicate.
I have performed a self-review of these additions (including spelling, grammar, and related).
I have added comments to my code to help provide understanding.
I have added a test which covers the code changes found within this PR.
I have deleted all non-relevant text in this pull request template.
Reviewer assignment: Tag a relevant team member to review and approve the changes.

Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.

Fixed trailing zero bug, fixed FTP timeout bug (?), fixed empty files hanging downloads, fixed imbalanced genome data sets (e.g., no .fna, yes .faa, yes .gff)

Added a function to parse CD-HIT .clstr output into a long-format mapping of clusters to member feature ids. Updated database writing logic to include the new protein members table.

Minor regex change --> cleanData

Updated resistance summary calculation to read from the database and include a count of resistant classes.

eboyer221

Ran PR locally for Shigella flexneri and it worked as expected once I made the code changes suggested in this review (see inline comments below).
All expected tables present at the end of the run:

Metadata: metadata, amr_phenotype, genome_data, original_metadata
Genes: gene_count, gene_names, gene_seqs, struct
Proteins: protein_count, protein_names, protein_seqs, protein_members, genome_gene_protein
Domains: domain_count, domain_names

Co-authored-by: Abhirupa Ghosh <100681585+AbhirupaGhosh@users.noreply.github.com> Co-authored-by: Emily Boyer <130874527+eboyer221@users.noreply.github.com>

jananiravi

because of a lot of formatting changes, not sure what were the actual changes -- looks OK at a high-level. what's new here, @AbhirupaGhosh?

AbhirupaGhosh · 2026-04-22T19:31:50Z

because of a lot of formatting changes, not sure what were the actual changes -- looks OK at a high-level. what's new here, @AbhirupaGhosh?

The PR started with separating functions to create parquets from Metadata and feature data. Then minor changes in regex patterns to read BV-BRC feature ids, reformating the summary stats, adding the missed libraries as @importFrom

eboyer221 · 2026-06-02T16:45:53Z

Reviewed since my last pass, everything looks good so I am going to approve. I pushed a small fix because cleanMetaData() was still hard-coding absolute paths in its four parquet view definitions, so I applied the same basename() + SET file_search_path pattern from PR #24 to bring it in line with cleanData(). Five-line change. @AbhirupaGhosh

AbhirupaGhosh · 2026-06-02T16:50:17Z

Reviewed since my last pass, everything looks good so I am going to approve. I pushed a small fix because cleanMetaData() was still hard-coding absolute paths in its four parquet view definitions, so I applied the same basename() + SET file_search_path pattern from PR #24 to bring it in line with cleanData(). Five-line change. @AbhirupaGhosh

Thanks @eboyer221. I didn't know how to point the paths to DuckDB.

eboyer221 · 2026-06-02T16:51:13Z

@AbhirupaGhosh Yeah, I think we have it formatted correctly now so the new SET file_search_path='%s' line right above tells DuckDB where to look, so the filename alone is enough to find the parquet. That way the .duckdb still works if the folder gets moved. I tested it locally, moved the folder to a new path and the views still resolved as long as file_search_path is set on reopen.

AbhirupaGhosh · 2026-06-02T16:52:33Z

Yeah, I think we have it formatted correctly now so the new SET file_search_path='%s' line right above tells DuckDB where to look, so the filename alone is enough to find the parquet. That way the .duckdb still works if the folder gets moved. I tested it locally, moved the folder to a new path and the views still resolved as long as file_search_path is set on reopen.

Maybe we have to do the same thing in other scripts. You can go ahead and merge this PR then.

AbhirupaGhosh and others added 3 commits January 28, 2026 16:31

Update regex patterns for feature ID extraction

73f7b52

Style code (GHA)

8d494a5

AbhirupaGhosh requested review from charmvang, epbrenner and jananiravi February 19, 2026 15:55

AbhirupaGhosh self-assigned this Feb 19, 2026

AbhirupaGhosh and others added 3 commits February 19, 2026 15:56

Style code (GHA)

c52aaa7

Updating download logic

4edc964

Fixed trailing zero bug, fixed FTP timeout bug (?), fixed empty files hanging downloads, fixed imbalanced genome data sets (e.g., no .fna, yes .faa, yes .gff)

Style code (GHA)

5287d1b

AbhirupaGhosh assigned AbhirupaGhosh and epbrenner and unassigned AbhirupaGhosh Feb 26, 2026

AbhirupaGhosh commented Feb 27, 2026

View reviewed changes

Comment thread R/data_curation.R

AbhirupaGhosh mentioned this pull request Feb 27, 2026

Summary printing after retrieveMetadata #12

Open

AbhirupaGhosh commented Feb 27, 2026

View reviewed changes

Comment thread R/data_curation.R Outdated

AbhirupaGhosh commented Feb 27, 2026

View reviewed changes

Comment thread R/data_curation.R Outdated

AbhirupaGhosh and others added 10 commits March 11, 2026 16:31

Add CD-HIT parsing function and update DB logic

f04d3dc

Added a function to parse CD-HIT .clstr output into a long-format mapping of clusters to member feature ids. Updated database writing logic to include the new protein members table.

Style code (GHA)

6bcddb6

Merge branch 'cleanData' into minor_regex_change

394cfd7

Merge pull request #15 from JRaviLab/minor_regex_change

9da2f23

Minor regex change --> cleanData

Refactor data processing to use filtered_metadata

918b2e0

Style code (GHA)

e85f88f

Filter genome drug resistant phenotype in data processing

3461807

Style code (GHA)

577b251

Refactor resistance summary calculation

ea4d844

Updated resistance summary calculation to read from the database and include a count of resistant classes.

Style code (GHA)

93b5b2e

AbhirupaGhosh commented Mar 18, 2026

View reviewed changes

Comment thread R/data_processing.R Outdated

eboyer221 self-requested a review April 8, 2026 17:43

AbhirupaGhosh commented Apr 10, 2026

View reviewed changes

Comment thread R/data_processing.R Outdated

eboyer221 requested changes Apr 15, 2026

View reviewed changes

Comment thread R/data_processing.R

Comment thread R/data_processing.R Outdated

AbhirupaGhosh commented Apr 20, 2026

View reviewed changes

Comment thread R/data_processing.R

AbhirupaGhosh and others added 3 commits April 20, 2026 10:34

Apply suggestions from code review

1d92219

Co-authored-by: Abhirupa Ghosh <100681585+AbhirupaGhosh@users.noreply.github.com> Co-authored-by: Emily Boyer <130874527+eboyer221@users.noreply.github.com>

Style code (GHA)

f71dd45

generated NAMESPACE

29ab577

eboyer221 previously approved these changes Apr 21, 2026

View reviewed changes

jananiravi previously approved these changes Apr 22, 2026

View reviewed changes

Comment thread vignettes/intro.Rmd

Merge branch 'main' into cleanData

15d50de

AbhirupaGhosh dismissed stale reviews from jananiravi and eboyer221 via 15d50de June 2, 2026 15:41

apply basename + file_search_path to cleanMetaData parquet views

3a543cc

AbhirupaGhosh closed this Jun 2, 2026

AbhirupaGhosh reopened this Jun 2, 2026

eboyer221 approved these changes Jun 2, 2026

View reviewed changes

eboyer221 merged commit 8a8a472 into main Jun 2, 2026
1 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split cleanData and add Parquet exports#11

Split cleanData and add Parquet exports#11
eboyer221 merged 21 commits into
mainfrom
cleanData

AbhirupaGhosh commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eboyer221 left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jananiravi left a comment

Uh oh!

Uh oh!

AbhirupaGhosh commented Apr 22, 2026 •

edited

Loading

Uh oh!

eboyer221 commented Jun 2, 2026

Uh oh!

AbhirupaGhosh commented Jun 2, 2026

Uh oh!

eboyer221 commented Jun 2, 2026 •

edited

Loading

Uh oh!

AbhirupaGhosh commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

AbhirupaGhosh commented Feb 19, 2026

Description

What kind of change(s) are included?

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eboyer221 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jananiravi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AbhirupaGhosh commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eboyer221 commented Jun 2, 2026

Uh oh!

AbhirupaGhosh commented Jun 2, 2026

Uh oh!

eboyer221 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AbhirupaGhosh commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eboyer221 left a comment •

edited

Loading

AbhirupaGhosh commented Apr 22, 2026 •

edited

Loading

eboyer221 commented Jun 2, 2026 •

edited

Loading