Skip to content

Split cleanData and add Parquet exports#11

Merged
eboyer221 merged 21 commits into
mainfrom
cleanData
Jun 2, 2026
Merged

Split cleanData and add Parquet exports#11
eboyer221 merged 21 commits into
mainfrom
cleanData

Conversation

@AbhirupaGhosh
Copy link
Copy Markdown
Contributor

Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.

Description

What kind of change(s) are included?

  • Feature (adds or updates new capabilities)
  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

  • I have read and followed the CONTRIBUTING.md guidelines.
  • I have searched for existing content to ensure this is not a duplicate.
  • I have performed a self-review of these additions (including spelling, grammar, and related).
  • I have added comments to my code to help provide understanding.
  • I have added a test which covers the code changes found within this PR.
  • I have deleted all non-relevant text in this pull request template.
  • Reviewer assignment: Tag a relevant team member to review and approve the changes.

AbhirupaGhosh and others added 3 commits January 28, 2026 16:31
Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.
AbhirupaGhosh and others added 3 commits February 19, 2026 15:56
Fixed trailing zero bug, fixed FTP timeout bug (?), fixed empty files hanging downloads, fixed imbalanced genome data sets (e.g., no .fna, yes .faa, yes .gff)
Comment thread R/data_curation.R
Comment thread R/data_curation.R Outdated
Comment thread R/data_curation.R Outdated
AbhirupaGhosh and others added 10 commits March 11, 2026 16:31
Added a function to parse CD-HIT .clstr output into a long-format mapping of clusters to member feature ids. Updated database writing logic to include the new protein members table.
Minor regex change --> cleanData
Updated resistance summary calculation to read from the database and include a count of resistant classes.
Comment thread R/data_processing.R Outdated
@eboyer221 eboyer221 self-requested a review April 8, 2026 17:43
Comment thread R/data_processing.R Outdated
Copy link
Copy Markdown
Contributor

@eboyer221 eboyer221 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran PR locally for Shigella flexneri and it worked as expected once I made the code changes suggested in this review (see inline comments below).
All expected tables present at the end of the run:

Metadata: metadata, amr_phenotype, genome_data, original_metadata
Genes: gene_count, gene_names, gene_seqs, struct
Proteins: protein_count, protein_names, protein_seqs, protein_members, genome_gene_protein
Domains: domain_count, domain_names

Comment thread R/data_processing.R
Comment thread R/data_processing.R Outdated
Comment thread R/data_processing.R
AbhirupaGhosh and others added 3 commits April 20, 2026 10:34
Co-authored-by: Abhirupa Ghosh <100681585+AbhirupaGhosh@users.noreply.github.com>
Co-authored-by: Emily Boyer <130874527+eboyer221@users.noreply.github.com>
eboyer221
eboyer221 previously approved these changes Apr 21, 2026
jananiravi
jananiravi previously approved these changes Apr 22, 2026
Copy link
Copy Markdown
Member

@jananiravi jananiravi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of a lot of formatting changes, not sure what were the actual changes -- looks OK at a high-level. what's new here, @AbhirupaGhosh?

Comment thread vignettes/intro.Rmd
@AbhirupaGhosh
Copy link
Copy Markdown
Contributor Author

AbhirupaGhosh commented Apr 22, 2026

because of a lot of formatting changes, not sure what were the actual changes -- looks OK at a high-level. what's new here, @AbhirupaGhosh?

The PR started with separating functions to create parquets from Metadata and feature data. Then minor changes in regex patterns to read BV-BRC feature ids, reformating the summary stats, adding the missed libraries as @importFrom

@AbhirupaGhosh AbhirupaGhosh dismissed stale reviews from jananiravi and eboyer221 via 15d50de June 2, 2026 15:41
@eboyer221
Copy link
Copy Markdown
Contributor

Reviewed since my last pass, everything looks good so I am going to approve. I pushed a small fix because cleanMetaData() was still hard-coding absolute paths in its four parquet view definitions, so I applied the same basename() + SET file_search_path pattern from PR #24 to bring it in line with cleanData(). Five-line change. @AbhirupaGhosh

@AbhirupaGhosh
Copy link
Copy Markdown
Contributor Author

Reviewed since my last pass, everything looks good so I am going to approve. I pushed a small fix because cleanMetaData() was still hard-coding absolute paths in its four parquet view definitions, so I applied the same basename() + SET file_search_path pattern from PR #24 to bring it in line with cleanData(). Five-line change. @AbhirupaGhosh

Thanks @eboyer221. I didn't know how to point the paths to DuckDB.

@eboyer221
Copy link
Copy Markdown
Contributor

eboyer221 commented Jun 2, 2026

@AbhirupaGhosh Yeah, I think we have it formatted correctly now so the new SET file_search_path='%s' line right above tells DuckDB where to look, so the filename alone is enough to find the parquet. That way the .duckdb still works if the folder gets moved. I tested it locally, moved the folder to a new path and the views still resolved as long as file_search_path is set on reopen.

@AbhirupaGhosh
Copy link
Copy Markdown
Contributor Author

Yeah, I think we have it formatted correctly now so the new SET file_search_path='%s' line right above tells DuckDB where to look, so the filename alone is enough to find the parquet. That way the .duckdb still works if the folder gets moved. I tested it locally, moved the folder to a new path and the views still resolved as long as file_search_path is set on reopen.

Maybe we have to do the same thing in other scripts. You can go ahead and merge this PR then.

@eboyer221 eboyer221 merged commit 8a8a472 into main Jun 2, 2026
1 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants