Conversation
Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.
Fixed trailing zero bug, fixed FTP timeout bug (?), fixed empty files hanging downloads, fixed imbalanced genome data sets (e.g., no .fna, yes .faa, yes .gff)
Added a function to parse CD-HIT .clstr output into a long-format mapping of clusters to member feature ids. Updated database writing logic to include the new protein members table.
Minor regex change --> cleanData
Updated resistance summary calculation to read from the database and include a count of resistant classes.
There was a problem hiding this comment.
Ran PR locally for Shigella flexneri and it worked as expected once I made the code changes suggested in this review (see inline comments below).
All expected tables present at the end of the run:
Metadata: metadata, amr_phenotype, genome_data, original_metadata
Genes: gene_count, gene_names, gene_seqs, struct
Proteins: protein_count, protein_names, protein_seqs, protein_members, genome_gene_protein
Domains: domain_count, domain_names
Co-authored-by: Abhirupa Ghosh <100681585+AbhirupaGhosh@users.noreply.github.com> Co-authored-by: Emily Boyer <130874527+eboyer221@users.noreply.github.com>
jananiravi
left a comment
There was a problem hiding this comment.
because of a lot of formatting changes, not sure what were the actual changes -- looks OK at a high-level. what's new here, @AbhirupaGhosh?
The PR started with separating functions to create parquets from Metadata and feature data. Then minor changes in regex patterns to read BV-BRC feature ids, reformating the summary stats, adding the missed libraries as |
|
Reviewed since my last pass, everything looks good so I am going to approve. I pushed a small fix because cleanMetaData() was still hard-coding absolute paths in its four parquet view definitions, so I applied the same basename() + SET file_search_path pattern from PR #24 to bring it in line with cleanData(). Five-line change. @AbhirupaGhosh |
Thanks @eboyer221. I didn't know how to point the paths to DuckDB. |
|
@AbhirupaGhosh Yeah, I think we have it formatted correctly now so the new SET file_search_path='%s' line right above tells DuckDB where to look, so the filename alone is enough to find the parquet. That way the .duckdb still works if the folder gets moved. I tested it locally, moved the folder to a new path and the views still resolved as long as file_search_path is set on reopen. |
Maybe we have to do the same thing in other scripts. You can go ahead and merge this PR then. |
Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.
Description
What kind of change(s) are included?
Checklist
Please ensure that all boxes are checked before indicating that this pull request is ready for review.