Skip to content

PEDS-1257 Add Difference class with diff#9

Open
sonali-shaw wants to merge 3 commits into
mainfrom
feature/diff
Open

PEDS-1257 Add Difference class with diff#9
sonali-shaw wants to merge 3 commits into
mainfrom
feature/diff

Conversation

@sonali-shaw

Copy link
Copy Markdown

Created the class Difference in pfb_to_zip.py that generates the difference between two datasets writes it out as an avro file. This functionality can be used from the command line with -d (diff.zip will be downloaded). Additionally, there will be another file downloaded containing some relevant console output (such as withdrawn consent indicating deleted subjects).

@grugna grugna changed the title Add Difference class with diff PEDS-1618 Add Difference class with diff Jun 11, 2026
@grugna grugna requested a review from sgchoe June 11, 2026 16:15
Comment thread pfb_to_zip/pfb_to_zip.py Outdated
@grugna grugna changed the title PEDS-1618 Add Difference class with diff PEDS-1257 Add Difference class with diff Jun 11, 2026
Comment thread pfb_to_zip/pfb_to_zip.py Outdated
Comment thread pfb_to_zip/pfb_to_zip.py

@sgchoe sgchoe left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my line level comments regarding output of removed records and reconsidering how record changes are determined.

Comment thread pfb_to_zip/pfb_to_zip.py Outdated
for records in new_nodes.values():
diff_records.extend(records)
else:
if self._strip_timestamps(new_nodes) != self._strip_timestamps(old[sid]):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you want to sort this before comparison, PFB doesn't guarantee ordering:

def normalize(records):
    return sorted(
        records,
        key=lambda r: json.dumps(r, sort_keys=True)
    )

Comment thread pfb_to_zip/pfb_to_zip.py Outdated

print(f"Written to {output_path} with {len(diff_records)} records")

def create_test_diff_avro(self, output_path: str):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless there is a specific use for this the test one should be removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants