-
Notifications
You must be signed in to change notification settings - Fork 0
Dev Notes 4
When using the report method, users will receive a JSON-like document specifying the duplicates in their records. The document will have the following structure:
{
"email": "Email to send notification to",
"fields": "Number of fields of the data set",
"records": "Number of records parsed (with duplicates)",
"warnings": "List of warning messages, if any",
"file": "Link to the generated file. Only if 'action' is 'remove' or 'flag'",
"strict_duplicates": {
"count": "Number of rows that are exact copies of other rows",
"ids": "List consisting of the IDs of the duplicate rows. Only if ID field is provided or can be determined",
"index_pairs": "List consisting of the positions of duplicate record pairs. Only if 'count' > 0"
},
"partial_duplicates": {
"count": "Number of rows that are partial copies of other rows",
"ids": "List consisting of the IDs of the duplicate rows. Only if ID field is provided or can be determined",
"index_pairs": "List consisting of the positions of duplicate record pairs. Only if 'count' > 0"
},
"To be continued..."
}Actually, since we are no longer offering direct parsing of files, offering a JSON-like object with such information makes no sense. All people will receive is an email with the information.
In fact, I don't think people will ever use this function alone. I'll add the default to "flag" instead.
Pretty self-explanatory. The resulting file will omit duplicate rows.
This is the default action, as of Aug-4. The reasoning behind switching from report to flag is that, first, I don't think people will use this function that often and, second, I'd rather make people receive more rather than less information, so flag instead of remove for the default.
A recent conversation with John helped me clarify certain aspects of duplicate flagging. Here are the ideas:
- Add three fields to the dataset:
isDuplicate,duplicateTypeandduplicateOf -
isDuplicateis a boolean field indicating whether or not the row is a duplicate of another row -
dupicateTypeis a controlled vocabulary indicating the type of duplicate:full,partialor any other -
duplicateOfis a list of all the other record IDs for which the current record is a duplicate. Even for strict duplicates, for the sake of consistency, it makes sense to make this field a list.
| Action | isDuplicate | Effect |
|---|---|---|
| report | no | Nothing |
| report | yes | Add to list |
| remove | no | Write row with no flags |
| remove | yes | Don't write row |
| flag | no | Write row with [0, null, null]
|
| flag | yes | Write row with [1, type, list_of_dupes]
|
This repository is part of the VertNet project.
For more information, please check out the project's home page and GitHub organization page