Dev Notes 4

Dev notes: Available actions

report
1. New version
remove
flag
Table of effects

`report`

When using the report method, users will receive a JSON-like document specifying the duplicates in their records. The document will have the following structure:

{
    "email": "Email to send notification to",
    "fields": "Number of fields of the data set",
    "records": "Number of records parsed (with duplicates)",
    "warnings": "List of warning messages, if any",
    "file": "Link to the generated file. Only if 'action' is 'remove' or 'flag'",
    "strict_duplicates": {
        "count": "Number of rows that are exact copies of other rows",
        "ids": "List consisting of the IDs of the duplicate rows. Only if ID field is provided or can be determined",
        "index_pairs": "List consisting of the positions of duplicate record pairs. Only if 'count' > 0"
    },
    "partial_duplicates": {
        "count": "Number of rows that are partial copies of other rows",
        "ids": "List consisting of the IDs of the duplicate rows. Only if ID field is provided or can be determined",
        "index_pairs": "List consisting of the positions of duplicate record pairs. Only if 'count' > 0"
    },
    "To be continued..."
}

New version

Actually, since we are no longer offering direct parsing of files, offering a JSON-like object with such information makes no sense. All people will receive is an email with the information.

In fact, I don't think people will ever use this function alone. I'll add the default to "flag" instead.

`remove`

Pretty self-explanatory. The resulting file will omit duplicate rows.

`flag`

This is the default action, as of Aug-4. The reasoning behind switching from report to flag is that, first, I don't think people will use this function that often and, second, I'd rather make people receive more rather than less information, so flag instead of remove for the default.

A recent conversation with John helped me clarify certain aspects of duplicate flagging. Here are the ideas:

Add three fields to the dataset: isDuplicate, duplicateType and duplicateOf
isDuplicate is a boolean field indicating whether or not the row is a duplicate of another row
dupicateType is a controlled vocabulary indicating the type of duplicate: full, partial or any other
duplicateOf is a list of all the other record IDs for which the current record is a duplicate. Even for strict duplicates, for the sake of consistency, it makes sense to make this field a list.

Table of effects

Action	isDuplicate	Effect
report	no	Nothing
report	yes	Add to list
remove	no	Write row with no flags
remove	yes	Don't write row
flag	no	Write row with `[0, null, null]`
flag	yes	Write row with `[1, type, list_of_dupes]`

This repository is part of the VertNet project.

For more information, please check out the project's home page and GitHub organization page

Home

De-duplication service

Dev notes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev Notes 4

Dev notes: Available actions

`report`

New version

`remove`

`flag`

Table of effects

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally