Is your feature request related to a problem? Please describe.
NA
Describe the solution you'd like
** MIGHT BE EXPORTED TO A SEPARATE REPO/PACKAGE IF IMPLEMENTATION IS TOO COMPLEX**
This initial goal of this feature is to help the user identify geographic identifiers in their dataset. For example, if there exists
a column called geoid and the data under it is 21321, 22433... We might assume this refers to zip or county code, but we don't know completely. What about county as the column and then 32212, 13131, ... then clearly we would know what it is referring to.
I think this would require some knowledge of machine learning using NLP alongside using structure of the entries within a column. This assumption is only made only for US datasets. For example, a 5 digit number would allow us to narrow down to zip, county, or cbsa. The name of the column might also help classify it. (Could we expand this to other countries?). This would require a fair amount of knowledge on geographic identifiers...
Another worthwhile feature that could be added is to determine the hierarchy of geographic data within a dataset. For example, we should order state column as higher above county column and countysubdivision below that. This is likely a feature that should be added on after there is a working solution to detecting geoids in a dataset.
Describe alternatives you've considered
NA
Additional context
NA
Is your feature request related to a problem? Please describe.
NA
Describe the solution you'd like
** MIGHT BE EXPORTED TO A SEPARATE REPO/PACKAGE IF IMPLEMENTATION IS TOO COMPLEX**
This initial goal of this feature is to help the user identify geographic identifiers in their dataset. For example, if there exists
a column called geoid and the data under it is 21321, 22433... We might assume this refers to zip or county code, but we don't know completely. What about county as the column and then 32212, 13131, ... then clearly we would know what it is referring to.
I think this would require some knowledge of machine learning using NLP alongside using structure of the entries within a column. This assumption is only made only for US datasets. For example, a 5 digit number would allow us to narrow down to zip, county, or cbsa. The name of the column might also help classify it. (Could we expand this to other countries?). This would require a fair amount of knowledge on geographic identifiers...
Another worthwhile feature that could be added is to determine the hierarchy of geographic data within a dataset. For example, we should order state column as higher above county column and countysubdivision below that. This is likely a feature that should be added on after there is a working solution to detecting geoids in a dataset.
Describe alternatives you've considered
NA
Additional context
NA