Skip to content

Conversation

@JATAYU000
Copy link
Contributor

Metadata

@codecov-commenter
Copy link

codecov-commenter commented Jan 8, 2026

Codecov Report

❌ Patch coverage is 57.45721% with 174 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.32%. Comparing base (c5f68bf) to head (96df5e3).

Files with missing lines Patch % Lines
openml/_api/resources/datasets.py 31.55% 128 Missing ⚠️
openml/datasets/functions.py 6.66% 14 Missing ⚠️
openml/_api/http/client.py 82.60% 12 Missing ⚠️
openml/_api/resources/tasks.py 87.23% 6 Missing ⚠️
openml/_api/runtime/fallback.py 0.00% 6 Missing ⚠️
openml/_api/runtime/core.py 81.48% 5 Missing ⚠️
openml/_api/__init__.py 75.00% 1 Missing ⚠️
openml/_api/config.py 96.87% 1 Missing ⚠️
openml/tasks/functions.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1608      +/-   ##
==========================================
+ Coverage   53.02%   53.32%   +0.29%     
==========================================
  Files          36       46      +10     
  Lines        4326     4645     +319     
==========================================
+ Hits         2294     2477     +183     
- Misses       2032     2168     +136     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JATAYU000
Copy link
Contributor Author

JATAYU000 commented Jan 9, 2026

FYI @geetu040 Currently the get_dataset() function has 3 download requirement

  • download_data : uses api_calls._download_minio_bucket() to download all the files in the bucket if download_all_files param was True and api_calls._download_minio_file() to download the dataset.pq file if it was not found in cache. When download parquet fails it fallback to download dataset.arff file with get request
  • download_features : if feature_file is passed via init it extracts during initialization else does get request and caches the xml
  • download_qualities : if qulities_file is passed via init it extracts during initialization else does get request and caches the xml

Issues:

  • The data files .pq and .arff are common for versions and doesn't make sense to be downloaded multiple times
  • Path handling for download to return the path especially the data files, As mentioned in the meet I can try the Download specific class which uses the cache mixin and only inherited by dataset resource.
  • Current implementation in OpenMLDataset has v1 specific parsing which in my opinion should be using the current interface (api_context)

Example:

current load_features() ref link
This calls a function which downloads and returns a file path and then parse from the file path
This can be changed by changing that function's definition ref link to get -> parse -> return features instead of file paths

def _get_dataset_features_file(did_cache_dir: str | Path | None, dataset_id: int) -> dict[int, OpenMLDataFeature]:
        return _features

Or by updating the Dataset class to use the underlining interface method from api_context directly.

def _load_features(self) -> None:
       ...
        self._features = api_context.backend.datasets.get_features(self.dataset_id)

Another option is to add return_path to client requests, which in my opinion would be wasteful since adding a param to all the methods of client for just the dataset resource, and that too which could be handled without it as mentioned above.

@geetu040 geetu040 mentioned this pull request Jan 9, 2026
25 tasks
Copy link
Contributor

@geetu040 geetu040 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left an intermediate review. This is solid work and well done overall. Nice job. I'll look into the download part now.

Comment on lines +31 to +38
def list(
self,
limit: int,
offset: int,
*,
data_id: list[int] | None = None, # type: ignore
**kwargs: Any,
) -> pd.DataFrame: ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not have same signature for all 3 methods: DatasetsAPI.list, DatasetsV1.list, DatasetsV2.list? does it raise pre-commit failures since a few might not be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that v2 signature was experimental, idk how pre-commits did not catch that, Will came them same

Comment on lines +35 to +42
def list(
self,
limit: int,
offset: int,
*,
data_id: list[int] | None = None, # type: ignore
**kwargs: Any,
) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make this simple using private helper methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geetu040 I could not find a good common block to make an helper since the filters are passed via url in v1 and via json object in v2 , and both have different parsing, If you have any specific Idea on that please let me know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at DatasetsV1 I can think of these helper methods: _build_list_call, _parse_and_validate_xml, _parse_dataset
you can do something similar for DatasetsV2 though they can be different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do something similar for DatasetsV2 though they can be different.

I see, That opens more options.

bool
True if the deletion was successful. False otherwise.
"""
return openml.utils._delete_entity("data", dataset_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you implement the delete logic yourself instead of openml.utils._delete_entity, how would that look? I think it would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes Sense , It would look like a delete request from client along with exception handling

Comment on lines 456 to 461
def list(
self,
limit: int,
offset: int,
**kwargs: Any,
) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, it can use private helper methods

# Minimalistic check if the XML is useful
if "oml:data_qualities_list" not in qualities:
raise ValueError('Error in return XML, does not contain "oml:data_qualities_list"')
from openml._api import api_context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we have this import at the very top? does it create circular import error? if not, should be moved to top from all functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does raise circular import

@geetu040
Copy link
Contributor

FYI @geetu040 Currently the get_dataset() function has 3 download requirement

Thanks for a detailed explanation, I now have good understanding of the download mechanism.

download_data : uses api_calls._download_minio_bucket() to download all the files in the bucket if download_all_files param was True and api_calls._download_minio_file() to download the dataset.pq file if it was not found in cache. When download parquet fails it fallback to download dataset.arff file with get request

minio can be handled easily, we will use a separate client along with HTTPClient or implement it's methods in the HTTPClient, which work independently of the api version

  • download_features : if feature_file is passed via init it extracts during initialization else does get request and caches the xml

  • download_qualities : if qulities_file is passed via init it extracts during initialization else does get request and caches the xml

these are actually different objects in both apis, v1 uses xml and v2 keeps them in json

The data files .pq and .arff are common for versions and doesn't make sense to be downloaded multiple times

yes you are right, they are the same files, which are not required to be downloaded again for both versions, but isn't this true for almost all the http objects? they may have different format xml or json, slightly different structure, but if parsed most are identical, so shouldn't this rule be applied to all the responses?

Path handling for download to return the path especially the data files, As mentioned in the meet I can try the Download specific class which uses the cache mixin and only inherited by dataset resource.

I don't understand this point

Current implementation in OpenMLDataset has v1 specific parsing which in my opinion should be using the current interface (api_context)

agreed, should be handled by api_context

Another option is to add return_path to client requests, which in my opinion would be wasteful since adding a param to all the methods of client for just the dataset resource, and that too which could be handled without it as mentioned above.

agreed, adding return_path for just one specific method of one resource is not preffered


in conclusion, I may ask, if we ignore the fact that it downloads the .arff files for both versions separately, does everything else works out smooth without any blocker? I think ignoring this part is not really bad because conceptually this rule could be applied to almost every other response object

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants