-
-
Notifications
You must be signed in to change notification settings - Fork 262
Description
Description
Three related gaps in openml/datasets/ verified on openml==0.16.0. A PR fixing all three is in progress.
OpenMLDataset.__repr__crashes withKeyErrorwhen quality keys are missing orNaNOpenMLDataset.get_data()raises a bareKeyErrorfrom pandas internals whentargetis invalid or filtered outcreate_datasethas a misleadingstrannotation ondefault_target_attributedespite the REST API allowingNone
All three are scoped to openml/datasets/dataset.py and openml/datasets/functions.py.
Bug 1 - OpenMLDataset.__repr__ raises KeyError on partial or NaN qualities
_get_repr_body_fields accesses _qualities with direct dict indexing at line 300:
if self._qualities is not None and self._qualities["NumberOfFeatures"] is not None:
n_features = int(self._qualities["NumberOfFeatures"])If the qualities dict exists but is missing "NumberOfFeatures" or "NumberOfInstances" which happens for newly uploaded datasets where the server has only partially computed qualities, repr() crashes with a KeyError. The same crash occurs when the server returns NaN for these keys.
This is distinct from #847, which fixed the case where _qualities is None entirely. Here _qualities is a non-empty dict, specific keys are just absent or NaN.
Steps to Reproduce
import openml
ds = openml.datasets.get_dataset(61, download_data=False, download_qualities=True)
ds._qualities = {"SomeOtherQuality": 1.0}
repr(ds)Expected Results
repr(dataset) returns a valid string. Fields derived from missing or NaN quality keys are simply omitted from the output.
Actual Results
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "openml/base.py", line 20, in __repr__
body_fields = self._get_repr_body_fields()
File "openml/datasets/dataset.py", line 300, in _get_repr_body_fields
if self._qualities is not None and self._qualities["NumberOfFeatures"] is not None:
KeyError: 'NumberOfFeatures'
Bug 2 - OpenMLDataset.get_data() raises a bare pandas KeyError on invalid target
get_data() calls data.drop(columns=[target_name]) at line 797 without first checking whether target_name is present in data.columns. When the column is absent because of a typo, or because include_row_id=False / include_ignore_attribute=False silently removed it, pandas raises a raw KeyError with no OpenML context. The user has no way of knowing whether the column never existed or was filtered out by an OpenML flag.
Steps to Reproduce
import openml
ds = openml.datasets.get_dataset(61, download_data=True)
ds.get_data(target="nonexistent_column")Expected Results
A ValueError with a clear, actionable message before the drop call e.g.:
ValueError: Target column 'nonexistent_column' does not exist in this dataset.
Available columns: ['sepal_length', 'sepal_width', 'petal_length', 'class']
Or, when the column was filtered out by a flag:
ValueError: Target column 'id' was removed from the dataset because it is listed
as a row_id or ignore attribute. Available columns after filtering: [...]
Actual Results
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "openml/datasets/dataset.py", line 797, in get_data
x = data.drop(columns=[target_name])
File "pandas/core/frame.py", line 5603, in drop
...
KeyError: "['nonexistent_column'] not found in axis"
Bug 3 - create_dataset has a misleading type annotation on default_target_attribute
The REST API allows default_target_attribute to be absent for unsupervised datasets. The Python function currently annotates it as str, and an in-code TODO explicitly acknowledges the mismatch:
# TODO(eddiebergman): Function requires `default_target_attribute` exist but API allows None
default_target_attribute: str,At runtime, _expand_parameter(None) happens to return [] silently so no exception is raised but the str annotation causes mypy / static analysis errors, the docstring gives no indication that None is valid, and the TODO confirms this was never intentional. Users creating unsupervised datasets (e.g. clustering or anomaly detection) have no way of knowing None is safe to pass, and type checkers will flag it as an error.
This is the complement of #964, which added validation that a non-None target refers to an existing column. This issue is about the upstream problem: None should be formally supported in the first place.
Steps to Reproduce
import pandas as pd
from openml.datasets import create_dataset
df = pd.DataFrame({"x": [1.0, 2.0, 3.0]})
create_dataset(
name="test", description="test", creator=None,
contributor=None, collection_date=None, language="English",
licence=None, attributes="auto", data=df,
default_target_attribute=None, # type checker raises error here
ignore_attribute=None, citation="N/A",
)Expected Results
default_target_attribute=None is formally accepted no mypy error, no ambiguity. The annotation is str | None, the docstring explicitly documents None as valid for unsupervised datasets, and the TODO is resolved. Passing a non-None invalid column name still raises a ValueError (existing #964 behaviour preserved).
Actual Results
Passes at runtime silently, but the str annotation causes mypy to flag the call as a type error, and there is no documentation that None is intentionally supported.
Scope
All three fixes are contained in two files:
openml/datasets/dataset.py- Bugs 1 and 2openml/datasets/functions.py- Bug 3
No changes to the REST API surface, no new public methods, no changes outside the dataset subpackage.
Versions
openml: 0.16.0
Python: 3.10.18
OS: Windows-10-10.0.26200-SP0