Skip to content

[FIX] : Three dataset robustness gaps in OpenMLDataset and create_dataset #1711

@phantom-712

Description

@phantom-712

Description

Three related gaps in openml/datasets/ verified on openml==0.16.0. A PR fixing all three is in progress.

  1. OpenMLDataset.__repr__ crashes with KeyError when quality keys are missing or NaN
  2. OpenMLDataset.get_data() raises a bare KeyError from pandas internals when target is invalid or filtered out
  3. create_dataset has a misleading str annotation on default_target_attribute despite the REST API allowing None

All three are scoped to openml/datasets/dataset.py and openml/datasets/functions.py.


Bug 1 - OpenMLDataset.__repr__ raises KeyError on partial or NaN qualities

_get_repr_body_fields accesses _qualities with direct dict indexing at line 300:

if self._qualities is not None and self._qualities["NumberOfFeatures"] is not None:
    n_features = int(self._qualities["NumberOfFeatures"])

If the qualities dict exists but is missing "NumberOfFeatures" or "NumberOfInstances" which happens for newly uploaded datasets where the server has only partially computed qualities, repr() crashes with a KeyError. The same crash occurs when the server returns NaN for these keys.

This is distinct from #847, which fixed the case where _qualities is None entirely. Here _qualities is a non-empty dict, specific keys are just absent or NaN.

Steps to Reproduce

import openml
 
ds = openml.datasets.get_dataset(61, download_data=False, download_qualities=True)
ds._qualities = {"SomeOtherQuality": 1.0}
repr(ds)

Expected Results

repr(dataset) returns a valid string. Fields derived from missing or NaN quality keys are simply omitted from the output.

Actual Results

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "openml/base.py", line 20, in __repr__
    body_fields = self._get_repr_body_fields()
  File "openml/datasets/dataset.py", line 300, in _get_repr_body_fields
    if self._qualities is not None and self._qualities["NumberOfFeatures"] is not None:
KeyError: 'NumberOfFeatures'

Bug 2 - OpenMLDataset.get_data() raises a bare pandas KeyError on invalid target

get_data() calls data.drop(columns=[target_name]) at line 797 without first checking whether target_name is present in data.columns. When the column is absent because of a typo, or because include_row_id=False / include_ignore_attribute=False silently removed it, pandas raises a raw KeyError with no OpenML context. The user has no way of knowing whether the column never existed or was filtered out by an OpenML flag.

Steps to Reproduce

import openml
 
ds = openml.datasets.get_dataset(61, download_data=True)
ds.get_data(target="nonexistent_column")

Expected Results

A ValueError with a clear, actionable message before the drop call e.g.:

ValueError: Target column 'nonexistent_column' does not exist in this dataset.
Available columns: ['sepal_length', 'sepal_width', 'petal_length', 'class']

Or, when the column was filtered out by a flag:

ValueError: Target column 'id' was removed from the dataset because it is listed
as a row_id or ignore attribute. Available columns after filtering: [...]

Actual Results

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "openml/datasets/dataset.py", line 797, in get_data
    x = data.drop(columns=[target_name])
  File "pandas/core/frame.py", line 5603, in drop
    ...
KeyError: "['nonexistent_column'] not found in axis"

Bug 3 - create_dataset has a misleading type annotation on default_target_attribute

The REST API allows default_target_attribute to be absent for unsupervised datasets. The Python function currently annotates it as str, and an in-code TODO explicitly acknowledges the mismatch:

# TODO(eddiebergman): Function requires `default_target_attribute` exist but API allows None
default_target_attribute: str,

At runtime, _expand_parameter(None) happens to return [] silently so no exception is raised but the str annotation causes mypy / static analysis errors, the docstring gives no indication that None is valid, and the TODO confirms this was never intentional. Users creating unsupervised datasets (e.g. clustering or anomaly detection) have no way of knowing None is safe to pass, and type checkers will flag it as an error.

This is the complement of #964, which added validation that a non-None target refers to an existing column. This issue is about the upstream problem: None should be formally supported in the first place.

Steps to Reproduce

import pandas as pd
from openml.datasets import create_dataset
 
df = pd.DataFrame({"x": [1.0, 2.0, 3.0]})
create_dataset(
    name="test", description="test", creator=None,
    contributor=None, collection_date=None, language="English",
    licence=None, attributes="auto", data=df,
    default_target_attribute=None,  # type checker raises error here
    ignore_attribute=None, citation="N/A",
)

Expected Results

default_target_attribute=None is formally accepted no mypy error, no ambiguity. The annotation is str | None, the docstring explicitly documents None as valid for unsupervised datasets, and the TODO is resolved. Passing a non-None invalid column name still raises a ValueError (existing #964 behaviour preserved).

Actual Results

Passes at runtime silently, but the str annotation causes mypy to flag the call as a type error, and there is no documentation that None is intentionally supported.


Scope

All three fixes are contained in two files:

  • openml/datasets/dataset.py - Bugs 1 and 2
  • openml/datasets/functions.py - Bug 3

No changes to the REST API surface, no new public methods, no changes outside the dataset subpackage.


Versions

openml:  0.16.0
Python:  3.10.18 
OS:      Windows-10-10.0.26200-SP0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions