[FIX] : Three dataset robustness gaps in `OpenMLDataset` and `create_dataset`

---

## Description
 
Three related gaps in `openml/datasets/` verified on `openml==0.16.0`. A PR fixing all three is in progress.
 
1. `OpenMLDataset.__repr__` crashes with `KeyError` when quality keys are missing or `NaN`
2. `OpenMLDataset.get_data()` raises a bare `KeyError` from pandas internals when `target` is invalid or filtered out
3. `create_dataset` has a misleading `str` annotation on `default_target_attribute` despite the REST API allowing `None`
 
All three are scoped to `openml/datasets/dataset.py` and `openml/datasets/functions.py`.
 
---
 
## Bug 1 - `OpenMLDataset.__repr__` raises `KeyError` on partial or NaN qualities
 
`_get_repr_body_fields` accesses `_qualities` with direct dict indexing at line 300:
 
```python
if self._qualities is not None and self._qualities["NumberOfFeatures"] is not None:
    n_features = int(self._qualities["NumberOfFeatures"])
```
 
If the qualities dict exists but is missing `"NumberOfFeatures"` or `"NumberOfInstances"`  which happens for newly uploaded datasets where the server has only partially computed qualities, `repr()` crashes with a `KeyError`. The same crash occurs when the server returns `NaN` for these keys.
 
This is distinct from #847, which fixed the case where `_qualities` is `None` entirely. Here `_qualities` is a non-empty dict, specific keys are just absent or `NaN`.
 
### Steps to Reproduce
 
```python
import openml
 
ds = openml.datasets.get_dataset(61, download_data=False, download_qualities=True)
ds._qualities = {"SomeOtherQuality": 1.0}
repr(ds)
```
 
### Expected Results
 
`repr(dataset)` returns a valid string. Fields derived from missing or NaN quality keys are simply omitted from the output.
 
### Actual Results
 
```
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "openml/base.py", line 20, in __repr__
    body_fields = self._get_repr_body_fields()
  File "openml/datasets/dataset.py", line 300, in _get_repr_body_fields
    if self._qualities is not None and self._qualities["NumberOfFeatures"] is not None:
KeyError: 'NumberOfFeatures'
```
 
---
 
## Bug 2 - `OpenMLDataset.get_data()` raises a bare pandas `KeyError` on invalid target
 
`get_data()` calls `data.drop(columns=[target_name])` at line 797 without first checking whether `target_name` is present in `data.columns`. When the column is absent because of a typo, or because `include_row_id=False` / `include_ignore_attribute=False` silently removed it, pandas raises a raw `KeyError` with no OpenML context. The user has no way of knowing whether the column never existed or was filtered out by an OpenML flag.
 
### Steps to Reproduce
 
```python
import openml
 
ds = openml.datasets.get_dataset(61, download_data=True)
ds.get_data(target="nonexistent_column")
```
 
### Expected Results
 
A `ValueError` with a clear, actionable message before the `drop` call e.g.:
 
```
ValueError: Target column 'nonexistent_column' does not exist in this dataset.
Available columns: ['sepal_length', 'sepal_width', 'petal_length', 'class']
```
 
Or, when the column was filtered out by a flag:
 
```
ValueError: Target column 'id' was removed from the dataset because it is listed
as a row_id or ignore attribute. Available columns after filtering: [...]
```
 
### Actual Results
 
```
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "openml/datasets/dataset.py", line 797, in get_data
    x = data.drop(columns=[target_name])
  File "pandas/core/frame.py", line 5603, in drop
    ...
KeyError: "['nonexistent_column'] not found in axis"
```
 
---
 
## Bug 3 - `create_dataset` has a misleading type annotation on `default_target_attribute`
 
The REST API allows `default_target_attribute` to be absent for unsupervised datasets. The Python function currently annotates it as `str`, and an in-code TODO explicitly acknowledges the mismatch:
 
```python
# TODO(eddiebergman): Function requires `default_target_attribute` exist but API allows None
default_target_attribute: str,
```
 
At runtime, `_expand_parameter(None)` happens to return `[]` silently so no exception is raised  but the `str` annotation causes mypy / static analysis errors, the docstring gives no indication that `None` is valid, and the TODO confirms this was never intentional. Users creating unsupervised datasets (e.g. clustering or anomaly detection) have no way of knowing `None` is safe to pass, and type checkers will flag it as an error.
 
This is the complement of #964, which added validation that a non-`None` target refers to an existing column. This issue is about the upstream problem: `None` should be formally supported in the first place.
 
### Steps to Reproduce
 
```python
import pandas as pd
from openml.datasets import create_dataset
 
df = pd.DataFrame({"x": [1.0, 2.0, 3.0]})
create_dataset(
    name="test", description="test", creator=None,
    contributor=None, collection_date=None, language="English",
    licence=None, attributes="auto", data=df,
    default_target_attribute=None,  # type checker raises error here
    ignore_attribute=None, citation="N/A",
)
```
 
### Expected Results
 
`default_target_attribute=None` is formally accepted  no mypy error, no ambiguity. The annotation is `str | None`, the docstring explicitly documents `None` as valid for unsupervised datasets, and the TODO is resolved. Passing a non-`None` invalid column name still raises a `ValueError` (existing #964 behaviour preserved).
 
### Actual Results
 
Passes at runtime silently, but the `str` annotation causes mypy to flag the call as a type error, and there is no documentation that `None` is intentionally supported.
 
---
 
## Scope
 
All three fixes are contained in two files:
- `openml/datasets/dataset.py` - Bugs 1 and 2
- `openml/datasets/functions.py` - Bug 3
 
No changes to the REST API surface, no new public methods, no changes outside the dataset subpackage.
 
---
 
## Versions
 
```
openml:  0.16.0
Python:  3.10.18 
OS:      Windows-10-10.0.26200-SP0
```
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FIX] : Three dataset robustness gaps in `OpenMLDataset` and `create_dataset` #1711

Description

Bug 1 - `OpenMLDataset.repr` raises `KeyError` on partial or NaN qualities

Steps to Reproduce

Expected Results

Actual Results

Bug 2 - `OpenMLDataset.get_data()` raises a bare pandas `KeyError` on invalid target

Steps to Reproduce

Expected Results

Actual Results

Bug 3 - `create_dataset` has a misleading type annotation on `default_target_attribute`

Steps to Reproduce

Expected Results

Actual Results

Scope

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[FIX] : Three dataset robustness gaps in OpenMLDataset and create_dataset #1711

Description

Description

Bug 1 - OpenMLDataset.__repr__ raises KeyError on partial or NaN qualities

Steps to Reproduce

Expected Results

Actual Results

Bug 2 - OpenMLDataset.get_data() raises a bare pandas KeyError on invalid target

Steps to Reproduce

Expected Results

Actual Results

Bug 3 - create_dataset has a misleading type annotation on default_target_attribute

Steps to Reproduce

Expected Results

Actual Results

Scope

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[FIX] : Three dataset robustness gaps in `OpenMLDataset` and `create_dataset` #1711

Bug 1 - `OpenMLDataset.repr` raises `KeyError` on partial or NaN qualities

Bug 2 - `OpenMLDataset.get_data()` raises a bare pandas `KeyError` on invalid target

Bug 3 - `create_dataset` has a misleading type annotation on `default_target_attribute`