Trust internal scikit-learn types needed for GB/HGB models by cakedev0 · Pull Request #508 · skops-dev/skops

cakedev0 · 2026-03-30T17:18:07Z

This PR description was mostly written by AI. I reviewed it before submission.

Reference Issues/PRs

This will make #498 very easy, like a 4 lines change. #498 is not a bug, but I'd say it's still a big friction for mlflow users and probably doesn't help them adopting skops.

What does this implement/fix? Explain your changes.

This PR adds support for persisting/loading the internal scikit-learn types needed by GradientBoosting and HistGradientBoosting models without surfacing them as untrusted types.

The implementation introduces a sklearn-specific internal-object path in skops/io/_sklearn.py for a small allowlist of trusted sklearn internals used by GB/HGB models.

CyHalfMultinomialLoss is serialized under the qualified module path sklearn._loss._loss.CyHalfMultinomialLoss rather than _loss.CyHalfMultinomialLoss so that trust is anchored under sklearn.. Maybe this is overkilled?

The PR also adds targeted regression tests in skops/io/tests/test_persist.py covering GradientBoosting / HistGradientBoosting variants and checking that CyHalfMultinomialLoss is serialized under the qualified sklearn module path.

AI usage disclosure:

The code changes were developed with AI assistance in an interactive back-and-forth session.
I asked the assistant to probe which trusted types were needed for GradientBoosting/HistGradientBoosting, explain the relevant serialization/trust code paths, review security tradeoffs, and iterate on the implementation based on my feedback.
I reviewed the proposed changes, asked follow-up questions about the design and security properties, and requested specific adjustments such as using a qualified sklearn path for CyHalfMultinomialLoss and tightening/cleaning the sklearn-internal trust path.

cakedev0 · 2026-03-31T08:39:38Z

I asked Codex to audit added scikit-learn classes, here is its report (TLDR: those are all simple, numeric-oriented classes, and are safe to trust).

Details

Per-class notes

Stateless numeric wrappers

These classes are thin mathematical wrappers with no meaningful mutable state and no custom deserialization hooks. In practice they only expose NumPy / SciPy computations, so trusting them does not add an obvious code-execution surface.

Classes	What they do	Safety review
`sklearn._loss.link.IdentityLink`, `sklearn._loss.link.LogLink`, `sklearn._loss.link.LogitLink`, `sklearn._loss.link.HalfLogitLink`, `sklearn._loss.link.MultinomialLogit`	Link functions used by the sklearn loss objects.	Safe: stateless numeric wrappers only; no `eval` / `exec`, no dynamic import, no custom `__setstate__`, no attacker-controlled callable invocation on load.

Small passive data containers

These objects mainly carry parameters or bounds. The main caveat is that skops restores them through __new__ plus attribute injection, so constructor-time validation is bypassed.

Classes	What they do	Safety review
`sklearn._loss.link.Interval`	Dataclass storing lower/upper bounds and inclusiveness flags for valid value ranges.	Safe from code execution: only scalar fields and range checks; malformed values could violate invariants, but there is no code-exec hook.

Python loss wrappers around trusted numeric backends

These classes wrap Cython loss kernels plus link objects and a few scalar flags. They do not define custom deserialization logic; in the skops path they are rebuilt by restoring plain attributes (closs, link, flags, intervals, sometimes quantile / n_classes).

Classes	What they do	Safety review
`sklearn._loss.loss.HalfSquaredError`, `sklearn._loss.loss.AbsoluteError`, `sklearn._loss.loss.PinballLoss`, `sklearn._loss.loss.HuberLoss`, `sklearn._loss.loss.HalfPoissonLoss`, `sklearn._loss.loss.HalfGammaLoss`, `sklearn._loss.loss.HalfBinomialLoss`, `sklearn._loss.loss.HalfMultinomialLoss`, `sklearn._loss.loss.ExponentialLoss`	Regression / classification loss objects used by GB / HGB and related estimators.	Safe: attribute-only restore pattern, no custom `__setstate__`, no dynamic execution logic; malformed trusted state could still cause wrong predictions or runtime errors because constructor validation is not re-run.

Cython backend object

Classes	What they do	Safety review
`sklearn._loss._loss.CyHalfMultinomialLoss`	Cython implementation of multiclass loss / gradient / probability kernels.	Safe in the skops path: effectively stateless here (`__getstate__` is `None`), and skops bypasses Cython's pickle helper by reconstructing the instance via `__new__`, so there is no attacker-controlled function call during load.

HistGradientBoosting internals with explicit state restoration

These are the only reviewed classes with meaningful state-restoration behavior. I checked those methods directly.

Classes	What they do	Safety review
`sklearn.ensemble._hist_gradient_boosting.binning._BinMapper`	Stores HGB binning metadata and maps raw features to integer bins.	Safe: deserialization goes through `BaseEstimator.__setstate__`, which only pops `_sklearn_version`, warns on version mismatch, then updates `__dict__`; no dynamic execution.
`sklearn.ensemble._hist_gradient_boosting.predictor.TreePredictor`	Stores HGB tree arrays and runs fast prediction / partial dependence kernels.	Safe: custom `__setstate__` only restores state and casts `nodes` to `PREDICTOR_RECORD_DTYPE` for cross-bitness compatibility; no import/eval/callback behavior.

cakedev0 added 2 commits March 30, 2026 18:26

Trust internal scikit-learn types needed for GB/HGB models

856e6c4

clean-up

68ae655

cakedev0 marked this pull request as ready for review March 30, 2026 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trust internal scikit-learn types needed for GB/HGB models#508

Trust internal scikit-learn types needed for GB/HGB models#508
cakedev0 wants to merge 2 commits intoskops-dev:mainfrom
cakedev0:trust_gb_hgb_internals

cakedev0 commented Mar 30, 2026 •

edited

Loading

Uh oh!

cakedev0 commented Mar 31, 2026

Per-class notes

Stateless numeric wrappers

Small passive data containers

Python loss wrappers around trusted numeric backends

Cython backend object

HistGradientBoosting internals with explicit state restoration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cakedev0 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

AI usage disclosure:

Uh oh!

cakedev0 commented Mar 31, 2026

Per-class notes

Stateless numeric wrappers

Small passive data containers

Python loss wrappers around trusted numeric backends

Cython backend object

HistGradientBoosting internals with explicit state restoration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cakedev0 commented Mar 30, 2026 •

edited

Loading