Skip to content

Trust internal scikit-learn types needed for GB/HGB models#508

Open
cakedev0 wants to merge 2 commits intoskops-dev:mainfrom
cakedev0:trust_gb_hgb_internals
Open

Trust internal scikit-learn types needed for GB/HGB models#508
cakedev0 wants to merge 2 commits intoskops-dev:mainfrom
cakedev0:trust_gb_hgb_internals

Conversation

@cakedev0
Copy link
Copy Markdown

@cakedev0 cakedev0 commented Mar 30, 2026

This PR description was mostly written by AI. I reviewed it before submission.

Reference Issues/PRs

This will make #498 very easy, like a 4 lines change. #498 is not a bug, but I'd say it's still a big friction for mlflow users and probably doesn't help them adopting skops.

What does this implement/fix? Explain your changes.

This PR adds support for persisting/loading the internal scikit-learn types needed by GradientBoosting and HistGradientBoosting models without surfacing them as untrusted types.

The implementation introduces a sklearn-specific internal-object path in skops/io/_sklearn.py for a small allowlist of trusted sklearn internals used by GB/HGB models.

CyHalfMultinomialLoss is serialized under the qualified module path sklearn._loss._loss.CyHalfMultinomialLoss rather than _loss.CyHalfMultinomialLoss so that trust is anchored under sklearn.. Maybe this is overkilled?

The PR also adds targeted regression tests in skops/io/tests/test_persist.py covering GradientBoosting / HistGradientBoosting variants and checking that CyHalfMultinomialLoss is serialized under the qualified sklearn module path.

AI usage disclosure:

  • The code changes were developed with AI assistance in an interactive back-and-forth session.
  • I asked the assistant to probe which trusted types were needed for GradientBoosting/HistGradientBoosting, explain the relevant serialization/trust code paths, review security tradeoffs, and iterate on the implementation based on my feedback.
  • I reviewed the proposed changes, asked follow-up questions about the design and security properties, and requested specific adjustments such as using a qualified sklearn path for CyHalfMultinomialLoss and tightening/cleaning the sklearn-internal trust path.

@cakedev0 cakedev0 marked this pull request as ready for review March 30, 2026 17:29
@cakedev0
Copy link
Copy Markdown
Author

I asked Codex to audit added scikit-learn classes, here is its report (TLDR: those are all simple, numeric-oriented classes, and are safe to trust).

Details

Per-class notes

Stateless numeric wrappers

These classes are thin mathematical wrappers with no meaningful mutable state and no custom deserialization hooks. In practice they only expose NumPy / SciPy computations, so trusting them does not add an obvious code-execution surface.

Classes What they do Safety review
sklearn._loss.link.IdentityLink, sklearn._loss.link.LogLink, sklearn._loss.link.LogitLink, sklearn._loss.link.HalfLogitLink, sklearn._loss.link.MultinomialLogit Link functions used by the sklearn loss objects. Safe: stateless numeric wrappers only; no eval / exec, no dynamic import, no custom __setstate__, no attacker-controlled callable invocation on load.

Small passive data containers

These objects mainly carry parameters or bounds. The main caveat is that skops restores them through __new__ plus attribute injection, so constructor-time validation is bypassed.

Classes What they do Safety review
sklearn._loss.link.Interval Dataclass storing lower/upper bounds and inclusiveness flags for valid value ranges. Safe from code execution: only scalar fields and range checks; malformed values could violate invariants, but there is no code-exec hook.

Python loss wrappers around trusted numeric backends

These classes wrap Cython loss kernels plus link objects and a few scalar flags. They do not define custom deserialization logic; in the skops path they are rebuilt by restoring plain attributes (closs, link, flags, intervals, sometimes quantile / n_classes).

Classes What they do Safety review
sklearn._loss.loss.HalfSquaredError, sklearn._loss.loss.AbsoluteError, sklearn._loss.loss.PinballLoss, sklearn._loss.loss.HuberLoss, sklearn._loss.loss.HalfPoissonLoss, sklearn._loss.loss.HalfGammaLoss, sklearn._loss.loss.HalfBinomialLoss, sklearn._loss.loss.HalfMultinomialLoss, sklearn._loss.loss.ExponentialLoss Regression / classification loss objects used by GB / HGB and related estimators. Safe: attribute-only restore pattern, no custom __setstate__, no dynamic execution logic; malformed trusted state could still cause wrong predictions or runtime errors because constructor validation is not re-run.

Cython backend object

Classes What they do Safety review
sklearn._loss._loss.CyHalfMultinomialLoss Cython implementation of multiclass loss / gradient / probability kernels. Safe in the skops path: effectively stateless here (__getstate__ is None), and skops bypasses Cython's pickle helper by reconstructing the instance via __new__, so there is no attacker-controlled function call during load.

HistGradientBoosting internals with explicit state restoration

These are the only reviewed classes with meaningful state-restoration behavior. I checked those methods directly.

Classes What they do Safety review
sklearn.ensemble._hist_gradient_boosting.binning._BinMapper Stores HGB binning metadata and maps raw features to integer bins. Safe: deserialization goes through BaseEstimator.__setstate__, which only pops _sklearn_version, warns on version mismatch, then updates __dict__; no dynamic execution.
sklearn.ensemble._hist_gradient_boosting.predictor.TreePredictor Stores HGB tree arrays and runs fast prediction / partial dependence kernels. Safe: custom __setstate__ only restores state and casts nodes to PREDICTOR_RECORD_DTYPE for cross-bitness compatibility; no import/eval/callback behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant