-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
PySDK Version
PySDK 3.6.0
Describe the bug
load_feature_definitions_from_dataframe() in sagemaker.mlops.feature_store only recognizes numpy dtypes (float64, int64, etc.) but not pandas nullable dtypes (Float64, Int64, string). When a DataFrame uses nullable dtypes (common after calling pd.DataFrame.convert_dtypes()), all numeric columns are incorrectly mapped to StringFeatureDefinition.
To reproduce
import pandas as pd
from sagemaker.mlops.feature_store import load_feature_definitions_from_dataframe
# Create a DataFrame with numpy dtypes (works correctly)
df_numpy = pd.DataFrame({
"id": [1, 2, 3],
"price": [1.1, 2.2, 3.3],
"name": ["a", "b", "c"],
})
print("numpy dtypes:", {c: str(df_numpy[c].dtype) for c in df_numpy.columns})
# {'id': 'int64', 'price': 'float64', 'name': 'object'}
defs = load_feature_definitions_from_dataframe(df_numpy)
for d in defs:
print(f" {d.feature_name}: {d.feature_type}")
# Now convert to pandas nullable dtypes (common pattern)
df_nullable = df_numpy.convert_dtypes()
print("\nnullable dtypes:", {c: str(df_nullable[c].dtype) for c in df_nullable.columns})
# {'id': 'Int64', 'price': 'Float64', 'name': 'string'}
defs = load_feature_definitions_from_dataframe(df_nullable)
for d in defs:
print(f" {d.feature_name}: {d.feature_type}")
Root cause
In sagemaker/mlops/feature_store/feature_utils.py, _INTEGER_TYPES and _FLOAT_TYPES only contain lowercase numpy dtype names:
INTEGER_TYPES = {'int8', 'int16', 'int32', 'int64', 'int', 'uint8', 'uint16', 'uint32', 'uint64'}
FLOAT_TYPES = {'float16', 'float32', 'float64', 'float'}
Pandas nullable dtypes are capitalized (Int64, Float64, etc.) and are not matched.
Suggested fix
Add nullable dtype names to the type sets:
INTEGER_TYPES = {'int8', 'int16', 'int32', 'int64', 'int',
'Int8', 'Int16', 'Int32', 'Int64',
'uint8', 'uint16', 'uint32', 'uint64',
'UInt8', 'UInt16', 'UInt32', 'UInt64'}
FLOAT_TYPES = {'float16', 'float32', 'float64', 'float',
'Float16', 'Float32', 'Float64'}
Or use case-insensitive comparison in _generate_feature_definition().
Expected behavior
Panda nullable types should get properly converted.
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 3.6.0
I think this got fixed/addressed before.. but maybe that 2.x code didn't carray over to 3.x
https://github.com/aws/sagemaker-python-sdk/pull/3740/changes