sklearn 数据预处理

from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer, Binarizer

# 区间缩放是将原始数据中的数据缩放到[0,1]范围
scaler = MinMaxScaler(feature_range=(0,1))
features = scaler.fit_transform(features)

# 标准化是将数据的分布调整成正态分布，也叫高斯分布，也就是使得数据的均值为0，方差为1
scaler = StandardScaler()
features = scaler.fit_transform(features)

# 正则化/归一化是将样本在向量空间模型上的一个转换，经常被使用在分类与聚类中。其目的在于样本向量在点乘运算或其他核函数计算相似性时，拥有统一的标准。
scaler = Normalizer()
features = scaler.fit_transform(features)

# 特征二值化是将特征值转换为0或1。
# 例如，在房价预测问题中对于“是否为学区房”这一特征，取值为1表示该房是学区房，反之则为0。 
# 在sklearn中可以设置一个阈值，大于阈值的赋值为1，小于等于阈值的赋值为0。
scaler = Binarizer(threshold=3)
features = scaler.fit_transform(features)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, labels, train_size=0.8, random_state=0)

'''
参数说明：
- arrays：样本数组，包含特征向量和标签
- test_size：float-获得多大比重的测试样本, int - 获得多少个测试样本
- train_size: 同test_size
- random_state:int - 随机种子（种子固定，实验可复现）
- shuffle - 是否在分割之前对数据进行洗牌
'''

VO2max Project > 模型训练

#sklearn

sklearn 数据预处理

https://wonderhoi.com/2024/09/24/sklearn-数据预处理/

作者

wonderhoi

发布于

2024年9月24日

许可协议

Python 取出数组中最大/小的 n 个（重复）元素以及索引上一篇

PyTorch 常用损失函数下一篇