鸢尾花的非监督学习

代码：

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, adjusted_rand_score

csv_path = '/Users/bakako/Downloads/archive/Iris.csv'
df = pd.read_csv(csv_path)

# 将 dataframe 转为 numpy
dataset = df.values

# delete index
dataset = np.delete(dataset, 0, axis=1)
# delete label
features = np.delete(dataset, -1, axis=1)
# delete feature
labels = np.delete(dataset, [0, 1, 2, 3], axis=1)

# 分割训练集和测试集
train_features, \
        test_features, \
        train_labels, \
        test_labels = train_test_split(features, labels, test_size=0.2)

X = train_features

model = KMeans(n_clusters=3)
model.fit(X)

label_predict = model.labels_

x0 = X[label_predict == 0]
x1 = X[label_predict == 1]
x2 = X[label_predict == 2]

fig, axes = plt.subplots(1, 4, figsize=(32, 6))
# 按花萼长度查看分布情况
axes[0].scatter(X[:, 0], X[:, 1], c='red', marker='o', label='see')
axes[0].set_xlabel('sepal length')
axes[0].set_ylabel('sepal width')
axes[0].legend(loc=2)

# 按花瓣长度查看分布情况
axes[1].scatter(X[:, 2], X[:, 3], c='green', marker='o', label='see')
axes[1].set_xlabel('petal length')
axes[1].set_ylabel('petal width')
axes[1].legend(loc=2)

# 按花萼进行聚类
axes[2].scatter(x0[:, 0], x0[:, 1], c='red', marker='o', label='label0')
axes[2].scatter(x1[:, 0], x1[:, 1], c='green', marker='*', label='label1')
axes[2].scatter(x2[:, 0], x2[:, 1], c='blue', marker='+', label='label2')
axes[2].set_xlabel('sepal length')
axes[2].set_ylabel('sepal width')
axes[2].legend(loc=2)

# 按花瓣进行聚类
axes[3].scatter(x0[:, 2], x0[:, 3], c='red', marker='o', label='label0')
axes[3].scatter(x1[:, 2], x1[:, 3], c='green', marker='*', label='label1')
axes[3].scatter(x2[:, 2], x2[:, 3], c='blue', marker='+', label='label2')
axes[3].set_xlabel('petal length')
axes[3].set_ylabel('petal width')
axes[3].legend(loc=2)

# plt.show()

prediction = model.predict(test_features)

test_labels_num = []

for item in test_labels:
    if item == 'Iris-setosa':
        test_labels_num.append(0)
    elif item == 'Iris-versicolor':
        test_labels_num.append(1)
    elif item == 'Iris-virginica':
        test_labels_num.append(2)

print(accuracy_score(test_labels_num, prediction))
print(adjusted_rand_score(test_labels_num, prediction))

其中，在机器学习（二）之无监督学习：数据变换、聚类分析中有提到：

用这种方式评估聚类时，一个常见的错误是使用 accuracy_score 而不是 adjusted_rand_score、normalized_mutual_info_score 或其他聚类指标。使用精度的问题在于，它要求分配的簇标签与真实值完全匹配。但簇标签本身毫无意义——唯一重要的是哪些点位于同一个簇中。

简单说，聚类后 x0、x1、x2 三类花会对应 012，021，102，120，201，210 的任意组合。所以你不知道 0 对应的是真实标签中的 Iris-setosa、Iris-virginica 还是 Iris-versicolor。有 5/6 的概率是得不到正确值的。

比如说下面这种情况

print(prediction)           
# [1 1 2 2 1 0 2 1 2 0 0 1 0 2 1 1 1 1 0 2 0 0 2 2 0 2 2 0 1 1]
print(test_labels_num)      
# [0 0 1 1 0 2 1 0 1 2 2 0 2 1 0 0 0 0 2 1 2 2 1 1 2 1 1 2 0 0]

聚类后预测的关系是 0 对应 Iris-virginica，1 对应 Iris-setosa，2 对应 Iris-versicolor。

这与我们手动标签的 0 对应 Iris-setosa，1 对应 Iris-versicolor，2 对应 Iris-virginica 对不上。

此时使用 accuracy_score 会存在问题。应该使用 adjusted_rand_score(labels_true, labels_pred)，输出越接近 1 说明模型越好。

另外，在【Scikit-Learn 中文文档】聚类 - 无监督学习中有提到：

adjusted_rand_score 是 symmetric（对称的）——交换参数不会改变 score （得分）。它可以作为 consensus measure（共识度量）。

也就是说，我随意改变手动标签 012 的值，或者干脆直接用 Iris-setosa、Iris-versicolor、Iris-virginica 也是不影响结果的。

test_labels = test_labels.flatten()
# adjusted_rand_score 函数用的一维数组，所以要将 test_labels 从（30，1）降维成（30，）
# prediction 已经是（30，）

print(adjusted_rand_score(test_labels, prediction))

随笔

#Python #sklearn

鸢尾花的非监督学习

https://wonderhoi.com/2023/12/22/鸢尾花的非监督学习/

作者

wonderhoi

发布于

2023年12月22日

许可协议

多层感知器 MLP，全连接网络，DNN 三者的关系上一篇

Numpy 的 1 维数组相关操作下一篇