首页猿问 Python 中的线性/保序聚类

Python 中的线性/保序聚类

Python

慕森卡 2021-10-19 16:30:23

我想根据数字与其邻居相比的“大”程度对列表中的数字进行分组，但我想尽可能地通过聚类连续进行。为了澄清，让我举个例子：假设你有列表lst = [10, 11.1, 30.4, 30.0, 32.9, 4.5, 7.2]然后，如果我们有 3 个组，那么如何进行聚类就很明显了。从 sklearn 运行 k-means 算法（见代码）证实了这一点。但是，当列表中的数字不是那么“方便”时，我就遇到了麻烦。假设您有以下列表：lst = [10, 11.1, 30.4, 30.0, 32.9, 6.2, 31.2, 29.8, 12.3, 10.5]我现在的问题有两个：我想要某种“保留顺序的线性”聚类，它会考虑数据的顺序。对于上面的列表，聚类算法应该给我一个所需的形式输出lst = [0,0,1,1,1,1,1,1,2,2]如果您查看上面的输出，您还会看到我希望将值 6.2 聚集在第二个集群中，即我希望集群算法将其视为异常值，而不是一个全新的集群。编辑为了澄清起见，我希望能够指定线性聚类过程中的聚类数量，即聚类的“最终总数”。代码：import numpy as npfrom sklearn.cluster import KMeanslst = [10, 11.1, 30.4, 30.0, 32.9, 4.5, 7.2]km = KMeans(3,).fit(np.array(lst).reshape(-1,1))print(km.labels_)# [0 0 1 1 1 2 2]: OK outputlst = [10, 11.1, 30.4, 30.0, 32.9, 6.2, 31.2, 29.8, 12.3, 10.5]km = KMeans(3,).fit(np.array(lst).reshape(-1,1))print(km.labels_)# [0 0 1 1 1 2 1 1 0 0]. Desired output: [0 0 1 1 1 1 1 1 2 2]

查看完整描述

3 回答

慕田峪4524236

TA贡献1875条经验获得超5个赞

如前所述，我认为获得所需结果的直接（ish）方法是仅使用正常的 K 均值聚类，然后根据需要修改生成的输出。

解释：这个想法是得到 K-means 输出，然后遍历它们：跟踪前一项的集群组和当前的集群组，并控制根据条件创建的新集群。代码中的解释。

import numpy as np

from sklearn.cluster import KMeans

lst = [10, 11.1, 30.4, 30.0, 32.9, 4.5, 7.2]

km = KMeans(3,).fit(np.array(lst).reshape(-1,1))

print(km.labels_)

# [0 0 1 1 1 2 2]: OK output

lst = [10, 11.1, 30.4, 30.0, 32.9, 6.2, 31.2, 29.8, 12.3, 10.5]

km = KMeans(3,).fit(np.array(lst).reshape(-1,1))

print(km.labels_)

# [0 0 1 1 1 2 1 1 0 0]. Desired output: [0 0 1 1 1 1 1 1 2 2]

def linear_order_clustering(km_labels, outlier_tolerance = 1):

'''Expects clustering outputs as an array/list'''

prev_label = km_labels[0] #keeps track of last seen item's real cluster

cluster = 0 #like a counter for our new linear clustering outputs

result = [cluster] #initialize first entry

for i, label in enumerate(km_labels[1:]):

if prev_label == label:

#just written for clarity of control flow,

#do nothing special here

pass

else: #current cluster label did not match previous label

#check if previous cluster label reappears

#on the right of current cluster label position

#(aka current non-matching cluster is sandwiched

#within a reasonable tolerance)

if (outlier_tolerance and

prev_label in km_labels[i + 1: i + 2 + outlier_tolerance]): label = prev_label #if so, overwrite current label

else:

cluster += 1 #its genuinely a new cluster

result.append(cluster)

prev_label = label

return result

请注意，我仅对 1 个异常值的容差进行了测试，并且不能保证它在所有情况下都能按原样运行。然而，这应该让你开始。

输出：

print(km.labels_)

result = linear_order_clustering(km.labels_)

print(result)

[1 1 0 0 0 2 0 0 1 1]

[0, 0, 1, 1, 1, 1, 1, 1, 2, 2]

反对回复 2021-10-19

达令说

TA贡献1821条经验获得超6个赞

我会通过几次来解决这个问题。首先，我将有第一个函数/方法来进行分析以确定每个组的聚类中心并返回这些中心的数组。然后，我会将这些中心与列表一起放入另一个函数/方法中，以组装列表中每个数字的集群 ID 列表。然后我会返回排序的列表。

反对回复 2021-10-19

3 回答
0 关注
268 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

Python 中的线性/保序聚类

Python 中的线性/保序聚类

3 回答

添加回答