如何为大型数据集优化标签编码（scikit learn）

1条回答

网友

1楼 · 发布于 2024-05-18 22:13:57

不要用scikit-learn循环，你可以试试纯粹的numpy，我相信这会更快。在

元素的数量总是相等的（如果你有3个元素的话？）在内部列表中，您可以尝试以下操作：

1。准备一些数据：

n=5
xs = np.random.choice(list("qwertyuiopasdfghjklzxcvbnm"),3*n).reshape((-1,3))
xs
array([['z', 'h', 'd'],
       ['g', 'k', 'y'],
       ['t', 'c', 'o'],
       ['f', 'b', 's'],
       ['x', 'n', 'z']],
      dtype='<U1')

2。编码

^{pr2}$

3。时机

n = 1000000
xs = np.random.choice(list("qwertyuiopasdfghjklzxcvbnm"),3*n).reshape((-1,3))

%timeit np.unique(xs, return_inverse=True)[1].reshape((-1,3))
849 ms ± 39.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

不到一秒钟。。。在

如果你能展示你的完整代码，我们可以比较运行时。在

编辑：使用编码前后移动

由于@JCDJulian的评论（见下文），我添加了代码片段，以在dictionary的帮助下显示任何数据处理点的编码/解码：

首先，您需要dic，如果您想编码：

labels = np.unique(xs, return_inverse=True)[1]
dic = dict(zip(xs.flatten(),labels))

编码过程本身是：

ys = np.reshape([dic[v] for list in xs for v in list], (-1,3))
ys
array([[13,  5,  2],
       [ 4,  6, 12],
       [10,  1,  8],
       [ 3,  0,  9],
       [11,  7, 13]])

解码时，您需要reverse_dic：

reverse_dic = dict(zip(labels, xs.flatten()))
np.reshape([reverse_dic[v] for list in ys for v in list], (-1,3))
array([['z', 'h', 'd'],
       ['g', 'k', 'y'],
       ['t', 'c', 'o'],
       ['f', 'b', 's'],
       ['x', 'n', 'z']],
      dtype='<U1')

编辑2：随机形状数组

从完备性的角度出发，给出了随机形状阵列的一种解法

编码：

labels = np.unique(xs, return_inverse=True)[1]
dic = dict(zip(xs.flatten(),labels))
np.vectorize(dic.get)(xs)
array([[13,  5,  2],
       [ 4,  6, 12],
       [10,  1,  8],
       [ 3,  0,  9],
       [11,  7, 13]])

解码：

reverse_dic = dict(zip(labels, xs.flatten()))
np.vectorize(reverse_dic.get)(ys)
array([['z', 'h', 'd'],
       ['g', 'k', 'y'],
       ['t', 'c', 'o'],
       ['f', 'b', 's'],
       ['x', 'n', 'z']],
      dtype='<U1')

请注意，数组的形状不会出现在代码中的任何地方！在

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何为大型数据集优化标签编码（scikit learn）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >