根据numpy数组中的行生成唯一值

def getCellId(self, valueSet): # Turn the set of values (a numpy vector) to a tuple so it can be hashed key = tuple(valueSet) # Try and simply return an existing ID for this key try: return self.attributeDict[key] except KeyError: # If the key was new (and didnt exist), try and generate a new Id by adding one to the max of all current Id's. This will fail the very first time we do this (as there will be no Id's yet), so in that case, just assign the value '1' to the newId try: newId = max(self.attributeDict.values()) +1 except ValueError: newId = 1 self.attributeDict[key] = newId return newId

3条回答

网友

1楼 · 编辑于 2024-06-26 14:58:47

如果只是散列，试试这个

import numpy as np
import numpy.random

# create random data
a = numpy.random.randint(10,size=(5,3,3))

# create some identical 0-axis data
a[:,0,0] = np.arange(5)
a[:,0,1] = np.arange(5)

# create matrix with the hash values
h = np.apply_along_axis(lambda x: hash(tuple(x)),0,a)

h[0,0]==h[0,1]
# Output: True

但是，请谨慎使用，并首先用您的代码测试此代码。。。我只能说，对于这个简单的例子，它是有效的。在

此外，两个值可能具有相同的哈希值，尽管它们不同。这是一个使用hash函数总是会发生的问题，但它们不太可能发生

编辑：以便与其他解决方案进行比较

^{pr2}$

网友

2楼 · 编辑于 2024-06-26 14:58:47

根据需要生成多少个新密钥和旧密钥，很难说什么是最佳的。但使用您的逻辑，以下步骤应该相当快：

import collections
import hashlib

_key = 0

def _get_new_key():
    global _key
    _key += 1
    return _key

attributes = collections.defaultdict(_get_new_key)

def get_cell_id(series):                             
    global attributes
    return attributes[hashlib.md5(series.tostring()).digest()]

编辑：

现在，我更新了根据您的问题循环所有数据系列的步骤：

^{pr2}$

上面每个元素数组执行256x256查找/分配。当然不能保证md5哈希不会发生冲突。如果这是一个问题，当然可以更改为同一库中的其他哈希。在

编辑2:

鉴于您似乎要在3D阵列的第一个轴上执行大多数昂贵的操作，我建议您重新组织阵列：

In [254]: A2 = np.random.random((256, 256, 30))

In [255]: A2_strided = np.lib.stride_tricks.as_strided(A2, (A2.shape[0] * A2.shape[1], A2.shape[2]), (A2.itemsize * A2.shape[2], A2.itemsize))

In [256]: %timeit tuple(get_cell_id(S) for S in A2_strided)
10 loops, best of 3: 126 ms per loop

不必在内存中长距离跳转大约可以提高25%的速度

编辑3:

如果实际上不需要缓存一个散列来查找int，但您只需要实际的散列，并且如果3D数组是int8-类型，那么给定A2和{}组织，时间可以再减少一些。这15毫秒中有元组循环。在

In [9]: from hashlib import md5

In [10]: %timeit tuple(md5(series.tostring()).digest() for series in A2_strided) 
10 loops, best of 3: 72.2 ms per loop

网友

3楼 · 编辑于 2024-06-26 14:58:47

这可能是一种使用基本numpy函数的方法-

import numpy as np

# Random input for demo
arr = np.random.randint(0,3,[2,5,4])

# Get dimensions for later usage
m,n,k = arr.shape

# Reshape arr to a 2D array that has each slice arr[:, n, k] in each row
arr2d = np.transpose(arr,(1,2,0)).reshape([-1,m])

# Perform lexsort & get corresponding indices and sorted array 
sorted_idx = np.lexsort(arr2d.T)
sorted_arr2d =  arr2d[sorted_idx,:]

# Differentiation along rows for sorted array
df1 = np.diff(sorted_arr2d,axis=0)

# Look for changes along df1 that represent new labels to be put there
df2 = np.append([False],np.any(df1!=0,1),0)

# Get unique labels
labels = df2.cumsum(0)

# Store those unique labels in a n x k shaped 2D array
pos_labels = np.zeros_like(labels)
pos_labels[sorted_idx] = labels
out = pos_labels.reshape([n,k])

样本运行-

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章