用其他值替换零序

2024-09-28 23:02:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个很大的数据集(>;200k),我正在尝试用一个值替换零序。大于2个零的零序是一个伪影,应该通过将其设置为np.NAN公司. 你知道吗

我读过Searching a sequence in a NumPy array,但它不完全符合我的要求,因为我没有静态模式。你知道吗

np.array([0, 1.0, 0, 0, -6.0, 13.0, 0, 0, 0, 1.0, 16.0, 0, 0, 0, 0, 1.0, 1.0, 1.0, 1.0])
# should be converted to this
np.array([0, 1.0, 0, 0, -6.0, 13.0, NaN, NaN, NaN, 1.0, 16.0, NaN, NaN, NaN, NaN, 1.0, 1.0, 1.0, 1.0])    

如果你需要更多的信息,请告诉我。 提前谢谢!你知道吗


结果:

谢谢你的回答,这是我的(不专业的)测试结果288240分

divakar took 0.016000ms to replace 87912 points
desiato took 0.076000ms to replace 87912 points
polarise took 0.102000ms to replace 87912 points

因为@Divakar的解是最短最快的,所以我接受他的解。你知道吗


Tags: to数据gtsearchingnp公司nanarray
3条回答

您可以使用itertools包的groupby

import numpy as np
from itertools import groupby

l = np.array([0, 1, 0, 0, -6, 13, 0, 0, 0, 1, 16, 0, 0, 0, 0])

def _ret_list( k, it ):
    # number of elements in iterator, i.e., length of list of similar items
    l = sum( 1 for i in it )

    if k==0 and l>2:
        # sublist has more than two zeros. replace each zero by np.nan
        return [ np.nan ]*l
    else:
        # return sublist of simliar items
        return [ k ]*l

# group items and apply _ret_list on each group
procesed_l = [_ret_list(k,g) for k,g in groupby(l)]
# flatten the list and convert to a numpy array
procesed_l = np.array( [ item for l in procesed_l for item in l ] )

print procesed_l

这给了你

[  0.   1.   0.   0.  -6.  13.  nan  nan  nan   1.  16.  nan  nan  nan  nan]

请注意,每个int都转换为float。请看这里:NumPy or Pandas: Keeping array type as integer while having a NaN value

下面是一个可用于列表的函数:

import numpy as np

def replace(a_list):
    for i in xrange(len(a_list) - 2):
        print a_list[i:i+3]
        if (a_list[i] == 0 and a_list[i+1] == 0 and a_list[i+2] == 0) or (a_list[i] is np.NaN and a_list[i+1] is np.NaN and a_list[i+2] == 0):
            a_list[i] = np.NaN
            a_list[i+1] = np.NaN
            a_list[i+2] = np.NaN
    return a_list

因为列表是在一个方向上遍历的,所以只有两个比较:(0, 0, 0)(NaN, NaN, 0),因为在执行时用NaN替换0。你知道吗

这基本上是一个^{},在闭合间隙上有一个阈值要求。这是一个基于它的实现-

# Pad with ones so as to make binary closing work around the boundaries too
a_extm = np.hstack((True,a!=0,True))

# Perform binary closing and look for the ones that have not changed indiicating
# the gaps in those cases were above the threshold requirement for closing
mask = a_extm == binary_closing(a_extm,structure=np.ones(3))

# Out of those avoid the 1s from the original array and set rest as NaNs
out = np.where(~a_extm[1:-1] & mask[1:-1],np.nan,a)

有一种方法可以避免在处理边界元素时根据需要在早期方法中附加边界元素,这可能会使处理大型数据集时的成本有点高,如下所示-

# Create binary closed mask
mask = ~binary_closing(a!=0,structure=np.ones(3))
idx = np.where(a)[0]
mask[:idx[0]] = idx[0]>=3
mask[idx[-1]+1:] = a.size - idx[-1] -1 >=3

# Use the mask to set NaNs in a
out = np.where(mask,np.nan,a)

相关问题 更多 >