查找大量数据的嵌套列表

3条回答

网友

1楼 · 编辑于 2024-09-28 01:26:31

完成循环要花多少时间？在我的测试用例中，它只需要几百毫秒。你知道吗

import random

# generate the nested lists
a = list('abcdefghijklmnop')
nested_list = [ [random.choice(a) for x in range(random.randint(1,30))]
                for n in range(700000)]

%%timeit -n 10
word = 'c'
b = [word in x for x in nested_list]
# 10 loops, best of 3: 191 ms per loop

将每个内部列表减少到一个集合可以节省一些时间。。。你知道吗

nested_sets = [set(x) for x in nested_list]
%%timeit -n 10
word = 'c'
b = [word in s for s in nested_sets]
# 10 loops, best of 3: 132 ms per loop

一旦你把它变成一个集合列表，你就可以建立一个布尔元组列表。但没有实时节省。你知道吗

%%timeit -n 10
words = list('abcde')
b = [(word in s for word in words) for s in nested_sets]
# 10 loops, best of 3: 749 ms per loop

网友

2楼 · 编辑于 2024-09-28 01:26:31

当迭代一次时，最好使用生成器表达式。
使用numpy.fromiter函数的解决方案：

nested_list = [['a','b','c'],['a','b'],['b','c'],['c']]
words = 'c'
arr = np.fromiter((words in l for l in nested_list), int)

print(arr)

输出：

[1 0 1 1]

https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html

网友

3楼 · 编辑于 2024-09-28 01:26:31

我们可以将所有子列表中的元素展平，得到1D数组。然后，我们只需在平坦1D数组中的每个子列表的限制内查找'c'。因此，基于这种理念，我们可以使用两种方法，基于我们如何计算任何c的发生率。你知道吗

方法#1:一种方法^{}-

lens = np.array([len(i) for i in nested_list])
arr = np.concatenate(nested_list)
ids = np.repeat(np.arange(lens.size),lens)
out = np.bincount(ids, arr=='c')!=0

因为，如问题中所述，nested_list不会在迭代过程中改变，所以我们可以重用所有东西，只循环到最后一步。你知道吗

方法#2:另一种方法^{}重用前一种方法的arr和lens

grp_idx = np.append(0,lens[:-1].cumsum())
out = np.add.reduceat(arr=='c', grp_idx)!=0

在循环遍历words列表时，我们可以通过沿轴使用np.add.reduceat并使用broadcasting给我们一个2D数组布尔值来保持这种方法在最后一步的矢量化，就像这样-

np.add.reduceat(arr==np.array(words)[:,None], grp_idx, axis=1)!=0

样本运行-

In [344]: nested_list
Out[344]: [['a', 'b', 'c'], ['a', 'b'], ['b', 'c'], ['c']]

In [345]: words
Out[345]: ['c', 'b']

In [346]: lens = np.array([len(i) for i in nested_list])
     ...: arr = np.concatenate(nested_list)
     ...: grp_idx = np.append(0,lens[:-1].cumsum())
     ...: 

In [347]: np.add.reduceat(arr==np.array(words)[:,None], grp_idx, axis=1)!=0
Out[347]: 
array([[ True, False,  True,  True],    # matches for 'c'
       [ True,  True,  True, False]])   # matches for 'b'

相关问题更多 >

编程相关推荐

热门问题

热门文章