如何在python中有效地检查元素是否在列表列表中

2024-09-30 19:32:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我有如下清单。你知道吗

mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]]

我还有一个概念列表,如下所示。你知道吗

myconcepts = ["method", "standing"]

我想看看myconcepts中的每个概念在mylist记录中出现了多少次。i、 e

"method" = 2 times in records (i.e. in `52749` and `5274923`)
"standing" = 2 times in records

我现在的代码如下。你知道吗

mycounting = 0
for concept in myconcepts:
  for item in mylist:
     if concept in item[1]:
       mycounting = mycounting + 1
print(mycounting)

但是,我现在的mylist非常大,有大约500万条记录。myconcepts这个列表有大约10000个概念。你知道吗

在我当前的代码中,一个概念需要将近1分钟才能得到count,这非常慢。你知道吗

我想知道在python中最有效的方法是什么?你知道吗

出于测试目的,我将数据集的一小部分附加在:https://drive.google.com/file/d/1z6FsBtLyDZClod9hK8nK4syivZToa7ps/view?usp=sharing

如果需要,我很乐意提供更多细节。你知道吗


Tags: 代码in概念列表for记录itemmethod
3条回答

将概念列表更改为集合,这样搜索将是O(1)。然后可以使用交集来计算每个集合中的匹配数。你知道吗

import set
mylist = [
    [5274919, {"report", "porcelain", "firing", "technic"}], 
    [5274920, {"implantology", "dentistry"}], 
    [52749, {"method", "recognition", "long", "standing", "root", "perforation", "molar"}], 
    [5274923, {"exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"}]
]
myconcepts = {"method", "standing"}
mycounting = 0
for item in mylist:
    mycounting += len(set.intersection(myconcepts, item[1]))
print(mycounting)

如果要分别获取每个概念的计数,则需要在myconcept上循环,然后使用in运算符。你可以把结果放进字典里。你知道吗

mycounting = {concept: sum(1 for l in mylist if concept in l[1]) for concept in myconcepts}
print(mycounting) // {'standing': 2, 'method': 2}

这仍然比使用列表更有效,因为concept in l[1]是O(1)。你知道吗

https://www.geeksforgeeks.org/python-count-the-sublists-containing-given-element-in-a-list/调整方法3

from itertools import chain 
from collections import Counter 

mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]]

myconcepts = ["method", "standing"]

def countList(lst, x):
" Counts number of times item x appears in sublists "
    return Counter(chain.from_iterable(set(i[1]) for i in lst))[x] 

# Use dictionary comprehension to apply countList to concept list
result = {x:countList(mylist, x) for x in myconcepts}
print(result) # {'method':2, 'standing':2}

*修改当前方法(只计算一次计数)*

def count_occurences(lst):
    " Number of counts of each item in all sublists "
    return Counter(chain.from_iterable(set(i[1]) for i in lst))

cnts = count_occurences(mylist)
result = {x:cnts[x] for x in myconcepts}
print(result) # {'method':2, 'standing':2}

性能(使用Jupyter笔记本比较发布的方法)

结果表明,该方法与Barmar贴纸法相近(即36对42 us)

对当前方法的改进减少了大约一半的时间(即从36 us减少到19 us)。对于更多的概念(即问题有超过1000个概念),这种改进应该更为重要。你知道吗

然而,原来的方法速度更快,为2.55us/圈。你知道吗

方法当前方法

%timeit { x:countList(mylist, x) for x in myconcepts}
#10000 loops, best of 3: 36.6 µs per loop

Revised current method:

%%timeit
cnts = count_occurences(mylist)
result = {x:cnts[x] for x in myconcepts}
10000 loops, best of 3: 19.4 µs per loop

方法2(来自Barmar post)

%%timeit
r = collections.Counter(flatten(mylist))
{i:r.get(i, 0) for i in myconcepts}
# 10000 loops, best of 3: 42.7 µs per loop

方法3(原始方法)

%%timeit

result = {}
for concept in myconcepts:
  mycounting = 0
  for item in mylist:
     if concept in item[1]:
       mycounting = mycounting + 1
  result[concept] = mycounting
  # 100000 loops, best of 3: 2.55 µs per loop

您可以展平输入,然后使用collections.Counter

import collections
myconcepts = ["method", "standing"]
mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [5274921, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "standing"]]]
def flatten(d):
  for i in d:
    yield from [i] if not isinstance(i, list) else flatten(i)

r = collections.Counter(flatten(mylist))
result = {i:r.get(i, 0) for i in myconcepts}

输出:

{'method': 2, 'standing': 2}

编辑:记录查找:

result = {i:sum(i in b for _, b in mylist) for i in myconcepts}

输出:

{'method': 2, 'standing': 2}

相关问题 更多 >