快速（er）的方法来检查一个词是否是英语比较它与一个白名单的英语单词？

englishwords = list(set(nltk.corpus.words.words())) englishwords = [x.lower() for x in list(englishwords)] englishwords = [ps.stem(w) for w in englishwords] # this step takes too long: shareholderletter= ' '.join(w for w in nltk.wordpunct_tokenize(shareholderletter) if w in englishwords)

1条回答

网友

1楼 · 发布于 2024-09-30 00:37:40

您正在检查somthing in otherthing-并且您的otherthing是一个列表。你知道吗

列表很适合存储内容，但是查找“does x is in”需要O(n)。你知道吗

使用set代替，它将查找放在O(1)和中，它将删除所有重复项，因此如果您有重复项，您的基本大小也将放在一起查找。你知道吗

如果你的集合之后没有改变，那么就使用一个frozenset-它是不可变的。你知道吗

读取：Documentation of sets

如果使用@DeepSpace的建议，并利用set操作，您将获得更好的性能：

s = set( t.lower().strip() for t in ["Some","text","in","set"])

t = set("Some text in a string that holds other words as well".lower().split())

print ( s&t )  # show me all things that are in both sets (aka intersection)

输出：

set(['text', 'some', 'in'])

见set operations

O（n）：最糟糕的情况：你的单词是你列表中20万个单词中的最后一个，你检查整个列表-这需要20万个检查。你知道吗

O（1）：查找时间是恒定的，无论数据结构中有多少项，都需要相同的时间来检查其是否存在。为了获得这个好处，set有一个更复杂的存储解决方案，它需要稍多的内存（然后是一个列表）才能在查找时执行得如此出色。你知道吗

编辑：不在集合/列表中查找单词的最坏情况：

import timeit

setupcode = """# list with some dupes
l = [str(i) for i in range(10000)] + [str(i) for i in range(10000)] + [str(i) for i in range(10000)]
# set of this list
s = set( l )
"""

print(timeit.timeit("""k = "10000" in l """,setup = setupcode, number=100))
print(timeit.timeit("""k = "10000" in s """,setup = setupcode, number=100))

0.03919574100000034    # checking 100 times if "10000" is in the list
0.00000512200000457    # checking 100 times if "10000" us in the set

相关问题更多 >

编程相关推荐

热门问题

热门文章