从字符串中获取不在其他列表中的单词列表

网友

1楼 · 编辑于 2024-10-01 05:02:49

我认为最直接的方法是使用集合。例如

s = "This is a test"
s2 = ["This", "is", "another", "test"]
set(s.split()) - set(s2)

# returns {'a'}

但是，考虑到文本的大小，使用生成器避免将所有内容同时保存在内存中可能是值得的，例如

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()

[word for word in itersplit(s) if word not in s2]

# returns ['a']

网友

2楼 · 编辑于 2024-10-01 05:02:49

使用基于集合的解决方案会给您O(len(nlwoorden))for the whole thing。它应该需要另一个O(len(nlwoorden)) + O(len(tekst))to make the two sets。你知道吗

因此，您要查找的代码片段基本上是在注释中列出的：

belangrijk=list(set(tekst.split()) - set(nlwoorden))

（假设您希望在结尾再次将其作为列表）

网友

3楼 · 编辑于 2024-10-01 05:02:49

“读取并处理”woorden.txt文件“一行一行

将所有nlwoorden加载到集合中（这比加载到列表中更有效）
一部分一部分地读取大文件，对每个部分进行拆分，只将lnwoorden中没有的内容写入结果文件。你知道吗

假设您的大600MB文件有合理的长行（不是600MB长），我会这样做

nlwoorden = set()
with open("nlwoorden.txt") as f:
    for line in f:
        nlwoorden.update(line.split())

with open("woorden.txt") as f, with open("out.txt", "w") as fo:
    for line in f:
        newwords = set(line.split())
        newwords.difference_update(nlwoorden)
        fo.write(" ".join(newwords)

结论

此解决方案不应消耗太多内存，因为您从未从“”读取所有数据woorden.txt文件“马上。你知道吗

万一你的文件没有被行分割，你必须改变你读取部分文件的方式。但我想，你的文件会有新行。你知道吗

“读取并处理”woorden.txt文件“一行一行

结论

相关问题更多 >

编程相关推荐

热门问题

热门文章

从字符串中获取不在其他列表中的单词列表

“读取并处理”woorden.txt文件“一行一行

结论

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >