在大量文件中搜索大量单词的最佳方法是什么？

wordlist = [...list of around 10000 english words...] filelist = [...list of around 5000 filenames...] wordlistre = re.compile('|'.join(wordlist), re.IGNORECASE) discovered = [] for x in filelist: with open(x, 'r') as f: found = wordlistre.findall(f.read()) if found: discovered = [x, found]

3条回答

网友

1楼 · 编辑于 2024-06-30 07:58:09

Aho-Corasick algorithm正是为这种用法而设计的，并在Unix中实现为fgrep。在POSIX中，定义了grep -F命令来执行此功能。在

它与正则grep的不同之处在于，它只使用固定字符串（而不是正则表达式），并针对搜索大量字符串进行了优化。在

要在大量文件上运行它，请在命令行中指定精确的文件，或通过xargs传递这些文件：

xargs -a filelist.txt grep -F -f wordlist.txt

xargs的功能是用尽可能多的文件填充命令行，并根据需要多次运行grep

^{pr2}$

每次调用的精确文件数取决于单个文件名的长度以及系统上ARG_MAX常量的大小。在

网友

2楼 · 编辑于 2024-06-30 07:58:09

如果没有关于数据的更多信息，有两种想法是使用词典而不是列表，并减少搜索/排序所需的数据。同时考虑使用重新分割如果您的分隔符不像下面这样干净：

wordlist = 'this|is|it|what|is|it'.split('|')
d_wordlist = {}

for word in wordlist:
    first_letter = word[0]
    d_wordlist.setdefault(first_letter,set()).add(word)

filelist = [...list of around 5000 filenames...]
discovered = {}

for x in filelist:
    with open(x, 'r') as f:
        for word in f.read():
            first_letter = word[0]
            if word in d_wordlist[first_letter]:
                discovered.get(x,set()).add(word)

return discovered

网友

3楼 · 编辑于 2024-06-30 07:58:09

{cd1>如果您可以访问以下命令行：

grep -i -f wordlist.txt -r DIRECTORY_OF_FILES

您需要创建一个包含所有单词的文件wordlist.txt（每行一个单词）。在

任何文件中与任何单词匹配的任何行都将以以下格式打印到STDOUT：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章