优化大规模搜索速度的策略

import os import mmap import glob os.chdir("C:/mysearch/") searchtermfile = "original_search_terms.txt" # load list of 50,000 search terms into memory as a list with open(searchtermfile, 'r') as f: searchtermlist = [line.strip() for line in f] numberofsearchterms = len(searchtermlist) #make a list of database files in the directory dblist = glob.glob('databasepart*.txt') sizedblist = len(dblist) counterdb = 0 #counts the iterations over the database files countersearchterms = 0 #counts the iterations over the search terms previousstring = "DUMMY" #a dummy value just for the first time it's used #iterate first over list of file names for nameoffile in dblist: counterdb += 1 countersearchterms = 0 #remove old notfound list, this iteration will make a new, shorter one. os.remove("notfound.txt") #returns an error if there is not already a notfound.txt file; I always make sure there's an empty file with that name #read current database file (50 MB) into memory with open(nameoffile, 'r+b') as f: m = mmap.mmap(f.fileno(), 0) #Size 0 reads entire file into memory #iterate over search terms for searchstring in searchtermlist: countersearchterms += 1 if m.find(searchstring) == -1: with open("notfound.txt", "a") as myfile: myfile.write(searchstring + "\n") #this print line won't be there in the final code, it's allowing me to see how fast this program runs print str(counterdb) + " of " + str(sizedblist) + " & " + str(countersearchterms) + " of " + str(numberofsearchterms) previousstring = searchstring m.close() #reload saved list of not found terms as new search term list with open('notfound.txt', 'r') as f: searchtermlist = [line.strip() for line in f] numberofsearchterms = len(searchtermlist)

2条回答

网友

1楼 · 编辑于 2024-09-23 06:26:36

我对Python的经验较少，所以我个人会用C或C++来做。这个问题被简化了，因为您只寻找精确的匹配。你知道吗

内环是所有时间都花在的地方，所以我会集中精力。你知道吗

首先，我将获取5e4术语的列表，对它们进行排序，将它们放在表中进行二进制搜索，或者（更好的是）将它们放在trie结构中进行逐字搜索。你知道吗

然后，在“句子”中的每个字符位置，调用搜索函数。应该很快。原则上，哈希表具有O（1）性能，但常量因子很重要。我敢打赌，在这种情况下，trie仍然比它强，你可以把它调出来。你知道吗

网友

2楼 · 编辑于 2024-09-23 06:26:36

不过，您可以尝试使用正则表达式：

>>> searchterms = ["A", "B", "AB", "ABC", "C", "BC"]
>>> # To match longest sequences first, yes need to place them at the beginning
>>> searchterms.sort(key=len, reverse=True)
>>> searchterms
['ABC', 'AB', 'BC', 'A', 'B', 'C']
>>> # Compile a big regex searching all terms together
>>> _regex =re.compile("("+"|".join(searchterms)+")")
>>> _regex.findall("ABCBADCBDACBDACBDCBADCBADBCBCBDACBDACBDACBDABCDABC")
['ABC', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'C', 'B', 'A', 'C', 'B', 'A', 'BC', 'BC', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'ABC', 'ABC']
>>>

如果您只对计算匹配项感兴趣，可以使用finditer。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章