  • 把参考数据库分成200个部分,50万句话
  • 迭代这些部分数据库,用mmain将每个数据库加载到内存中
  • 将查询词列表加载到内存中的列表中
  • 使用mmain的find迭代列表(当然不是regex!),并将未找到的术语写入新的查询术语列表
  • 当循环转到下一个数据库时,生成查询条件的较短文件的新列表
  • 等等


我马上想到了几件事: -大于或小于50MB的数据库文件会更理想吗? -我确定我应该把“找不到”术语的列表保存在内存中,只在进程结束时将其写入磁盘。我这样做是为了在这个设计阶段衡量这个过程。你知道吗

import os
import mmap
import glob

searchtermfile = "original_search_terms.txt"

# load list of 50,000 search terms into memory as a list
with open(searchtermfile, 'r') as f:
    searchtermlist = [line.strip() for line in f]
    numberofsearchterms = len(searchtermlist)

#make a list of database files in the directory
dblist = glob.glob('databasepart*.txt') 
sizedblist = len(dblist)

counterdb = 0 #counts the iterations over the database files
countersearchterms = 0 #counts the iterations over the search terms
previousstring = "DUMMY" #a dummy value just for the first time it's used

#iterate first over list of file names
for nameoffile in dblist:
    counterdb += 1
    countersearchterms = 0
    #remove old notfound list, this iteration will make a new, shorter one.
    os.remove("notfound.txt") #returns an error if there is not already a notfound.txt file; I always make sure there's an empty file with that name
    #read current database file (50 MB) into memory
    with open(nameoffile, 'r+b') as f:
        m = mmap.mmap(f.fileno(), 0) #Size 0 reads entire file into memory
        #iterate over search terms
        for searchstring in searchtermlist:
            countersearchterms += 1
            if m.find(searchstring) == -1:
                with open("notfound.txt", "a") as myfile:
                    myfile.write(searchstring + "\n")
            #this print line won't be there in the final code, it's allowing me to see how fast this program runs
            print str(counterdb) + " of " + str(sizedblist) + " & " + str(countersearchterms) + " of " + str(numberofsearchterms)
            previousstring = searchstring
    #reload saved list of not found terms as new search term list
    with open('notfound.txt', 'r') as f:
        searchtermlist = [line.strip() for line in f]
        numberofsearchterms = len(searchtermlist)

我对Python的经验较少,所以我个人会用C或C++来做。 这个问题被简化了,因为您只寻找精确的匹配。你知道吗



然后,在“句子”中的每个字符位置,调用搜索函数。 应该很快。 原则上,哈希表具有O(1)性能,但常量因子很重要。 我敢打赌,在这种情况下,trie仍然比它强,你可以把它调出来。你知道吗


>>> searchterms = ["A", "B", "AB", "ABC", "C", "BC"]
>>> # To match longest sequences first, yes need to place them at the beginning
>>> searchterms.sort(key=len, reverse=True)
>>> searchterms
['ABC', 'AB', 'BC', 'A', 'B', 'C']
>>> # Compile a big regex searching all terms together
>>> _regex =re.compile("("+"|".join(searchterms)+")")
['ABC', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'C', 'B', 'A', 'C', 'B', 'A', 'BC', 'BC', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'ABC', 'ABC']


