擅长:python、mysql、java
<p>您可以通过将句子映射到散列来构建一个可共享的数据库,然后您可以在潜在的位置查找数据。在</p>
<pre><code>from collections import defaultdict
from cStringIO import StringIO
DATA = """applachian
rocky mountains
andes
sierra nevada
long mountain ranges of the world"""
def normalize(sentence):
return "".join(sentence.lower().strip())
def create_db(inf):
db = defaultdict(list)
offset = 0
for line in inf:
l = len(line)
db[hash(normalize(line))].append((offset, l))
offset += l
return db
def main():
db = create_db(StringIO(DATA))
# save this db, and in a different script, load it to retrieve:
for needle in ["rocky", "sierra nevada"]:
key = hash(normalize(needle))
for offset, length in db.get(key, []):
print "possibly found at", offset, length
if __name__ == '__main__':
main()
</code></pre>
<p>这说明了这样一个想法:您构建一个数据库(例如存储为pickle),其中包含所有标准化的搜索关键字,并将其映射到找到这些关键字的位置。然后您可以快速检索偏移量和长度,并在实际文件中查找该位置,进行适当的基于==的比较。在</p>