用python比较大量regexp文本的最快方法是什么？

import string, re2, datetime, time, array rules = [ [[1],["(?!.*ipiranga).*((?=.*posto)(?=.*petrobras).*|(?=.*petrobras)).*"]], [[2],["(?!.*brasil).*((?=.*posto)(?=.*petrobras).*|(?=.*petrobras)).*"]], ] #cache compile compilled_rules = [] for rule in rules: compilled_scopes.append([[rule[0][0]],[re2.compile(rule[1][0])]]) def get_rules(text): new_tweet = string.lower(tweet) for rule in compilled_rules: ok = 1 if not re2.search(rule[1][0], new_tweet): ok=0 print ok def test(): t0=datetime.datetime.now() i=0 time.sleep(1) while i<1000000: get_rules("Acabei de ir no posto petrobras. Moro pertinho do posto brasil") i+=1 t1=datetime.datetime.now()-t0 print "test" print i print t1 print i/t1.seconds

3条回答

网友

1楼 · 编辑于 2024-09-25 06:30:49

更进一步，我创建了一个Cython扩展来评估规则，现在它的速度非常快。我可以用大约3000个regex规则每秒处理70个请求

在正则表达式.pyx在

import re2
import string as pystring

cpdef list match_rules(char *pytext, dict compilled_rules):
    cdef int ok, scope, term
    cdef list response = []
    text = pystring.lower(pytext)
    for scope, rules in compilled_rules.iteritems():
        ok = 1
        for term,rule in rules.iteritems():
            if ok==1:
                if re2.search(rule, text):
                    ok=0
                    response.append([scope,term])
    return response

python代码

^{pr2}$

网友

2楼 · 编辑于 2024-09-25 06:30:49

您的规则似乎是这里的罪魁祸首：因为两个.*，由lookahead分隔，所以必须检查非常多的置换才能成功匹配（或排除匹配）。使用不带锚点的re.search()进一步加剧了这一点。另外，包括posto部分的替换是多余的-正则表达式匹配字符串中是否有posto的内容，因此您最好完全删除它。在

例如，您的第一条规则可以重写为

^(?!.*ipiranga)(?=.*petrobras)

结果没有任何变化。如果要查找精确的单词，可以使用单词边界进一步优化它：

^{pr2}$

一些测量（使用RegexBuddy）：

应用于字符串Acabei de ir no posto petrobras. Moro pertinho do posto brasil的第一个正则表达式需要正则表达式引擎大约4700步才能找到匹配项。如果我去掉petrobras中的s，则需要超过100000个步骤来确定不匹配。在

我的匹配分为230步（260步失败），因此只要正确构造正则表达式，就可以获得20-400倍的速度。在

网友

3楼 · 编辑于 2024-09-25 06:30:49

除了优化regex模式本身（这将产生巨大的差异），您还可以尝试Google's RE2-它应该比Python的标准正则表达式模块更快。在

它是用C++完成的，但有{a2}，脸谱网的Python包装，用于Re2：）/P>

另外，感谢您的问题，我在regex匹配上找到了a great read！在

在正则表达式.pyx在

python代码

相关问题更多 >

编程相关推荐

热门问题

热门文章