如何使用regex提取多个字符串

网友

1楼 · 编辑于 2024-06-28 20:37:15

这看起来像是某种DSL，一种domains特定的l语言，因此您可以为它编写一个小型解析器。这里，我们使用a ^{} parser called ^{}。你知道吗

您需要一个小语法和一个NodeVisitor类：

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

path = "(Canberra)-[:capital_of {}]->(Australia)"

class PathVisitor(NodeVisitor):
    grammar = Grammar(
        r"""
        path    = (pair junk?)+
        pair    = lpar notpar rpar

        lpar    = ~"[(\[]+"
        rpar    = ~"[)\]]+"

        notpar  = ~"[^][()]+"
        junk    = ~"[-:>]+"
        """
    )

    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_pair(self, node, visited_children):
        _, value, _ = visited_children
        return value.text

    def visit_path(self, node, visited_children):
        return [child[0] for child in visited_children]

pv = PathVisitor()
output = pv.parse(path)
print(output)

这将产生

['Canberra', ':capital_of {}', 'Australia']

网友

2楼 · 编辑于 2024-06-28 20:37:15

你就快到了。您需要通过将每个组封装在()中来定义要捕获的组。你知道吗

代码看起来像

import re
path = "(Canberra)-[:capital_of {}]->(Australia)"
pattern = r'\((.*)\)\-\[(:.*)\]\-\>\((.*)\)'
print(re.match(pattern,path).groups())

输出将是

('Canberra', ':capital_of {}', 'Australia')

网友

3楼 · 编辑于 2024-06-28 20:37:15

如果不需要使用regex，可以使用

s="(Canberra)-[:capital_of {}]->(Australia)"
entityA = s[1:].split(')-')[0]
entityB = s.split('->(')[-1][:-1]

根据')-'子字符串的出现情况拆分输入字符串，并取第一部分来获得第一个实体。你知道吗

split()是在'->('子字符串的基础上完成的，最后一次分割被选择来获得第二个实体。你知道吗

所以

print(f'EntityA: {entityA}')
print(f'EntityB: {entityB}')

会给

EntityA: Canberra
EntityB: Australia

非正则表达式解决方案通常更快。你知道吗

编辑：根据评论中的要求进行计时。你知道吗

s="(Canberra)-[:capital_of {}]->(Australia)"
def regex_soln(s):
    pattern = r'\((.*)\)\-\[(:.*)\]\-\>\((.*)\)'
    rv = re.match(pattern,s).groups()
    return rv[0], rv[-1]

def non_regex_soln(s):
    return s[1:].split(')-')[0], s.split('->(')[-1][:-1]

%timeit regex_soln(s)
1.47 µs ± 60.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


%timeit non_regex_soln(s)
619 ns ± 30.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用regex提取多个字符串

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >