使用pyparsing,如何对一个或多个(expre1 | expr2)匹配的表达式进行分组?

2024-09-30 16:35:00 发布

您现在位置:Python中文网/ 问答频道 /正文

My website receives允许用户发布一个字符串,其中包含几个问题,然后是多项选择答案。有一个强制的样式指南,允许正则表达式解析结果,然后将问题+MCQ选项存储在数据库中,稍后在随机练习考试中返回

我想转换到pyparsing,因为正则表达式不是立即可读的,我觉得有点被它束缚住了。我希望可以选择轻松地扩展questionparser的功能,使用Regex感觉非常麻烦

用户输入的形式为:

quiz = [<question-answer>, <q-start>]
<question-answer> = <question> + <answer>
<question> = [<q-text>, \n] ?!= <a-start>
<answer> = [<answer>, <a-start>]  ?!= <q-start>
<q-start> = <nums> + "." | ")"
<a-start> = <alphas> + "." | ")" 

长的用户输入字符串被分割成问题答案,由下一个问题答案组的q-start清除。 问题都是q-start和a-start之间的文本。 答案是介于a-start和a-start或以下q-start之间的所有文本的列表

示例文本:

3. A lesion that affects N. Solitarius will result in the patient having problems related to:
a. taste and blood pressure regulation
c. swallowing and respiration
b. smell and taste
d. voice quality and taste
e. whistling and chewing

4. A patient comes to your office complaining of weakness on the right side of their body. You notice that their head is
turned slightly to the left and their right shoulder droops. When asked to protrude their tongue, it deviates to the right. Eye
movements and eye-related reflexes appear to be normal. The lesion most likely is located in the:
c. left ventral medulla
a. left ventral midbrain
b. right dorsal medulla
d. left ventral pons
e. right ventral pons

5. A colleague {...}

Regex我一直在使用:

# matches a question-answer block. Matching q-start until an empty line.
regex1 = r"(^[\t ]*[0-9]+[\)\.][\t ]+[\s\S]*?(?=^[\n\r]))" 

# Within question-answer block, matches everything that does not start with a-start
regex6 = r"(^(?!(^[a-fA-F][\)\.]\s+[\s\S]+)).*)"

# Matches all text between a-start and the following a-start, or until the question-answer substring block ends.
regex5 = r"(^[a-fA-F][\)\.]\s+[\s\S]+)"       

然后用一点python和re删除问题编号、mcq字母,连接所有有问题的虚线,将mcq附加到列表中

在pyparsing中,我尝试了以下方法:

EOL = Suppress(LineEnd())
delim = oneOf(". )")
q_start = LineStart() + Word(nums) + delim
a_start = LineStart() + Char(alphas) + delim

question = Optional(EOL) + Group(Suppress(q_start) + OneOrMore(SkipTo(LineEnd()) + EOL, stopOn=a_start)).setResultsName('question', listAllMatches=True)

answer = Optional(EOL) + Group(Suppress(a_start) + OneOrMore( SkipTo(LineEnd()) + EOL, stopOn=(a_start | q_start | StringEnd()))).setResultsName('answer', listAllMatches=True)



qi = Group(OneOrMore(question|answer)).setResultsName('group', listAllMatches=True)
t = qi.parseString(test)
print(t.dump())

结果:

[[['The tectum of the midbrain comprises the:'], ['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['Damage to the dorsal columns on one side of the spinal cord would results in:'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]]
- group: [[['The tectum of the midbrain comprises the:'], ['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['Damage to the dorsal columns on one side of the spinal cord would results in:'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]]
  [0]:
    [['The tectum of the midbrain comprises the:'], ['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['Damage to the dorsal columns on one side of the spinal cord would results in:'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]
    - answer: [['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]
      [0]:
        ['superior and inferior colliculi']
      [1]:
        ['reticular formation']
      [2]:
        ['internal arcuate fibers']
      [3]:
        ['cerebellar peduncles']
      [4]:
        ['pyramids']
      [5]:
        ['loss of MVP ipsilaterally below the level of the lesion']
      [6]:
        ['hypertonicity of the contralateral limbs']
      [7]:
        ['loss of pain and temperature contralaterally below the level of the lesion']
      [8]:
        ['loss of MVP contralaterally above the level of the lesion']
      [9]:
        ['loss of pain and temperature ipsilaterally above the level of the lesion']
    - question: [['The tectum of the midbrain comprises the:'], ['Damage to the dorsal columns on one side of the spinal cord would results in:']]
      [0]:
        ['The tectum of the midbrain comprises the:']
      [1]:
        ['Damage to the dorsal columns on one side of the spinal cord would results in:']

会将问题和答案进行匹配,并正确绕过可能打断问题或答案的换行符。我遇到的问题是,它们没有按我预期的方式分组。 我一直在期待着类似的事情 组[0]=问题、答案[1:4] 组[2]=问题、答案[1:4]

有人有什么建议吗

谢谢


Tags: andofthetoanswerlevelstartbelow
1条回答
网友
1楼 · 发布于 2024-09-30 16:35:00

我认为您的思路是正确的——我对您的解析器进行了单独的检查,得出了非常相似的结构,但只是一些不同之处

question = Combine(q_start.suppress() + SkipTo(EOL + a_start))
answer = Combine(a_start.suppress() + SkipTo(EOL + (a_start | q_start | StringEnd())))
q_a = Group(question("question") + answer[1, ...]("answers"))

for t in q_a[...].parseString(test):
    print(t.dump())

最大的区别在于,我用来解析文本的表达式不仅仅是OneOrMore(question | answer),而是定义了一个Group(question + OneOrMore(answer))。这将为每个问题及其相关答案创建一个组。在解析器中,使用listAllMatches只为所有问题创建一个结果名称,为所有答案创建另一个结果名称,但会丢失它们之间的所有关联。通过创建“问题+一个或多个答案”组,可以维护这些关联

如果要删除'\n',可以使用解析操作比使用EOL业务更轻松地完成此操作

相关问题 更多 >