Regex用于提取以先生|夫人|博士开头的名字

2024-10-06 09:46:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我当时正试图写一个正则表达式,用以识别以MR | MS | THE | DR开头的名字

比如说

      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 1    VIKRAM NATH,HONOURABLE MR. JUSTICE             1     1      0     3       5
      J.B.PARDIWALA
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 2    VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M.    0     1      0     0       1
      PANCHOLI
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 3    VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH   107    4     10     6      127
      J. SHASTRI

因此,输出应该是

[THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH, MR. JUSTICE J.B.PARDIWALA]
[THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH, MR. JUSTICE VIPUL M. PANCHOLI]
and so on

但是我越来越

THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH 
MR. JUSTICE             1     1      0     3       5
      J.B.PARDIWALA

我试过\s*HONOURABLE\s+(?=THE|MR|MS|DR)([^/\[\]\n]*)

尊敬的先生可以重复任何次数

任何帮助都将不胜感激

提前谢谢


Tags: the名字msmrdrchiefvikramjustice
1条回答
网友
1楼 · 发布于 2024-10-06 09:46:48

悬赏回答

你可以用

import re
text = """     HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 1    VIKRAM NATH,HONOURABLE MR. JUSTICE             1     1      0     3       5
      J.B.PARDIWALA
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 2    VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M.    0     1      0     0       1
      PANCHOLI
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 3    VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH   107    4     10     6      127
      J. SHASTRI"""
text = re.sub(r'^[\d \t]+|[\d \t]+$', '', text, flags=re.M)
#print(text)
m = re.findall(r'^HONOURABLE\s+(.*(?:\n(?!HONOURABLE\b).*)*)', text, re.M)
for x in m:
    print(x.replace('\n',' '))

输出:

[
  'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE J.B.PARDIWALA',
  'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. PANCHOLI',
  'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH J. SHASTRI'
]

Python demo

详细信息

  • re.sub(r'^[\d \t]+|[\d \t]+$', '', text, flags=re.M)删除文本中每行开头和结尾的所有空格、制表符和数字

  • r'^HONOURABLE\s+(.*(?:\n(?!HONOURABLE\b).*)*)'是一个正则表达式,在“修剪”文本中匹配以下内容:

  • ^-行的开始

  • HONOURABLE-一个单词HONOURABLE

  • \s+-一个或多个空格

  • (.*(?:\n(?!HONOURABLE\b).*)*)-捕获组1:

    • .*-行的其余部分
    • (?:\n(?!HONOURABLE\b).*)*-零行或多行不以HONOURABLE作为一个完整单词开头

原始答案 你可以用

\bHONOURABLE\s+((?:THE|MR|MS|DR)[^,]*)

regex demo。如果不希望在生成的列表项中有换行符,可以稍后将其替换为.replace('\n', ' ')。如果要在[\]处限制匹配项的右侧边界,请将它们添加到求反字符类,将[^,]更改为[^][/,]

详细信息

  • \bHONOURABLE-一个完整的单词{}
  • \s+-一个或多个空格
  • ((?:THE|MR|MS|DR)[^,]*)-捕获组1:THEMRMSDR后跟除逗号以外的零个或多个字符

见a Python demo

import re
rx = r"\bHONOURABLE\s+((?:THE|MR|MS|DR)\b[^,]*)"
text = "HONOURABLE THE CHIEF JUSTICE MR. JUSTICE\nVIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH\nJ. SHASTRI, HONOURABLE MS. ADITI GUPTA"
m = re.findall(rx, text)
print([x.replace('\n','') for x in m])

输出:

['THE CHIEF JUSTICE MR. JUSTICEVIKRAM NATH', 'MR. JUSTICE ASHUTOSHJ. SHASTRI', 'MS. ADITI GUPTA']

相关问题 更多 >