python中的Regex可选匹配失败

获取文本并清理/转换某些内容

filename = (r'.\4-12_4-26.txt') import re import sys #Clean up output from the web to ensure that you have one catagory per line f = open(filename) w = open('cleantext.txt','w') origdatepat = (r'(Ticket Date: )([0-9]+/[0-9]+/[0-9]+),( [0-9]+:[0-9]+ [PA]M)') tickettypepat = (r'MIS Notes:.*(//[pewPEW]//)?.*') print 'Begining Blank Line Removal' for line in f: redate = re.search(origdatepat,line) retype = re.search(tickettypepat,line) if line == ' \n': line = '' print 'Removing blank Line' #remove ',' from time and date line elif redate: line = redate.group(1) + redate.group(2)+ redate.group(3)+'\n' print 'Redating... ' + line elif retype: print retype.group(0) print retype.group(1) if retype.group(1) == '//p//': line = line + 'Type: Phone\n' print 'Setting type for... ' + line elif retype.group(1) == '//e//': line = line + 'Type: Email\n' print 'Setting type for... ' + line elif retype.group(1) == '//w//': line = line + 'Type: Walk-in\n' print 'Setting type for... ' + line elif retype.group(1) == ('' or None): line = line + 'Type: Ticket\n' print 'Setting type for... ' + line w.write(line) print 'Closing Files' f.close() w.close()

这里有一些输入示例。在

Ticket No.: 20100426132 Ticket Date: 04/26/10, 10:22 AM Close Date: Primary User: XXX Branch: XXX Help Tech: XXX Status: Pending Priority: Medium Application: xxx Description: some issue Resolution: some resolution MIS Notes: some random stuff //p// followed by more stuff Key Words: Ticket No.: 20100426132 Ticket Date: 04/26/10, 10:22 AM Close Date: Primary User: XXX Branch: XXX Help Tech: XXX Status: Pending Priority: Medium Application: xxx Description: some issue Resolution: some resolution MIS Notes: //p// Key Words: Ticket No.: 20100426132 Ticket Date: 04/26/10, 10:22 AM Close Date: Primary User: XXX Branch: XXX Help Tech: XXX Status: Pending Priority: Medium Application: xxx Description: some issue Resolution: some resolution MIS Notes: //e// stuff.... Key Words: Ticket No.: 20100426132 Ticket Date: 04/26/10, 10:22 AM Close Date: Primary User: XXX Branch: XXX Help Tech: XXX Status: Pending Priority: Medium Application: xxx Description: some issue Resolution: some resolution MIS Notes: Key Words:

3条回答

网友

1楼 · 编辑于 2024-10-01 17:31:47

这个模式对你的目的来说是模棱两可的。最好按前缀或后缀对它们进行分组。在这里的示例中，我选择了前缀分组。基本上，如果//p//出现在行中，那么前缀是非空的。后缀将是//p//项之后的所有内容，或者是行中不存在的所有内容。在

import re
lines = ['MIS Notes: //p//',
    'MIS Notes: prefix//p//suffix']

tickettypepat = (r'MIS Notes: (?:(.*)//p//)?(.*)')
for line in lines:
    m = re.search(tickettypepat,line)
    print 'line:', line
    if m: print 'groups:', m.groups()
    else: print 'groups:', m

结果：

^{pr2}$

网友

2楼 · 编辑于 2024-10-01 17:31:47

Regex是贪婪的，这意味着.*尽可能匹配整个字符串。所以没有什么可以匹配的可选组了。group(0)总是整个匹配的刺。在

从你的评论来看，你为什么要regex？这还不够：

if line.startswith('MIS Notes:'): # starts with that string
    data = line[len('MIS Notes:'):] # the rest in the interesting part
    if '//p//' in data:
        stuff, sep, rest = data.partition('//p//') # or sothing like that
    else:
        pass #other stuff

网友

3楼 · 编辑于 2024-10-01 17:31:47

MIS Notes:.*(//p//)?.*的工作原理是这样的，在{}作为目标的例子中：

MIS Notes:匹配"MIS Notes:"，这里没有什么惊喜。在
.*立即运行到字符串的末尾（到目前为止匹配"MIS Notes: //p//"）
(//p//)?是可选的。什么也没发生。在
.*没有可匹配的内容，我们已经在字符串的末尾了。由于star允许前一个原子的匹配项为零，因此regex引擎停止将整个字符串报告为匹配项，并将子组报告为空。在

现在，当您将regex更改为MIS Notes:.*(//p//).*时，行为将发生变化：

MIS Notes:匹配"MIS Notes:"，这里仍然没有惊喜。在
.*立即运行到字符串的末尾（到目前为止匹配"MIS Notes: //p//"）
(//p//)是必需的。为了满足这一要求，引擎开始逐字回溯。（目前为止匹配"MIS Notes: "）
(//p//)可以匹配。子组1被保存并包含"//p//"。在
.*运行到字符串的末尾。提示：如果您对它匹配的内容不感兴趣，那么它是多余的，您可以删除它。在

现在，当您将regex更改为MIS Notes:.*?//(p)//时，行为将再次更改：

MIS Notes:匹配"MIS Notes:"，这里仍然没有惊喜。在
.*?是非贪心的，在它继续之前检查以下原子（match-to-to "MIS Notes: "）
//(p)//可以匹配。子组1被保存并包含"p"。在
完成了。请注意，不会发生回溯，这样可以节省时间。在

现在，如果您知道在//p//之前不能有/，那么可以使用：MIS Notes:[^/]*//(p)//：

MIS Notes:匹配"MIS Notes:"，你明白了。在
[^/]*可以快进到第一个斜杠（这比.*?快）
//(p)//可以匹配。子组1被保存并包含"p"。在
完成了。请注意，不会发生回溯，这样可以节省时间。这应该比版本3快。在

获取文本并清理/转换某些内容

相关问题更多 >

编程相关推荐

热门问题

热门文章