为什么这个正则表达式不能在所有情况下都工作？

@markwarner VIRGINIA - Mark Warner @senatorleahy VERMONT - Patrick Leahy NO @senatorsanders VERMONT - Bernie Sanders @orrinhatch UTAH - Orrin Hatch NO @jimdemint SOUTH CAROLINA - Jim DeMint NO @senmikelee UTAH -- Mike Lee @kaybaileyhutch TEXAS - Kay Hutchison @johncornyn TEXAS - John Cornyn @senalexander TENNESSEE - Lamar Alexander

import re politicians = open('testfile.txt') text = politicians.read() # Grab the 'no' votes # Should be 11 entries regex = re.compile(r'(no\s@[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I) no = regex.findall(text) ## Make the list a string newlist = ' '.join(no) ## Replace the dashes in the string with a space deldash = re.compile('\s-*\s') a = deldash.sub(' ', newlist) # Delete 'NO' in the string delno = re.compile('NO\s') b = delno.sub('', a) # make the string into a list # problem with @jimdemint SOUTH CAROLINA Jim DeMint regex2 = re.compile(r'(@[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I) lst1 = regex2.findall(b) for i in lst1: print i

2条回答

网友

1楼 · 编辑于 2024-10-01 09:33:13

因为他的州名包含两个字：南卡罗来纳

如果你的第二个正则表达式是这样，应该会有帮助

 (@[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)

我补充道

(?:\s\w+)?

这是一个可选的非捕获组，与后跟一个或多个字母数字下划线字符的空格相匹配

http://regexr.com?31fv5显示它正确地匹配带有NOs和破折号的输入

编辑： 如果您希望一个主正则表达式能够正确地捕获和拆分所有内容，那么在删除Nos和破折号之后，可以使用

((@[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))

你可以在这里玩：http://regexr.com?31fvk

完全匹配的价格是1美元，Twitter句柄是2美元，状态是3美元，名字是4美元

每个捕获组的工作方式如下：

(@[\w]+?\s)

它匹配一个@符号，后跟至少一个但尽可能少的字符，直到一个空格。你知道吗

((?:(?:[\w]+?)\s){1,2})

这将匹配并捕获一个或两个单词，这应该是状态。这只适用于下一篇文章，其中必须有两个词

((?:[\w]+?\s){2})

匹配并捕获两个单词，定义为尽可能少的字符后跟空格

网友

2楼 · 编辑于 2024-10-01 09:33:13

text=re.sub(' (NO|-+)(?= |$)','',text)

为了捕捉一切：

re.findall('(@\w+) ([A-Z ]+[A-Z]) (.+?(?= @|$))',text)

或同时进行：

re.findall('(@\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= @|$))',text)

相关问题更多 >

编程相关推荐

热门问题

热门文章