查找所有文本,直到下一个regex匹配

2024-10-04 05:34:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试编译所有的文本,直到下次在python中与regex匹配为止。这些数据是网上的辩论记录。在

目前,我正在尝试遍历p标记的所有匹配项,并识别带有标记的speaker的匹配项,然后将所有没有标记speaker的后续文本追加到前一个匹配项中。在

我不确定这是不是最好的方法继续,或者它会更容易简单地搜索和分组所有的文本一次。目前,我只能看到所有的文字开头至少有三个大写字母。在

import re    
import requests as rq
from bs4 import BeautifulSoup as bs

r = rq.get('http://www.cbsnews.com/news/transcript-of-the-2015-gop-debate-9-pm/')
b = bs(r.text, 'html.parser')
debatetext = b.find('div', attrs= {'class' , 'entry'}).findAll('p')
pattern = re.compile(r'[A-Z][A-Z][A-Z].*:')
for line in debatetext:
        if re.search(pattern, line.text) is not None:
                print line

示例文本

^{2}$

理想情况下,我希望在第一句话后面加上不带“BUSH:”的三行,或者在第一行的开头加上“BUSH:”或其他候选人说的话。在

编辑:大样本

    <div class="entry" itemprop="articleBody" id="article-entry">...


<p>   CARSON:  -- extremely effectively.</p>
<p>   (APPLAUSE)</p>
<p>   BAIER:  Gentlemen, the next series of questions deals with ObamaCare and the role of the federal government.</p>
<p>   Mr. Trump, ObamaCare is one of the things you call a disaster.</p>
<p>   TRUMP:  A complete disaster, yes.</p>
<p>   BAIER:  Saying it needs to be repealed and replaced.</p>
<p>   TRUMP:  Correct.</p>
<p>   BAIER:  Now, 15 years ago, uncalled yourself a liberal on health care.  You were for a single-payer system, a Canadian-style system.</p>
<p>   Why were you for that then and why aren't you for it now?  TRUMP:  First of all, I'd like to just go back to one.  In July of 2004, I came out strongly against the war with Iraq, because it was going to destabilize the Middle East.  And I'm the only one on this stage that knew that and had the vision to say it.  And that's exactly what happened.</p>
<p>   BAIER:  But on ObamaCare...</p>
<p>   TRUMP:  And the Middle East became totally destabilized.  So I just want to say.</p>
<p>   As far as single payer, it works in Canada.  It works incredibly well in Scotland.  It could have worked in a different age, which is the age you're talking about here.</p>
<p>   What I'd like to see is a private system without the artificial lines around every state.  I have a big company with thousands and thousands of employees.  And if I'm negotiating in New York or in New Jersey or in California, I have like one bidder.  Nobody can bid.</p>
<p>   You know why?</p>
<p>   Because the insurance companies are making a fortune because they have control of the politicians, of course, with the exception of the politicians on this stage.</p>
<p>   But they have total control of the politicians.  They're making a fortune.</p>
<p>   Get rid of the artificial lines and you will have...</p>
<p>   (BUZZER NOISE)</p>
<p>   TRUMP:  -- yourself great plans.  And then we have to take care of the people that can't take care of themselves.  And I will do that through a different system.</p>
<p>   (CROSSTALK)</p>
<p>   BAIER:  Mr. Trump, hold up one second.</p>
<p>   PAUL:  I've got a news flash...</p>

Tags: andofthetoin文本reyou
2条回答

我稍微重新格式化了正则表达式,如下所示:

pattern = re.compile(r'([A-Z]+):(.*)')

+给了我一个或无限个大写字母,所以这只是对之前的regex代码进行了一点清理。 我还修改了它以创建捕获组,第一个是“:”前的任何大写字母,第二个是“:”之后的任何文本。在

现在第二个匹配项(组(0)是整个匹配项,组(1)是名称)可以用于附加到字典中,并且可以附加连续的文本。在

为了处理添加遵循这个初始regex模式的缺失语句的问题,我使用了一个状态机。 注意,这仅仅是因为我假设下面所有来自regex匹配的文本都应该属于从regex模式找到的说话人。在

^{pr2}$

这次采取了一些IRL的帮助,但我认为这个解决方案在这个例子中很好地工作,可以帮助其他人。我用这个来分析第二次辩论,效果很好。我可能会对它进行修改,以便按顺序添加语句,这样我就可以结合twitter数据进行一些相关性分析。在

对“我不确定这是最好的方法还是更容易一次搜索和分组所有的文本。”或者,最好的方法是你理解和解决问题的方式。这是快速和肮脏的,但应该让你开始。在

import pprint

test_data="""    <div class="entry" itemprop="articleBody" id="article-entry">...


<p>   CARSON:    extremely effectively.</p>
<p>   (APPLAUSE)</p>
<p>   BAIER:  Gentlemen, the next series of questions deals with ObamaCare and the role of the federal government.</p>
<p>   Mr. Trump, ObamaCare is one of the things you call a disaster.</p>
<p>   TRUMP:  A complete disaster, yes.</p>
<p>   BAIER:  Saying it needs to be repealed and replaced.</p>
<p>   TRUMP:  Correct.</p>
<p>   BAIER:  Now, 15 years ago, uncalled yourself a liberal on health care.  You were for a single-payer system, a Canadian-style system.</p>
<p>   Why were you for that then and why aren't you for it now?  TRUMP:  First of all, I'd like to just go back to one.  In July of 2004, I came out strongly against the war with Iraq, because it was going to destabilize the Middle East.  And I'm the only one on this stage that knew that and had the vision to say it.  And that's exactly what happened.</p>
<p>   BAIER:  But on ObamaCare...</p>
<p>   TRUMP:  And the Middle East became totally destabilized.  So I just want to say.</p>
<p>   As far as single payer, it works in Canada.  It works incredibly well in Scotland.  It could have worked in a different age, which is the age you're talking about here.</p>
<p>   What I'd like to see is a private system without the artificial lines around every state.  I have a big company with thousands and thousands of employees.  And if I'm negotiating in New York or in New Jersey or in California, I have like one bidder.  Nobody can bid.</p>
<p>   You know why?</p>
<p>   Because the insurance companies are making a fortune because they have control of the politicians, of course, with the exception of the politicians on this stage.</p>
<p>   But they have total control of the politicians.  They're making a fortune.</p>
<p>   Get rid of the artificial lines and you will have...</p>
<p>   (BUZZER NOISE)</p>
<p>   TRUMP:    yourself great plans.  And then we have to take care of the people that can't take care of themselves.  And I will do that through a different system.</p>
<p>   (CROSSTALK)</p>
<p>   BAIER:  Mr. Trump, hold up one second.</p>
<p>   PAUL:  I've got a news flash...</p>"""

## look for 3 capital letters
## assume every line starts with "<p>" (so won't test for it)

one_group=[]
for record in test_data.split("\n"):
    record=record.strip()
    if len(record):
        split_rec=record.split()
        found=True
        for ltr in split_rec[1][:3]:
            if ltr < "A" or ltr > "Z":
                found=False

        ## found new name so print previous block
        if found and len(one_group):
            pprint.pprint(one_group)
            print
            one_group=[]
        one_group.append(record)

## last group
print one_group

相关问题 更多 >