Regex在htmlpython中对段落大写

2024-05-02 08:37:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我想把所有的东西都放在一个HTML文档中,并将句子大写(在段落标记中)。输入文件包含所有大写字母。你知道吗

我的尝试有两个缺陷-第一,它删除了段落标记本身,第二,它只是降低了匹配组中所有内容的大小写。我不太清楚capitalize()是如何工作的,但我认为它会留下句子的第一个字母。。。资本化。你知道吗

可能还有比regex更简单的方法。以下是我所拥有的:

import re

def replace(match):
    return match.group(1).capitalize()

with open('explanation.html', 'rbU') as inf:
    with open('out.html', 'wb') as outf:
        cont = inf.read()
        par = re.compile(r'(?s)\<p(.*?)\<\/p')
        s = re.sub(par, replace, cont)
        outf.write(s)

Tags: rehtmlasmatchwithopenreplace句子
1条回答
网友
1楼 · 发布于 2024-05-02 08:37:29

beautifulsoupnltk为例:

from nltk.tokenize import PunktSentenceTokenizer
from bs4 import BeautifulSoup

html_doc = '''<html><head><title>abcd</title></head><body>
<p>i want to take everything in an HTML document and capitalize the sentences (within paragraph tags).
the input file has everything in all caps.</p>
<p>my attempt has two flaws - first, it removes the paragraph tags, themselves, and second, it simply lower-cases everything in the match groups.
 i don't quite know how capitalize() works, but I assumed that it would leave the first letter of sentences... capitalized.</p>
<p>there may be a much easier way to do this than regex, too. Here's what I have:</p>
</body>
<html>'''

soup = BeautifulSoup(html_doc, 'html.parser')

for paragraph in soup.find_all('p'):
    text = paragraph.get_text()
    sent_tokenizer = PunktSentenceTokenizer(text)
    sents = [x.capitalize() for x in sent_tokenizer.tokenize(text)]
    paragraph.string = "\n".join(sents)

print(soup)

相关问题 更多 >