Regex在htmlpython中对段落大写

import re def replace(match): return match.group(1).capitalize() with open('explanation.html', 'rbU') as inf: with open('out.html', 'wb') as outf: cont = inf.read() par = re.compile(r'(?s)\<p(.*?)\<\/p') s = re.sub(par, replace, cont) outf.write(s)

1条回答

网友

1楼 · 发布于 2024-05-02 08:37:29

以beautifulsoup和nltk为例：

from nltk.tokenize import PunktSentenceTokenizer
from bs4 import BeautifulSoup

html_doc = '''<html><head><title>abcd</title></head><body>
<p>i want to take everything in an HTML document and capitalize the sentences (within paragraph tags).
the input file has everything in all caps.</p>
<p>my attempt has two flaws - first, it removes the paragraph tags, themselves, and second, it simply lower-cases everything in the match groups.
 i don't quite know how capitalize() works, but I assumed that it would leave the first letter of sentences... capitalized.</p>
<p>there may be a much easier way to do this than regex, too. Here's what I have:</p>
</body>
<html>'''

soup = BeautifulSoup(html_doc, 'html.parser')

for paragraph in soup.find_all('p'):
    text = paragraph.get_text()
    sent_tokenizer = PunktSentenceTokenizer(text)
    sents = [x.capitalize() for x in sent_tokenizer.tokenize(text)]
    paragraph.string = "\n".join(sents)

print(soup)

相关问题更多 >

编程相关推荐

热门问题

热门文章

Regex在htmlpython中对段落大写

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >